Index:
CPU_ELAN(4)CPU_SOEKRIS(4)
aac(4)
acd(4)
acpi(4)
acpi_asus(4)
acpi_panasonic(4)
acpi_thermal(4)
acpi_toshiba(4)
acpi_video(4)
ad(4)
adv(4)
adw(4)
afd(4)
agp(4)
agpgart(4)
aha(4)
ahb(4)
ahc(4)
ahd(4)
aic(4)
aio(4)
alpm(4)
altq(4)
amd(4)
amdpm(4)
amr(4)
an(4)
apm(4)
ar(4)
arcmsr(4)
arl(4)
arp(4)
asr(4)
ast(4)
ata(4)
atapicam(4)
ath(4)
ath_hal(4)
atkbd(4)
atkbdc(4)
aue(4)
awi(4)
axe(4)
bfe(4)
bge(4)
bktr(4)
blackhole(4)
bpf(4)
bridge(4)
brooktree(4)
bt(4)
cam(4)
card(4)
cardbus(4)
carp(4)
cbb(4)
ccd(4)
cd(4)
cdce(4)
ch(4)
ciss(4)
cm(4)
cnw(4)
cp(4)
cpufreq(4)
crypto(4)
cryptodev(4)
cs(4)
ct(4)
ctau(4)
cue(4)
cx(4)
cy(4)
da(4)
dc(4)
dcons(4)
dcons_crom(4)
ddb(4)
de(4)
devctl(4)
digi(4)
disc(4)
divert(4)
dpt(4)
dummynet(4)
ed(4)
ef(4)
ehci(4)
el(4)
em(4)
en(4)
ep(4)
esp(4)
ex(4)
exca(4)
faith(4)
fast_ipsec(4)
fatm(4)
fd(4)
fdc(4)
fe(4)
fea(4)
firewire(4)
fla(4)
fpa(4)
fwe(4)
fwip(4)
fwohci(4)
fxp(4)
gbde(4)
gdb(4)
gem(4)
geom(4)
gif(4)
gre(4)
gx(4)
harp(4)
hatm(4)
hfa(4)
hifn(4)
hme(4)
hptmv(4)
i4b(4)
i4bcapi(4)
i4bctl(4)
i4bing(4)
i4bipr(4)
i4bisppp(4)
i4bq921(4)
i4bq931(4)
i4brbch(4)
i4btel(4)
i4btrc(4)
iavc(4)
ichsmb(4)
ichwd(4)
icmp(4)
icmp6(4)
ida(4)
idt(4)
ie(4)
ieee80211(4)
if_an(4)
if_aue(4)
if_awi(4)
if_axe(4)
if_bfe(4)
if_bge(4)
if_cue(4)
if_dc(4)
if_de(4)
if_disc(4)
if_ed(4)
if_ef(4)
if_em(4)
if_en(4)
if_faith(4)
if_fatm(4)
if_fwe(4)
if_fwip(4)
if_fxp(4)
if_gem(4)
if_gif(4)
if_gre(4)
if_gx(4)
if_hatm(4)
if_hme(4)
if_idt(4)
if_kue(4)
if_lge(4)
if_my(4)
if_ndis(4)
if_nge(4)
if_oltr(4)
if_patm(4)
if_pcn(4)
if_ppp(4)
if_re(4)
if_rl(4)
if_rue(4)
if_sbni(4)
if_sbsh(4)
if_sf(4)
if_sis(4)
if_sk(4)
if_sl(4)
if_sn(4)
if_ste(4)
if_stf(4)
if_tap(4)
if_ti(4)
if_tl(4)
if_tun(4)
if_tx(4)
if_txp(4)
if_udav(4)
if_vge(4)
if_vlan(4)
if_vr(4)
if_wb(4)
if_wi(4)
if_xe(4)
if_xl(4)
ifmib(4)
ifpi(4)
ifpi2(4)
ifpnp(4)
ihfc(4)
iic(4)
iicbb(4)
iicbus(4)
iicsmb(4)
iir(4)
imm(4)
inet(4)
inet6(4)
intpm(4)
intro(4)
io(4)
ip(4)
ip6(4)
ipaccounting(4)
ipacct(4)
ipf(4)
ipfirewall(4)
ipfw(4)
ipl(4)
ipnat(4)
ips(4)
ipsec(4)
isic(4)
isp(4)
ispfw(4)
itjc(4)
iwic(4)
ixgb(4)
joy(4)
kame(4)
keyboard(4)
kld(4)
kmem(4)
ktr(4)
kue(4)
led(4)
lge(4)
linux(4)
lnc(4)
lo(4)
longrun(4)
loop(4)
lp(4)
lpbb(4)
lpt(4)
mac(4)
mac_biba(4)
mac_bsdextended(4)
mac_ifoff(4)
mac_lomac(4)
mac_mls(4)
mac_none(4)
mac_partition(4)
mac_portacl(4)
mac_seeotheruids(4)
mac_stub(4)
mac_test(4)
mcd(4)
md(4)
mem(4)
meteor(4)
miibus(4)
mlx(4)
mly(4)
mouse(4)
mpt(4)
mse(4)
mtio(4)
multicast(4)
my(4)
natm(4)
natmip(4)
ncr(4)
ncv(4)
ndis(4)
net(4)
netgraph(4)
netintro(4)
networking(4)
ng_UI(4)
ng_async(4)
ng_atm(4)
ng_atmllc(4)
ng_atmpif(4)
ng_bluetooth(4)
ng_bpf(4)
ng_bridge(4)
ng_bt3c(4)
ng_btsocket(4)
ng_ccatm(4)
ng_cisco(4)
ng_device(4)
ng_echo(4)
ng_eiface(4)
ng_etf(4)
ng_ether(4)
ng_fec(4)
ng_frame_relay(4)
ng_gif(4)
ng_gif_demux(4)
ng_h4(4)
ng_hci(4)
ng_hole(4)
ng_hub(4)
ng_iface(4)
ng_ip_input(4)
ng_ksocket(4)
ng_l2cap(4)
ng_l2tp(4)
ng_lmi(4)
ng_mppc(4)
ng_netflow(4)
ng_one2many(4)
ng_ppp(4)
ng_pppoe(4)
ng_pptpgre(4)
ng_rfc1490(4)
ng_socket(4)
ng_split(4)
ng_sppp(4)
ng_sscfu(4)
ng_sscop(4)
ng_tee(4)
ng_tty(4)
ng_ubt(4)
ng_uni(4)
ng_vjc(4)
ng_vlan(4)
nge(4)
nmdm(4)
npx(4)
nsp(4)
null(4)
ohci(4)
oldcard(4)
oltr(4)
opie(4)
orm(4)
pae(4)
pass(4)
patm(4)
pccard(4)
pccbb(4)
pcf(4)
pci(4)
pcic(4)
pcm(4)
pcn(4)
pcvt(4)
perfmon(4)
pf(4)
pflog(4)
pfsync(4)
pim(4)
plip(4)
pnp(4)
pnpbios(4)
polling(4)
ppbus(4)
ppc(4)
ppi(4)
ppp(4)
psm(4)
pst(4)
pt(4)
pty(4)
puc(4)
random(4)
rawip(4)
ray(4)
rc(4)
re(4)
rl(4)
rndtest(4)
route(4)
rp(4)
rue(4)
sa(4)
sab(4)
safe(4)
sbni(4)
sbp(4)
sbp_targ(4)
sbsh(4)
sc(4)
scbus(4)
scd(4)
sched_4bsd(4)
sched_ule(4)
screen(4)
screensaver(4)
scsi(4)
sem(4)
ses(4)
sf(4)
si(4)
sio(4)
sis(4)
sk(4)
skey(4)
sl(4)
smapi(4)
smb(4)
smbus(4)
smp(4)
sn(4)
snc(4)
snd(4)
snd_ad1816(4)
snd_als4000(4)
snd_cmi(4)
snd_cs4281(4)
snd_csa(4)
snd_ds1(4)
snd_emu10k1(4)
snd_es137x(4)
snd_ess(4)
snd_fm801(4)
snd_gusc(4)
snd_ich(4)
snd_maestro(4)
snd_maestro3(4)
snd_neomagic(4)
snd_sbc(4)
snd_solo(4)
snd_uaudio(4)
snd_via8233(4)
snd_via82c686(4)
snd_vibes(4)
snp(4)
sound(4)
speaker(4)
spic(4)
spkr(4)
splash(4)
sppp(4)
sr(4)
stderr(4)
stdin(4)
stdout(4)
ste(4)
stf(4)
stg(4)
streams(4)
svr4(4)
sym(4)
syncache(4)
syncer(4)
syncookies(4)
syscons(4)
sysmouse(4)
tap(4)
targ(4)
tcp(4)
tdfx(4)
termios(4)
ti(4)
tl(4)
trm(4)
ttcp(4)
tty(4)
tun(4)
twa(4)
twe(4)
tx(4)
txp(4)
uart(4)
ubsa(4)
ubsec(4)
ubser(4)
ubtbcmfw(4)
ucom(4)
udav(4)
udbp(4)
udp(4)
ufm(4)
uftdi(4)
ugen(4)
uhci(4)
uhid(4)
uhidev(4)
ukbd(4)
ulpt(4)
umass(4)
umct(4)
umodem(4)
ums(4)
unix(4)
uplcom(4)
urio(4)
usb(4)
uscanner(4)
utopia(4)
uvisor(4)
uvscom(4)
vga(4)
vge(4)
viapm(4)
vinum(4)
vinumdebug(4)
vlan(4)
vn(4)
vpd(4)
vpo(4)
vr(4)
vt(4)
vx(4)
watchdog(4)
wb(4)
wd(4)
wdc(4)
wi(4)
witness(4)
wl(4)
wlan(4)
worm(4)
xe(4)
xl(4)
xpt(4)
zero(4)
geom(4)
NAME
GEOM -- modular disk I/O request transformation framework.
DESCRIPTION
The GEOM framework provides an infrastructure in which "classes" can per- form transformations on disk I/O requests on their path from the upper kernel to the device drivers and back. Transformations in a GEOM context range from the simple geometric dis- placement performed in typical disk partitioning modules over RAID algo- rithms and device multipath resolution to full blown cryptographic pro- tection of the stored data. Compared to traditional "volume management", GEOM differs from most and in some cases all previous implementations in the following ways: o GEOM is extensible. It is trivially simple to write a new class of transformation and it will not be given stepchild treatment. If someone for some reason wanted to mount IBM MVS diskpacks, a class recognizing and configuring their VTOC information would be a trivial matter. o GEOM is topologically agnostic. Most volume management implementa- tions have very strict notions of how classes can fit together, very often one fixed hierarchy is provided for instance subdisk - plex - volume. Being extensible means that new transformations are treated no differ- ently than existing transformations. Fixed hierarchies are bad because they make it impossible to express the intent efficiently. In the fixed hierarchy above it is not possible to mirror two physical disks and then partition the mirror into subdisks, instead one is forced to make subdisks on the physical volumes and to mirror these two and two resulting in a much more complex configuration. GEOM on the other hand does not care in which order things are done, the only restriction is that cycles in the graph will not be allowed.
TERMINOLOGY and TOPOLOGY
GEOM is quite object oriented and consequently the terminology borrows a lot of context and semantics from the OO vocabulary: A "class", represented by the data structure g_class implements one par- ticular kind of transformation. Typical examples are MBR disk partition, BSD disklabel, and RAID5 classes. An instance of a class is called a "geom" and represented by the data structure "g_geom". In a typical i386 FreeBSD system, there will be one geom of class MBR for each disk. A "provider", represented by the data structure "g_provider", is the front gate at which a geom offers service. A provider is "a disk-like thing which appears in /dev" - a logical disk in other words. All providers have three main properties: name, sectorsize and size. A "consumer" is the backdoor through which a geom connects to another geom provider and through which I/O requests are sent. o A geom has zero or more providers. o A consumer can be attached to zero or one providers. o A provider can have zero or more consumers attached. All geoms have a rank-number assigned, which is used to detect and pre- vent loops in the acyclic directed graph. This rank number is assigned as follows: 1. A geom with no attached consumers has rank=1 2. A geom with attached consumers has a rank one higher than the high- est rank of the geoms of the providers its consumers are attached to.
SPECIAL TOPOLOGICAL MANEUVERS
In addition to the straightforward attach, which attaches a consumer to a provider, and detach, which breaks the bond, a number of special topolog- ical maneuvers exists to facilitate configuration and to improve the overall flexibility. TASTING is a process that happens whenever a new class or new provider is created and it provides the class a chance to automatically configure an instance on providers, which it recognize as its own. A typical example is the MBR disk-partition class which will look for the MBR table in the first sector and if found and validated it will instantiate a geom to multiplex according to the contents of the MBR. A new class will be offered to all existing providers in turn and a new provider will be offered to all classes in turn. Exactly what a class does to recognize if it should accept the offered provider is not defined by GEOM, but the sensible set of options are: o Examine specific data structures on the disk. o Examine properties like sectorsize or mediasize for the provider. o Examine the rank number of the provider's geom. o Examine the method name of the provider's geom. ORPHANIZATION is the process by which a provider is removed while it potentially is still being used. When a geom orphans a provider, all future I/O requests will "bounce" on the provider with an error code set by the geom. Any consumers attached to the provider will receive notification about the orphanization when the eventloop gets around to it, and they can take appropriate action at that time. A geom which came into being as a result of a normal taste operation should selfdestruct unless it has a way to keep functioning lacking the orphaned provider. Geoms like diskslicers should therefore selfdestruct whereas RAID5 or mirror geoms will be able to continue, as long as they do not loose quorum. provider for it. o The geoms on top of the disk receive the orphanization event and orphans all their providers in turn. Providers, which are not attached to, will typically self-destruct right away. This process continues in a quasi-recursive fashion until all rele- vant pieces of the tree has heard the bad news. o Eventually the buck stops when it reaches geom_dev at the top of the stack. o Geom_dev will call destroy_dev(9) to stop any more request from coming in. It will sleep until all (if any) outstanding I/O requests have been returned. It will explicitly close (ie: zero the access counts), a change which will propagate all the way down through the mesh. It will then detach and destroy its geom. o The geom whose provider is now attached will destroy the provider, detach and destroy its consumer and destroy its geom. o This process percolates all the way down through the mesh, until the cleanup is complete. While this approach seems byzantine, it does provide the maximum flexi- bility and robustness in handling disappearing devices. The one absolutely crucial detail to be aware is that if the device driver does not return all I/O requests, the tree will not unravel. SPOILING is a special case of orphanization used to protect against stale metadata. It is probably easiest to understand spoiling by going through an example. Imagine a disk, "da0" on top of which a MBR geom provides "da0s1" and "da0s2" and on top of "da0s1" a BSD geom provides "da0s1a" through "da0s1e", both the MBR and BSD geoms have autoconfigured based on data structures on the disk media. Now imagine the case where "da0" is opened for writing and those data structures are modified or overwritten: Now the geoms would be operating on stale metadata unless some notification system can inform them otherwise. To avoid this situation, when the open of "da0" for write happens, all attached consumers are told about this, and geoms like MBR and BSD will selfdestruct as a result. When "da0" is closed again, it will be offered for tasting again and if the data structures for MBR and BSD are still there, new geoms will instantiate themselves anew. Now for the fine print: If any of the paths through the MBR or BSD module were open, they would have opened downwards with an exclusive bit rendering it impossible to open "da0" for writing in that case and conversely the requested exclu- sive bit would render it impossible to open a path through the MBR geom while "da0" is open for writing. From this it also follows that changing the size of open geoms can only be done with their cooperation. Finally: the spoiling only happens when the write count goes from zero to non-zero and the retasting only when the write count goes from non-zero to zero. configure yet a mirror copy on the mirror geom, request a synchroniza- tion, and finally drop the first mirror copy. We have now in essence moved a mounted file system from one disk to another while it was being used. At this point the mirror geom can be deleted from the path again, it has served its purpose. CONFIGURE is the process where the administrator issues instructions for a particular class to instantiate itself. There are multiple ways to express intent in this case, a particular provider can be specified with a level of override forcing for instance a BSD disklabel module to attach to a provider which was not found palatable during the TASTE operation. Finally IO is the reason we even do this: it concerns itself with sending I/O requests through the graph. I/O REQUESTS represented by struct bio, originate at a consumer, are scheduled on its attached provider, and when processed, returned to the consumer. It is important to realize that the struct bio which enters through the provider of a particular geom does not "come out on the other side". Even simple transformations like MBR and BSD will clone the struct bio, modify the clone, and schedule the clone on their own con- sumer. Note that cloning the struct bio does not involve cloning the actual data area specified in the IO request. In total four different IO requests exist in GEOM: read, write, delete, and get attribute. Read and write are self explanatory. Delete indicates that a certain range of data is no longer used and that it can be erased or freed as the underlying technology supports. Tech- nologies like flash adaptation layers can arrange to erase the relevant blocks before they will become reassigned and cryptographic devices may want to fill random bits into the range to reduce the amount of data available for attack. It is important to recognize that a delete indication is not a request and consequently there is no guarantee that the data actually will be erased or made unavailable unless guaranteed by specific geoms in the graph. If "secure delete" semantics are required, a geom should be pushed which converts delete indications into (a sequence of) write requests. Get attribute supports inspection and manipulation of out-of-band attributes on a particular provider or path. Attributes are named by ascii strings and they will be discussed in a separate section below. (stay tuned while the author rests his brain and fingers: more to come.)
HISTORY
This software was developed for the FreeBSD Project by Poul-Henning Kamp and NAI Labs, the Security Research Division of Network Associates, Inc. under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the DARPA CHATS research program. The first precursor for GEOM was a gruesome hack to Minix 1.2 and was never distributed. An earlier attempt to implement a less general scheme in FreeBSD never succeeded.
SPONSORED LINKS
Man(1) output converted with man2html , sed , awk