Index:
a.out(5)acct(5)
adduser.conf(5)
aliases(5)
amd.conf(5)
auth.conf(5)
big5(5)
bluetooth.hosts(5)
bluetooth.protocols(5)
bootparams(5)
bootptab(5)
config(5)
core(5)
crontab(5)
ctm(5)
cvs(5)
devd.conf(5)
devfs(5)
device.hints(5)
dhclient.conf(5)
dhclient.leases(5)
dhcp-eval(5)
dhcp-options(5)
dir(5)
dirent(5)
disktab(5)
editrc(5)
elf(5)
ethers(5)
euc(5)
eui64(5)
exports(5)
fbtab(5)
fdescfs(5)
finger.conf(5)
forward(5)
fs(5)
fstab(5)
ftpchroot(5)
gb18030(5)
gb2312(5)
gbk(5)
gettytab(5)
groff_font(5)
groff_out(5)
groff_tmac(5)
group(5)
hcsecd.conf(5)
hesiod.conf(5)
hosts(5)
hosts.equiv(5)
hosts.lpd(5)
hosts_access(5)
hosts_options(5)
inetd.conf(5)
info(5)
inode(5)
intro(5)
ipf(5)
ipnat(5)
ipnat.conf(5)
ipsend(5)
isdnd.acct(5)
isdnd.rates(5)
isdnd.rc(5)
kbdmap(5)
keycap(5)
keymap(5)
krb5.conf(5)
lastlog(5)
libarchive-formats(5)
libmap.conf(5)
link(5)
linprocfs(5)
loader.conf(5)
login.access(5)
login.conf(5)
mac.conf(5)
magic(5)
mailer.conf(5)
make.conf(5)
malloc.conf(5)
master.passwd(5)
moduli(5)
motd(5)
msdos(5)
msdosfs(5)
mskanji(5)
named.conf(5)
netconfig(5)
netgroup(5)
netid(5)
networks(5)
newsyslog.conf(5)
nologin(5)
nsmb.conf(5)
nsswitch.conf(5)
ntp.conf(5)
ntp.keys(5)
opieaccess(5)
opiekeys(5)
passwd(5)
pbm(5)
pccard.conf(5)
periodic.conf(5)
pf.conf(5)
pf.os(5)
phones(5)
printcap(5)
procfs(5)
protocols(5)
publickey(5)
pw.conf(5)
quota.group(5)
quota.user(5)
radius.conf(5)
rc.conf(5)
rcsfile(5)
remote(5)
resolv.conf(5)
resolver(5)
rhosts(5)
rndc.conf(5)
rpc(5)
rrenumd.conf(5)
rtadvd.conf(5)
services(5)
shells(5)
ssh_config(5)
sshd_config(5)
stab(5)
style.Makefile(5)
sysctl.conf(5)
syslog.conf(5)
tacplus.conf(5)
tar(5)
term(5)
termcap(5)
terminfo(5)
texinfo(5)
tmac(5)
ttys(5)
tzfile(5)
usbd.conf(5)
utf2(5)
utf8(5)
utmp(5)
uuencode(5)
uuencode.format(5)
vgrindefs(5)
wtmp(5)
utf8(5)
NAME
utf8 -- UTF-8, a transformation format of ISO 10646
SYNOPSIS
ENCODING "UTF-8"
DESCRIPTION
The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards compatible with ASCII, so 0x00-0x7f refer to the ASCII character set. The multibyte encoding of non-ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is represented by the following table: [0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb [0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb [0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] -> 1110bbbb, 10bbbbbb, 10bbbbbb [0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] -> 11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb [0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] -> 111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb [0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] -> 1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb If more than a single representation of a value exists (for example, 0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation is always used. Longer ones are detected as an error as they pose a potential security risk, and destroy the 1:1 character:octet sequence mapping.
COMPATIBILITY
The utf8 encoding supersedes the utf2(5) encoding. The only differences between the two are that utf8 handles the full 31-bit character set of ISO 10646 whereas utf2(5) is limited to a 16-bit character set, and that utf2(5) accepts redundant, non-``shortest form'' representations of char- acters.
SEE ALSO
euc(5), utf2(5) Rob Pike and Ken Thompson, "Hello World", Proceedings of the Winter 1993 USENIX Technical Conference, USENIX Association, January 1993. F. Yergeau, UTF-8, a transformation format of ISO 10646, January 1998, RFC 2279. The Unicode Standard, Version 3.0, The Unicode Consortium, 2000, as amended by the Unicode Standard Annex #27: Unicode 3.1 and by the Unicode Standard Annex #28: Unicode 3.2.
STANDARDS
The utf8 encoding is compatible with RFC 2279 and Unicode 3.2. FreeBSD 5.4 April 7, 2004 FreeBSD 5.4
SPONSORED LINKS
Man(1) output converted with man2html , sed , awk