xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/unicode.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593SmuzhiyunUnicode support
2*4882a593Smuzhiyun===============
3*4882a593Smuzhiyun
4*4882a593Smuzhiyun		 Last update: 2005-01-17, version 1.4
5*4882a593Smuzhiyun
6*4882a593SmuzhiyunThis file is maintained by H. Peter Anvin <unicode@lanana.org> as part
7*4882a593Smuzhiyunof the Linux Assigned Names And Numbers Authority (LANANA) project.
8*4882a593SmuzhiyunThe current version can be found at:
9*4882a593Smuzhiyun
10*4882a593Smuzhiyun	    http://www.lanana.org/docs/unicode/admin-guide/unicode.rst
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunIntroduction
13*4882a593Smuzhiyun------------
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunThe Linux kernel code has been rewritten to use Unicode to map
16*4882a593Smuzhiyuncharacters to fonts.  By downloading a single Unicode-to-font table,
17*4882a593Smuzhiyunboth the eight-bit character sets and UTF-8 mode are changed to use
18*4882a593Smuzhiyunthe font as indicated.
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunThis changes the semantics of the eight-bit character tables subtly.
21*4882a593SmuzhiyunThe four character tables are now:
22*4882a593Smuzhiyun
23*4882a593Smuzhiyun=============== =============================== ================
24*4882a593SmuzhiyunMap symbol	Map name			Escape code (G0)
25*4882a593Smuzhiyun=============== =============================== ================
26*4882a593SmuzhiyunLAT1_MAP	Latin-1 (ISO 8859-1)		ESC ( B
27*4882a593SmuzhiyunGRAF_MAP	DEC VT100 pseudographics	ESC ( 0
28*4882a593SmuzhiyunIBMPC_MAP	IBM code page 437		ESC ( U
29*4882a593SmuzhiyunUSER_MAP	User defined			ESC ( K
30*4882a593Smuzhiyun=============== =============================== ================
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunIn particular, ESC ( U is no longer "straight to font", since the font
33*4882a593Smuzhiyunmight be completely different than the IBM character set.  This
34*4882a593Smuzhiyunpermits for example the use of block graphics even with a Latin-1 font
35*4882a593Smuzhiyunloaded.
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunNote that although these codes are similar to ISO 2022, neither the
38*4882a593Smuzhiyuncodes nor their uses match ISO 2022; Linux has two 8-bit codes (G0 and
39*4882a593SmuzhiyunG1), whereas ISO 2022 has four 7-bit codes (G0-G3).
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunIn accordance with the Unicode standard/ISO 10646 the range U+F000 to
42*4882a593SmuzhiyunU+F8FF has been reserved for OS-wide allocation (the Unicode Standard
43*4882a593Smuzhiyunrefers to this as a "Corporate Zone", since this is inaccurate for
44*4882a593SmuzhiyunLinux we call it the "Linux Zone").  U+F000 was picked as the starting
45*4882a593Smuzhiyunpoint since it lets the direct-mapping area start on a large power of
46*4882a593Smuzhiyuntwo (in case 1024- or 2048-character fonts ever become necessary).
47*4882a593SmuzhiyunThis leaves U+E000 to U+EFFF as End User Zone.
48*4882a593Smuzhiyun
49*4882a593Smuzhiyun[v1.2]: The Unicodes range from U+F000 and up to U+F7FF have been
50*4882a593Smuzhiyunhard-coded to map directly to the loaded font, bypassing the
51*4882a593Smuzhiyuntranslation table.  The user-defined map now defaults to U+F000 to
52*4882a593SmuzhiyunU+F0FF, emulating the previous behaviour.  In practice, this range
53*4882a593Smuzhiyunmight be shorter; for example, vgacon can only handle 256-character
54*4882a593Smuzhiyun(U+F000..U+F0FF) or 512-character (U+F000..U+F1FF) fonts.
55*4882a593Smuzhiyun
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunActual characters assigned in the Linux Zone
58*4882a593Smuzhiyun--------------------------------------------
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunIn addition, the following characters not present in Unicode 1.1.4
61*4882a593Smuzhiyunhave been defined; these are used by the DEC VT graphics map.  [v1.2]
62*4882a593SmuzhiyunTHIS USE IS OBSOLETE AND SHOULD NO LONGER BE USED; PLEASE SEE BELOW.
63*4882a593Smuzhiyun
64*4882a593Smuzhiyun====== ======================================
65*4882a593SmuzhiyunU+F800 DEC VT GRAPHICS HORIZONTAL LINE SCAN 1
66*4882a593SmuzhiyunU+F801 DEC VT GRAPHICS HORIZONTAL LINE SCAN 3
67*4882a593SmuzhiyunU+F803 DEC VT GRAPHICS HORIZONTAL LINE SCAN 7
68*4882a593SmuzhiyunU+F804 DEC VT GRAPHICS HORIZONTAL LINE SCAN 9
69*4882a593Smuzhiyun====== ======================================
70*4882a593Smuzhiyun
71*4882a593SmuzhiyunThe DEC VT220 uses a 6x10 character matrix, and these characters form
72*4882a593Smuzhiyuna smooth progression in the DEC VT graphics character set.  I have
73*4882a593Smuzhiyunomitted the scan 5 line, since it is also used as a block-graphics
74*4882a593Smuzhiyuncharacter, and hence has been coded as U+2500 FORMS LIGHT HORIZONTAL.
75*4882a593Smuzhiyun
76*4882a593Smuzhiyun[v1.3]: These characters have been officially added to Unicode 3.2.0;
77*4882a593Smuzhiyunthey are added at U+23BA, U+23BB, U+23BC, U+23BD.  Linux now uses the
78*4882a593Smuzhiyunnew values.
79*4882a593Smuzhiyun
80*4882a593Smuzhiyun[v1.2]: The following characters have been added to represent common
81*4882a593Smuzhiyunkeyboard symbols that are unlikely to ever be added to Unicode proper
82*4882a593Smuzhiyunsince they are horribly vendor-specific.  This, of course, is an
83*4882a593Smuzhiyunexcellent example of horrible design.
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun====== ======================================
86*4882a593SmuzhiyunU+F810 KEYBOARD SYMBOL FLYING FLAG
87*4882a593SmuzhiyunU+F811 KEYBOARD SYMBOL PULLDOWN MENU
88*4882a593SmuzhiyunU+F812 KEYBOARD SYMBOL OPEN APPLE
89*4882a593SmuzhiyunU+F813 KEYBOARD SYMBOL SOLID APPLE
90*4882a593Smuzhiyun====== ======================================
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunKlingon language support
93*4882a593Smuzhiyun------------------------
94*4882a593Smuzhiyun
95*4882a593SmuzhiyunIn 1996, Linux was the first operating system in the world to add
96*4882a593Smuzhiyunsupport for the artificial language Klingon, created by Marc Okrand
97*4882a593Smuzhiyunfor the "Star Trek" television series.	This encoding was later
98*4882a593Smuzhiyunadopted by the ConScript Unicode Registry and proposed (but ultimately
99*4882a593Smuzhiyunrejected) for inclusion in Unicode Plane 1.  Thus, it remains as a
100*4882a593SmuzhiyunLinux/CSUR private assignment in the Linux Zone.
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunThis encoding has been endorsed by the Klingon Language Institute.
103*4882a593SmuzhiyunFor more information, contact them at:
104*4882a593Smuzhiyun
105*4882a593Smuzhiyun	http://www.kli.org/
106*4882a593Smuzhiyun
107*4882a593SmuzhiyunSince the characters in the beginning of the Linux CZ have been more
108*4882a593Smuzhiyunof the dingbats/symbols/forms type and this is a language, I have
109*4882a593Smuzhiyunlocated it at the end, on a 16-cell boundary in keeping with standard
110*4882a593SmuzhiyunUnicode practice.
111*4882a593Smuzhiyun
112*4882a593Smuzhiyun.. note::
113*4882a593Smuzhiyun
114*4882a593Smuzhiyun  This range is now officially managed by the ConScript Unicode
115*4882a593Smuzhiyun  Registry.  The normative reference is at:
116*4882a593Smuzhiyun
117*4882a593Smuzhiyun	https://www.evertype.com/standards/csur/klingon.html
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunKlingon has an alphabet of 26 characters, a positional numeric writing
120*4882a593Smuzhiyunsystem with 10 digits, and is written left-to-right, top-to-bottom.
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunSeveral glyph forms for the Klingon alphabet have been proposed.
123*4882a593SmuzhiyunHowever, since the set of symbols appear to be consistent throughout,
124*4882a593Smuzhiyunwith only the actual shapes being different, in keeping with standard
125*4882a593SmuzhiyunUnicode practice these differences are considered font variants.
126*4882a593Smuzhiyun
127*4882a593Smuzhiyun======	=======================================================
128*4882a593SmuzhiyunU+F8D0	KLINGON LETTER A
129*4882a593SmuzhiyunU+F8D1	KLINGON LETTER B
130*4882a593SmuzhiyunU+F8D2	KLINGON LETTER CH
131*4882a593SmuzhiyunU+F8D3	KLINGON LETTER D
132*4882a593SmuzhiyunU+F8D4	KLINGON LETTER E
133*4882a593SmuzhiyunU+F8D5	KLINGON LETTER GH
134*4882a593SmuzhiyunU+F8D6	KLINGON LETTER H
135*4882a593SmuzhiyunU+F8D7	KLINGON LETTER I
136*4882a593SmuzhiyunU+F8D8	KLINGON LETTER J
137*4882a593SmuzhiyunU+F8D9	KLINGON LETTER L
138*4882a593SmuzhiyunU+F8DA	KLINGON LETTER M
139*4882a593SmuzhiyunU+F8DB	KLINGON LETTER N
140*4882a593SmuzhiyunU+F8DC	KLINGON LETTER NG
141*4882a593SmuzhiyunU+F8DD	KLINGON LETTER O
142*4882a593SmuzhiyunU+F8DE	KLINGON LETTER P
143*4882a593SmuzhiyunU+F8DF	KLINGON LETTER Q
144*4882a593Smuzhiyun	- Written <q> in standard Okrand Latin transliteration
145*4882a593SmuzhiyunU+F8E0	KLINGON LETTER QH
146*4882a593Smuzhiyun	- Written <Q> in standard Okrand Latin transliteration
147*4882a593SmuzhiyunU+F8E1	KLINGON LETTER R
148*4882a593SmuzhiyunU+F8E2	KLINGON LETTER S
149*4882a593SmuzhiyunU+F8E3	KLINGON LETTER T
150*4882a593SmuzhiyunU+F8E4	KLINGON LETTER TLH
151*4882a593SmuzhiyunU+F8E5	KLINGON LETTER U
152*4882a593SmuzhiyunU+F8E6	KLINGON LETTER V
153*4882a593SmuzhiyunU+F8E7	KLINGON LETTER W
154*4882a593SmuzhiyunU+F8E8	KLINGON LETTER Y
155*4882a593SmuzhiyunU+F8E9	KLINGON LETTER GLOTTAL STOP
156*4882a593Smuzhiyun
157*4882a593SmuzhiyunU+F8F0	KLINGON DIGIT ZERO
158*4882a593SmuzhiyunU+F8F1	KLINGON DIGIT ONE
159*4882a593SmuzhiyunU+F8F2	KLINGON DIGIT TWO
160*4882a593SmuzhiyunU+F8F3	KLINGON DIGIT THREE
161*4882a593SmuzhiyunU+F8F4	KLINGON DIGIT FOUR
162*4882a593SmuzhiyunU+F8F5	KLINGON DIGIT FIVE
163*4882a593SmuzhiyunU+F8F6	KLINGON DIGIT SIX
164*4882a593SmuzhiyunU+F8F7	KLINGON DIGIT SEVEN
165*4882a593SmuzhiyunU+F8F8	KLINGON DIGIT EIGHT
166*4882a593SmuzhiyunU+F8F9	KLINGON DIGIT NINE
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunU+F8FD	KLINGON COMMA
169*4882a593SmuzhiyunU+F8FE	KLINGON FULL STOP
170*4882a593SmuzhiyunU+F8FF	KLINGON SYMBOL FOR EMPIRE
171*4882a593Smuzhiyun======	=======================================================
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunOther Fictional and Artificial Scripts
174*4882a593Smuzhiyun--------------------------------------
175*4882a593Smuzhiyun
176*4882a593SmuzhiyunSince the assignment of the Klingon Linux Unicode block, a registry of
177*4882a593Smuzhiyunfictional and artificial scripts has been established by John Cowan
178*4882a593Smuzhiyun<jcowan@reutershealth.com> and Michael Everson <everson@evertype.com>.
179*4882a593SmuzhiyunThe ConScript Unicode Registry is accessible at:
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun	  https://www.evertype.com/standards/csur/
182*4882a593Smuzhiyun
183*4882a593SmuzhiyunThe ranges used fall at the low end of the End User Zone and can hence
184*4882a593Smuzhiyunnot be normatively assigned, but it is recommended that people who
185*4882a593Smuzhiyunwish to encode fictional scripts use these codes, in the interest of
186*4882a593Smuzhiyuninteroperability.  For Klingon, CSUR has adopted the Linux encoding.
187*4882a593SmuzhiyunThe CSUR people are driving adding Tengwar and Cirth into Unicode
188*4882a593SmuzhiyunPlane 1; the addition of Klingon to Unicode Plane 1 has been rejected
189*4882a593Smuzhiyunand so the above encoding remains official.
190