CJK Unified Ideographs

CJKV ideograph in traditional and simplified Chinese, Korean, Vietnamese and Japanese

The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters. In the process called Han unification, the common (shared) characters were identified and named CJK Unified Ideographs. As of Unicode 11.0, Unicode defines a total of 87,887 CJK Unified Ideographs.[1]

The terms ideographs or ideograms may be misleading, since the Chinese script is not strictly a pictographic or ideographic system.

Historically, Vietnam used Chinese ideographs too, so sometimes the abbreviation "CJKV" is used. This system was replaced by the Latin-based Vietnamese alphabet in the 1920s.

CJK Unified Ideographs blocks

CJK Unified Ideographs

The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,976 basic Chinese characters in the range U+4E00 through U+9FEF. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system and hanja, whose use is diminishing in Korea. Many characters in this block are used in all three writing systems, while others are in only one or two of the three. Chinese characters were also used in Vietnam's Nôm script (now obsolete). The first 20,902 characters in the block are arranged according to the Kangxi Dictionary ordering of radicals. In this system the characters written with the fewest strokes are listed first. The remaining characters were added later, and so are not in radical order.

The block is the result of Han unification,[2] which was somewhat controversial in the Far East.[3] Since Chinese, Japanese and Korean characters were coded in the same location, the appearance of a selected glyph could depend on the particular font being used. However, the source separation rule states that characters encoded separately in an earlier character set would remain separate in the new Unicode encoding.[4]

Using variation selectors, it is possible to specify certain variant CJK ideograms within Unicode. The Adobe-Japan1 character set, which has 14,683 ideographic variation sequences,[5] is an extreme example of the use of variation selectors.[6]

Charts

4E00-62FF, 6300-77FF, 7800-8CFF, 8D00-9FFF.

Sources

Note: Most characters appear in multiple sources, making the sum of individual character counts (102,424) far more than the number of encoded characters (20,976).[7]

Country or regionCodeStandard[8]Character countTotal
 ChinaG0GB 2312-806,76320,916
G1GB 12345-902,202
G3GB 7589-87 traditional form4,834
G5GB 7590-87 traditional form2,841
G7Modern Chinese general character chart (Simplified Chinese: 现代汉语通用字表)42
G8GB8565-88290
G9GB18030-20008
GCENational Academy for Educational Research3
GEGB16500-953,779
GFCModern Chinese Standard Dictionary (现代汉语规范词典)2
GGFZGeneral Chinese Standard Dictionary (通用规范汉字字典)1
GHGB/T 15564-199559
GHZHanyu Da Zidian1
GKGB 12052-8989
GKXKangxi Dictionary2
 Hong KongHHong Kong Supplementary Character Set2,29215,375
HB0Computer Chinese Glyph and Character Code Mapping Table, Technical Report C-26
(電腦用中文字型與字碼對照表, 技術通報C-26)
9
HB1Big-5, Level 15,401
HB2Big-5, Level 27,650
HDHong Kong Supplementary Character Set, 201623
 JapanJ0JIS X 0208-19906,35612,565
J1JIS X 0212-19903,058
J13JIS X 0213:2004 level-3 characters replacing J1 characters1,037
J13AJIS X 0213:2004 level-3 character addendum from JIS X 0213:2000 level-3 replacing J1 character2
J14JIS X 0213:2004 level-4 characters replacing J1 characters1,704
J3JIS X 0213-2004 Level 395
J3AJIS X 0213-2004 Level 3 addendum7
J4JIS X 0213-2004 Level 4301
JARIBARIB STD-B243
JMJCharacter Information Development and Maintenance Project for e-Government "MojiJoho-Kiban Project" (文字情報基盤整備事業)2
 North KoreaKP0KPS 9566-974,65215,011
KP1KPS 10721-200010,359
 South KoreaK0KS C 5601-87 (now KS X 1001:2004)4,62015,392
K1KS C 5657-91 (now KS X 1002:2004)2,856
K2PKS C 5700-1:19947,911
K3PKS C 5700-2:19941
K4PKS 5700-3:19984
 TaiwanT1CNS 11643-1992 plane 15,41318,372
T2CNS 11643-1992 plane 27,650
T3CNS 11643-1992 plane 34,144
T4CNS 11643-1992 plane 4895
T5CNS 11643-1992 plane 564
T6CNS 11643-1992 plane 631
T7CNS 11643-1992 plane 716
TCCNS 11643-1992 plane 121
TFCNS 11643-1992 plane 15158
 VietnamV0TCVN 5773-19935934,759
V1TCVN 6056-19953,310
V2VHN 01-1998763
V3VHN 02-199891
VUVietnamese horizontal extensions2
n/aUTCUTC sources3434

In Unicode 4.1, 14 HKSCS-2004 characters and 8 GB 18030 characters were assigned to between U+9FA6 and U+9FBB code points.

CJK Unified Ideographs Extension A

The block named CJK Unified Ideographs Extension A (3400–4DBF) contains 6,582 additional characters in the range U+3400 through U+4DB5 that were added in Unicode 3.0 (1999).

Charts

3400-4DBF.

Sources

Note: Most characters appear in more than one source, making the sum of individual character counts (18,753) far more than the number of encoded characters (6,582).[7]

Country or regionCodeStandard[8]Character countTotal
 ChinaG3GB 7589-87 traditional form2,3916,192
G5GB 7590-87 traditional form1,226
G7Modern Chinese general character chart120
GHZHanyu Da Zidian339
GKXKangxi Zidian1,890
GSSingapore Chinese characters226
 Hong KongHHong Kong Supplementary Character Set572572
 JapanJ3JIS X 0213-2004 Level 32738
J4JIS X 0213-2004 Level 478
JAJapanese IT Vendors Contemporary Ideographs, 1993574
JA3JIS X 0213:2004 level-3 characters replacing JA characters17
JA4JIS X 0213:2004 level-4 characters replacing JA characters67
 North KoreaKP0KPS 9566-9713,189
KP1KPS 10721-20003,188
 South KoreaK3PKS C 5700-2:19941,8331,835
K4PKS 5700-3:19982
 TaiwanT3CNS 11643-1992 plane 32,1785,906
T4CNS 11643-1992 plane 42,917
T5CNS 11643-1992 plane 5395
T6CNS 11643-1992 plane 6197
T7CNS 11643-1992 plane 7133
TFCNS 11643-1992 plane 1586
 VietnamV0TCVN 5773-1993138308
V2VHN 01-1998151
V3VHN 02-199819
n/aUTCUTC sources1313

CJK Unified Ideographs Extension B

The block named CJK Unified Ideographs Extension B (20000–2A6DF) contains 42,711 characters in the range U+20000 through U+2A6D6 that were added in Unicode 3.1 (2001). These include most of the characters used in the Kangxi Dictionary that are not in the basic CJK Unified Ideographs block, as well as many Nôm characters that were formerly used to write Vietnamese.

Charts

20000-215FF, 21600-230FF, 23100-245FF, 24600-260FF, 26100-275FF, 27600-290FF, 29100-2A6DF.

Sources

Note: Many characters appear in more than one source, making the sum of individual character counts (73,955) far more than the number of encoded characters (42,711).[7]

Country or regionCodeStandard[8]Character countTotal
 ChinaG3GB 7589-87 traditional form130,525
G4KSiku Quanshu522
G9GB18030-20006
GBKEncyclopedia of China86
GCHCihai247
GCYCiyuan66
GFZFounder Press System65
GHCHanyu Da Cidian553
GHZHanyu Da Zidian10,510
GKXKangxi Dictionary18,469
 Hong KongHHong Kong Supplementary Character Set1,7031,703
 JapanJ3JIS X 0213-2004 Level 325303
J3AJIS X 0213-2004 Level 3 addendum1
J4JIS X 0213-2004 Level 4277
 MacauMACMacao Information System Character Set (澳門資訊系統字集)11
 North KoreaKP1KPS 10721-20005,7665,766
 South KoreaK4PKS 5700-3:1998166166
 TaiwanT3CNS 11643-1992 plane 32530,178
T4CNS 11643-1992 plane 43,408
T5CNS 11643-1992 plane 58,111
T6CNS 11643-1992 plane 65,934
T7CNS 11643-1992 plane 76,299
TFCNS 11643-1992 plane 156,401
 VietnamV0TCVN 5773-19931,5155,260
V2VHN 01-19982,290
V3VHN 02-1998425
V4Dictionary on Nom (Từ điển chữ Nôm)
Dictionary on Nom of Tay ethnic (Từ điển chữ Nôm Tày)
Lookup Table for Nom in the South (Bảng tra chữ Nôm miền Nam)
1
VUVietnamese horizontal extensions1,029
n/aUCIUTC sources453
USATSAT (Taishō Tripiṭaka digitization project)1
UTCUTC sources48

CJK Unified Ideographs Extension C

The block named CJK Unified Ideographs Extension C (2A700–2B73F) contains 4,149 characters in the range U+2A700 through U+2B734 that were added in Unicode 5.2 (2009).

Charts

2A700-2B73F.

Sources

Note: Some characters appear in more than one source, making the sum of individual character counts (4,534) more than the number of encoded characters (4,149).[7]

Country or regionCodeStandard[8]Character countTotal
 ChinaGBKEncyclopedia of China741,120
GCHCihai264
GCYCiyuan1
GCYYChinese Academy of Surveying and Mapping ideographs55
GFZFounder Press System1
GGHOld Chinese Dictionary (古代汉语词典)51
GHCHanyu Da Cidian14
GHZHanyu Da Zidian1
GJZCommercial Press ideographs61
GKXKangxi Dictionary6
GXCXiandai Hanyu Cidian25
GZFYDictionary of Chinese Dialects (汉语方言大辞典)202
GZJWCollections of Bronze Inscriptions from Yin and Zhou Dynasties
(殷周金文集成引得)
365
 Hong KongHHong Kong Supplementary Character Set11
 JapanJKJapanese Kokuji Collection367367
 MacauMACMacao Information System Character Set (澳門資訊系統字集)1616
 North KoreaKP1KPS 10721-200088
 South KoreaK5Korean IRG Hanja Character Set404404
 TaiwanTCCNS 11643-1992 plane 126341,750
TDCNS 11643-1992 plane 13766
TECNS 11643-1992 plane 14350
 VietnamV1TCVN 6056:19951787
V4Dictionary on Nom (Từ điển chữ Nôm)
Dictionary on Nom of Tay ethnic (Từ điển chữ Nôm Tày)
Lookup Table for Nom in the South (Bảng tra chữ Nôm miền Nam)
784
n/aUCIUTC sources181
VUVietnamese horizontal extensions2
UTCUTC sources80

CJK Unified Ideographs Extension D

The block named CJK Unified Ideographs Extension D (2B740–2B81F) contains 222 characters in the range U+2B740 through U+2B81D that were added in Unicode 6.0 (2010).

Charts

2B740–2B81F.

Sources

Note: Some characters appear in more than one source, making the sum of individual character counts (226) more than the number of encoded characters (222).[7]

Country or regionCodeStandard[8]Character countTotal
 ChinaGCHCihai176
GIDCID System of the Ministry of Public Security of China32
GXCXiandai Hanyu Cidian4
GZHZhonghua Zihai39
 JapanJHHanyo-Denshi Program (汎用電子情報交換環境整備プログラム)107107
 TaiwanTBCNS 11643-1992 plane 112424
n/aUTCUTC sources1919

CJK Unified Ideographs Extension E

The block named CJK Unified Ideographs Extension E (2B820–2CEAF) contains 5,762 characters in the range U+2B820 through U+2CEA1 that were added in Unicode 8.0 (2015).

Charts

2B820–2CEAF.

Sources

Note: Some characters appear in more than one source, making the sum of individual character counts (5,793) more than the number of encoded characters (5,762).[7]

Country or regionCodeStandard[8]Character countTotal
 ChinaGBKEncyclopedia of China152,814
GCHCihai112
GCYCiyuan3
GCYYChinese Academy of Surveying and Mapping ideographs98
GDZGeology Press ideographs1
GGHOld Chinese Dictionary (古代汉语词典)175
GHCHanyu Da Cidian7
GIDCID System of the Ministry of Public Security of China36
GJZCommercial Press ideographs147
GKXKangxi Dictionary22
GRMPeople's Daily ideographs3
GWZHanyu Da Cidian Press ideographs12
GXCXiandai Hanyu Cidian57
GXHXinhua Zidian4
GZFYHanyu Fangyan Dacidian (汉语方言大辞典, Dictionary of Chinese Dialects)712
GZJWCollections of Bronze Inscriptions from Yin and Zhou Dynasties
(殷周金文集成引得)
1,410
 JapanJKJapanese Kokuji Collection415415
 MacauMACMacao Information System Character Set (澳門資訊系統字集)4848
 TaiwanTCCNS 11643-1992 plane 123231257
TDCNS 11643-1992 plane 13595
TECNS 11643-1992 plane 14339
 VietnamV4Dictionary on Nom (Từ điển chữ Nôm)
Dictionary on Nom of Tay ethnic (Từ điển chữ Nôm Tày)
Lookup Table for Nom in the South (Bảng tra chữ Nôm miền Nam)
1,0281,031
VUVietnamese horizontal extensions3
n/aUCIUTC sources1228
UTCUTC sources227

CJK Unified Ideographs Extension F

The block named CJK Unified Ideographs Extension F (2CEB0–2EBEF) contains 7,473 characters in the range U+2CEB0 through 2EBE0 that were added in Unicode 10.0 (2017). It includes more than 1,000 Sawndip characters for Zhuang.

Charts

2CEB0–2EBEF.

Sources

Note: Some characters appear in more than one source, making the sum of individual character counts (7,650) more than the number of encoded characters (7,473).[7]

Country or regionCodeStandard[8]Character countTotal
 ChinaGCYCiyuan1221,304
GFCModern Chinese Standard Dictionary (现代汉语规范词典)27
GIDCID System of the Ministry of Public Security of China1
GLGYJZhuang Liao Songs Research (壮族嘹歌研究)1
GOCDOxford English-Chinese Chinese-English Dictionary (牛津英汉汉英词典)2
GPGLGZhuang Folk Song Culture Series - Pingguo County Liao Songs (壮族民歌文化丛书・平果嘹歌)70
GXHZXinhua Big Dictionary (新华大字典)51
GZAncient Zhuang Character Dictionary (古壮字字典)995
GZJWCollections of Bronze Inscriptions from Yin and Zhou Dynasties
(殷周金文集成引得)
33
GZYSChinese Ancient Ethnic Characters Research (中国民族古文字研究)2
 JapanJMJCharacter Information Development and Maintenance Project for e-Government "MojiJoho-Kiban Project" (文字情報基盤整備事業)1,6451,645
 South KoreaKCKorean History On-Line (한국 역사 정보 통합 시스템)1,7931,793
 MacauMACMacao Information System Character Set (澳門資訊系統字集)2222
 VietnamVUVietnamese horizontal extensions11
n/aUSATSAT (Taishō Tripiṭaka digitization project)2,8842,885
UTCUTC sources1

CJK Compatibility Ideographs

The block named CJK Compatibility Ideographs (F900–FAFF) was created to retain round-trip compatibility with other standards. Only twelve of its characters have the "Unified Ideograph" property: U+FA0E, FA0F, FA11, FA13, FA14, FA1F, FA21, FA23, FA24, FA27, FA28 and FA29.[1] None of the other characters in this and other "Compatibility" blocks relate to CJK Unification.

Charts

F900–FAFF.

Sources

Note: Some characters appear in more than one source, making the sum of individual character counts (34) more than the number of encoded Unified characters (12).[7]

Country or regionCodeStandard[8]Character countTotal
 ChinaG9GB18030-20001212
 JapanJ3JIS X 0213-2004 Level 338
J4JIS X 0213-2004 Level 43
JAJapanese IT Vendors Contemporary Ideographs, 19931
JA3JIS X 0213:2004 level-3 characters replacing JA characters1
 TaiwanTFCNS 11643-1992 plane 1511
 VietnamV2VHN 01-199811
n/aUTCUTC sources1212

UTC Sources

The Ideographic Rapporteur Group (IRG) bears the formal responsibility of developing extensions to the encoded repertoires of unified CJK ideographs. The Unicode Consortium participates in this group as a liaison member of ISO. The characters submitted by the Unicode Technical Committee bear the prefix "UTC". All CJK Unified Ideographs in ISO/IEC10646 are required to have at least one source identifier. Changes to IRG source information, however, can leave a given ideograph without any such sources. In such cases, the ideograph is included in the U-source database to guarantee it has at least one source. Such ideographs are indicated by a source prefix of "UCI" instead of "UTC".[9]

The UTC sources consist of the following:

  • ABC Chinese-English Dictionary by John DeFrancis
  • The Adobe-CNS1 glyph collection
  • The Adobe-Japan1 glyph collection
  • A Complete Checklist of Species and Subspecies of Chinese Birds (中国鸟类系统检索)
  • The Great Nom Dictionary (Đại Tự Điển Chữ Nôm)
  • Annotations to Shuowen Jiezi (annotated by Duan Yucai)
  • GB18030-2000
  • Required Character List Supplied by The Church of Jesus Christ of Latter-day Saints (Hong Kong)
  • New Commercial Dictionary (商务新词典), Hong Kong
  • Defect reports filed against the Unicode Standard or other direct communication with the Unicode editorial committee
  • Unicode Technical Committee (UTC) documents
  • Modern Chinese Dictionary (现代汉语词典), by Chinese Academy of Social Sciences, Linguistics Research Institute, Dictionary Editorial Office
  • Working Group (WG2) documents
  • Wenlin (文林) http://www.wenlin.com/

Known issues

Disunification

U+4039

The character U+4039 (䀹) was a unification of two different characters (one with jiā 夾 phonetic and one with shǎn 㚒 phonetic) until Unicode 5.0. However, they were lexically different characters that should not have been unified; they have different pronunciations and different meanings.

The proposal of disunification of U+4039[10] was accepted and the new character is encoded at U+9FC3 (鿃) in Unicode 5.1.

Other 3 glyphs in Extension B

In CJK Unified Ideographs Extension B, some characters are incorrectly unified with others. These characters include U+2017B (𠅻), U+204AF (𠒯) and U+24CB2 (𤲲). The first two characters contained a wrong unification of Chinese Mainland and Vietnamese source of their glyph, while the last one unifies the Chinese Mainland and Taiwanese ones.[11]

Unifiable variants and exact duplicates in Extension B

Also in CJK Unified Ideographs Extension B, hundreds of glyph variants were encoded.[12] In addition to the deliberate encoding of close glyph variants, six exact duplicates (where the same character has inadvertently been encoded twice) and two semi-duplicates (where the CJK-B character represents a de facto disunification of two glyph forms unified in the corresponding BMP character) were encoded by mistake:[13]

  • U+34A8 㒨 = U+20457 𠑗 : U+20457 is the same as the China-source glyph for U+34A8, but it is significantly different from the Taiwan-source glyph for U+34A8
  • U+3DB7 㶷 = U+2420E 𤈎 : same glyph shapes
  • U+8641 虁 = U+27144 𧅄 : U+27144 is the same as the Korean-source glyph for U+8641, but it is significantly different from the Chinese Mainland-, Taiwan- and Japan-source glyphs for U+8641
  • U+204F2 𠓲 = U+23515 𣔕 : same glyph shapes, but ordered under different radicals
  • U+249BC 𤦼 = U+249E9 𤧩 : same glyph shapes
  • U+24BD2 𤯒 = U+2A415 𪐕 : same glyph shapes, but ordered under different radicals
  • U+26842 𦡂 = U+26866 𦡦 : same glyph shapes
  • U+FA23 﨣 = U+27EAF 𧺯 : same glyph shapes (U+FA23 﨣 is a unified CJK ideograph, despite its name "CJK COMPATIBILITY IDEOGRAPH-FA23.")

Other CJK Ideographs in Unicode, not Unified

Apart from the seven blocks of "Unified Ideographs," Unicode has about a dozen more blocks with not-unified CJK-characters. These are mainly CJK radicals, strokes, punctuation, marks, symbols and compatibility characters. Although some characters have their (decomposable) counterparts in other blocks, the usages can be different.

Four blocks of compatibility characters are included for compatibility with legacy text handling systems and older character sets:

They include forms of characters for vertical text layout and rich text characters that Unicode recommends handling through other means. Therefore, their use is discouraged.

Usually, compatibility characters are those that would not have been encoded except for compatibility and round-trip convertibility with other standards. However, the amount of CJK ideographs within any non-Unicode standard is too big to fit into Unicode's CJK Compatibility Ideographs blocks. Instead, code points are assigned when the affected characters are approved by the Unicode Consortium, but have yet to assign any code points within the CJK Unified Ideographs blocks.

Font support

The blocks CJK Unified Ideographs and CJK Unified Ideographs Extension A, being parts of the Basic Multilingual Plane, are supported by the majority of the CJK fonts. However, Japanese and Korean fonts usually have less characters (about 13,000 and 8,000, respectively) than Chinese. Extensions B, C, D are supported by additional fonts MingLiU-ExtB, MingLiU_HKSCS-ExtB, PMingLiU-ExtB, SimSun-ExtB included in Microsoft Windows since Vista.[14]

Unicode version history

CJK unified Ideographs additions per Unicode version
Unicode versionAdditionPlaneCharacters addedTotal Characters
1.0 (1991)CJK Unified IdeographsBasic Multilingual Plane (BMP)20,90220,914
CJK Compatibility IdeographsBMP12
3.0 (1999)CJK Unified Ideographs Extension ABMP6,58227,496
3.1 (2001)CJK Unified Ideographs Extension BSupplementary Ideographic Plane (SIP)42,71170,207
4.1 (2005)CJK Unified Ideographs: Ideographs from HKSCS-2004 and GB 18030-2000 not in ISO 10646BMP2270,229
5.1 (2008)CJK Unified Ideographs: Ideographs from Adobe Japan and disunification of U+4039BMP870,237
5.2 (2009)CJK Unified Ideographs Extension CSIP4,14974,394
8 other characters from ARIB #47, #95, #93 and HKSCSBMP8
6.0 (2010)CJK Unified Ideographs Extension DSIP22274,616
6.1 (2012)1 character corresponding to Adobe-Japan1-6 CID+20156BMP174,617
8.0 (2015)CJK Unified Ideographs Extension ESIP5,76280,388
9 other charactersBMP9
10.0 (2017)CJK Unified Ideographs Extension FSIP7,47387,882
21 other charactersBMP21
11.0 (2018)CJK Unified IdeographsBMP587,887

See also

Notes

  1. 1 2 "Unicode 11.0 UCD: PropList.txt". 2018-03-15. Retrieved 2018-06-06.
  2. The Unicode Standard 4.0, Appendix A - Han Unification History
  3. Suzanne Topping, "The secret life of Unicode"
  4. "Chapter 11 - East Asian scripts", The Unicode standard, 4.0.
  5. "Ideographic Variation Database". 2017-12-12. Retrieved 2016-08-15.
  6. PRI 108: Combined registration of the Adobe Japan1 collection and of sequences in that collection
  7. 1 2 3 4 5 6 7 8 "Unihan_IRGSources.txt (from Unihan.zip)". 2018-05-18. Retrieved 2018-06-05.
  8. 1 2 3 4 5 6 7 8 "UAX #38: Unicode Han Database (Unihan)". Unicode Consortium. 2018-05-18.
  9. Jenkins, John H. (2018-05-18). "UAX #45: U-source Ideographs". Unicode Consortium.
  10. Andrew West and John Jenkins, proposal of disunification of U+4039
  11. Eiso Chan (陈永聪), Comments on four error glyphs on CJK Unified Ideographs Ext B & E.
  12. unifiable glyph variants
  13. Cook, Richard (6 October 2003). "Defect Report on Duplicate Encoded CJK Forms" (PDF). ISO/IEC JTC1/SC2/WG2. Retrieved 2012-03-28.
  14. Lunde, Ken (2009). CJKV Information Processing. O'Reilly. pp. 633–634. ISBN 978-0-596-51447-1.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.