CESU-8

The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26.[1] A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four.

The encoding of Unicode supplementary characters works out to 11101101 1010yyyy 10xxxxxx 11101101 1011xxxx 10xxxxxx (yyyy represents the top five bits of the character minus one).

CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only.[2] It should be used exclusively for internal processing and never for external data exchange.

Supporting CESU-8 in HTML documents is prohibited by the W3C[3][4] and WHATWG[5] HTML standards, as it would present a cross-site scripting vulnerability.[6]

CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).

The Oracle database uses CESU-8 for its "UTF8" character set. Standard UTF-8 can be obtained using the character set "AL32UTF8" (since Oracle version 9.0).

Examples

Encoding Unicode code point
U+0045U+0205U+10400
Eȅ𐐀
UTF-8 45 C8 85 F0 90 90 80
UTF-16 00450205D801 DC00
CESU-8 45 C8 85 ED A0 81 ED B0 80

References

  1. McGowan, Rick. "Unicode Technical Report #26 - Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)". Unicode Consortium.
  2. "About Unicode Technical Reports - Types of Unicode Technical Reports: UAX, UTS, UTR". Unicode Consortium.
  3. "8.2.2.3. Character encodings". HTML 5.1 Standard. W3C.
  4. "8.2.2.3. Character encodings". HTML 5 Standard. W3C.
  5. "12.2.3.3 Character encodings". HTML Living Standard. WHATWG.
  6. "<meta> - HTML". MDN Web Docs. Mozilla.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.