Punycode

Punycode is a representation of Unicode with the limited ASCII character subset used for Internet host names. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphen, which is called the Letter-Digit-Hyphen (LDH) subset. For example, München (German name for Munich) is encoded as Mnchen-3ya.

While the Domain Name System (DNS) technically supports arbitrary sequences of octets in domain name labels, the DNS standards recommend the use of the LDH subset of ASCII conventionally used for host names, and require that string comparisons between DNS domain names should be case-insensitive. The Punycode syntax is a method of encoding strings containing Unicode characters, such as internationalized domain names (IDNA), into the LDH subset of ASCII favored by DNS. It is specified in IETF Request for Comments 3492.[1]

Encoding procedure

As stated in RFC 3492, "Punycode is an instance of a more general algorithm called Bootstring, which allows strings composed from a small set of 'basic' code points to uniquely represent any string of code points drawn from a larger set." Punycode defines parameters for the general Bootstring algorithm to match the characteristics of Unicode text. This section demonstrates the procedure for Punycode encoding, using the example of the string "bücher" (Bücher is German for books), which is translated into the label "bcher-kva".

Separation of ASCII characters

First, all basic ASCII characters in the string are copied from input to output, skipping over any other characters. For example, "bücher" is copied to "bcher". If any characters were copied an ASCII hyphen is added to the output next (e.g., "bücher" → "bcher-"). Since it is a basic character, the ASCII hyphen may itself appear in the string before this additional character. However, the additional ASCII hyphen does not cause any ambiguity as no later part of the encoding process can introduce another ASCII hyphen; the last ASCII hyphen, if any, signifies the end of the basic characters.

Encoding of non-ASCII character insertions as code numbers

The next part of the encoding process first requires an understanding of the decoder, which is a finite-state machine with two state variables i and n. i is an index into the string ranging from zero (representing a potential insertion at the start) to the current length of the extended string (representing a potential insertion at the end).

i starts at zero, and n starts at 128 (the first non-ASCII code point). The state progression is a monotonic function. A state change either increments i or, if i is at its maximum, resets i to zero and increments n by 1, then goes back to incrementing i in the following state change. At each state change, either the code point denoted by n is inserted or it is not inserted.

The code numbers generated by the encoder represent how many possibilities to skip before an insertion is made. There are six possible places to insert a character in the current string "bcher" (including before the first character and after the last one). There are 124 code points between the last one considered (127, the end of ASCII) and "ü" (code point 252). Also there is one position to insert a "ü" that needs to be skipped (at position zero before the 'b'). That is why it is necessary to tell the decoder to skip a total of (6 × 124) + 1 = 745 possible insertions before getting to the one required. Once the character is inserted there are now seven possible places to insert another character.

Re-encoding of code numbers as ASCII sequences

Punycode uses generalized variable-length integers to represent these values. For example, this is how "kva" is used to represent the code number 745:

A number system with little-endian ordering is used which allows variable-length codes without separate delimiters: a digit lower than a threshold value marks that it is the most-significant digit, hence the end of the number. The threshold value depends on the position in the number and also on previous insertions, to increase efficiency. Correspondingly the weights of the digits vary.

In this case a number system with 36 digits is used, with the case-insensitive 'a' through 'z' equal to the numbers 0 through 25, and '0' through '9' equal to 26 through 35. Thus "kva", corresponds to "10 21 0".

To decode this string of digits, the threshold starts out as 1 and the weight is 1. The first digit is the units digit; 10 with a weight of 1 equals 10. After this, the threshold value is adjusted. For the sake of simplicity, let's assume it is now 2. The second digit has a weight of 36 minus the previous threshold value, in this case, 35. Therefore, the sum of the first two "digits" is 10 × 1 + 21 × 35. Since the second "digit" is not less than the threshold value of 2, there is more to come. The weight for the third "digit" is the previous weight times 36 minus the new threshold value; 35 × 34. The third "digit" in this example is 0, which is less than 2, meaning that it is the last (most significant) part of the number. Therefore, "kva" represents the number (10 × 1) + (21 × 35) + (0 × 35 × 34) = 745.

The threshold itself is determined by an algorithm keeping it between 1 and 26 inclusive, meaning the last character of an encoding will always be alphabetic. The case can then be used to provide information about the original case of the string.

For the insertion of a second special character in "bücher", the first possibility is "büücher" with code "bcher-kvaa", the second "bücüher" with code "bcher-kvab", etc. After "bücherü" with code "bcher-kvae" comes "ýbücher" with code "bcher-kvaf" (different from "übücher" coded "bcher-jvab"), etc.

To make the encoding and decoding algorithms simple, no attempt has been made to prevent some encoded values from encoding inadmissible Unicode values: however, these should be checked for and detected during decoding.

Punycode is designed to work across all scripts, and to be self-optimizing by attempting to adapt to the character set ranges within the string as it operates. It is optimized for the case where the string is composed of zero or more ASCII characters and in addition characters from only one other script system, but will cope with any arbitrary Unicode string. Note that for DNS use, the domain name string is assumed to have been normalized using Nameprep and (for top-level domains) filtered against an officially registered language table before being punycoded, and that the DNS protocol sets limits on the acceptable lengths of the output Punycode string.

Internationalized domain names

To prevent non-international domain names containing hyphens from being accidentally interpreted as Punycode, international domain name Punycode sequences have a so-called ASCII Compatible Encoding (ACE) prefix, "xn--", prepended.[2] Thus the domain name "bücher.tld" would be represented in ASCII as "xn--bcher-kva.tld".

See also

References

  1. RFC 3492, Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA), A. Costello, The Internet Society (March 2003)
  2. Internet Assigned Numbers Authority (2003-02-14). "Completion of IANA Selection of IDNA Prefix". www.atm.tut.fi. Retrieved 2017-09-22.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.