Western Latin character sets (computing)

Several binary representations of character sets for common Western European languages are compared in this article. These encodings were designed for representation of Italian, Spanish, Portuguese, French, German, Dutch, English, Danish, Swedish, Norwegian, and Icelandic, which use the Latin alphabet, a few additional letters and ones with precomposed diacritics, some punctuation, and various symbols (including some Greek letters). Although they're called "Western European" many of these languages are spoken all over the world. Also, these character sets happen to support many other languages such as Malay, Swahili, and Classical Latin.

This material is technically obsolete, having been functionally replaced by Unicode. However it continues to have historical interest.

Summary

The ISO-8859 series of 8-bit character sets encodes all Latin character sets used in Europe, albeit that the same code points have multiple uses that caused some difficulty. The arrival of Unicode, with a unique code point for every glyph, resolved these issues.

  • ISO/IEC 8859-1 or Latin-1 is the most used and also defines the first 256 codes in Unicode
  • ISO/IEC 8859-15 modifies ISO-8859-1 to fully support Estonian, Finnish and French and add the euro sign.
  • Windows-1252 is a superset of ISO-8859-1 that includes the characters from ISO-8859-15 and popular punctuation such as curved quotation marks. It is common that web page tools for Windows use Windows-1252 but label the web page as using ISO-8859-1, this has been addressed in HTML 5, which mandates that pages labeled as ISO-8859-1 must be interpreted as Windows-1252.
  • IBM CP437, being intended for English only, has very little in the way of accented letters but has far more graphics characters than the others and also some Greek characters that are useful as technical symbols.
  • IBM CP850 has all the printable characters that ISO-8859-1 has (albeit arranged differently) and still manages to have enough graphics characters to build a usable text-mode user interface.
  • IBM CP858 differs from CP850 only by one character — a dotless i (ı), rarely used outside Turkey, was replaced by euro currency sign (€).[1]
  • IBM CP859 contains all the printable characters that ISO-8859-15 has, so unlike CP850 it supports the €, Finnish and French.
  • IBM code pages 037, 500, and 1047 are EBCDIC encodings that include all of the ISO-8859-1 characters.
  • The Mac OS Roman character set (often referred to as MacRoman and known by the IANA as simply MACINTOSH) has most, but not all, of the same characters as ISO-8859-1 but in a very different arrangement; and it also adds many technical and mathematical characters (though it lacks the important ×) and more diacritics. Older Macintosh web browsers were known to munge the few characters that were in ISO-8859-1 but not their native Macintosh character set when editing text from Web sites. Conversely, in Web material prepared on an older Macintosh, many characters were displayed incorrectly when read by other operating systems.

History

The earlier seven-bit U.S. ASCII encoding has characters sufficient to properly represent only English, Latin, and Swahili. It is missing some letters and letter-diacritic combinations used in other Latin-alphabet languages. However, since there was no other choice on most U.S.-supplied computer platforms, ASCII was unavoidable in most of the non-English-speaking world (seven-bit encoding was necessitated by the limitations of early computing networks). There was the ISO 646 group of encodings which replaced some of the symbols in ASCII with local characters, but space was very limited, and some of the symbols replaced were quite common in things like programming languages.

Although seven-bit communication was the norm, most computers internally used eight-bit bytes, and they mostly put some form of characters in the 128 higher byte positions. In the early days most of these were system specific, but gradually a few standards were settled on.

In recent years, as storage and memory costs fall, the issues associated with multiple meanings of a given eight-bit code (there are seven ISO-Latin code sets alone) have ceased to be justified. All major operating systems have moved to Unicode as their main internal representation. However Windows does not support Unicode using their 8-bit character interfaces (by supporting UTF-8 in standard interfaces such as fopen), so many applications continue to be restricted to these legacy character sets.

The euro sign

The euro and its euro sign introduced significant pressure to support the euro sign (€), and most 8-bit character sets had to be adapted in some way.

  • Apple with MacRoman and Sun Microsystems with Solaris OS simply replaced the generic currency sign (¤). This caused significant difficulty because organisations had found other uses for it, such as the company logo.
  • ISO introduced a further variant of ISO 8859, ISO 8859-15, which replaced the generic currency sign with the euro sign as well as making some other replacements of symbols with letters with diacritics. ISO 8859-15 never received widespread adoption.
  • Windows-1252 placed the euro sign in a gap (position 80hex) in the existing C1 control codes.

All of these issues have been resolved as operating systems have been upgraded to support Unicode as standard, which encodes the euro sign at U+20AC (decimal 8364).

Comparison table

Code points U+0000 to U+007F are not shown in this table currently, as they are directly mapped in all character sets listed here. The ASCII coding standard defines the original specification for the mapping of the first 0-127 characters.

The table is arranged by Unicode code point. Character sets are referred to here by their IANA names in upper case.

CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
NBSPU+00A0A0A0A0FFFFCA
¡U+00A1A1A1A1ADADC1
¢U+00A2A2A2A29BBDA2
£U+00A3A3A3A39C9CA3
¤U+00A4A4 A4 CF 
¥U+00A5A5A5A59DBEB4
¦U+00A6A6 A6 DD 
§U+00A7A7A7A7 F5A4
¨U+00A8A8 A8 F9AC
©U+00A9A9A9A9 B8A9
ªU+00AAAAAAAAA6A6BB
«U+00ABABABABAEAEC7
¬U+00ACACACACAAAAC2
SHYU+00ADADADAD F0 
®U+00AEAEAEAE A9A8
¯U+00AFAFAFAF EEF8
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
°U+00B0B0B0B0F8F8A1
±U+00B1B1B1B1F1F1B1
²U+00B2B2B2B2FDFD 
³U+00B3B3B3B3 FC 
´U+00B4B4 B4 EFAB
µU+00B5B5B5B5E6E6B5
U+00B6B6B6B6 F4A6
·U+00B7B7B7B7FAFAE1
¸U+00B8B8 B8 F7FC
¹U+00B9B9B9B9 FB 
ºU+00BABABABAA7A7BC
»U+00BBBBBBBBAFAFC8
¼U+00BCBC BCACAC 
½U+00BDBD BDABAB 
¾U+00BEBE BE F3 
¿U+00BFBFBFBFA8A8C0
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
ÀU+00C0C0C0C0 B7CB
ÁU+00C1C1C1C1 B5E7
ÂU+00C2C2C2C2 B6E5
ÃU+00C3C3C3C3 C7CC
ÄU+00C4C4C4C48E8E80
ÅU+00C5C5C5C58F8F81
ÆU+00C6C6C6C69292AE
ÇU+00C7C7C7C7808082
ÈU+00C8C8C8C8 D4E9
ÉU+00C9C9C9C9909083
ÊU+00CACACACA D2E6
ËU+00CBCBCBCB D3E8
ÌU+00CCCCCCCC DEED
ÍU+00CDCDCDCD D6EA
ÎU+00CECECECE D7EB
ÏU+00CFCFCFCF D8EC
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
ÐU+00D0D0D0D0 D1 
ÑU+00D1D1D1D1A5A584
ÒU+00D2D2D2D2 E3F1
ÓU+00D3D3D3D3 E0EE
ÔU+00D4D4D4D4 E2EF
ÕU+00D5D5D5D5 E5CD
ÖU+00D6D6D6D6999985
×U+00D7D7D7D7 9E 
ØU+00D8D8D8D8 9DAF
ÙU+00D9D9D9D9 EBF4
ÚU+00DADADADA E9F2
ÛU+00DBDBDBDB EAF3
ÜU+00DCDCDCDC9A9A86
ÝU+00DDDDDDDD ED 
ÞU+00DEDEDEDE E8 
ßU+00DFDFDFDFE1E1A7
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
àU+00E0E0E0E0858588
áU+00E1E1E1E1A0A087
âU+00E2E2E2E2838389
ãU+00E3E3E3E3 C68B
äU+00E4E4E4E484848A
åU+00E5E5E5E586868C
æU+00E6E6E6E69191BE
çU+00E7E7E7E787878D
èU+00E8E8E8E88A8A8F
éU+00E9E9E9E982828E
êU+00EAEAEAEA888890
ëU+00EBEBEBEB898991
ìU+00ECECECEC8D8D93
íU+00EDEDEDEDA1A192
îU+00EEEEEEEE8C8C94
ïU+00EFEFEFEF8B8B95
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
ðU+00F0F0F0F0 D0 
ñU+00F1F1F1F1A4A496
òU+00F2F2F2F2959598
óU+00F3F3F3F3A2A297
ôU+00F4F4F4F4939399
õU+00F5F5F5F5 E49B
öU+00F6F6F6F694949A
÷U+00F7F7F7F7F6F6D6
øU+00F8F8F8F8 9BBF
ùU+00F9F9F9F997979D
úU+00FAFAFAFAA3A39C
ûU+00FBFBFBFB96969E
üU+00FCFCFCFC81819F
ýU+00FDFDFDFD EC 
þU+00FEFEFEFE E7 
ÿU+00FFFFFFFF9898D8
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
ıU+0131    D5F5
ŒU+0152 BC8C  CE
œU+0153 BD9C  CF
ŠU+0160 A68A   
šU+0161 A89A   
ŸU+0178 BE9F  D9
ŽU+017D B48E   
žU+017E B89E   
ƒU+0192  839F9FC4
ˆU+02C6  88  F6
ˇU+02C7     FF
˘U+02D8     F9
˙U+02D9     FA
˚U+02DA     FB
˛U+02DB     FE
˜U+02DC  98  F7
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
˝U+02DD     FD
ΓU+0393   E2  
ΘU+0398   E9  
ΣU+03A3   E4  
ΦU+03A6   E8  
ΩU+03A9   EA BD
αU+03B1   E0  
δU+03B4   EB  
εU+03B5   EE  
πU+03C0   E3 B9
σU+03C3   E5  
τU+03C4   E7  
φU+03C6   ED  
U+2013  96  D0
U+2014  97  D1
U+2017    F2 
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
U+2018  91  D4
U+2019  92  D5
U+201A  82  E2
U+201C  93  D2
U+201D  94  D3
U+201E  84  E3
U+2020  86  A0
U+2021  87  E0
U+2022  95  A5
U+2026  85  C9
U+2030  89  E4
U+2039  8B  DC
U+203A  9B  DD
U+2044     DA
U+207F   FC  
U+20A7   9E  
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
U+20AC A480 (D5)[nb 1][2][3]DB
U+2122  99  AA
U+2202     B6
U+2206     C6
U+220F     B8
U+2211     B7
U+2219   F9  
U+221A   FB C3
U+221E   EC B0
U+2229   EF  
U+222B     BA
U+2248   F7 C5
U+2260     AD
U+2261   F0  
U+2264   F3 B2
U+2265   F2 B3
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
U+2310   A9  
U+2320   F4  
U+2321   F5  
U+2500   C4C4 
U+2502   B3B3 
U+250C   DADA 
U+2510   BFBF 
U+2514   C0C0 
U+2518   D9D9 
U+251C   C3C3 
U+2524   B4B4 
U+252C   C2C2 
U+2534   C1C1 
U+253C   C5C5 
U+2550   CDCD 
U+2551   BABA 
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
U+2552   D5  
U+2553   D6  
U+2554   C9C9 
U+2555   B8  
U+2556   B7  
U+2557   BBBB 
U+2558   D4  
U+2559   D3  
U+255A   C8C8 
U+255B   BE  
U+255C   BD  
U+255D   BCBC 
U+255E   C6  
U+255F   C7  
U+2560   CCCC 
U+2561   B5  
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
U+2562   B6  
U+2563   B9B9 
U+2564   D1  
U+2565   D2  
U+2566   CBCB 
U+2567   CF  
U+2568   D0  
U+2569   CACA 
U+256A   D8  
U+256B   D7  
U+256C   CECE 
U+2580   DFDF 
U+2584   DCDC 
U+2588   DBDB 
U+258C   DD  
U+2590   DE  
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
U+2591   B0B0 
U+2592   B1B1 
U+2593   B2B2 
U+25A0   FEFE 
U+25CA     D7
U+FB01     DE
U+FB02     DF

Notes

  • The mappings for the IBM code pages are from the Unicode site supplied by Microsoft. Refer to the Unicode Consortium's document on the differences between IBM's and Microsoft's mappings for these code pages.
  • IBM437 and IBM850 defined printable characters for the control code ranges. While these could not be used when printing text through DOS, as they would be trapped before reaching the screen, they could be used by applications that used screen memory directly.
  • Macintosh has an Apple logo at 0xF0, and translates it to U+F8FF in the Private Use Area for Unicode.

References

  1. IBM's PC DOS 2000, released in 1998, changed their definition of code page 850 to what they called modified code page 850 now including the euro sign at code point 213 instead of adding support for the new code page 858. The reason for this might have been down to existing restrictions in the implementation of the codepage switching logic under MS-DOS/PC DOS, which limited .CPI files to 64 KB in size or about six codepages maximum, a limitation, which was circumvented in some OEM versions of MS-DOS, in Windows NT, and also does not exist in DR-DOS. Further, the parser in MS-DOS/PC DOS limits the number of possible country / codepage entries in COUNTRY.SYS files to a maximum of 146 or 438, a limitation non-existent in DR-DOS. So, adding support for codepage 858 might have meant to drop another (e.g. codepage 850) at the same time, which might not have been a viable solution at that time, given that some applications were hard-wired to use codepage 850.
  1. "00858". Code pages by CPGID. IBM. Archived from the original on 2016-06-06. Retrieved 2016-06-06.
  2. Paul, Matthias (2001-08-15). "Changing codepages in FreeDOS" (Technical design specification based on fd-dev post ). Archived from the original on 2016-06-06. Retrieved 2016-06-06. The new official ID for the Multilingual "codepage 850 with EURO SIGN" is 858, not 850. IBM will switch to use 858 instead of their 850 variant with future issues of their products. […] I can only guess why they didn't add 858 to their EGAx.CPI, COUNTRY.SYS, and KEYBOARD.SYS files in PC DOS 2000. Many third-party applications are designed to work with 850 and didn't know about 858 at the time PC DOS 2000 was released, so it's easier for everyone, but unfortunately it's not compatible. […] As explained above, COUNTRY.SYS and KEYBOARD.SYS contain only two codepage entries for a given country in Western issues of DOS. (In Arabic and Hebrew issues there can be up to 8 codepages for one country, in theory there is no limit below the range of allowed codepages 1..65534). […] The problem is that removing support for 850 might have caused compatibility problems with applications which are hard-wired to use 850. Adding 858 as a third choice to all the files would have increased the file and table sizes significantly. The COUNTRY.SYS file parser in MS-DOS/PC DOS IO.SYS/IBMBIO.COM sets aside a 6 Kb (for DOS 6) scratchpad to load all the info. This allows a maximum of 438 entries in a COUNTRY.SYS file to be accepted, otherwise you will get the message "COUNTRY.SYS too large.". The NLSFUNC parser does not have this limitation, and the file parsers in DR-DOS (kernel and NLSFUNC) also do not know of such a restriction. Older issues of MS-DOS/PC DOS even had a 2 Kb buffer for a maximum of 146 entries.
  3. Paul, Matthias (2001-08-27). "Changing codepages in FreeDOS (follow-up)". Archived from the original on 2014-10-01. Retrieved 2013-05-08. […] one could also create custom .CPI files in the traditional FONT style without difficulties, but you could only store up to […] six codepages in such a file if it should be useable by MS-DOS/PC DOS (some OEM issues and NT can handle files larger than 64 Kb, but MS-DOS/PC DOS can not).
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.