UTF-EBCDIC

UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.

To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first (creating what the specification calls an I8 sequence). The main difference between this encoding and UTF-8 is that it allows Unicode code points U+0080 through U+009F (the C1 control codes) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this, UTF-8-Mod uses 101XXXXX instead of 10XXXXXX as the format for trailing bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, the UTF-8-Mod encoding of codepoints above U+009F is generally larger than the UTF-8 encoding.

The UTF-8-Mod transformation leaves the data in an ASCII-based format (for example, U+0041 "A" is still encoded as 01000001), so each byte is fed through a reversible (one-to-one) lookup table to produce the final UTF-EBCDIC encoding. For example, 01000001 in this table maps to 11000001; thus the UTF-EBCDIC encoding of U+0041 (Unicode's "A") is 0xC1 (EBCDIC's "A").

This encoding form is rarely used, even on the EBCDIC-based mainframes for which it was designed. IBM EBCDIC-based mainframe operating systems, such as z/OS, usually use UTF-16 for complete Unicode support. For example, DB2 UDB, COBOL, PL/I, Java and the IBM XML toolkit support UTF-16 on IBM mainframes.

Codepage layout

There are 160 characters with single-byte encodings in UTF-EBCDIC (compared to 128 in UTF-8). As can be seen, the single-byte portion is similar to IBM-1047 instead of IBM-37 due to the location of the square brackets. CCSID 37 has [] at hex BA and BB instead of at hex AD and BD respectively.

UTF-EBCDIC
_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F
0_ NUL
0000
SOH
0001
STX
0002
ETX
0003
ST
009C
HT
0009
SSA
0086
DEL
007F
EPA
0097
RI
008D
SS2
008E
VT
000B
FF
000C
CR
000D
SO
000E
SI
000F
1_ DLE
0010
DC1
0011
DC2
0012
DC3
0013
OSC
009D
LF
000A
BS
0008
ESA
0087
CAN
0018
EM
0019
PU2
0092
SS3
008F
FS
001C
GS
001D
RS
001E
US
001F
2_ PAD
0080
HOP
0081
BPH
0082
NBH
0083
IND
0084
NEL
0085
ETB
0017
ESC
001B
HTS
0088
HTJ
0089
VTS
008A
PLD
008B
PLU
008C
ENQ
0005
ACK
0006
BEL
0007
3_ DCS
0090
PU1
0091
SYN
0016
STS
0093
CCH
0094
MW
0095
SPA
0096
EOT
0004
SOS
0098
SGCI
0099
SCI
009A
CSI
009B
DC4
0014
NAK
0015
PM
009E
SUB
001A
4_ SP
0020

+00

+01

+02

+03

+04

+05

+06

+07

+08

+09
.
002E
<
003C
(
0028
+
002B
|
007C
5_ &
0026

+0A

+0B

+0C

+0D

+0E

+0F

+10

+11

+12
!
0021
$
0024
*
002A
)
0029
;
003B
^
005E
6_ -
002D
/
002F

+13

+14

+15

+16

+17

+18

+19

+1A

+1B
,
002C
%
0025
_
005F
>
003E
?
003F
7_
+1C

+1D

+1E

+1F
2
0000
2
0020
2
0040
2
0060
2
0080
`
0060
:
003A
#
0023
@
0040
'
0027
=
003D
"
0022
8_ 2
00A0
a
0061
b
0062
c
0063
d
0064
e
0065
f
0066
g
0067
h
0068
i
0069
2
00C0
2
00E0
2
0100
2
0120
2
0140
2
0160
9_ 2
0180
j
006A
k
006B
l
006C
m
006D
n
006E
o
006F
p
0070
q
0071
r
0072
2
01A0
2
01C0
2
01E0
2
0200
2
0220
2
0240
A_ 2
0260
~
007E
s
0073
t
0074
u
0075
v
0076
w
0077
x
0078
y
0079
z
007A
2
0280
2
02A0
2
02C0
[
005B
2
02E0
2
0300
B_ 2
0320
2
0340
2
0360
2
0380
2
03A0
2
03C0
2
03E0
3
0000
3
0400
3
0800
3
0C00
3
1000
3
1400
]
005D
3
1800
3
1C00
C_ {
007B
A
0041
B
0042
C
0043
D
0044
E
0045
F
0046
G
0047
H
0048
I
0049
3
2000
3
2400
3
2800
3
2C00
3
3000
3
3400
D_ }
007D
J
004A
K
004B
L
004C
M
004D
N
004E
O
004F
P
0050
Q
0051
R
0052
3
3800
3
3C00
4
4000
4
8000
4
10000
4
18000
E_ \
005C
4
20000
S
0053
T
0054
U
0055
V
0056
W
0057
X
0058
Y
0059
Z
005A
4
28000
4
30000
4
38000
5
40000
5
100000
F_ 0
0030
1
0031
2
0032
3
0033
4
0034
5
0035
6
0036
7
0037
8
0038
9
0039
APC
009F

  Letter  Number  Punctuation  Symbol  Other  Undefined

      Blue cells containing a large single-digit number are the start bytes for a sequence of that many bytes. The unbolded hexadecimal code point number shown in the cell is the lowest character value encoded using that start byte. This value can be greater than the value which would be obtained by following the start byte with continuation bytes which are all 65 (hex 0x41), if this would result in an invalid overlong form.

      Orange cells with one dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 5 bits they add.

      Red cells indicate start bytes (for a sequence of that many bytes) which can never appear in properly encoded UTF-EBCDIC text, because any possible continuation would result in an invalid overlong form. For example, 0x76 is marked in red because even 0x76 0x73 (which maps to the UTF-8-Mod sequence 0xC2 0xBF) would merely be an overlong encoding of U+005F (properly encoded as UTF-8-Mod 0x5F, UTF-EBCDIC 0x6D).

Oracle UTFE

Oracle UTFE is a Unicode 3.0 UTF-8 Oracle database variation, similar to the CESU-8 variant of UTF-8, where supplementary characters are encoded as two 4-byte characters rather than a single 4- or 5-byte character. It is used only on EBCDIC platforms.[1]

Advantages:

  • Only Unicode character set for EBCDIC.
  • Length of SQL CHAR types can be specified in number of characters.
  • Binary order of the SQL CHAR columns is same as binary order of the SQL NCHAR columns if the data consists of same supplementary characters. Consequently, these columns sort the same for identical strings.[1]

Disadvantages:

  • Supplementary characters occupy six bytes instead of four bytes only. Consequently, supplementary characters need to be converted.
  • UTFE is not a Unicode standard encoding. Clients requiring UTF-8 encoding must convert data on retrieval and storage.[1]

See also

References

  1. Baird, Cathy; Chiba, Dan; Chu, Winson; Fan, Jessica; Ho, Claire; Law, Simon; Lee, Geoff; Linsley, Peter; Matsuda, Keni; Oscroft, Tamzin; Takeda, Shige; Tanaka, Linus; Tozawa, Makoto; Trute, Barry; Tsujimoto, Mayumi; Wu, Ying; Yau, Michael; Yu, Tim; Wang, Chao; Wong, Simon; Zhang, Weiran; Zheng, Lei; Zhu, Yan; Moore, Valarie (2002) [1996]. "Appendix A: Locale Data". Oracle9i Database Globalization Support Guide (PDF) (Release 2 (9.2) ed.). Oracle Corporation. Oracle A96529-01. Archived (PDF) from the original on 2017-02-14. Retrieved 2017-02-14.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.