Unicode Format (UTF)

Unicode Format

In Unicode standard, every character is assigned with a unique number (Code Point). The range of Code Point is 0h to 10FFFFh. The total number of valid Code Point is 1,114,112. The Code Point is represented as U+nnnn in hex (e.g. U+41 is 'A'). Several Methods called Unicode transformation format (UTF) are defined to represent the Code Point in different formats.

UTF-32

UTF-32 represents Code Point in simple 32-bit unsigned integer. This is the most straightforward method. However, most operating systems and applications do not support UTF-32. It is because every character requires 4 bytes of storage and it is undesirable.

UTF-16

It is the most commonly used format. Code Points from U+0 to U+D7FF are represented as single 16-bit integer. The remaining Code Points are represented in two 16-bit integers.

Actually, the Code Points that require two 16-bit integers are rarely used. So, most people simply consider Unicode Code Points are 16-bit only. UTF-16 can further divide to two sub-types. UTF16-Little Endian and UTF16-Big Endian. Please refer to Little-Endian vs. Big-Endian for additional information.

Note: Unifier simply labels UTF-16 as Unicode in Unicode Format Options.

UTF-8

UTF-8 stores Code Points in a sequence of bytes with variable length. The sequence can be from 1 byte to 4 bytes. Code Point from U+0000 to U+007F (ASCII) is represented in one single byte. The size remains the same to convert a ASCII string to UTF-8 string.

Most Unicode Code Points are represented in three bytes sequence. If a string in Asian Languages (e.g. Chinese, Japanese) is converted from UTF-16 to UTF-8, its size will be increased by 50%.

UTF-8 is widely used in byte-oriented system, especially Web and e-mail. Most Web-browser and e-mail client support Unicode in UTF-8 format. It is the recommended format for HTML-based files.

UTF-7

UTF-7 is similar to UTF-8. But it encodes the Code Point in 7-bit unit. Some older e-mail system can transmit data in 7-bit unit only. UTF-7 is designed for the compatibility of these systems. It is rarely used.

Byte-Order Mark (BOM)

To identify the transformation format used, a special character U+FEFF is used as signature at the beginning of data stream. BOM is useful if the file / data stream are not known to be in either big or little endian format. For example, all text files on Windows platform are identified as .txt file. BOM is used to identify the text file as plain ASCII, UTF-16 or UTF-8. The Notepad application in Windows 2000/XP recognize BOM properly.

The exact bytes comprising BOM are listed below.

Unicode Transformation Format	Byte Order Mark (Hex)
UTF-32 Big Endian	00 00 FE FF
UTF-32 Little Endian	FF FE 00 00
UTF-16 Big Endian	FE FF
UTF-16 Little Endian	FF FE
UTF-8	EF BB BF

Note: Unifier supports conversion to UTF-16 (simply called Unicode) and UTF-8 format.

See Also
Little-Endian vs. Big-Endian