Unicode Codec Tool

Unicode Codec Tool

What is Unicode Codec?

HTML standard and Java Programming Language define methods to represent Unicode code point by plain ASCII characters. Unicode Codec is a tool to convert between Unicode Raw Characters and Encoded Code Points.

HTML Character Reference

In HTML, any Unicode character can be represented by a character reference. The following is some examples.

®      is ® character
♠    is black spade suit symbol

The code point can be presented in decimal number or hexadecimal number. Some commonly used characters can also be presented by character entity reference, which are easier to remember. For example, ® represent ® character.

For more information, please read What is HTML Character Reference? section.

Java Unicode Notation

In Java Source Code, any Unicode character can be represented by a '\u' escape sequence. The '\u' token must be followed by a 4-digit hexadecimal number. Characters beyond 0xffff must be represented by surrogate pairs.

Here are some examples.

\u00f7   is ÷
\u2194 is ↔

Unicode Codec

The following is the Unicode Codec window, which can be activated by F5 key or Tools | Unicode Codec menu command. The upper text field is raw Unicode text and the lower one is encoded text. Both encoding and decoding are supported.

    unicodecodec1

Text Encoding

1) Select the required Codec (HTML or Java)
2) For HTML Encoding, there are additional options. Click the Option button to setup appropriate options.
3) Type-in the text into Upper Text Field.
4) Press the Encode button

Text Decoding

1) Select the proper Codec (HTML or Java)
2) Type-in the text into lower Text Field.
3) Press the Decode button

HTML Encoding Option

For HTML, Unicode character can be represented in decimal number or hexadecimal number. Some commonly used characters can be represented in character entity as well. So, there are additional options if you are encoding to HTML character reference. The following is the option window.

unicodecodec2

All these options apply to Encoder only. If Prefer character entities to numeric entities checkbox is enabled, the encoder will first try to convert Unicode Text into HTML Character Entities. If the characters cannot be represented in Character Entities, the encoder will convert them to numeric entities.

Tips

·	You may drag-and-drop text file to text boxes. The program can detect the precense of Unicode BOM and read Unicode, Unicode Big Endian and UTF-8 text file. If no BOM is found, the program read the file using default system code page.

See Also
What is HTML Character Reference?
HTML Character Reference Table