I thought I’d write this blog now in order to give you all a heads-up on what’s coming soon in Toad for Oracle (more to follow...).
As some of you may already be aware, we are releasing our next version of Toad (version 10.0) later this year and it will have one or two “under the cover” differences. Unicode support is one of those differences and is probably one of the most significant technology upgrades in Toad’s history.
Now, for those of you not yet using Unicode in your Oracle database, this may not be of much interest, but I’ve found loads of companies either using (or planning to use) Unicode-enabled Oracle databases as far afield as Hong Kong, Scandinavia, UK, Germany and North America for all sorts of languages including Simplified Chinese, Japanese, Korean, Russian, Greek and Hebrew.
Why? Because Unicode helps solve the problem of the world’s computers and software not being able to recognise languages from different countries. If we could define and group together collections of characters for all the world’s different languages, and assign a code to each character, then each code could be represented in a computer using bytes. Unicode therefore means “one code”.
The purpose of Part 1 of this blog is to explain in general terms what the problem is that Unicode fixes, what Unicode is and also explain how Oracle implements Unicode in their database.
What is the problem?
With the increase in globalization and with so many companies having offices and databases around the globe, dealing with different languages has always been a challenge. There is no single character set that contains enough characters to represent the needs of the many languages participating in e-business operations today. Although many countries speak English, the character set the English language uses (7-bit ASCII) simply cannot encode all the various letters, punctuation and technical symbols used by other countries.
ASCII and other national character sets are not globally universal and cannot handle multiple languages. Even the various national character sets available within Europe are based on ASCII and are unable to store characters from languages such as Chinese, Korean and Russian. There are also potential conflicts between character sets because they might share the same numeric code value.
In the late 1980s, several organizations started working on dealing with these problems and try to come up with a global character set. This work continued in earnest during the 1990s with the advent of the World Wide Web. With so many organizations now conducting business over the internet, the need for a global character set to support all major scripts, legacy data and implementations, conform to international standards and enable the worldwide interchange of data has become crucial.
So what actually is Unicode?
There are many definitions available of what Unicode is. I’ve chosen the Wikipedia definition:
Unicode is a computingindustry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, Unicode consists of a repertoire of more than 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).
The Unicode Consortium, the non-profit organization that coordinates Unicode's development, has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingual environments.
In a nutshell: Unicode is a universal encoded character set that allows you to store information from any language.
Unfortunately, Unicode means different things to different people. I’ve heard talk about UTF-8 as though it’s Unicode, so there is quite a bit of confusion on the subject. Below you will find a generalized glossary of Unicode terms which you might find helpful.
“UTF-8 is not Unicode, an encoding is not a character set, a character set is not Unicode and Multi-byte is not Unicode or a character set. Unicode does not specify languages but scripts. A Font is not Unicode and not all fonts can display all Unicode characters!” – Mark Lerch
Glossary of (some) Unicode Terms
Encodings
An Encoding is a way to get the platonic character in the Unicode map out of the clouds and down into a computer file. There are a variety of encodings to serve various purposes. Why not just represent every character with 4 bytes and be done with it? One answer is because that particular method would instantly quadruple the size of every document, email, web page and everything else. Therefore some of these encodings were created for size, some for optimization, some for memory and so forth.
Encodings transform the Unicode Code Points into particular byte value formats, and are therefore called “Unicode Transformation Formats,” or “UTFs.”
Byte Order Mark (BOM)
How does an application know which encoding was used when a file was created? It knows through the use of a “Byte Order Mark.” This special set of bytes is placed at the very beginning of a file to indicate whether a file was encoded using UTF-8, UTF-16 or UTF-32.
Character Set – a set of characters available to be digitally represented by a computer. Each character is assigned a unique character code (a sequence of byte values) which is used to represent it. The ASCII characters (mainly the English alphabet) have the same character code in all character sets. Other characters, such as the Euro, may have a different character code in each character set, if it is present at all. The term “Character Set” is itself somewhat ambiguous and controversial. One could further qualify a character set as “single byte” (Western scripts), “double byte” (Asian scripts) or “unicode” (all scripts).
SBCS – Single Byte Character Set. Defines a character set which encodes using a single byte per character. All the earlier and traditional Western character sets are SBCS.
DBCS – Double Byte Character Set. One use of this term is to mean a character set which encodes using two bytes per character. This, in itself, is ambiguous because Double Byte Character Sets also encode using one byte if they are encoding an ASCII character. Another use of this term is to indicate a Unicode encoding which uses two bytes – UTF16. This is an incorrect use of the term. The most common use of the term is to indicate a CJK (Chinese/Japanese/Korean) character set. These are not Unicode character sets, but rather very large character sets requiring two bytes for each character.
MBCS – Multi Byte Character Set. A controversial term, it defines a character set which uses variable-width encoding. The number of bytes used to store each character is a function of the encoding, not the character set, thus the confusion. Furthermore, when using UTF-8 and only Latin characters, there will always be a single byte written per character. This can hardly be called “multi-byte” even though a Character Set is being used which can potentially require more than one byte per character.
System Locale - In use prior to Windows XP, this phrase referred to the code pages which were the defaults for the system. Only applications which are not fully Unicode use them. In XP this phrase has been replaced with “Language for non-Unicode programs.” [2]
So how does Oracle implement Unicode in their databases?
Oracle's support of Unicode is quite comprehensive. Oracle Database 11g, for example, provides full support for Unicode 5.0, the current standard for multilingual support. This support allows customers to develop, deploy, and host multiple languages in a single central database or as part of a grid. Oracle also offers the flexibility to store all data in a Unicode database in UTF-8 or to incrementally store select columns in the Unicode datatype in UTF-8 or UTF-16.
Oracle-supported Unicode Character Sets
Oracle has been supporting Unicode since version 7.2. This table shows the chronology of Oracle’s character set support:
AL24UTFFSS
AL24UTFFSS was the first Unicode character set supported by Oracle. It was introduced in Oracle 7.2. The AL24UTFFSS encoding scheme was based on the Unicode 1.1 standard, which is now obsolete. AL24UTFFSS has been de-supported from Oracle 9i. The migration path for existing AL24UTFFSS databases is to upgrade the database to 8.0 or 8.1, then upgrade the character set to UTF8 before upgrading the database further to 9i or 10g.
UTF-8 is a variable-length encoding mechanism. It uses between 1 and 4 bytes to represent any one of the million-plus Unicode characters. It has at least two characteristics which are very appealing: it is completely backward-compatible with ASCII formatted files and it typically results in the smallest encoded string.
UTF8 was the UTF-8 encoded character set introduced in Oracle 8 and 8i. It followed the Unicode 2.1 standard between Oracle 8.0 and 8.1.6, and was upgraded to Unicode version 3.0 for versions 8.1.7, 9i, 10g and 11g. To maintain compatibility with existing installations this character set will remain at Unicode 3.0 in future Oracle releases. Although specific supplementary characters were not assigned to Unicode until version 3.1, the allocation for these characters were already defined in 3.0, so if supplementary characters are inserted in a UTF8 database, it will not corrupt the actual data inside the database. They will be treated as 2 separate undefined characters, occupying 6 bytes in storage. Oracle recommends that customers switch to AL32UTF8 for full supplementary character support.
UTFE
This is the UTF8 database character set for the EBCDIC platforms. It has the same properties as UTF8 on ASCII based platforms. The EBCDIC Unicode transformation format is documented in Unicode Technical Report #1 UTF-EBCDIC. Which can be found at
http://www.unicode.org/unicode/reports/tr16/
AL32UTF8
This is the UTF-8 encoded character set introduced in Oracle 9i. AL32UTF8 is the database character set that supports the latest version (5.0 in Oracle 11.1) of the Unicode standard. It also provides support for the newly defined supplementary characters. All supplementary characters are stored as 4 bytes. AL32UTF8 was introduced because when UTF8 was designed there wasn’t the concept of supplementary characters; there UTF8 has a maximum of 3 bytes per character. Changing the design of UTF8 would break backward compatibility, so a new character set was introduced. The introduction of surrogate pairs should mean that no significant architecture changes are needed in future versions of the Unicode standard, so currently the plan is to keep enhancing AL32UTF8 as necessary to support future versions of the Unicode standard. For example, in Oracle 10.1 this character set was implemented the Unicode 3.2 standard, in Oracle 10.2 that has been updated to support the Unicode 4.01 standard and in Oracle 11.1 to the Unicode 5.0 standard.
Please note that pre-Oracle 9 software can have some serious problems connecting to a AL32UTF8 database.
UTF-16 is a variable length encoding which uses 2 or 4 bytes to represent every Unicode Code Point. For characters in the BMP (Plane 0), 2 bytes are used. For characters in other Planes, 4 bytes are used, which are called surrogate pairs.
AL16UTF16
This is the first UTF-16 encoded character set in Oracle. It was introduced in Oracle 9i as the default national character set (NLS_NCHAR_CHARACTERSET). AL16UTF16 supports the latest version (5.0 in Oracle 11.1) of the Unicode standard. It also provides support for the newly define supplementary characters. All supplementary characters are stored as 4 bytes. As with AL32UTF8, the plan is to keep enhancing AL16UTF16 as necessary to support future versions of the Unicode standard. AL16UTF16 cannot be used as a database character set (NLS_CHARACTERSET), only as the national character set (NLS_NCHAR_CHARACTERSET). The database character set is used to identify and to hold SQL, SQL metadata and PL/SQL source code. It must have either single byte 7-bit ASCII or single byte EBCDIC as a subset, whichever is native to the deployment platform.
Therefore, it is not possible to use a fixed-width, multi-byte character set (such as AL16UTF16) as the database character set. Trying to create a database with AL16UTF16 as the database character set in 9i and up will give “ORA-12706: THIS CREATE DATABASE CHARACTER SET IS NOT ALLOWED”. AL16UTF16 is always in Big Endian byte order, regardless of the processor endianess. [3]
What if I want to migrate some of my databases to Unicode?
Johannes Ahrends is one of our Oracle and Unicode experts based in our Cologne office in Germany and is a regular Toad World blogger. He has been performing a lot of testing of Unicode in Oracle and makes some observations and recommendations for those thinking of such a migration.
- The size of a CHAR/VARCHAR: the maximum size of a character field is still in bytes. So for example a VARCHAR2 field in the Oracle space can contain only 4000 bytes whereas a CHAR field only 2000 bytes. This can cause problems if you migrate to Unicode as for example the Western European special characters like the Euro symbol or the German “umlaut” now has a length of two bytes instead of one.
- CLOB: If the database is running with a Unicode character set all CLOBs are stored with a fixed length character set. Up to 9i it was UCS2. Since Oracle10g its AL16UTF18. So by changing the database character set to Unicode your CLOB rows will double in size (Metalink Note 257772.1)
- To migrate to a Unicode database its most of the time necessary to unload and load all data (e.g. create a new database). The only way for a direct update is if you are on US7ASCII as this is a clear subset of all Unicode character sets.
- As a best practice the table definitions should have the character semantic (e.g. VARCHAR2(50 CHAR)) as the default is still “BYTE”. This default can be changed on session or system level with the parameter NLS_LENGTH_SEMANTICS.
- Johannes has written a blog specifically about how to migrate a database to Unicode which you can find here: http://www.toadworld.com/Community/Blogs/tabid/67/EntryID/306/Default.aspx
Thank you to Mark Lerch from the Toad Development Team for his kind permission to reproduce articles he has written on the subject.
There is often confusion as to how to setup your Windows PC/notebook to actually work with Unicode data in an IDE such as Toad.
Part 2 of this blog will help you understand what you need to do to take full advantage of Toad 10’s ability to work with Unicode.
[2] Mark Lerch – Toad development