Unicode 101

Unicode is a standard for encoding, representing, and processing text in computer software. The standard is updated and maintained by the Unicode consortium (https://home.unicode.org/). The main goal of Unicode is to allow multiple languages to be used and combined in software without having to switch between code pages.

Code Points

In Unicode, every symbol is identified by its unique code point. The code point is a number identifying the symbol. The current version of the Unicode standard (13.0) has indexed over 140,000 symbols. The first 127 code points are compatible with the ASCII set. The website https://unicode-table.com/ provides a reference of all code points available in Unicode.

Composite Characters

Where most characters consist of only a single code point, many can also be built by combining multiple code points. These are called composite characters and common examples of these are diacritics. In many cases, composite characters can be represented in multiple forms. For example the ö character is available as single code point (U+00F6) and as a combination of two code points (U+006F & U+0308).

The existence of multiple forms that represent the same text can cause difficulties with searching in strings. For example, when using the Pos function and the Contains operator, they may be expecting one form and the data may contain another. To prevent these difficulties from occurring, strings can be normalized to the same form using the NormalizeString function.

Encodings

The Unicode standard defines multiple encodings. These encodings define how the code points are represented in binary data. So, the code points as described above are the same for each encoding, the difference is how the individual code points are stored.

UTF-8

With UTF-8 the code points are broken down into a maximum of 4 code units that are 1 byte each. So, the first 127 code points are stored in only a single code unit while higher code points need 2, 3 or 4 code units / bytes. This makes UTF-8 the most optimal encoding for western languages and structured document formats like XML, JSON and HTML. UTF-8 also has the advantage of being partially compatible with ANSI & OEM for ASCII characters.

UTF-16

With UTF-16 the code points are broken down into 1 or 2 code units that are 2 bytes each. This means that the first 65,536 code points (also known as the Basic Multilingual Plane) all fit into a single code unit. The higher code points that take 2 code units are also referred to as surrogate pairs. UCS-2 was the predecessor of UTF-16 and it only supported those first 65,536 code points (so it always used 1 code unit / 2 bytes per code point).

The standard defines two variations of UTF-16 being UTF-16BE and UTF-16LE where the difference is in the order of the code units. The most common variant on windows platforms is UTF-16LE.

UTF-32

In this encoding code units are 4 bytes long and every code point fits in a single code unit. This encoding is not commonly used and requires significantly more memory and storage because even the most common ASCII characters require 4 bytes instead of 1.

Unicode in Practice

While Unicode is being adopted by more and more software vendors there is still a lot of non-Unicode software out there. And within Unicode there are the different encodings causing variations.

Windows started with ANSI which is an 8-bit character encoding using code pages defining the meaning of the second 127 characters. DataFlex predated Windows and uses OEM, which is compatible with ANSI except for the code pages being different. OEM is available in Windows as a separate set of code pages.

When Windows initially adopted Unicode, Microsoft moved all of its API’s to work with UCS-2, which used 16-bits per character. Instead of changing the existing API’s, they duplicated all API’s to maintain compatibility with non-Unicode software.  Over time these UCS-2 API’s were migrated to work with UTF-16 which supported 32-bit characters as well.

On the web, and in non-Windows environments, UTF-8 gained popularity and in the more recent Windows versions, Microsoft increased the support for UTF-8.

Character Translations in DataFlex 2021 and Higher

DataFlex 2021 represented a fundamental shift in the core character encoding from OEM to UTF-8 and subsequently, the various translations between DataFlex and the other components of the system.

We chose UTF-8 as our new core encoding because it provides the best backwards compatibility, is native to the Web and is best for Western languages. The source code is stored as UTF-8 with a byte order mark (or BOM). Non-ASCII characters allowed in string literals and comments and any source file that does not contain a BOM will be interpreted as OEM.

DataFlex 2021 and higher uses the wide Windows API’s and supports this through a new WString type for automatic conversions between UTF-8 and UTF-16.

We can see how DataFlex uses character translations in the following diagram:  

Character Translations in DataFlex 19.1 (and earlier)

Older revisions of DataFlex are not Unicode but use OEM as their main character encoding. This means that it uses character translations throughout the system. A lot of these translations are so-called lossy translations where unsupported characters are stripped out. We can see how prior revisions of DataFlex used character translations in the following diagram:  

 

See Also

Unicode in DataFlex

What's New in DataFlex 2022

What's New in DataFlex 2021