Unicode in DataFlex

Beginning with DataFlex 2021, DataFlex is fully Unicode. The language itself (compiler & runtime) work with UTF-8 as their default encoding. In practice this means that source code is stored as UTF-8 so that string literals and comments can contain any Unicode character. String variables store their data in memory as UTF-8 so that they can contain any Unicode character. When communicating with external API’s, conversions to UTF-16 will take place. To make this easier from within the DataFlex language there is a new WString type. Strings are automatically converted to UTF-16 when using this type.

String

While string variables in practice can contain any binary data, the runtime always treated them as OEM strings. Strings are now treated as UTF-8, which means that each string function and command assumes the data to be UTF-8 encoded. Since UTF-8 is a variable length encoding a character can be more than a single byte long. The string functions are adjusted for this and functions like Mid & Left assume their parameters to be character positions. The Length function also returns the number of characters in a string (which can be different than the number of bytes). A new function named SizeOfString returns the number of bytes used by a String.

It is still possible to convert a string to OEM or ANSI in memory. But instead of ToOEM and ToANSI this is done using Utf8ToOEM and Utf8ToANSI.

Collating

String comparisons with Unicode are much more complicated than with OEM / ANSI. DataFlex uses the ICU Library for comparing strings according to the Unicode standards. Multiple collations are supported and can be configured via the new DF_LOCALE_CODE string attribute. This global attribute defaults to the language of the operating system. It can be changed at runtime and it is configured using ISO639 language codes. See http://www.localeplanet.com/icu/iso639.html for available codes. Note that when using the embedded database, the indexes will be built up according to the old collating system configured via DF_Collate.cfg for backwards compatibility.

String Functions

SizeOfString

This function returns the size of a string in bytes (UTF-8 code-units). This can be different than the length returned by the Length function. which returns the number of characters (UTF-8 code-points) in a string.

PointerToString

This function converts a pointer to a string in memory into a string. This can be used within an expression and the resulting string can then be moved into a string variable. This function replaces the special functionality that the address (now pointer) type had when moving an address to a string. This conversion is now illegal and this function can be used to perform the same operation.

WString

A lot of external API’s, such as the Windows APIs, work with UTF-16 encoding. When calling these API’s, the strings that need to be converted as DataFlex String variables are UTF-8 encoded. To make this easier a new WString type was added to the language. When moving strings to / from this type, the data is automatically converted to UTF-16. WStrings can be passed to external functions as parameters or as pointer (in case of a return buffer).

It is recommended to only use this WString type when actually calling an external API, and not instead of the regular String type. Even though string manipulations and string functions do work, internally the data is converted between UTF-16 and UTF-8 for each operation, which will slow down your application.

When working with COM there is no need to use the WString, as variants are already UTF-16 encoded (they always have been).

Example: WString Parameters

When an external function has a string as a parameter (technically this is usually a pointer to a string) like this:

External_Function WritePrivateProfileString "WritePrivateProfileStringA" Kernel32.dll ;

    String sSection String sKeyName String sValue String sFileName Returns Integer

Then converting it, to its wide version is now as easy as changing it into:

External_Function WritePrivateProfileString "WritePrivateProfileStringW" Kernel32.dll ;

    WString sSection WString sKeyName WString sValue WString sFileName Returns Integer

When calling this function, you can simply use strings as parameters. These can come from a parameter, an expression or a constant without problems. The runtime will automatically convert them to a WString before actually calling the external function. As with string, the runtime is smart enough to pass a pointer to the wide string when executing the external function. So, the line below will work properly:

Move (WritePrivateProfileString(sSection, "", "", psFilename(Self))) to iRes

Example: WString with Pointer Parameters

It is common practice to define external API’s with Pointer (formerly also called Address) parameters. This is done when needed to allow passing 0 (NULL) as parameter or when a string is returned. This can simply be done the same way as we used to do with String parameters.

External_Function GetModuleFileNameW "GetModuleFileNameW" Kernel32.dll ;

    Handle   hModule ;

    Pointer  lpFilename ;

    UInteger nSize ;

    Returns  UInteger

This function returns a string in the buffer that is passed. The size of the buffer is passed as separate parameter. Calling this function can be done like this:

Integer iNumChars

WString wApplicationFileName

String sApplicationFileName

 

Move (Repeat(Character(0), 1024)) to wApplicationFileName

Move (GetModuleFileNameW(0, AddressOf(wApplicationFileName), 1024)) to iNumChars

Move (CString(wApplicationFileName)) to sApplicationFileName

So, we define a WString and fill it with 1024 null characters. Do note that Repeat generates a UTF-8 string, which is then converted to UTF-16 when it is put into the WString buffer. Then we call the external function, passing a pointer to the WString. The external function changes the WString buffer. On the last line we convert the UTF-16 result string to a regular UTF-8 string.

The CString function is used to adjust the length of the string. DataFlex strings (both WString and String) can contain 0 characters, while in other environments the 0 usually terminates the string. To support this, DataFlex strings actually store a length with them. The external function will adjust the content of the string, and write a 0 terminator, but it will not change the length of the string (which remains 1024 characters). Calling the CString function fixes that.

WString Functions

A couple of WString specific functions have been added to make working with WString easier:

SizeOfWString

This returns the number of WChars (codeunits / double-bytes) of a WString.

PointerToWString

This takes a pointer to a WString (or Char array with two bytes per character) as a parameter and returns a WString.

External_Function Wrapper Functions

To properly support Unicode, DataFlex uses the Wide versions of Windows API functions that use strings. A lot of these Windows APIs are called using External_Function within the DataFlex packages. We have updated our packages while maintaining as much backwards compatibility as possible. In most cases where we had to make changes that requires changes in the calling code, we provide wrapper function that perform the necessary conversions.

In most cases these wrapper functions are slower, so our code calling these functions usually doesn’t use the wrapper function but directly calls the wide version (functions ending with a W). It is recommended, but not mandatory,  that developers also convert their interfaces to use the wide versions.

Direct_Output / Append_Output / Direct_Input

The file paths passed to these commands are now assumed to be regular UTF-8 strings and they can contain Unicode characters. Using the Read, ReadLn, Write and WriteLn commands does not perform any conversions on the data so strings will be written / interpreted as UTF-8 data. When working with OEM or ANSI files, the conversion functions (Utf8ToAnsi, AnsiToUtf8, OemToUtf8 and Utf8toOem) can be used to properly convert the data. Note that text files written with previous versions of DataFlex will usually contain OEM encoded strings unless conversions were made in the source code.

Database

The embedded database does not support Unicode and data written to it is converted to OEM by the runtime. It is backwards compatible and the database can be shared with older revisions of DataFlex. The sorting of the indexes is done according to the Df_collate.cfg in bin or bin64. Note that string comparisons in the language are now performed using the new Unicode comparisons and can be different than the embedded database collation. See Table Character Format in DataFlex 2021 for more information.

It is recommended to use SQL databases where MS SQL is the recommended backend. To work with Unicode on MS SQL, use the NChar and NvarChar data types. Data is stored as UTF-16 and the SQL drivers will perform the necessary conversions for you.

Note that if the df_table_character_format attribute in your existing SQL tables is set to OEM, your data will be stored as OEM in the database. When converting fields to NVarChar or NChar the automatic conversion of your data by MS SQL will likely fail as it interprets your data as ANSI. So, it is recommended to convert your existing SQL data from OEM to ANSI before converting to Unicode data types.

See these additional topics for more information:

 

Source Code

DataFlex uses UTF-8 encoding for all source code files. When you create new files in the Studio, they are automatically created as UTF-8 and use Byte Order Marks (or BOMs) to signify their encoding style. Source files from previous revisions of DataFlex used OEM encoding and those files do not need to be converted to UTF-8 to compile. Editing an older, OEM encoded, source file in the Studio will automatically convert it to the new UTF-8 encoding.

This change in the encoding of source files is one of the main reasons we recommend creating separate copies of your application and library workspaces before using DataFlex 2021 or higher.

You can elect to continue to use OEM encoding for source files (for backward compatibility) by configuring the Studio to “Save source files as OEM”. Just select the option in the Editor tab from the Configure Studio menu option:

Strings for Binary Data

In the past, DataFlex strings were sometimes used to store binary data. We recommend not to use that technique and use UChar arrays instead. The string functions and the debugger will now try to interpret the strings as UTF-8 data and binary data in a lot of cases is not valid UTF-8. String functions do not translate their parameters directly to memory offsets any more but interpret them as character offsets where they will analyze the string to convert them to memory positions. This will go wrong when the data are not valid UTF-8 strings. An added advantage of using UChar arrays is that the Max_Argument_Size does not apply to them.

Replace TYPE Definitions with Structs

The TYPE command should not be used anymore. This is an obsolete way of defining structs, and it is no longer recommended. Instead, use the Struct command for defining structs. The 2021 Studio generates compiler warnings when it encounters Type commands.

This also means that the commands ZeroType, FillType, GetBuff, GetBuff_String, Put, Put_String, ArrayPut, Size_of_field, StoreField and RetrieveField should not be used. Also, the automatically created [TypeName]_Size constant cannot be used; use SizeOfType() instead. Again, the 2021 Studio generates compiler warnings when these commands are encountered.

 

See Also

Unicode 101