Unicode Support in SocketTools

SocketTools includes support for Unicode in several different ways, depending on the edition and development tools used. This article discusses how Unicode support is implemented for the current version of each SocketTools edition.

Character Encoding and Unicode

Legacy applications for earlier versions of Windows typically used code pages and multi-byte character sets (MBCS), where characters can be represented by one or two bytes. These character sets were predominantly used in consumer versions of Windows prior to Windows 2000. In the Win32 API, those functions that accept string parameters were given an "A" (ANSI) suffix to indicate they expected 8-bit characters using the current code page.

The current standard is to use Unicode, which supports over 135,000 characters and more than 130 languages. In Windows, Unicode is natively implemented using UTF-16 encoding, which means that each character is represented by two (or sometimes four) bytes. Win32 functions that accept UTF-16 encoded strings (also called "wide strings") were given a "W" suffix to indicate they expected 16-bit characters. The SocketTools API uses the same conventions, with "A" and "W" versions of all functions that accept string parameters.

However, UTF-16 encoding presents a problem for most Internet protocols because they were designed with the presumption that a single character could be represented by a single byte. At a lower level, TCP also handles all data as "octets" which are 8-bit (byte) values. To address this, there is another Unicode encoding called UTF-8 which represents Unicode characters as a sequence of one to four bytes. Because the first 127 characters that make up the ASCII character set are the same in UTF-8, an ASCII string is by definition also a UTF-8 string.

Today, most services recognize and accept UTF-8 as a standard character encoding. Internet protocols have been changed or expanded to either explicitly opt-in to use UTF-8 encoding, or implicitly expect it as the default. However, there one notable exception with the Domain Name Service (DNS). It only accepts a limited subset of ASCII characters in domain name labels. Names that use Unicode characters are called Internationalized Domain Names (IDNs), and to make them compatible with the DNS system, they are converted to ASCII using an alternative encoding called Punycode.

International Domain Names

Domain names that contain Unicode characters must be encoded before they are resolved to an IP address. This is accomplished using an encoding called Punycode, which converts a Unicode domain name into a format that is compatible with the DNS system. For example, the domain name δοκιμή.sockettools.com is an IDN. When this is passed to SocketTools, it automatically encodes it using Punycode, resulting in the domain name xn--jxalpdlp.sockettools.com which is then resolved to an IP address.

Previous versions of SocketTools required that the developer explicitly perform this encoding if necessary. With the current version, this is done automatically with any host name (or the host name portion of a URL) that contains Unicode characters. Whenever you call a function or method in SocketTools that requires a host name, that value will go through a normalization process that removes any leading or trailing whitespace, converts it to lower case and performs any necessary encoding. If the host name is specified using UTF-16 or UTF-8 encoding, it's also checked to ensure that the encoding is valid and invisible code points are removed (this process is referred to as Nameprep).

SocketTools .NET Edition

The SocketTools .NET classes use the managed System.String type, which is a UTF-16 encoded string. All methods that accept a string parameter will automatically encode it as UTF-8 before sending it to the remote host. Text data that is received will be converted from UTF-8 to a UTF-16 encoded string prior to be returned to the caller. This conversion only occurs with string types, not binary data exchanged using byte arrays. The versions of methods that use byte arrays or MemoryStream objects will exchange data as-is without any encoding.

This conversion can have implications for some methods, such as ReadLine and WriteLine which will convert strings between UTF-16 and UTF-8. For example, if you use WriteLine to send a text string to a service, that text will be automatically converted to UTF-8 and then written to the socket. The service that is reading the text that you’re sending must be able to recognize and process UTF-8 encoding correctly. It may choose to retain the UTF-8 encoding, or it may convert it back to UTF-16 prior to storing or displaying the text.

SocketTools Library Edition

The SocketTools libraries provide an API that includes both Unicode and ANSI versions of functions. If a function accepts a string parameter, or a structure which contains strings, the Unicode version of the function will have a "W" suffix and the ANSI (MBCS) version will have an "A" suffix. For example, the function FtpGetFile actually has two implementations, FtpGetFileW and FtpGetFileA. With C/C++, the default version of the function used is determined by the project configuration, which defines the UNICODE macro. If your project is configured to use Unicode, then a call to FtpGetFile is changed to actually call FtpGetFileW; otherwise, it will call FtpGetFileA.

It is recommended that projects use Unicode whenever possible. When the Unicode version of a function is called, the UTF-16 (wide) string parameters are converted to use UTF-8 encoding when required. This can have implications for some functions, such as InetReadLineW and InetWriteLineW which will convert strings between UTF-16 and UTF-8. For example, if you use InetWriteLineW to send Unicode text to a service, that text will be automatically converted to UTF-8 and then written to the socket. The service that is reading the text that you're sending must be able to recognize and process UTF-8 encoding correctly. It may choose to retain the UTF-8 encoding, or it may convert it back to UTF-16 prior to storing or displaying the text.

It is possible to use Unicode strings with legacy projects built using multi-byte character sets (MBCS). The MBCS versions of SocketTools functions (those with the "A" suffix) now require UTF-8 encoded strings to ensure there are no conversion errors and to prevent potential data loss. This is also the current best practice for Windows and can be accomplished a few different ways, with the recommended approach using a manifest to specify UTF-8 as the active code page for the application. If you use the ANSI version of a function, remember that the text returned to you will be UTF-8 encoded. The MultiByteToWideChar function can be used to convert it back to a UTF-16 encoded string if required.

If you are upgrading SocketTools from an older version and your application uses the legacy MBCS functions, you may find that certain strings are not being converted as expected. Older versions of SocketTools did not support UTF-8 with all functions and wasn't capable of handling conversions where the data contained a mix of languages which could not represent one another's character sets. The current version of SocketTools always uses Unicode internally, including with MBCS (ANSI) builds. We recommend updating your project to use Unicode or use conversion functions to convert from the current locale to Unicode. If your project uses C/C++, Microsoft provides a collection of useful conversion macros in the atlconv.h header file.

It is important to note that this conversion only applies to C/C++ style null-terminated strings (an array of characters terminated with a null character). In the SocketTools API these are your typical LPTSTR or LPCTSTR types. Those functions which accept byte values or pointers to byte arrays (LPBYTE) are handled as-is and are not subject to UTF-8 encoding.

SocketTools ActiveX Edition

The SocketTools ActiveX controls use the BSTR string type, which is a special UTF-16 encoded string typically used with COM and ActiveX. Property values use BSTRs and all methods accept VARIANTARG parameters (which can represent any data type). If the method expects a string value, it will attempt to convert the VARIANTARG parameter to a BSTR, which will then be converted to a UTF-8 encoded string. If text is returned back to the caller, it will be converted to a UTF-16 encoded string and returned as a BSTR value.

When using Visual Basic, the conversion between UTF-16 and UTF-8 encoding only occurs with String types, and not byte arrays. If your code calls a method and passes in a byte array, then the data will be sent or received as-is without any encoding. Generally speaking, this entire conversion process will be transparent and handled automatically.

Although strings in Visual Basic are internally managed as UTF-16 strings, the default common controls used in Visual Basic 6.0 do not support Unicode. Those controls, such as buttons, text boxes and labels, will automatically convert the Unicode text to ANSI using the current code page. This means that text in the end-user's native language (depending on system settings) will display correctly, although text in other languages using different character sets may not. Also note that the VB6 IDE is not Unicode aware and may display corrupted string values or invalid characters (e.g.: tooltip values when debugging).

For Unicode support in Visual Basic 6.0, it's recommended that you use third-party controls. An alternative that some developers have used is the Microsoft Forms 2.0 Object Library (FM20.DLL) that is part of Microsoft Office. It includes a collection of controls that support Unicode, however they are not redistributable and Microsoft has stated that their use with VB6 is unsupported.

Unicode Support in SocketTools

Character Encoding and Unicode

International Domain Names

SocketTools .NET Edition

SocketTools Library Edition

SocketTools ActiveX Edition

See Also

Company

Products

Support

Resources

Contact