Understanding text encodings

All computers use encoding systems to store character strings as a series of bytes. The oldest and most familiar encoding scheme is the ASCII encoding which defines 128 character codes (using Integer values 0-127). These characters include only the upper and lowercase English alphabet, numbers, some symbols, and invisible control codes used in early computers. You can use the String.ChrByte function to get the character that corresponds to a particular ASCII code value.

Over the years, ASCII was extended and other encodings were created to handle more and more characters and languages. In particular, the Unicode standard was designed to handle the encoding of any language and a mixture of languages in the same string. In your Xojo projects, any Strings you create in code (as constants, variables, or literals) use the UTF-8 encoding, which is what is most commonly used today.

If the strings you work with are created, saved, and read only within your own apps, you shouldn't have to worry about encoding issues because the encoding used is stored along with the content of the string.

If you are creating apps that open, create, or modify text files or data that are created outside of your app, then it's possible that the text was encoded using something other than UTF-8. For these situations you need to understand how text encodings work and what changes you may need to make to your code to make sure it recognizes the text as it was encoded. If your app assumes the text was encoded as UTF-8 but it was in fact encoded as WindowsLatin1, then you may find that some characters do not display properly.

Note

Strings in Structures do not contain encoding information because they are just a series of bytes.

From ASCII to Unicode

As you know, computers don't really store or understand characters. They store each character as a numeric code. For example, "A" is ASCII character number 65. When the computer industry was in its infancy, each computer maker came up with their own numbering scheme. A numbering scheme is sometimes called a character set. It is a mapping of letters, numbers, symbols, and invisible codes (like the carriage return or line feed) to numbers. With a character set, information can be exchanged between computers made by different manufacturers.

In 1963 the American Standards Association (which later changed its name to the American National Standards Institute) announced the American Standard Code for Information Interchange (ASCII) which was based on the character set available on an English language typewriter.

Over the years, computers became more and more popular outside of the United States and ASCII started to show its weaknesses. The ASCII character set defines only 128 characters. That covers what is available on an English-language typewriter, plus some special “control“ characters that can be used on computers to control output. It doesn't include special characters that are commonly used in typeset books such as curved quotes or the curved apostrophe, bullet characters, and long dashes—like this one. Also, many languages (like French and German) use accented characters that are not defined as part of the ASCII specification.

When the Macintosh and Windows operating systems were introduced, each OS defined extensions to standard ASCII by defining codes from 128-255. This enabled both operating systems to handle accented characters and other symbols that are not supported by the ASCII standard. However, the Macintosh and Windows extensions do not agree with one another. Cross-platform applications have to build in some way of managing text that uses characters in the 128-255 range.

The problem is even worse for users of languages that don't use the standard Roman alphabetic characters at all — like Japanese, Chinese, or Hebrew. Because there are so many characters, the character sets devised to support some of these languages use two bytes of data per character (rather than one byte per character, as in ASCII).

Apple eventually created various text encodings to make it easier to manage data. MacRoman is a text encoding for files that use ASCII. MacJapanese is a text encoding for files that store Japanese characters. There are others as well. But these encodings were Mac-specific. They didn't make exchanging data with other operating systems any easier and mixing data with different encodings (typing a sentence in Japanese in the middle of an English-Language document, for example) was problematic.

In 1986, people working at Xerox and Apple both had different problems to solve that required the same solution. Before long, the concept of a universal character encoding that contained all the characters for all languages, became the obvious solution. The universal encoding was dubbed “Unicode” by one of the people at Xerox that helped to create it. Unicode solves all of these problems. Any character you need from any language is supported and will be the same character on any computer that supports Unicode. And as a bonus, you can mix characters from different languages together in one document since all are defined in Unicode.

Unicode support began appearing on the Macintosh with System 7.6 and on Windows with Windows 95. You could translate files between other text encodings and Unicode but Unicode was still the exception and not the rule. It wasn't until Mac OS X and Windows 2000 that Unicode became the standard.

Computer users are now in a transition. There are some using older systems where Unicode is not the standard. All new systems that are running macOS, Windows, or Linux use Unicode as the standard encoding. As a result, you may have to deal with text files of different encodings for a while. That means you may need to modify your code to handle this. At some point in the future, it may be so rare that you can assume all files are in Unicode format but until then, you may need to make some modifications to your code so that your application operates properly when it encounters text with different types of encoding.

Handling text encodings

Unfortunately, there is no perfectly accurate way to determine the encoding of a file. You have to know what encoding the file is using. For example, if it is coming from an English-speaking user of Windows, it's probably Windows ANSI.

If the encoding of a string is defined, you can use the Encoding function to get its encoding, like this:

Var theEncoding As TextEncoding

theEncoding = Encoding(myString)

where the variable myString contains the string whose encoding is to be determined and theEncoding is a TextEncoding object. If the encoding is not defined, the Encoding function returns Nil.

Any String you create in Xojo has the UTF-8 encoding by default:

Var testString As String = "Hello"
Var theEncoding As TextEncoding

theEncoding = Encoding(testString) ' theEncoding.Name = "UTF-8"

When you get text from another source, such as an external file, database or the internet it is possible it has a different encoding. As mentioned above, there's no way to look at text and know what it's encoding is, but you should be able to determine from the originator of the text what the encoding is.

If you get text in a different encoding, you'll want to tell Xojo what the encoding is. You can use the DefineEncoding function for this. So if you get text from a file that is in MacRoman encoding, then you can tell Xojo its encoding like this:

' InputText is the text from the file

InputText = InputText.DefineEncoding(Encodings.MacRoman)

With the correct encoding specified, the text will be properly displayed in your app.

In other cases you may want to covert the encoding of the text to another encoding. For example, if the incoming text is MacRoman but you really want to work in UTF-8 you can convert the encoding.

' Define the encoding as MacRoman
InputText = InputText.DefineEncoding(Encodings.MacRoman)

' Convert the encoding to UTF-8
InputText = InputText.ConvertEncoding(Encodings.UTF8)

You may also want to convert the encoding if you app needs to output text in a specific encoding for use by another app. If you wanted to output your text as MacRoman you would also convert its encoding:

OutputText = OutputText.ConvertEncoding(Encodings.MacRoman)

Text encoding and files

When dealing with text data in files, it is particularly important to handle encodings properly. Refer to Accessing Text Files for information about encoding with text files.

Getting individual characters

As was mentioned earlier, when you need to obtain an individual ASCII character, you can use the TextEncoding.Chr function by passing it the ASCII code for the character you want. It requires that you specify both the encoding and the character value. For example, the following returns the ÷ sign in the variable, s:

Var divChar As String = Encodings.UTF8.Chr(247)