Ioannis Panagopoulos blog

Tutorials on HTML5, Javascript, WinRT and .NET

Encodings and Charsets

by Ioannis Panagopoulos

I always dreaded the "unrecognized characters" problems encountered in windows applications when dealing with old data. They appeared and still appear usually when data from an old MSDOS application are ported to current database management applications (MS Access, Microsoft SQL, .NET Datasets). The Greek language characters seem to have undergone a major change from old to new and applications do not automatically know how to cope with them. I guess the same applies also to other languages.In this article I will try to shed some light to those issues and offer a possble solution. To understand the "root of evil", we need to first understand enodings and character sets. And to do that we need to learn some history...

Each character displayed is associated with a number usually referred to as its "character code".  This number is the one used by the computer in order to locate in memory the representation of the character. Initially the first character codes to be used in computers where the 128 character codes defined in ASCII. Part of those 128 characters where the latin characters of the English alphabet. English speaking computer users where very happy with them but that can not be claimed for those poor people who happened to write and read in a language with different characters. What did the rest of the world do then to solve this problem?

Well if you think about it the ASCII character codes are stored in 1 byte (8 bits) where 7bits are used for the code and the 8th bit is reserved as it is the "parity" bit. The parity bit? Yeah,it turns out that originally ASCII was used as a telegraphic code and promoted as such by Bell data services. Since this bit ceased to serve such purpose in computers, each computer programmer who needed some custom characters to write in his/her aplhabet improvised new characters with codes from 128-255 by just setting this bit to 1 and assigning new character representations to the new set of 128 characters (the rest  bits with bit 8 set to 1). Each one of those new alphabets has become a character set and was given a specific code and thus has become a codepage.

To summarize:

  • Initially we had the ASCII table with codes from 0-127 stored in 1byte and handling only the latin alphabet.
  • Later a number of different character sets are created where 0-127 codes are the same as ASCII and 128-255 vary depending on each set. Those are called "codepages" or "charsets".

For example the latin letter "A" has character code 65 in every charset since it is one of the initial 128 characters. On the contrary the greek letter "A" has character code 128 in the ANSI Codepage 737(Greek DOS) and character code 164 in the ANSI Codepate 869 (Greek Modern DOS). This means that if you receive a string of characters from an old application you need to experiment with the codepages to find out to which codepage the characters refer to. 

This clearly is a mess. The resolution is to try to improvise a universal encoding scheme. And that is what Unicode represents. Each code represents a specific character. All characters in the world for all languages have a Unicode code associated with them. But how is this code stored in memory? Do we need to use 1 byte, 2 bytes or more? There seem to be a lot of ways that this number can be stored in memory and and that ar the "encodings". So different encodings are different representations of the Unicode code in memory. To make this clear suppose the following example concerning the greek character "A".

In the codepage era the character has many codes depending on the charset used.

  • Greek Character A has the character code 128 in the ANSI Codepage 737(Greek DOS).
  • Greek Character A has the character code 164 in the ANSI Codepage 869 (Greek Modern DOS).
  • If the character was not Greek  there may be a codepage where each character needs to bytes (the codepage consists of two byte entries)

In the Unicode era the character has a single Unicode code 0x391 but many encodings dependig on how it is stored in memory

  • Greek Character A with code 0x391 is stored as CE 91 in two bytes in UTF-8.
  • Greek Characer A  with code 0x391 is stored as 03 91 in two bytes in UTF-16.
  • Greek Charater A has a different encoding in UTF-7 and so on.

Note that the fact that characters are no longer stored in 1 byte, changes the view we are used to have the 1 char = 1 byte. Clearly we should forget about that.

Now in C# we distinguish the following cases:

We know the encoding of a string and need to get the actual bytes used to present the string. For example we have a string in Unicode(the default) and want to get the actual bytes:

String test="Α";                          //The greek A
Encoding dst=Encoding.Unicode;
byte [] byteCode = dst.GeBytes(test);

The code above returns the acutal bytes used to store the string:

Note that the normal ToCharArrayMethod() would return a two-byte char as expected:

For a string to be displayed properly, it must be in Unicode. So it must be translated from the encoding/charset it is to Unicode. This is achieved through the Encoding.Convert (Encoding Source, Encoding Destination, Bytes of String); method. So to translate a string expressed in an encoding we need to guess we do as folows:

String Test="The received string in unknown encoding";
Encoding src=Encoding.GetEncoding(Encoding Name);    //Play here with candidate encoding names (to get the list of encodings try Encoding.GetEncodings();
Encoding dst=Encoding.Unicode;

byte [] ByteCodes=src.GetBytes(Test);
byte [] ResultCodes=Encoding.Convert(src,dst,ByteCodes);

String Result=new String (dst.GetChars(ResultCodes));

If the result is readable congratulations, you have found the encoding and can translate !

Download a demo exe app for Encodings here. If you need the source leave a message (EncodingsDemo.exe (18,00 kb)).

kick it on DotNetKicks.com
blog comments powered by Disqus
hire me