Introduction
This article is for the beginners that are often puzzled by the big term, "Unicode" and also those users who ask questions like, "how to store non-English or non-ASCII text in the database and get it back". I remember, a few months ago I was in the same situation, where most of the questions where based on the same thing, "how to get the data from the database in non-ASCII text and print it in the application". Well, this article is meant to target all these questions, users and beginner programmers.
This article will most specifically let you understand what Unicode is and why it is currently used (and since the day it was created). Also a few points about its types (such as what UTF-8 and UTF-16 are and what the difference is and why to use them and so on) will be explained, then I will move on to using these characters in multiple .NET applications. Note that I will also be using ASP.NET web applications, to show the scenario in a web-based environment too. .NET Framework supports all of these encodings and the code pages to allow you to share your data among various products that understand Unicode standards. There are multiple classes provided in .NET to let you kick-start your application based on Unicode characters to support global languages.
Finally, I will be using a database example (I will be using Microsoft SQL Server) to show how to write and extract the data from the database. It is quite simple, no big deal at least for me. Once that has been done, you can download and execute the commands on your machine to test the Unicode characters yourself. Let us begin now.
I will not talk about Unicode itself, instead I will be talking about the .NET implementation of Unicode. Also note that the number value of the characters in this article are in numeric (and decimal) form, not in U+XXXX and hexadecimal form. I will, at the end, also show how to convert this decimal value into hexadecimal value.
Starting with Unicode
What Unicode is
Unicode is a standard for character encoding. You can think of as a standard for converting every character to its binary notation and every binary notation to its character representation. The computer can only store binary data. That is why non-binary data is converted into a binary representation to be stored on the machine.
Originally, there were not many schemes for developers and programmers to represent their data in languages other than English, although that was because application globalization was not general back then. Only the English language was used and the initial code pages included the codes to represent and process the encoding and decoding of English letters (lower and upper case) and some special characters. ASCII is one of them. Back in ASCII days, it encoded 128 characters of the English language to 7-bit data. ASCII doesn't only include encoding for text, but also for the directives for how text should be rendered and so on. Many are now not used. That was the most widely used standard, because technology was so limited and it fulfilled their needs at that time.
As computers became more widely used, technicians and many developers wanted their applications to be used in a client-locale-friendly version, there originated a requirement for a new standard, because otherwise every developer could create his own code page to represent various characters, but that would have removed the unity among the machines. Unicode, had originated back in late 1980s (see the history section in Wikipedia), but was not used because of its large size of 2 bytes for every character. It had the capability to represent more characters than the ASCII standard. Unicode supports 65,536 characters and that is capable of supporting all of the current world's characters. That is why Unicode is used widely, to support all of the characters globally and to ensure that the characters sent from one machine would be mapped back to a correct string and no data would be lost (by data loss I mean by sentences not being correctly rendered back).
Unicode is Different
Beginners stumble upon UTF-8, UTF-16 and UTF-32 and then finally on Unicode and they think of them being different. Well, no, they're not. The actual thing is just Unicode, a standard. UTF-8 and UTF-16 is the name given to a character set or encoding scheme of varying sizes. UTF-8 is 1 byte (but remember, this one can span to 2 bytes too if required and in the end of this article I will explain which one of these schemes you should use and why, so please read the article to the end) and so on.
UTF-8
UTF-8 is the variable-length Unicode encoding type, by default it has 8 bits (1 byte) but can span and this character encoding scheme can hold all of the characters (because it can span for multiple bytes). It was designed to be a type that supports backward compatibility with ASCII for machines that don't support Unicode at all. This standard can be used to represent the ASCII codes in the first 128 characters, then in the next 1920 characters, it represents the mostly used global languages, such as Latin, Arabic, Greek and so on and then all the remaining characters and code points can be used to represent the other characters. (See the Wikipedia article for UTF-8).
UTF-16
UTF-16 is also a variable-length Unicode character encoding type, the only difference is that the variable is a multiple of 2 bytes (2 bytes or 4 bytes depending on the character or more specifically the character-set). It was initially a fixed 2 byte character encoding, but then it was made variable-sized because 2 bytes are not enough.
UTF-32
UTF-32 uses exactly 32 bits (or 4 bytes) per character. Regardless of code points or character set or language, this encoding would always use 4 bytes for each of the characters. The only good thing about UTF-32 (as in the Wikipedia) is that the characters are directly indexible. That is not possible in variable-length UTF encodings. Whereas, I believe the biggest disadvantage of this encoding is the 4 bytes size per character, even if you're going to use Latin characters or ASCII characters specifically.
Getting to the .NET Framework
Enough of the small background of the Unicode standard. Now I will continue by providing an overview of the .NET Framework and the support of Unicode in the .NET Framework. The support for Unicode in .NET Framework is based on the primitive type, char. A char in the .NET Framework is 2 bytes and supports Unicode encoding schemes for characters. You can generally specify to use whichever Unicode encoding for your characters and strings, but by default you can think of the support for it to be UTF-16 (2 bytes).
Char (C# Reference) .NET documentation
Char Structure (System)
The preceding documents contain different content, but are similar. char is the System.Char object in the .NET Framework. By default .NET Framework supports Unicode characters too and would render them on the screen and you don't even need to write any seperate code, ensuring the encoding of the data source only. All of the applications in the .NET Framework support Unicode, such as WPF, WCF and the ASP.NET applications. You can use all of the Unicode characters in all of these applications and .NET would render the codes into their character notation. Do read the following section.
Console applications
As for Console applications, they are a good point to note here, because I said that every .NET application supports Unicode but I didn't mention Console applications. Well, the problem isn't generally the Unicode support, it is neither the platform nor the Console framework itself. It is because Console applications do not support graphics. Yes, supporting a variety of characters is graphical and you should read about glyphs.
When I started to work around in a console applications to test Unicode support in Console applications, I was amazed to see that Unicode character support doesn't only depend on the underlying framework, or the library being used, but instead there is another factor that you should consider before using Unicode support. That is the font family of your console. There are multiple fonts for Consoles if you open the properties of your console.
Let us, now try out a few basic examples of characters from the range 0-127, then from the next code page see how the console application behaves and what other applications might respond to out data in a way.
ASCII codes
First I will try ASCII codes (well a very basic one, "a") in the code to see if the console behaves correctly or messes something up. I used the following code to be executed in the console application:
- using System;
- using System.Collections.Generic;
- using System.Linq;
- using System.Text;
- using System.Threading.Tasks;
-
- namespace ConsoleUnicode
- {
- class Program
- {
- static void Main(string[] args)
- {
-
- char a = 'a';
- Console.WriteLine(String.Format("{0} character has code: {1}", a,
-
- Encoding.UTF8.GetBytes(a.ToString())[0].ToString()));
-
-
- Console.Read();
- }
- }
- }
The response to this code was like this:
You can see that now there is no difference as if the code is from ASCII or Unicode, because "a" is 97 in both of them. That was pretty basic. Now, let us take a step farther.
Non-ASCII codes
Let us now try Greek letters, the first one in the row, alpha. If we execute code similar to the preceding and replace the "a" with alpha, you will see the following result:
Well so far so good.
Let us take a big step now, why not try Hindi? Hindi is pretty much regularly asked about, for how to store and extract Hindi letters from the database and so on. Let us now try Hindi characters in the console application.
Nope, I didn't specify a question mark! That was meant to be a "k" sounding character in Hindi, but it isn't. it is a "q" sounding question mark. Why is that so?
That was not a problem in Unicode, but the console application's low support for global fonts, to support my answer on this I created another line of code to store this code inside a text file with Unicode support. The following is the code to store the binary of the characters (using UTF-8 encoding).
Note: Notepad can support ASCII, Unicode and other schemes so be sure your file supports the character set before saving the data.
- File.WriteAllBytes("F:\\file.txt", Encoding.UTF8.GetBytes(a.ToString()));
The preceding code was executed for the same character, on the console there was a question mark printed but the file presented something else.
This shows that the characters are widely supported in the .NET Framework, but it is the font that also matters, the glyphs in the font are to be available to be rendered by the character, otherwise the application would just show such characters (in other frameworks there is a square box denoting that the character is not supported).
Conclusion
So according to this hypothesis, if there is any problem in your application when displaying Unicode characters in a console application, you need to ensure that the character you're trying to display is supported in the font family that you're using. The problem is similar to loading a Hindi character in a console application that is not supported in the font family. This would end the discussion for supporting the Unicode characters in console applications, until you update the font family to support that code page (or at least that code point).
Unicode support in other application frameworks
Now let us see how much Unicode is supported in other frameworks, such as WPF and ASP.NET. I will not go into the use in Windows Forms. The process is similar to WPF. ASP.NET and WPF have a wide variety of fonts and glyphs that can support many characters, nearly all characters. So, let us continue from software frameworks to a web framework and then finally test the SQL Server database for each of this framework, to test what it would be like to support Unicode characters.
Let me coin the data source first
Before I continue to any framework, I would like to introduce the data source that I will use in the article to show how you can read and write the data in Unicode format from multiple data sources. In this article, I will use:
- Notepad, that supports multiple encodings, ASCII, Unicode and so on.
- SQL Server database to store the data in rows and columns.
You can use either of these data sources (the first one is available to you if you're using a Windows-based OS) and they would support Unicode data writing and reading. If you're going to write the data and create the file from the code, then there is no need for anything. Otherwise, if you're going to create a new file yourself and name it, then before saving ensure you've selected UTF-8 encoding (not the Unicode that is UTF-16) before hitting the save button to create the file otherwise it will be the default ASCII encoding and the Unicode data would be lost if saved to it. You can use Notepad as the data source, or if you have SQL Server then you can use SQL Server as your data source; they can both satisfy your needs.
Using SQL Server Database
You can use the SQL Server database in your project too and if you're going to use the source code given here, you might require a sample database to be created and inside this newly created database (or inside your current testing database you can) create a new table, to hold the Language and UnicodeData. You can also run the following SQL command to do that.
- CREATE TABLE (
- Langauge nvarchar(50),
- UnicodeData nvarchar(500)
- );
Make sure you're selecting the correct database to create the table inside, or use the USE DATABASE_NAME command before this command to execute. Initially, I filled in the database with the following data.
Language |
UnicodeData |
Arabic |
بِسْمِ اللهِ الرَّحْمٰنِ الرَّحِيْمِ |
Hindi |
यूनिकोड डेटा में हिंदी |
Russian |
рцы слово твердо |
English |
Love for all, hatred for none! |
Urdu |
یونیکوڈ ڈیٹا میں اردو |
Now there is quite enough data and languages to test our frameworks again. I am sure the console would never support it, so why even try? Yet if you want to see the output in a console application, I won't deny you.
WPF and Unicode
The only problem in the console application was the lesser support of character glyphs in the font family that has just been overcome in WPF. In WPF you can use many fonts (system based fonts or your own custom generated fonts) that you can use to display different characters in your applications, in a way that you want them to be.
WPF supports all of the characters, we will see why I say that. First, let us write a simple expression in plain text, starting to print the same characters now in WPF. Once I have done that, I will try to see if the fonts are a factor in WPF or not. Stay tuned.
- 'a' character
First of all, I will try printing the 'a' character on the screen and see what the encoded code for it is; that would be similar to that of ASCII too. The following code can be interpreted:
-
- text.Text = String.Format("character '{0}' has a code: {1}", "a", ((int)'a'));
Int32 can map to all of the characters in Unicode and can store their decimal value.
Now the preceding code, once executed, would print the following output.
Quite similar output to that of the console application. Moving forward now.
- 'α' character
Now, moving to that greek character and trying it out would result in the following screen:
- 'क' character
Now for the problematic character, the Hindi character to test in our application to see what is the affect of it in our application. When we change the code to print and fill it with क, we get:
This shows that WPF really does support the character, because the font family, Segoe UI, supports the Unicode characters. That at the current instance is the Hindi alphabet set.
Testing SQL Server data in WPF
We saw how the console application treated the data, now to test our WPF application to see how it treats our Unicode data coming from SQL Server and see if it represents raw data on the screen, or do we need to do something with it. I will create an SqlClient and run some SqlCommands on SqlConnection of my database.
You will need a connectionString for your SQL Server database.
- // Create the connection.
- using (SqlConnection conn = new SqlConnection("server=your_server;database=db_name;Trusted_Connection=true;"))
- {
- // DO REMEMBER TO OPEN THE CONNECTION!!!
- conn.Open();
-
- // Connection Established
- SqlCommand command = new SqlCommand("SELECT * FROM UnicodeData", conn);
-
- // For better readability
- text.FontSize = 13;
-
- // End the line for the data in the database
- text.Text = "Langauge\t | \tUnicodeData" + Environment.NewLine + Environment.NewLine;
- using (SqlDataReader reader = command.ExecuteReader())
- {
- while (reader.Read())
- {
- // Write the data
- text.Text += reader[0] + "\t \t | \t" + reader[1] + Environment.NewLine;
- }
- }
- }
Now the WPF shows me the following output on the screen.
Now the preceding image shows us that there is no other effort required by us to do anything for the data to be rendered, WPF does that for us.
Adding and retrieving the data
Usually people say that they have stored the data in the correct format, but when they try to extract the data, they get the data in the wrong format. Usually, Hindi, Arabic, Urdu and Japanese users are asking such questions, so I thought I should also try to provide an overview of what happens when a user stores the data to the data source. I used the following code to insert 3 rows into the database.
- SqlCommand insert = new SqlCommand(@"INSERT INTO UnicodeData (Language, UnicodeData)
- VALUES (@lang, @data)", conn);
-
- var lang = "language in every case";
- var udata = "a few characters in that particular language";
-
- // Adding the parameters
- insert.Parameters.Add(new SqlParameter("@lang", lang));
- insert.Parameters.Add(new SqlParameter("@data", udata));
-
- if (insert.ExecuteNonQuery() > 0)
- {
- // Stop the app, to view this message!
- MessageBox.Show("Greek data was stored into database, now moving forward to load the data.");
- }
The data I inserted was:
Greek |
Ελληνικών χαρακτήρων σε Unicode δεδομένων |
Chinese |
祝好运 |
Japanese |
幸運 |
So now the database table should look like this:
Fonts do matter in WPF too
In the console application, the font family also matters. The same question arises, "Does the font family matter in WPF too?". The answer is, "Yes! It does matter". But the actual underlying process is different. The WPF framework maps characters to their encodings and encodings to their characters for every font family. If you cannot map a character to an encoding then it uses a fallback to the default font family that supports that character.
If you read the FontFamily class documentation on the MSDN, you will find a quite interesting section named "Font Fallback", that states as in the following.
Quote:
Font fallback refers to the automatic substitution of a font other than the font that is selected by the client application. There are two primary reasons why font fallback is invoked:
- The font that is specified by the client application does not exist on the system.
- The font that is specified by the client application does not contain the glyphs that are required to render the text.
Now wait, it doesn't end there. It doesn't mean, WPF would use a custom font instead of that font or would create a box or question mark. What actually happens is wonderfull (in my opinion), WPF uses a default font fallback font family and thus provides a default, non-custom font for that encoding. You should read that documentation to understand the fonts in WPF. Anyhow, let us change our font in the WPF application and see what happens.
Adding the following code would change the font family of my TextBlock:
- text.FontFamily = new FontFamily("Consolas");
The output on the screen now is something like this:
What we see is quite similar to the previous output. That is because, only those characters that were mapped to the Consolas font family were rendered as Consolas font, else (Urdu, Arabic, Hindi and so on) were mapped back to the Segoe UI font using the font fallback mechanism enabling our application to view the data without requiring the user to see question marks and sqaure boxes. This feature of WPF, is one of the features that I love. Everything happens in the background.
ASP.NET support for Unicode
Now let us see if ASP.NET also supports Unicode data, or do we need to do some work on it to display the characters of Unicode encoding. As far as it has been, I have always been saying that "ASP.NET runs right over the .NET Framework, so anything that runs on the .NET Framework can be used on the back-end of ASP.NET if it cannot be used on the front-end". Now it is time for Unicode support to be tested.
Testing SQL Server data in ASP.NET
I will skip the three characters that I have had shown in the WPF example, because everyone knows that these characters, if they can be written in HTML, can be parsed in ASP.NET. So, I am going towards the data coming from SQL Server now.
I will use the same code from the WPF application and test it in my ASP.NET website. Once rendered, it shows the following output.
I know the format of the paragraphs is a bit messy, but I didn't want to work out the HTML rendering and CSS of the web document, so I left it to your intellect to see that everything is working. The output in this case is also similar to that of WPF output and shows that the characters are correctly rendered; the font family of this web page is also Segoe UI.
Fonts matter in ASP.NET too
The font also matters in ASP.NET. But there are some other factors that would come in action here.
- ASP.NET from the server side would send the response in a correctly and properly formatted HTML form, with all of the Unicode characters embedded in the document.
- The web browser would play a major role in supporting the character set. You need to rely on the browser to support the characters that were sent, unlike WPF there is no font fallback that can show a default font to the client. Most new browsers would have a custom mechanism for their own software and would act just like font fallback.
(Now it does make sense for using <meta charset="utf-8" />, doesn't it?)
- The OS being used, the OS should also support the Unicode, you cannot expect everyone to be running the latest version of everything and the best machine with the most rated OS.
Checking these conditions, you can assume that ASP.NET, on the server side would try to do everything that it can, to ensure that the data is not lost (by data lost, I mean the data being represented in a square box form or some diamond box with a question mark in it and so on). So, in ASP.NET the font doesn't only matter, the web browser and the character-set also matters. Making it somewhat different for developers to assume what would happen. WPF has .NET framework's versions running in the background to perform a fallback, ASP.NET doesn't get such a chance, because a user using Windows 95 (that rarely would be) can make a web request to your ASP.NET 5 based website.
Changing the font to Consolas in ASP.NET and testing in my Windows 8.1 based Google Chrome web browser, I don't find any difference.
But if you notice, the fonts are not fallen back, there is a slight difference in them (see the Arabic and Russian for a slight change) and it shows that there is a custom support for them in the web browser itself, to render the characters on itself.
Bonus data and examples
Guidance for users using Notepad
If you're going to use Notepad, you would need to use System.IO namespace and inside the File class. You can use the read all bytes to ensure that you're using the correct encoding scheme. For example the following code can be used to read all of the bytes from the file.
- byte[] bytes = File.ReadAllBytes("location");
Then to write these bytes on the screen, you can use the Encoding class of the System.Text namespace, to convert them to a String using an encoding. See the following:
- string data = Encoding.UTF8.GetString(bytes);
Now the data would contain Unicode characters. You can then represent them on the screen by outputting them using methods used in your own framework. You can also encode the data back and store in the file, you can convert the string to a byte array and store all the bytes to the file. Look below.
- string data = GetData();
- byte[] bytes = Encoding.UTF8.GetBytes(data);
- File.WriteAllBytes("location", bytes);
Now the file (if it supports Unicode encoding) would display the Unicode characters stored in it. The process is nearly similar to using SQL Server, just the classes and method is different. You cannot say that either one is better than the other one. Both are free and don't take much time.
Converting the decimal to Unicode representation (U+XXXX)
Usually, people think that both data A and U+0041 represent different characters. Well, you're wrong. They represent the same character, 'A'. You should read this Wikipedia article, before moving on. It covers all of the basic characters and special characters.
Now, the basic idea behind them is, that the 65 holding the data is a decimal-based number. Whereas the 41 is hexadecimal-based data and represents the same character. You can convert one to another and another to your own just as you would convert one number to another base. In the .NET Framework the code to get the numeric representation of a character and to get the character based on a number is:
-
- char a = 'A';
- int b = (int)a;
-
-
- int a = 65;
- char b = (char)a;
Using this mechanism, you can easily get the integeral value from the character and a character from an integer value. Now, once you've gotten the decimal representation, you can use it to convert it into the Unicode representation of that character. Have a look at the following code:
- string str = String.Format("Character '{0}' has code: {1} and Unicode value: U+{2}",'A', ((int)'A'), ((int)'A').ToString("X4"));
What happens when the preceding code executes?
The preceding image is worth a thousand words, isn't it? It clarifies the ambiguity of how these numbers are represented and what the actual value is and so on and so forth. You can denote the numbers in either way, depending on the way you like it.
Do not use bytes to assume you have the value of a character
File.ReadAllBytes makes sense for holding the value of the data in the byte array, but never try to believe that a single character value would be held in a single byte. A byte can hold data in the range 0-255, any data exceeding 255 would be then mapped in a way to denote the character, thus a byte array. If you would try to get the value of that byte, you will get the wrong value although the result of the render would be correct. For example, have a look at the following line of code.
That is absolutely the wrong value, the correct value of alpha ('α') is, 945.
The values in the byte array are mapped to maintain the correct character scheme and not to hold the values of each character at their own indices.
Why to prefer UTF-8 instead of other encodings
I said previously to prefer UTF-8 instead of other encoding schemes. Let me explain why. The major factor is the light-weight of UTF-8 encoding. It is an 8-bit variable-length encoding scheme provided by the Unicode consortium. It uses 1 byte for characters in the rangie 0 - 255, then a 2 byte data for the next code page and so on. Whereas, UTF-32 is fixed size but UTF-16 uses a minimum of 2 bytes for each character.
UTF-8 can be used for English-based websites. It maps to ASCII codes, thus providing support to older machines supporting only ASCII encoding schemes. If you go to this MSDN documentation and read the example in the Examples section, you will find that UTF-8 is the best version to be used among all of the Unicode standards. There are many other threads too, talking about the difference. I might take you there for more.
- Wikipedia: Comparison of Unicode encodings
- Jon Skeet's answer on Stack Overflow about UTF-8 vs. Unicode
- UTF8, UTF16 and UTF32 (on Stack Overflow)
- Difference between UTF-8 and UTF-16 (on Stack Overflow)
Classes provided in System.Text
In the .NET Framework, the System.Text namespace is responsible for all of the encoding based on characters. It includes the Encoding class that can be used to encode the string into bytes and/or encode the bytes into a string and so on. There are many options allowed in the .NET Framework to be used.
The ambiguity comes when the members Unicode, UTF8, UTF7 and UTF32 in the Encoding class (note that there is no UTF16 that is Unicode itself). To note there, there is a class for these encodings too. You can use them both, as Encoding.UTF8, or as UTF8Encoding for your application. The documentation about System.Text has many more resources for you to read and understand the concept of encodings in the .NET Framework.
Note: UTF8Encoding inherits from Encoding. Or specifically every class ending with Encoding in the System.Text namespace inherits from the Encoding class and has similar functionality as being a member of the Encoding class.
Points of Interest
Unicode
Unicode is a standard for character encoding and decoding for computers. You can use various encodings from Unicode, UTF-8 (8 bit) UTF-16 (16 bit) and so on. These encodings are used to globalize the applications and provide a locale interface to the users enabling them to use the applications in their own language, not just English.
Why to use UTF-8
UTF-8 is a variable-length encoding provided by Unicode and can accommodate every character, its size ranges from 1 byte to 4 bytes, depending on the code page that the character exists on. All of the websites, news applications and media devices use UTF-8, because of its light weight and efficiency.
Unicode in .NET
The .NET Framework has built-in support for Unicode characters. The char object in the .NET Framework represents a Unicode character (UTF-16). You can use Unicode characters in various .NET applications, console application, WPF applications and web applications based on the ASP.NET Framework. You can encode and decode the strings into characters and integers. In a console application you need to ensure that the character and the code page is supported in the font family that you're going to use otherwise you might encounter a question mark on the screen.
You do not need to write anything to support Unicode, it is there by default.
History
First version of the article.