banner
Welcome to HTML.co.uk, the number one resource for all news, information, and happenings regarding HTML.

Updates: HTML.co.uk has just been relaunched. Subscribe to our RSS Feed to stay on top of HTML news and techniques.
Oct
28th

Defining Character Sets for HTML: Important for Different Languages and Alphabets

Author: Editor | Files under HTML Tutorials
Tags for this article: , , , ,

Character sets basically decide how bytes that represent your text of the HTML document are translated into readable characters. The Windows Internet Explorer basically interprets bytes into a document as per the applied text set translations. The browser should actually know what character to use in order to display the HTML page in the right way. The character set such as ASCII used to support some time ago as it does support numbers from 0-9. Many countries may use some characters which are not the part of ASCII hence they use a default character set that is ISO-8859-1.

There are different character sets for different countries as North America, Western Europe, Latin America, the Caribbean, Canada and Africa supports ISO-8859-1 character sets. On the other hand, countries from Eastern Europe support ISO-8859-2 and are described as Latin alphabet part 2. SE Europe, Esperanto and miscellaneous other countries in the same region support ISO-8859-3 character set which is known as Latin alphabet part 3. Character set ISO-8859-4 is supported in a number of Scandinavian countries and Baltic’s. The ISO-8859-5 character set is used where Cyrillic alphabet is used such as Bulgaria, Belarus, Russia and Macedonia. Various languages that use Arabic alphabets support ISO-8859-6 character set.

The Greek language and various mathematical symbols that are derived from it support ISO-8859-7 which is also described as Latin or Greek part 7. The Nordic language supports ISO-8859-10 which is also described as Latin 6 Lappish, Nordic and Eskimo. The Japanese and Korean language is supported by ISO-2022-JP-2 and ISO-2022-KR character sets. It is not as complex as it seems but all these details are very important to know in order to define character sets. All the characters sets which are explained above are very limited in size. Hence all of them are not compatible at multilingual environments.

There is a Unicode standard which actually covers all characters, symbols and punctuations that is used all around the world. This is a very interesting and helpful standard where you can process any sort of platform, language or program. The Unicode Standard is basically developed by the Unicode Consortium. The ultimate aim is to replace the already existing character sets. The Unicode Standard has recently become a success and it is implemented in Java, XML, CORBA 3.0, WML, ECMAScript (JavaScript). It is also supported in a number of operating systems and is also supported by all modern browsers.

A number of character sets implement the Unicode and UTF-8 and UTF-16 are the most commonly used encodings. In the UTF-8 a character set can be from 1 to 4 bytes. It can also represent any sort of character that is there in the Unicode. It is the most preferred encoding for e-mail and webpages. The UTF-16 is actually a variable length character. It is actually capable of encoding Unicode repertoire. It is used in a number of major operating systems such as Microsoft Windows 2000, XP, 2003, Vista, CE and Java and .NET byte code environment as well.

Post a Comment