Tags for this article: character sets, html, html tutorial
If you simply type your text into a HTML editor, structure the text with help of HTML elements and then view the entire thing in your web browser, the entered text will typically be displayed correctly. This might sound self apparent – but it isn’t. The HTML file does not include letters and the other characters you entered, but instead only bytes; that is, ones and zeros.
When saving the characters, your HTML editor only needs to use a specific algorithm in order to transform the characters into bytes. The web browser must use this same method to transform the bytes back into characters, in order to correctly recognise all the characters in the HTML document. This method is known as character encryption. It has many different variations, such as for saving Western European, Cyrillic, or Arabic characters. We will address the topic of character encryption more closely in a separate chapter.
Such a character encryption references a translation table, which assigns a number or code to every character that can be entered. For example, in the Latin alphabet, the letter “a” in the Unicode character table has the number 97, “b” is 98, “c” is 99, and so on. The amount of characters in such a table is known as the character set.
Because countless encryptions can be used for HTML files, it is absolutely essential the browser uses the same encryption in which the HTML file was saved. You should then choose an editor where you can choose the type of encryption when saving. The HTML file should also specify which character encryption it uses or at least convey it through the web server in the HTTP answer. If such information is missing, then the HTML file is defective in terms of HTML standards. The web browser should not have to “guess” which character encryption is being used.
Nevertheless, web browsers are especially tolerant in this aspect and typically fall back on their preinstalled encryptions. ISO 8859-1 (Latin-1), the codification for West European countries, will likely be set as the default by your browser. The editor you use will probably also save files in ISO 8859-1, so it seems as if there is no problem. However, it might come to pass that people from Asia, Eastern Europe, or elsewhere with different character encryptions, will visit your website. If these people use their own character encryption to read a HTML file in a different encryption, then the result will be an incoherent and garbled mess.
Therefore, HTML gives you the opportunity to tell the browser which character codification you happen to be using. In the heading, within the so-called meta tag, you can provide information on the character encryption. Providing such information is definitely recommended, because it makes it easier for the browser to decide how the bytes from the HTML file should be transferred into characters. Then it is up to the browser to show the characters in such a way as was intended, to a visitor from Asia, for example.
The prevalent ISO encryptions work with a character set of 256 characters. This has the advantage of always displaying a character with exactly one byte. With such encryptions only 256 different characters can be saved. Nevertheless, the character set of HTML – by which we mean the amount of usable characters, independent of how they are saved – is not limited by the encryption being used. All characters from Unicode can fundamentally be used in a HTML document. Advanced encryption, such as UTF-8, can directly encode all the Unicode characters into bytes.
If you want to enter specific characters into the text, that aren’t included in the character set of the encryption you are using, then you can choose between two possibilities: either use a special numerical notation. Or use so-called labelled characters that HTML makes available for frequently used special characters. You should occupy yourself with the HTML character reference table to find out more about both possibilities.
Masking HTML specific characters
If you happen to use characters in your text that have a certain meaning in HTML, then you have to mask them. The following characters have to be masked as follows:
Replace & with &
Replace < with <
Replace > with >
Replace ” with "
Take Note:
The most dangerous is the left arrow (<). If you don’t mask this character, then you can throw the web browser into a lot of confusion, as it believes it is following a HTML tag. Failing to mask the other characters has much tamer consequences, although it is still recommendable.