HTML5 Character Encodings


In the following article, we are going to learn about HTML5 character encodings. A technique for specifying a mapping between bytes and text is character encoding. We must select the correct character encoding in order for an HTML content to appear appropriately.

Need of character encoding

An HTML or XML page's encoding should always be specified. In the absence of this, characters in your material run the danger of being misinterpreted. This isn't simply a problem for human readability; increasingly, machines must also be able to comprehend your data.

Let’s look into the following types for better understanding of HTML5 character encodings.

Unicode Byte Order Mark (BOM)

A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files.

Many Windows programs (including Windows Notepad) add the bytes 0xEF, 0xBB, 0xBF at the start of any document saved as UTF-8. This is the UTF-8 encoding of the Unicode byte order mark (BOM), and is commonly referred to as a UTF-8 BOM even though it is not relevant to byte order.

For HTML5 document, you can use a Unicode Byte Order Mark (BOM) character at the start of the file. This character provides a signature for the encoding used.

In the Beginning: ASCII

Electronic devices store computer information as binary ones and zeros.The American Standard Code for Information Interchange (ASCII) was developed to standardise the storage of alphanumeric characters. To handle the numerals, upper- and lower-case letters of the English alphabet, as well as some unusual characters, it established a distinct binary 7-bit number for each storable character.

The primary shortcoming of ASCII was the exclusion of non-English letters.Today, ASCII is still widely used, particularly in massive mainframe computer systems.

In Windows: ANSI

ANSI (also called Windows-1252) was the default character set in Windows, up to Windows 95. With additional international characters, ANSI is an expansion of ASCII. The 256 distinct characters are represented by a single byte (8 bits).

All browsers support ANSI since it has been Windows' standard character set.

In HTML 4: ISO-8859-1

ISO-8859-1 was the character set that HTML 4 most frequently utilised.

Extension of ASCII with additional foreign characters is called ISO-8859-1.

Syntax

<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">

In HTML 4, the <meta> tag can be used to specify a character set other than ISO-8859-1 −

Syntax

<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-8">

All HTML 4 processors also support UTF-8 −

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

In HTML5: Unicode UTF-8

The Unicode Consortium created the Unicode Standard because the character sets mentioned above are constrained and incompatible in contexts that support several languages.

The characters, punctuation, and symbols used worldwide are (nearly) all included in the Unicode Standard.Unicode makes it possible to process, store, and send text regardless of platform or language.

Note − UTF-8 is the default character encoding in HTML5

The HTML5 specification encourages web developers to use the UTF-8 character set.

<meta charset="UTF-8">

A character-set different from UTF-8 can be specified in the <meta> tag −

<meta charset="ISO-8859-1">

Updated on: 11-Oct-2023

94 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements