Skip to content Skip to sidebar Skip to footer

Why It's Necessary To Specify The Character Encoding In An Html5 Document If The Default Character Encoding For Html5 Is Utf-8?

I've following HTML5 document :

Beträge: 20€

<

Solution 1:

The HTTP1.1 specifies that the browsers should treat all text as ISO-8859-1, unless told otherwise:

When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1"

At the same time, HTML5 specifies that

If the transport layer specifies an encoding, and it is supported, return that encoding with the confidence certain, and abort these steps.

So, HTTP1.1 defaults to ISO-8859-1, and overrides everything else.

If you encode

Beträge: 20€

with UTF-8, and then decode it as ISO-8859-1, you get exactly the garbled output:

Beträge: 20â¬

as the following code snippet demonstrates (Java, doesn't really matter):

newString("Beträge: 20€".getBytes("utf-8"), "iso-8859-1")
// result: Beträge: 20â¬

The browser actually does warn you about it. E.g. Firefox displays the following warning in the console:

The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must be declared in the document or in the transfer protocol.

To obtain the correct output, you have to manually override the ISO-8859-1 by UTF-8 (in case of Firefox, it's under View -> Text Encoding -> Unicode (instead of "Western")).


So, to conclude: I don't see where it even says that "the default character encoding for HTML5 is UTF-8". All it says seems to be:

Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings.

Solution 2:

Because the statement "the default character encoding for HTML5 is UTF-8" is wrong. The statement is distributed by websites like this. But as Marcel Dopita writes at Don’t be fooled by w3schools, UTF-8 is not the default HTML5 charset, it is wrong and in fact the W3C recommendation has a "suggested default encoding" of Windows-1252 for English locales.

It is sometimes stated that "HTTP/1.1 defaults to ISO-8859-1". This was true in the 1999 standard (RFC 2616), but in the 2014 version (RFCs 7230-7329) the default charset has been removed, and so the default behaviour is now just specified by the HTML5 recommendation. Also, even if the transport layer does specify "iso-8859-1", it is not a supported encoding in HTML5 and the encoding specification says it should be treated as a label for Windows-1252.

Post a Comment for "Why It's Necessary To Specify The Character Encoding In An Html5 Document If The Default Character Encoding For Html5 Is Utf-8?"