You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: index.html
+47-28Lines changed: 47 additions & 28 deletions
Original file line number
Diff line number
Diff line change
@@ -2014,10 +2014,10 @@ <h3>Choosing character encodings</h3>
2014
2014
</aside>
2015
2015
2016
2016
<divclass="req" id="char-use-utf8">
2017
-
<pclass="advisement">Specify UTF-8 for all document formats, protocols, or serialization forms unless you have a good reason not to.</p>
2017
+
<pclass="advisement">Use UTF-8 for all document formats, protocols, or serialization forms.</p>
2018
2018
</div>
2019
2019
2020
-
<p>When specifying the serialization of text, whether it be in a file, format, or protocol, UTF-8 is the best choice for nearly all applications.</p>
2020
+
<p>UTF-8 is the best choice for nearly all applications.</p>
2021
2021
2022
2022
<asideclass="note">
2023
2023
<p>Web APIs and text processing usually specified using strings rather than trying to grappple with the raw byte sequences in a specific [=character encoding form=]. As noted in [[[#char_string]]], these strings are typically represented using UTF-16 [=code units=] ({{DOMString}}) or, less commonly, as Unicode [=code points=] ({{USVString}}). Because the conversion between these forms and UTF-8 is algorithmic, lossless, and usually invisible to users and since UTF-16 is a comparatively poor choice for serialization, UTF-8 is the preferred [=character encoding=] for storage and transmission.</p>
@@ -2029,33 +2029,8 @@ <h3>Choosing character encodings</h3>
2029
2029
2030
2030
<p>New protocols and formats, as well as existing formats deployed in new contexts, are required to use the UTF-8 character encoding. This policy applies to IETF and Web standards and is articulated in [[RFC2277]], [[RFC3629]], [[Encoding]], [[design-principles]], and many more. The only specifications that need <a>legacy character encodings</a> are those that work with older protocols or formats and even there UTF-8 is strongly recommended.</p>
2031
2031
2032
-
<divclass="req" id="char_identification">
2033
-
<pclass="advisement">Specifications that allow multiple [=character encoding forms=] MUST provide character encoding identification mechanisms such that the encoding of text can be reliably identified.</p>
2034
-
<detailsclass="links"><summary>explanations & examples</summary>
2035
-
<p><ahref="https://www.w3.org/TR/charmod/#sec-Encodings">Choice and Identification of Character Encodings, C015</a>, in <cite>Character Model for the World Wide Web: Fundamentals</cite></p>
2036
-
</details>
2037
-
</div>
2038
-
2039
-
<divclass="req" id="char_enc_rules">
2040
-
<pclass="advisement">When basing a protocol, format, or API on a protocol, format, or API that already has rules for choosing, applying, or labeling the character encoding, specifications SHOULD use the existing rules rather than change these rules.</p>
2041
-
<detailsclass="links"><summary>explanations & examples</summary>
2042
-
<p><ahref="https://www.w3.org/TR/charmod/#sec-Encodings">Choice and Identification of Character Encodings, C017</a>, in <cite>Character Model for the World Wide Web: Fundamentals</cite></p>
2043
-
</details>
2044
-
</div>
2045
-
2046
-
<pclass="issue">The above needs more work to incorporate the guidance to use UTF-8 when the protocol/format is used in a new context.</p>
2047
-
2048
-
<divclass="req" id="char_charset">
2049
-
<pclass="advisement">Specifications SHOULD avoid using the terms 'character set' and 'charset' to refer to a character encoding, except when the latter is used to refer to the MIME charset parameter or its IANA-registered values. The terms [=character encoding=] or [=character encoding form=] are RECOMMENDED.</p>
2050
-
<detailsclass="links"><summary>explanations & examples</summary>
2051
-
<p><ahref="https://www.w3.org/TR/charmod/#sec-EncodingIdent">Mandating a unique character encoding, C020</a>, in <cite>Character Model for the World Wide Web: Fundamentals</cite></p>
2052
-
</details>
2053
-
</div>
2054
-
2055
-
<pclass="issue">Is the above MUSTard needed?</p>
2056
-
2057
2032
<divclass="req" id="char-use-encoding-std">
2058
-
<pclass="advisement">Ifa specification permits [=legacy character encodings=], it <del>SHOULD</del>MUST restrict the set of [=character encodings=] to those listed in the [[[Encoding]]] in the section "Names and Labels". Other encodings SHOULD NOT be used, except by private agreement.</p>
2033
+
<pclass="advisement">If, for historical reasons, a specification permits [=legacy character encodings=], it MUST restrict the set of [=character encodings=] to those listed in the [[[Encoding]]] in the section "Names and Labels". Other encodings SHOULD NOT be used, except by private agreement.</p>
2059
2034
<detailsclass="links"><summary>explanations & examples</summary>
2060
2035
<p><ahref="https://www.w3.org/TR/charmod/#sec-EncodingIdent">Character encoding identification, C021</a>, in <cite>Character Model for the World Wide Web: Fundamentals</cite></p>
2061
2036
<p><ahref="https://www.w3.org/TR/charmod/#sec-EncodingIdent">Character encoding identification, C022</a>, in <cite>Character Model for the World Wide Web: Fundamentals</cite></p>
@@ -2079,6 +2054,50 @@ <h3>Identifying character encodings</h3>
2079
2054
</ul>
2080
2055
</aside>
2081
2056
2057
+
<divclass="req" id="char_identification">
2058
+
<pclass="advisement">Specifications that allow multiple [=character encoding forms=] MUST provide a mechanism, such as a field or parameter, that clearly identifies the encoding of text.</p>
2059
+
<detailsclass="links"><summary>explanations & examples</summary>
2060
+
<p><ahref="https://www.w3.org/TR/charmod/#sec-Encodings">Choice and Identification of Character Encodings, C015</a>, in <cite>Character Model for the World Wide Web: Fundamentals</cite></p>
2061
+
</details>
2062
+
</div>
2063
+
2064
+
<p>[=Character encodings=] cannot be reliably detected just from the byte values. If encodings other than UTF-8 are permitted, there has to be some mechanism for the [=consumer=] to determine what the encoding is.</p>
2065
+
2066
+
<asideclass="example" title="Examples of character encoding mechanisms">
2067
+
<p>Here are a few examples of ways that some common specifications indicate encoding:</p>
<td>New MIME types should not specify a <code>charset</code> parameter. They should always specify UTF-8 instead.</td>
2086
+
</tr>
2087
+
</table>
2088
+
</aside>
2089
+
2090
+
2091
+
<divclass="req" id="char_enc_rules">
2092
+
<pclass="advisement">If a protocol, format, or API is based on a format that already has rules for choosing, applying, or labeling the character encoding, the specification MUST NOT define a separate mechanism for identifying the encoding.</p>
2093
+
<detailsclass="links"><summary>explanations & examples</summary>
2094
+
<p><ahref="https://www.w3.org/TR/charmod/#sec-Encodings">Choice and Identification of Character Encodings, C017</a>, in <cite>Character Model for the World Wide Web: Fundamentals</cite></p>
2095
+
</details>
2096
+
</div>
2097
+
2098
+
<divclass="req" id="char_enc_rules">
2099
+
<pclass="advisement">If a specification is based on a format that permits encodings other than UTF-8, the specification SHOULD restrict the encoding to UTF-8.</p>
2100
+
</div>
2082
2101
2083
2102
<divclass="req" id="char_heuristics">
2084
2103
<pclass="advisement">Specifications MUST NOT propose the use of heuristics to determine the encoding of data.</p>
0 commit comments