You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fixes#151
Adds a new subsection about byte-oriented formats.
Adds guidance about ByteString and isomorphic strings.
Moves the guidance about not defining legancy encoding to the encoding
section.
Copy file name to clipboardExpand all lines: index.html
+25-18Lines changed: 25 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -1671,33 +1671,38 @@ <h3>Choosing a definition of 'string'</h3>
1671
1671
<pclass="advisement">Avoid mixing {{DOMString}} and {{USVString}} in a single document or protocol operation. Often you will choose a {{DOMString}} over a {{USVString}}, since the latter requires extra processing that does not benefit most document formats or protocols.</p>
<p>[[[#char_choosing]]] for additional best practices related to [=character encodings=].</p>
1680
+
<p><ahref="https://www.w3.org/TR/string-meta/#protocol-strings">Strings that are part of a legacy protocol or format</a>, in <cite>Strings on the Web: Language and Direction Metadata</cite> [[STRING-META]]</p>
1681
+
</div>
1682
+
1683
+
<p>Sometimes specifications need to deal with byte-oriented contexts. For example, the specification might be defining a binary format or working with a byte-oriented protocol.</p>
1684
+
1685
+
1686
+
<divclass="req" id="char-string-dom-usv-bytes">
1687
+
<pclass="advisement">Specify fields in protocols that are 'string-like' as {{DOMString}} or, rarely, {{USVString}}, unless there is some reason to interact with specific bytes values or for which the UTF-8 [=character encoding=] cannot be assumed.</p>
1688
+
</div>
1689
+
1690
+
<p>If the field in question is meant to be treated as a string, working with (Unicode) characters will be more reliable than trying to work with byte values. The data encoded into these fields will be deserialized from the wire format into your local in-memory string representation, such as the [[DOM]], JavaScript strings, or your platform's native Unicode string type and later it will need to be serialized into the wire format using some [=character encoding form=] (usually UTF-8).</p>
1691
+
1674
1692
<divclass="req" id="char_string_byte">
1675
-
<pclass="advisement">Specifications SHOULD NOT define a string as a {{ByteString}} or as a sequence of bytes ('byte string'). For binary data or sequences of bytes, use {{Uint8Array}} instead.</p>
1693
+
<pclass="advisement">Specify {{ByteString}} only when working with protocols (such as HTTP) or formats that don't distinguish between bytes and strings. If you need to represent a sequence of bytes, use {{Uint8Array}}.</p>
1676
1694
<detailsclass="links"><summary>explanations & examples</summary>
1677
1695
<p><ahref="https://www.w3.org/TR/charmod/#sec-Strings">String concepts, C011</a>, in <cite>Character Model for the World Wide Web: Fundamentals</cite>.</p>
1678
-
<p><ahref="https://www.w3.org/TR/string-meta/#protocol-strings">Strings that are part of a legacy protocol or format</a>, in <cite>Strings on the Web: Language and Direction Metadata</cite> [[STRING-META]]</p>
1679
1696
<p><ahref="https://www.w3.org/TR/design-principles/#idl-string-types">IDL String Types</a> in <cite>Web Platform Design Principles</cite> [[DESIGN-PRINCIPLES]]</p>
1680
1697
</details>
1681
1698
</div>
1682
1699
1683
-
<divclass="req" id="char_string_no_legacy">
1684
-
<pclass="advisement">Specifications SHOULD NOT add or define support for <a>legacy character encodings</a> unless there is a specific reason to do so.</p>
1685
-
<detailsclass="links"><summary>explanations & examples</summary>
1686
-
<p>See also <ahref="#char_choosing"></a>.</p>
1687
-
</details>
1688
-
</div>
1689
-
1690
-
<p>The type {{ByteString}} defines strings as sequences of bytes (octets). Interpretation of byte strings thus requires the specification of a <a>character encoding form</a>. UTF-8 is the preferred encoding for wire and document formats [[ENCODING]], but there is generally no reason to specify strings in terms of the underlying byte values.</p>
1700
+
<p>{{ByteString}} isn’t a general-purpose string type. The type {{ByteString}} defines strings as sequences of bytes (octets). Interpretation of byte strings thus requires the specification of a [=character encoding form=]. UTF-8 is the preferred encoding for wire and document formats on the Web [[ENCODING]] or the Internet in general [[RFC3629]]. If the field is encoded in UTF-8, there is rarely a reason to interact with it as a byte sequence.</p>
1691
1701
1692
-
<asideclass="note">
1693
-
<p>Specifications for document formats or protocols often deal with the specific byte values used for various fields or values or with the <a>character encoding</a> used for serializing the data. It is therefore tempting to specify a text field ("string") as a {{ByteString}} which uses the <a>UTF-8</a><a>character encoding form</a>.</p>
1694
-
1695
-
<p>It is preferable, however, to specify these fields as a {{DOMString}} (or, rarely, a {{USVString}}), since the data encoded into these fields must be serialized from and deserialized into in-memory string representations, such as the [[DOM]] or JavaScript strings or your platform's native Unicode string type.</p>
1696
-
</aside>
1697
-
1698
-
<p>See <ahref="#char_choosing"></a> for additional best practices.</p>
1702
+
<p>If a specification needs to interact with or process specific byte values, such as when working with a binary format, and does not or cannot rely on the later UTF-8 serialization of a {{DOMString}} or {{USVString}}, it might be necessary to specify the use of an [=isomorphic string=] [[INFRA]] for processing. The specification will then use an [=isomorphic encode=] to serialize the the string to bytes and an [=isomorphic decode=] when deserializing from the wire or storage format.</p>
1699
1703
1700
1704
</section>
1705
+
</section>
1701
1706
1702
1707
1703
1708
@@ -1900,7 +1905,9 @@ <h3>Choosing character encodings</h3>
1900
1905
</ul>
1901
1906
</aside>
1902
1907
1903
-
1908
+
<divclass="req" id="char_string_no_legacy">
1909
+
<pclass="advisement">Specifications SHOULD NOT add or define support for <a>legacy character encodings</a> unless there is a specific reason to do so.</p>
1910
+
</div>
1904
1911
1905
1912
<divclass="req" id="char_identification">
1906
1913
<pclass="advisement">Specifications MUST either specify a unique character encoding, or provide character encoding identification mechanisms such that the encoding of text can be reliably identified.</p>
0 commit comments