You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: index.html
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -1678,15 +1678,15 @@ <h4>Characters stored in byte sequences</h4>
1678
1678
<p><ahref="https://www.w3.org/TR/string-meta/#protocol-strings">Strings that are part of a legacy protocol or format</a>, in <cite>Strings on the Web: Language and Direction Metadata</cite> [[STRING-META]]</p>
1679
1679
</div>
1680
1680
1681
-
<p>Prior to the widespread adoption of Unicode, the basic definition of a string was a sequence of bytes in some (usually national or language-specific) [=coded character set=]. The general term <strong><em>byte string</em></strong> was sometimes used for this definition of a string.</p>
1681
+
<p>Prior to the widespread adoption of Unicode, it was common to define a string as a <strong><em>byte string</em></strong>, in which a string was simply a sequence of byte values rather than sequences of character or [=code points=]. A familiar manifestation of byte strings is a <codelang="zxx" translate="no">char*</code> in the C programming language.</p>
1682
1682
1683
-
<p>A familiar manifestation of byte strings is the <code>char*</code> type in the C programming language. Interpreting such byte strings requires the specification of a [=character encoding form=], because different [=character encodings=] use the same byte values for different purposes. Many [=legacy character encodings=] are stateful: processing such encodings often requires starting at the beginning of the byte buffer, so that character state is retained and the [=abstract character=] can be decoded, processed, or modified successfully.</p>
1683
+
<p>Processing or interpreting a byte string depends on the [=character encoding form=]. Many [=legacy character encodings=] are stateful: processing such encodings often requires starting at the beginning of the byte buffer, so that character state is retained and the [=abstract character=] can be decoded, processed, or modified successfully. A given byte value in such an encoding might mean different things depending on the bytes adjacent to it. For example, the exact same byte value might stand alone to represent a character or, depending on the preceding bytes, be part of a multibyte sequence that represents a different character. The rules for determining how to interpret each byte or byte sequence are different for different [=legacy character encodings=].</p>
<p><ahref="https://www.w3.org/TR/charmod/#sec-Strings" target="_blank">String concepts</a> in [[[CHARMOD]]])</p>
1687
1687
</div>
1688
1688
1689
-
<p>UTF-8 is the preferred encoding for wire and document formats on the Web [[ENCODING]] or the Internet in general [[RFC3629]]. When content is encoded in UTF-8, there is rarely a reason to interact with it as a byte sequence. Most Web APIs and interfaces are more concerned with the [=code point=] sequence, since that represents the characters in question, rather than the specific byte values.</p>
1689
+
<p>UTF-8 is the preferred [=character encoding=] for wire and document formats on the Web [[ENCODING]] or the Internet in general [[RFC3629]]. When content is encoded in UTF-8, there is rarely a reason to interact with it as a byte sequence. Most Web APIs and interfaces are more concerned with the [=code point=] sequence, since that represents the characters in question, rather than the specific byte values.</p>
1690
1690
1691
1691
<p>Sometimes specifications do need to deal with the storage, interpretation, and manipulation of byte values. In particular, many document formats and protocols were defined around the use of 7-bit [[ASCII]] bytes, while allowing the inclusion or interchange of non-ASCII data values via the use of various character or data encoding schemes. Sometimes this is done by designating a [=character encoding form=], such as with the <code>charset</code> parameter of the <code>text</code> media types. Or it might be done by encoding byte values using some special syntax, an example of which would be [=percent encoding=].</p>
0 commit comments