Skip to content

Commit e477cfe

Browse files
committed
Fix broken family-example link
1 parent 8cdf2bb commit e477cfe

File tree

1 file changed

+21
-56
lines changed

1 file changed

+21
-56
lines changed

‎index.html

Lines changed: 21 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1546,59 +1546,6 @@ <h3 id="char_def">Characters and character encoding basics</h3>
15461546
</tr>
15471547
</table>
15481548

1549-
<!--
1550-
<p>Here is the word for "Unicode" in the Hindi language (which uses the Devanagari script):</p>
1551-
1552-
<p class="bigtext" lang="hi">&#x092F;&#x0942;&#x0928;&#x093F;&#x0915;&#x094B;&#x0921;</p>
1553-
1554-
<p>This word contains four [=visual text units=] (four [=grapheme clusters=]):</p>
1555-
1556-
<p><span class="bigtext" lang="hi">&#x092F;&#x0942;&nbsp;<span>&#x0928;&#x093F;</span>&nbsp;<span>&#x0915;&#x094B;</span>&nbsp;<span>&#x0921;</span></p>
1557-
1558-
<p>Several of these [=grapheme clusters=] are made up of more than one Unicode [=code point=] because of the way that the Devanagari script works. Devanagari is an example of a script that uses combining characters. In fact, use of such characters is required to write text in this script. In this case, the four [=grapheme clusters=] are composed from seven [=abstract characters=], each of which is assigned a [=Unicode Scalar Value=] to serve as its [=code point=]. This sequence of [=code points=] can be encoded into a byte sequence using the UTF-8 [=character encoding=]:</p>
1559-
1560-
<table class="charTermExample">
1561-
<tr>
1562-
<th style="width:25%">Character</th>
1563-
<td class="bigtext">&#x092f;</td>
1564-
<td class="bigtext">&#x0942;</td>
1565-
<td class="bigtext">&#x0928;</td>
1566-
<td class="bigtext">&#x093f;</td>
1567-
<td class="bigtext">&#x0915;</td>
1568-
<td class="bigtext">&#x094b;</td>
1569-
<td class="bigtext">&#x0921;</td>
1570-
</tr>
1571-
<tr>
1572-
<th>Code Point</th>
1573-
<td><code>U+092F</code></td>
1574-
<td><code>U+0942</code></td>
1575-
<td><code>U+0928</code></td>
1576-
<td><code>U+093F</code></td>
1577-
<td><code>U+0915</code></td>
1578-
<td><code>U+094B</code></td>
1579-
<td><code>U+0921</code></td>
1580-
</tr>
1581-
<tr>
1582-
<th>UTF-8 Code Units</th>
1583-
<td><code>E0 A4 AF</code></td>
1584-
<td><code>E0 A5 82</code></td>
1585-
<td><code>E0 A4 A8</code></td>
1586-
<td><code>E0 A4 BF</code></td>
1587-
<td><code>E0 A4 95</code></td>
1588-
<td><code>E0 A5 8B</code></td>
1589-
<td><code>E0 A4 A1</code></td>
1590-
</tr>
1591-
1592-
<p>Я❤️🇨🇭🐄!</p>
1593-
1594-
1595-
1596-
</table>
1597-
1598-
<p></p>
1599-
1600-
-->
1601-
16021549
</aside>
16031550

16041551
<div class="req" id="char_sounds">
@@ -3442,11 +3389,16 @@ <h3>Truncating or limiting the length of strings</h3>
34423389

34433390
<p>Keep in mind that, while the examples chosen here are roughly the same length, other languages might require more characters to convey the same concepts. For example, the Scottish Gaelic translation would be <q lang="gd">Is urrainn dhomh glainne ithe, chan eil e gam ghoirteachadh</q>, which is significantly longer than the English. Many languages have different grammatical structure as well, so that key information (such as the verb) appearing at the end of the sentence (as is common in Hindi or Japanese).</p>
34443391

3445-
<p>Finally, don't forget that the limit will also interact with the truncation boundary chosen (as shown in [[[#example-code-unit-trunc-bad]]]): if the truncation is done naively at the 15th byte, the resulting string might contain only a partial character. For example, the Marathi could experience this problem: <span class="codepoint"><bdi lang="ma">मी का�...</bdi></span>.</p>
3392+
<p>Finally, don't forget that the limit will also interact with the truncation boundary chosen (as shown in [[[#example-code-unit-trunc-bad]]]): if the truncation is done naively at the 15th byte, the resulting string might contain only a partial character. For example, the Marathi could experience this problem:</p>
3393+
3394+
<p class="bigtext" lang="ma">मी का�...</p>
3395+
3396+
</aside>
3397+
<aside class="example" id="family-example" title="Emoji sequences as an example of grapheme clusters">
34463398

34473399
<p>Another example of the complex relationship between [=visual text units=] and [=code points=] are certain emoji. The emoji character for "family" has a code point in Unicode: <span class="codepoint" translate="no"><bdi lang="en">&#x1F46A;</bdi><code class="uname">U+1F46A FAMILY</code></span>. It can also be formed by using using a sequence of [=code points=]: <code class="uname">U+1F468 U+200D U+1F469 U+200D U+1F466</code>.</p>
34483400

3449-
<p>The character <span class="codepoint" translate="no"><img alt="ZWJ" src="./images/200D.png"><span class="uname" translate="no">U+200D ZERO WIDTH JOINER</span></span> is used to "join" separate emoji characters together (it also has a role in joining characters in various writing systems of the world). This compositional mechanism can be used to create other family variations. For example, the sequence <span class="codepoint" translate="no"><bdi translate="no">&#x1f468;&#x200d;&#x1f469;&#x200d;&#x1f467;&#x200d;&#x1f466;</bdi><code class="uname" translate="no">U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466</code></span> results in a composed emoji character for a "family: man, woman, girl, boy" on systems that support this kind of composition. That long character sequence still represents just a single [=visual text unit=]. Other characters, such as skin tone modifiers, can further extend the [=grapheme cluster=]:</p>
3401+
<p>The character <span class="codepoint" translate="no"><img alt="ZWJ" src="./images/200D.png"><span class="uname" translate="no">U+200D ZERO WIDTH JOINER</span></span> is used to "join" separate emoji characters together (it also has a role in joining characters in various writing systems of the world). This compositional mechanism can be used to create other family variations. For example, the sequence <span class="codepoint" translate="no"><bdi translate="no">&#x1f468;&#x200d;&#x1f469;&#x200d;&#x1f467;&#x200d;&#x1f466;</bdi><code class="uname" translate="no">U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466</code></span> results in a composed emoji character for a "family: man, woman, girl, boy" on systems that support this kind of composition. That long character sequence still represents just a single [=visual text unit=]. Other characters, such as skin tone modifiers, can further extend the [=grapheme cluster=]. Here are just a few of the possible emoji sequences possible for representing a family:</p>
34503402

34513403

34523404
<table class="cpExample" style="width:95%">
@@ -3472,7 +3424,20 @@ <h3>Truncating or limiting the length of strings</h3>
34723424
</tr>
34733425
</table>
34743426

3475-
<p>Many common emoji can <em>only</em> be formed using sequences of code points, but should be treated as a single [=visual text unit=] when displaying or processing the text. The simplest composed "family" emoji sequence "👨‍👩‍👦" consists of 5 code points. The byte limit of 15 truncates in the middle of the child family member: "👨‍👩‍�‍". If the truncation is done on the [=grapheme cluster=] boundary, the entire family is removed.</p>
3427+
<p>Many common emoji can <em>only</em> be formed using sequences of code points, but should be treated as a single [=visual text unit=] when displaying or processing the text. The simplest composed family emoji sequence consists of 5 code points:</p>
3428+
3429+
<table class="cpExample" style="width:95%">
3430+
<tr>
3431+
<td style="text-align:center;width:15%"><img src="./images/emoji-image-2.png" class="emoji-image" alt=">&#x1F468;&#x200d;&#x1f469;&#x200d;&#x1f466;"></td>
3432+
<td><code class="uname">U+1F468 U+200D U+1F469 U+200D U+1F466</code></td>
3433+
</tr>
3434+
</table>
3435+
3436+
<p>A limit of 15 bytes (UTF-8 [=code units=]) would truncate this sequence in the middle of the child family member: </p>
3437+
3438+
<p class="bigtext">👨‍👩‍�‍</p>
3439+
3440+
<p>If the truncation were done on the [=grapheme cluster=] boundary instead, the entire family would be removed.</p>
34763441
</aside>
34773442

34783443

0 commit comments

Comments
 (0)