You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>Several of these [=grapheme clusters=] are made up of more than one Unicode [=code point=] because of the way that the Devanagari script works. Devanagari is an example of a script that uses combining characters. In fact, use of such characters is required to write text in this script. In this case, the four [=grapheme clusters=] are composed from seven [=abstract characters=], each of which is assigned a [=Unicode Scalar Value=] to serve as its [=code point=]. This sequence of [=code points=] can be encoded into a byte sequence using the UTF-8 [=character encoding=]:</p>
1559
-
1560
-
<table class="charTermExample">
1561
-
<tr>
1562
-
<th style="width:25%">Character</th>
1563
-
<td class="bigtext">य</td>
1564
-
<td class="bigtext">ू</td>
1565
-
<td class="bigtext">न</td>
1566
-
<td class="bigtext">ि</td>
1567
-
<td class="bigtext">क</td>
1568
-
<td class="bigtext">ो</td>
1569
-
<td class="bigtext">ड</td>
1570
-
</tr>
1571
-
<tr>
1572
-
<th>Code Point</th>
1573
-
<td><code>U+092F</code></td>
1574
-
<td><code>U+0942</code></td>
1575
-
<td><code>U+0928</code></td>
1576
-
<td><code>U+093F</code></td>
1577
-
<td><code>U+0915</code></td>
1578
-
<td><code>U+094B</code></td>
1579
-
<td><code>U+0921</code></td>
1580
-
</tr>
1581
-
<tr>
1582
-
<th>UTF-8 Code Units</th>
1583
-
<td><code>E0 A4 AF</code></td>
1584
-
<td><code>E0 A5 82</code></td>
1585
-
<td><code>E0 A4 A8</code></td>
1586
-
<td><code>E0 A4 BF</code></td>
1587
-
<td><code>E0 A4 95</code></td>
1588
-
<td><code>E0 A5 8B</code></td>
1589
-
<td><code>E0 A4 A1</code></td>
1590
-
</tr>
1591
-
1592
-
<p>Я❤️🇨🇭🐄!</p>
1593
-
1594
-
1595
-
1596
-
</table>
1597
-
1598
-
<p></p>
1599
-
1600
-
-->
1601
-
1602
1549
</aside>
1603
1550
1604
1551
<divclass="req" id="char_sounds">
@@ -3442,11 +3389,16 @@ <h3>Truncating or limiting the length of strings</h3>
3442
3389
3443
3390
<p>Keep in mind that, while the examples chosen here are roughly the same length, other languages might require more characters to convey the same concepts. For example, the Scottish Gaelic translation would be <qlang="gd">Is urrainn dhomh glainne ithe, chan eil e gam ghoirteachadh</q>, which is significantly longer than the English. Many languages have different grammatical structure as well, so that key information (such as the verb) appearing at the end of the sentence (as is common in Hindi or Japanese).</p>
3444
3391
3445
-
<p>Finally, don't forget that the limit will also interact with the truncation boundary chosen (as shown in [[[#example-code-unit-trunc-bad]]]): if the truncation is done naively at the 15th byte, the resulting string might contain only a partial character. For example, the Marathi could experience this problem: <spanclass="codepoint"><bdilang="ma">मी का�...</bdi></span>.</p>
3392
+
<p>Finally, don't forget that the limit will also interact with the truncation boundary chosen (as shown in [[[#example-code-unit-trunc-bad]]]): if the truncation is done naively at the 15th byte, the resulting string might contain only a partial character. For example, the Marathi could experience this problem:</p>
3393
+
3394
+
<pclass="bigtext" lang="ma">मी का�...</p>
3395
+
3396
+
</aside>
3397
+
<asideclass="example" id="family-example" title="Emoji sequences as an example of grapheme clusters">
3446
3398
3447
3399
<p>Another example of the complex relationship between [=visual text units=] and [=code points=] are certain emoji. The emoji character for "family" has a code point in Unicode: <spanclass="codepoint" translate="no"><bdilang="en">👪</bdi><codeclass="uname">U+1F46A FAMILY</code></span>. It can also be formed by using using a sequence of [=code points=]: <codeclass="uname">U+1F468 U+200D U+1F469 U+200D U+1F466</code>.</p>
3448
3400
3449
-
<p>The character <spanclass="codepoint" translate="no"><imgalt="ZWJ" src="./images/200D.png"><spanclass="uname" translate="no">U+200D ZERO WIDTH JOINER</span></span> is used to "join" separate emoji characters together (it also has a role in joining characters in various writing systems of the world). This compositional mechanism can be used to create other family variations. For example, the sequence <spanclass="codepoint" translate="no"><bditranslate="no">👨‍👩‍👧‍👦</bdi><codeclass="uname" translate="no">U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466</code></span> results in a composed emoji character for a "family: man, woman, girl, boy" on systems that support this kind of composition. That long character sequence still represents just a single [=visual text unit=]. Other characters, such as skin tone modifiers, can further extend the [=grapheme cluster=]:</p>
3401
+
<p>The character <spanclass="codepoint" translate="no"><imgalt="ZWJ" src="./images/200D.png"><spanclass="uname" translate="no">U+200D ZERO WIDTH JOINER</span></span> is used to "join" separate emoji characters together (it also has a role in joining characters in various writing systems of the world). This compositional mechanism can be used to create other family variations. For example, the sequence <spanclass="codepoint" translate="no"><bditranslate="no">👨‍👩‍👧‍👦</bdi><codeclass="uname" translate="no">U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466</code></span> results in a composed emoji character for a "family: man, woman, girl, boy" on systems that support this kind of composition. That long character sequence still represents just a single [=visual text unit=]. Other characters, such as skin tone modifiers, can further extend the [=grapheme cluster=]. Here are just a few of the possible emoji sequences possible for representing a family:</p>
3450
3402
3451
3403
3452
3404
<tableclass="cpExample" style="width:95%">
@@ -3472,7 +3424,20 @@ <h3>Truncating or limiting the length of strings</h3>
3472
3424
</tr>
3473
3425
</table>
3474
3426
3475
-
<p>Many common emoji can <em>only</em> be formed using sequences of code points, but should be treated as a single [=visual text unit=] when displaying or processing the text. The simplest composed "family" emoji sequence "👨👩👦" consists of 5 code points. The byte limit of 15 truncates in the middle of the child family member: "👨👩�". If the truncation is done on the [=grapheme cluster=] boundary, the entire family is removed.</p>
3427
+
<p>Many common emoji can <em>only</em> be formed using sequences of code points, but should be treated as a single [=visual text unit=] when displaying or processing the text. The simplest composed family emoji sequence consists of 5 code points:</p>
0 commit comments