Monday, April 4, 2022

Emoji Are Not Born, They Are Made

Unicode now accepting proposals for Emoji 16.0

It’s hard to believe that just as Emoji 14.0 begins to appear on your device of choice this year, the Unicode Emoji Subcommittee [ESC] has already begun to plan for Emoji 16.0. That’s right, as of today — April 4, 2022 — applications to submit ideas for new emoji are open through July 31, 2022! 👁️📝👁️

So, how do you ensure your proposal is the best it can be? Well, here are some tips for consideration as you prepare it.

Check whether the emoji already exists!

✅ First: See if it’s already been approved.

🤔 Second, is it being reviewed?

🧑🏾‍🏫 Tip: Don’t skip any of the fields in the form! Incomplete proposals won’t be processed and will be returned. The ESC team members get a lot of submissions and complete proposals help them evaluate the submissions.

Be sure your proposal meets the criteria for consideration.

We recommend being faithful to the criteria for inclusion as much as possible and to consult the Emoji Subcommittee’s priorities, guidelines, strategies, reports, and audits. Many of the new provisional candidates for Emoji 15.0 are the result of these documents: pink heart, shaking face, rightwards pushing hand. The following are just some of the many considerations for writing a compelling proposal:
  • Multiple Uses
    Does the candidate emoji have significant metaphorical references or symbolism and not merely represent itself?
  • Use in sequences
    How is the emoji used with other emoji to communicate something new?
  • Breaking new ground
    Does the emoji represent something that is not already representable?
  • Distinctiveness
    Explain how and why this emoji represents a distinct, visually iconic entity that is relevant to a global audience
  • Compatibility
    Is it needed for compatibility with frequently-used emoji in popular existing systems, such as WeChat, Twitter, etc.
  • Frequency of Use
    Is there a high frequency of use? There should be empirical evidence of high usage in literature, movies, graphic novels, etc. worldwide.
Examples can be found on this page under “Selection Factors”

Well, let’s get going! How do I propose an emoji?

📝 Submit a proposal

My proposal wasn’t selected :(

We recognize that it will come as a disappointment if your proposal is not one of the few selected for inclusion. 💕 There are loads of reasons why this may have happened.
  • It can already be represented by a sequence
    (Ex. Garbage fire 🗑️🔥, Can of worms 🥫🪱)
  • 🔍 It’s too specific
    We can’t add every type of flower, every breed of dog, every color of drink
  • 💰 Very few are selected
    Roughly thirty emoji characters are added each year
  • 🐣 It’s a transient concept
    Think less “memes” and more “stable long-standing concepts”. Can you cite how this concept has existed in a communicative manner such as literature, movies, graphic novels, etc.?
  • ♾️ It’s open-ended
    There is no compelling evidence to add it over others of a similar type
  • Many other factors for exclusion

Why can’t we make EVERYTHING an emoji?

Any emoji additions have to take into consideration usage frequency, trade-offs with other choices, font file size, and the burden on developers (and users!) to make it easier to send and receive emoji. That’s why the Emoji Subcommittee set out to reduce the number of emoji we encode in any given year.

Reconciling the rapid, transient nature of modern communication with the formal, methodical process required by a standards body like the Unicode Consortium is the name of the game these days. Until the sending and receiving of images is standardized in some manner so you can send any image in the world alongside your text messages not just code points ... well, Unicode is here for the world’s emoji character needs. 🫂💖


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Monday, March 28, 2022

The Past and Future of Flag Emoji

Emoji Flags are dead, long live Emoji Flags 🏁 🏁 🏁

By Jennifer Daniel, Unicode Emoji Subcommittee Chair

With Emoji 16.0 submissions open from April 4, 2022 through July 31, 2022, the Unicode Emoji Subcommittee members stand with open arms for your future hair pick, khanda, and pink heart emoji proposals (BTW, if you were planning to prepare proposals for those concepts, we have some good news for you: they are already Emoij 15.0 draft candidates!).

That being said, there is one particular type of emoji for which the Unicode Consortium will no longer accept proposals. Flag emoji of any category.

Flag emoji have always been subject to special criteria due to their open-ended nature, infrequent use, and burden on implementations. Today nine out of ten are in the top twenty most frequently shared flags. (The only outlier is Russia.) The addition of other flags and thousands of valid sequences into the Unicode Standard has not resulted in wider adoption. They don’t stand still, are constantly evolving, and due to the open-ended nature of flags, the addition of one creates exclusivity at the expense of others.

Why do flag emoji exist in the first place?

Well, the shorter, more technical answer is: The country flags use a generative mechanism, and were encoded early on for compatibility reasons.

The longer answer requires a flashback to the 1990’s. KDDI and SoftBank — two Japanese mobile phone carriers — had early emoji sets which included 10 country flags: 🇨🇳 🇩🇪 🇪🇸 🇫🇷 🇬🇧 🇮🇹 🇯🇵 🇰🇷 🇷🇺 🇺🇸¹. A possibly apocryphal explanation is that they were used to denote what to grab for dinner: "American 🇺🇸 or Italian 🇮🇹?" (Such an innocent time in emoji history, pre-hamburger ��� emoji). Alas, as Unicode stepped in to create meaningful interoperability between these carrier-specific encodings, they were presented with a problem: why should these 10 countries have flag emoji when others do not?

The original emoji set included ten flags (shown above).
¹ Interestingly, Windows has never supported flag emoji 🔮. So, if you are reading this on a Windows device and flags aren't displaying, simply refer to the image above of the ten original flag emoji.

Various ideas were considered. The Unicode Consortium isn’t in the business of determining what is a country and what isn’t. That’s when the Consortium chose ISO 3166-1 alpha 2 as the source for valid country designations. ISO 3166 is a widely-accepted standard, and this particular mechanism represents each country with 2 letters, such as “US” (For United States), “FR” (France), or “CN” (China).

It wasn’t a perfect solution, but by allowing the 10 flag emoji — and the rest of the country flags — to be accurately interchanged between DoCoMo, KDDI, SoftBank, Google, and Apple, and others, it worked just fine.

Why this flag emoji but not that one?

Today, the largest emoji category is flags (Out of only ~3600 emoji, there are over 200 flags!). But, did you know that there are over 5,000 geographically-recognized regions that are also “valid”? These are known as subdivision regions and are based on ISO 3166-2. (These include states in the US, regions in Italy, provinces in Argentina, and so on.)

First, what does “valid” mean to the Unicode Standard? Well, think of it this way. Today, anyone could make a font of 5,000 emoji flags using these sequences. They are valid sequences. They are legit sequences. They won’t break. Any platform, application, or font can implement them. The significant difference here is that valid doesn’t mean they are recommended for implementation.

Back to ISO. ISO groups countries in a more formal way than say FIFA or The Olympics. For example, the four regions of the UK are regularly used in sport but not recognized in ISO 3166-1. In 2016, the Unicode Consortium started looking into solutions to support their inclusion (with the technical feasibility of adding more if needed in the future). This was the impetus for adding a general mechanism to make all ISO 3166-2 codes be valid for flags. However, only three of the 5,000 ISO 3166-2 codes have widely adopted emoji— England, Scotland, and Wales. (Northern Ireland remains in limbo until an “official flag” is formalized).

Flags for England, Scotland, and Wales were included in Emoji 5.0

So, with so many “valid sequences” why hasn’t anyone taken advantage of this sweet sweet rich flag opportunity?

At the time, in 2016, adding a few flags seemed reasonable but in retrospect was short-sighted. If the Emoji Subcommittee recommends the addition of a Catalonia flag emoji, then it looks like favoritism unless all the other subdivisions of Spain are added. And if those are added, what about the subdivisions of Japan or Namibia, or the Cantons of Liechtenstein? The inclusion of new flags will always continue to emphasize the exclusion of others. And there isn’t much room for the fluid nature of politics — countries change but Unicode additions are forever — once a character is added it can never be removed. (That being said, font designers can always update the designs as regimes change).

What happens when a country changes the design of its flag? 

Once Unicode designates a codepoint for a flag, Unicode is not part of the process for how the final codepoints look.
 
However, sometimes flags get redesigned and people have questions about what happens next. Whenever there is a change that requires an update to an existing flag emoji, the update can happen once a new design is officially recognized. For example, in 2023 when the Martinique assembly voted for and adopted the official flag.
 
The implementation of the new design takes some time and doesn’t happen immediately. Given the complexity of flag designs, artwork provided by an official representative is the safest way to ensure codepoints are accurately representing the country. Then, it can be deployed across various platforms.

How are flag emoji used?

Flags are very specific in what they mean, and they don’t represent concepts used multiple times a day or even multiple times a year. You could say flag emoji have transcended the messaging experience and are primarily found in more auto-biographical contexts. (Like your TikTok bio. Or, maybe you add a flag to your username on Twitter.) But, even then flags are not as commonly found in biographical spaces as you may expect. (The top five emoji found in Twitter bios? ❤️✨💙💜💛.)

Despite being the largest emoji category with a strong association tied to identity, flags are by far the least used. (There are exceptions: usage of the rainbow flag is above median!) That begs the question, “So, why not encode more identity flags?” Well, we have seen the same results for flags as we have seen for other emoji — a very long tail of rarely used options. They also tend to change over time! In the past six years since adding a Pride Flag to the Unicode Standard (2019) it’s already been redesigned. Many times. Identities are fluid and unstoppable which makes mapping them to a formal unchanging universal character set incompatible.


Why does usage matter in selecting emoji?

Any emoji additions have to take into consideration usage frequency, trade-offs with other choices, font file size, and the burden on developers (and users!) to make it easier to send and receive emoji. That’s why the Emoji Subcommittee set out to reduce the number of emoji we encode in any given year. Flags are also super hard to discern at emoji sizes — it’s quite easy to send a different flag than you intended (and with each additional flag the problem gets worse). The simple truth is that if more people used flags then there would be more of an argument to encode them. The Unicode Standard subset is just not a viable solution here for implementers nor users. Fortunately, there are seemingly infinite other ways to exchange images of flags that are more flexible and decentralized, such as stickers, gifs, and image attachments.

What is Unicode doing about it?

We realize closing this door may come as a disappointment — after all, flags often serve as a rallying cry to be seen, heard, recognized, and understood.

The Internet is a different place now than it was in the 90’s — the distribution of imagery online is unstoppable! Given how flags are commonly used this is a reasonable path forward: If you care to denote your affiliation with a region be it geographic, political, or identity (or all three) you can add a flag to your avatar image, share videos, or send a gif or sticker to razz your friend during a sports game (and of course there is always ⚽ ⚽ ⚽ ⚽ ⚽).


The more emoji can operate as building blocks, the more versatile, fluid, and useful they become! Rather than relying on Unicode to add new emoji for every concept under the Sun (this is simply not attainable) the citizens of the world have proven to be infinitely creative and fluid: often using existing emoji like the colored hearts (❤️️ 🧡 💛 💚 💙 💜 🤎 🖤 🤍) to express themselves. Hearts are among the most frequently used type of emoji and the nine colored hearts are often juxtaposed next to each other to denote markers of emotion (“I’m sorry 💙” or “love you ❤️”) and identity or affiliation that are not represented with atomic emoji in the Unicode Standard (ex. “Pan African pride ❤️️💚🖤”, “Hi I’m bi 💖💙💜”, and yes even sports teams “Go Mets! 💙🧡” ).

With this in mind, the Emoji Subcommittee has put forth a strategy to add a pink heart, a light blue heart, and a gray heart to the Unicode Standard. These are colors commonly found in gender flags (gender fluid pride flag), sexuality flags (bisexual pride flag), in sports team colors (Go Spurs!) and even some regional flags (Brussels). As of this year, these three heart emoji advanced as draft candidates, and you can expect them to land on your device of choice sometime next year.

In some ways we have returned to where we first started: Adding three new emoji to support a seemingly infinite number of concepts. This time if it fails, at least we’ll be left with lots of heart emoji that have multiple uses. ❤️🧡💛💚💙💜🤎🖤🤍



In light of this change, we’d like to clarify a few additional frequently asked questions with regards to emoji flags

Wait, if a country gains independence and is recognised by ISO, does that mean no flag emoji for them?
Flags for countries with Unicode region codes are automatically recommended, with no proposals necessary! First their codes and translated names are added to Unicode’s Common Locale Data Repository [CLDR], and then the emoji become valid in the next version of Unicode. These emoji are also automatically recommended for general interchange and wide deployment.

What about flags that change designs for geopolitical reasons?
Unicode does not specify the appearance of flag emoji. It is the responsibility of font designers to update their fonts as politics change. EG: no Unicode changes required for https://emojipedia.org/flag-mauritania/

My region was assigned a 3166-2 code. Do we have to submit a proposal?
No, the Emoji Subcommittee is no longer taking in any proposals for flags of any kind.

As a recent example, Kurdistan (a subdivision of Iraq) became an official subdivision in ISO 3166-2 (IQ-KR) on May 3, 2021. The corresponding Unicode subdivision code (iqkr) is slated for release in CLDR v41 on Apr 6, 2022. At that point the flag for Kurdistan will officially be valid — any platform, app, or font could support it. But that doesn’t mean it automatically gets in the queue for everyone’s phone. Only countries with ISO 3166-1 region codes are automatically recommended and require no proposal to move forward.

So what warrants an ISO 3166-1 assignment vs ISO 3166-2?
ISO 3166-1 is for countries recognized by the United Nations and ISO 3166-2 is for parts of countries.

Why is Antarctica part of ISO 3166-1 but Africa isn’t? There seems to be no rational explanation with regard to why islands with no inhabitants have a flag while regions with millions of people have no emoji flag.
It’s true, there are "Exceptional reservations." Antarctica has an ISO 3166-1 alpha 2 code: AQ. But WHY does it have an ISO 3166-1 code? Because ISO 3166 decided to (ages ago) include it, probably since the whole continent is "shared."

For historical reasons, you may see other exceptions like 🇦🇨 AC Ascension Island, 🇨🇵 CP Clipperton Island, or 🇩🇬 DG Diego Garcia.

Why don’t we have asexual, bisexual, pansexual, and non-binary pride flags? And if 🏴󠁧󠁢󠁷󠁬󠁳󠁿 and 🏴󠁧󠁢󠁳󠁣󠁴󠁿 get Unicode flags, surely there’s room for the Aboriginal and Torres Strait Islander flags?
Before diving into the facts of why these flags are not part of the universal character set, we want to first take a moment to consider what people mean when they ask these questions and what Unicode means when they decline these flag proposals. Because this question is not one we take lightly. In the course of world history, groups have used flags as a rallying cry to be seen, heard, recognized, and understood. In the Unicode Consortium’s mission to digitize the world’s languages, improve communication online, and achieve meaningful interoperability between platforms, the requests for flags have become a lightning rod for these rallying cries.

When people ask for a new flag emoji, we recognize that the underlying request is about more than simply a new emoji. And when we say, “We aren’t adding more flags,” we are only saying changing the Unicode Standard is not an effective mechanism for this recognition.

What if I submit a proposal for a flag despite this policy?
Your proposal will not be processed.

Relevant docs/Further Reading
https://www.unicode.org/L2/L2021/21128-esc-recs.pdf
https://www.unicode.org/L2/L2021/21167.htm
https://www.unicode.org/L2/L2021/21172-esc-recs.pdf
https://www.unicode.org/emoji/proposals.html#Flags
http://www.unicode.org/L2/L2019/19084-trans-flag.pdf

___________________________________________

This article was updated on Dec 12, 2024 to clarify how, when, and why flag emoji design can change.

Thursday, March 24, 2022

Unicode CLDR v41 Beta available for testing

[beta image] The Unicode CLDR v41 Beta is now available for testing. The beta has already been integrated into the development version of ICU

The XML data, JSON data, charts, and specification are available for review. These may change if showstopper bugs are found. We would especially appreciate feedback from non-ICU consumers of CLDR data. Feedback can be filed at CLDR Tickets.

The release is scheduled for April 06, 2022.

CLDR v41 is a limited-submission release. Most work was on tooling, with only specified updates to the data, namely Phase 3 of the grammatical units of measurement project. The required grammar data for the Modern coverage level increased, with 40 locales adding an average of 4% new data each. Ukrainian grew the most, by 15.6%. 
The tooling changes  are targeted at the v42 general submission release. They include a number of features and improvements such as progress meter widgets in the Survey Tool

Finally, the Basic level has been modified to make it easier to onboard new languages, and easier for implementations to filter locale data based on coverage levels.

The following table shows the number of Languages/Locales in this version. (See the v41 Locale Coverage table for more information.)

Level Languages  Locales  Notes
Modern 89 361 Suitable for full UI internationalization
Moderate 13 32 Suitable for full “document content” internationalization, such as formats in a spreadsheet.
Basic 22 21 Suitable for locale selection, such as choice of language in mobile phone settings.
Total 124 414 Total of all languages/locales with ≥ Basic coverage.

Beyond the member organizations of the Unicode Consortium, many dedicated communities and individuals regularly contribute to updating their locales, including:
  • Modern: Cherokee, Cantonese, Scottish Gaelic,  Sorbian (Lower), Sorbian (Upper)
  • Moderate: Asturian [nearly Modern], Breton, Faroese, Fulah (Adlam), Kaingang, Nheengatu, Quechua, Sardinian
  • Basic: Bosnian (Cyrillic), Interlingua, Kabuverdianu, Māori, Romansh, Tajik, Tatar, Tongan, Uzbek (Cyrillic), Wolof
For details, see the Unicode CLDR v41 Release Note.
The next version of CLDR, version 42, is slated to start General Submission on May 18, 2022.

Unicode CLDR provides key building blocks for software supporting the world’s languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Thursday, March 3, 2022

Update on the Internationalization & Unicode Conference



This is an update on the annual Internationalization & Unicode Conference. As some of you know, Object Management Group (OMG), our events and logistics partner for the annual Internationalization and Unicode Conference (IUC), is moving in a different strategic direction.

We decided to mutually end the partnership and are now in the process of transferring the various resources from OMG to the Unicode Consortium.

ACKNOWLEDGMENTS

We would like to take this opportunity to thank the OMG team, especially Mike Narducci and Carol David, for their support and dedication in making IUC such a mainstay for the global internationalization community.

Unicode would also like to thank the dedicated group of volunteers who worked with Rick McGowan on the program committee. Some of them have been on the committee from the early days even before we began working with OMG in 2006. This speaks to the strong commitment by the individuals as well as the organizations supporting their involvement over the years.

THE WAY FORWARD

While the ending of this partnership creates some challenges, it is also an opportunity to reshape how Unicode approaches community building and training. And given how the meeting and event landscape continues to evolve, it is a great time to explore best practices and apply lessons learned from other meetings and groups.

To that end, Unicode staff and a small group of volunteers convened late last year and will continue meeting in the coming 60-90 days to create the future IUC.

REQUEST

The Unicode Consortium is always looking to improve its conference. We recognize IUC as a key opportunity each year for knowledge-sharing, community building, and evangelization and want your help to shape the future IUC. Please give us your input and ideas by EOD on Friday, March 11th in one of these brief questionnaires.

(1) Survey for previous attendees
(2) Survey for those who have yet to attend

NEXT STEPS

Once we have additional community input and an update on our specific plan, we will share that information with the broader community via this blog and other channels, including on the meeting website at www.unicodeconference.org.

In the meantime, thanks for your time and ideas!


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Wednesday, March 2, 2022

Avoiding Source Code Spoofing

Unicode has convened a group of experts in programming languages, tooling, and security to provide guidance and recommendations on how to better handle international text in source code, as well as providing code to help implementations.

Recent reports have highlighted problems in the review of source code containing non-ASCII Unicode characters (the so-called “Trojan Source exploit”). A person reviewing a submission of source code could be fooled into thinking that the code was okay, when it was actually malicious. The basic problem occurs when the actual text is different from what the reader perceives it to be, based on what is displayed. This can result either from the presence of characters used in right-to-left scripts (such as Arabic or Hebrew) that can change the visual ordering of text, or from the presence of characters that look like others (also known as “confusables”).

The problems here are not solely a security issue: text with different writing directions or confusable characters can be hard to work with. Finding a solution here is important from both security and usability points of view. Developers of source code editors or compilers should not be required to have a deep knowledge of Unicode to provide good user experience and robust security mitigations.

Unicode’s mission is to allow everyone to use their own languages on computers and mobile devices. The above issues are part and parcel of a character set that covers all the writing systems of the world – and have been documented in the Unicode Standard since its very first version in 1991. Unicode’s past efforts have focused on misleading URLs and identifiers, and correct visual ordering of plain text. And while much of this material is relevant to source code, this group of experts will now collect, curate, and supplement that early documentation with concrete recommendations to support source code editors and compilers.

While it may seem that it is easiest to simply go back to limiting source code to only ASCII characters, ASCII-only environments make it much harder to write and maintain software that can be used all over the world – a fundamental requirement for modern software. Moreover, this approach disadvantages software developers who use languages other than English.

More details on the source code spoofing issue, the proposed plan, and formation of this group are found in document L2/22-007R2.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Wednesday, February 23, 2022

Unicode CLDR v41 Alpha available for testing

[beta image] The Unicode CLDR v41 Alpha is now available for testing. The alpha has already been integrated into the development version of ICU. We would especially appreciate feedback from non-ICU consumers of CLDR data. Feedback can be filed at CLDR Tickets.

Alpha means that the main data and charts are available for review, but the specification, JSON data, and other components are not yet ready for review. Some data may change if showstopper bugs are found. The planned schedule is:
  • Mar 09 — Beta (data)
  • Mar 23 — Beta2 (spec)
  • Apr 06 — Release
CLDR v41 is a limited-submission release. Most work was on tooling, with only specified updates to the data, namely Phase 3 of the grammatical units of measurement project. The required grammar data for the Modern coverage level increased, with 40 locales adding an average of 4% new data each. Ukrainian grew the most, by 15.6%.

The tooling changes are targeted at the v42 general submission release. They include a number of features and improvements such as progress meter widgets in the Survey Tool.

Finally, the Basic level has been modified to make it easier to onboard new languages, and easier for implementations to filter locale data based on coverage levels.

The following table shows the number of Languages/Locales in this version. (See the v41 Locale Coverage table for more information.)

Level Languages  Locales  Notes
Modern 89 361 Suitable for full UI internationalization
Moderate 13 32 Suitable for full “document content” internationalization, such as formats in a spreadsheet.
Basic 22 21 Suitable for locale selection, such as choice of language in mobile phone settings.
Total 124 414 Total of all languages/locales with ≥ Basic coverage.

Beyond the member organizations of the Unicode Consortium, many dedicated communities and individuals regularly contribute to updating their locales, including:
  • Modern: Cherokee, Cantonese, Scottish Gaelic, Sorbian (Lower), Sorbian (Upper)
  • Moderate: Asturian [nearly Modern], Breton, Faroese, Fulah (Adlam), Kaingang, Nheengatu, Quechua, Sardinian
  • Basic: Bosnian (Cyrillic), Interlingua, Kabuverdianu, Māori, Romansh, Tajik, Tatar, Tongan, Uzbek (Cyrillic), Wolof
Unicode CLDR provides key building blocks for software supporting the world’s languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Friday, February 11, 2022

Unicode 15.0 Alpha Review

u15 alpha image The repertoire for Unicode 15.0 is now open for early review and comment. During alpha review the repertoire is reasonably mature and stable, but is not yet completely locked down. Discussion regarding whether certain characters should be removed from the repertoire for publication is welcome. Character names and code point assignments are reasonably firm, but suggestions for improvement may still be entertained.

This early review is provided so that reviewers may consider the character repertoire issues prior to the start of beta review (currently scheduled to start in late May, 2022). Once beta review begins, the repertoire, code points, and character names will all be locked down, and no longer be subject to changes.

Feedback for the alpha review should be reported under PRI #442 using the Unicode contact form by April 5, 2022.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Wednesday, February 9, 2022

Enhancements to Unicode Regular Expressions

Regex image A new revision of UTS #18, Unicode Regular Expressions is now available.

Regular expressions are a key tool in software development. Back in 2000, few regular expression engines supported Unicode, even at a basic level. UTS #18 set out to raise the bar, describing how regular expression engines could be adapted to deal with Unicode correctly and completely. Since that time, major programming languages and libraries have adopted level 1 features (supporting all Unicode literals, basic character properties, subtraction, intersection, ...), and some also adopted some level 2 features (full character properties, grapheme clusters, ...).

The main focus in this release is on handling the complement of properties of strings. The distinction is drawn between code point complement and full complement, followed by explicitly defining the complement operator [^...] to be code point complement, and providing the reasons for doing so in an annex. The important difference between [A--B] and [A&&[^B]] is outlined — setting out the reasons why the latter is insufficient to represent set difference.

For the EBNF in general, and for character classes with strings in particular, examples were added and the text clarified. A new annex provides examples for how character classes can be parsed.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]