Saturday, October 30, 2010

Unicode 6.0 Sorting

Mountain View, CA, USA – October 29, 2010 – The new version of Unicode Technical Standard #10, Unicode Collation Algorithm (UCA), has been updated for Unicode Version 6.0, adding support for 2,088 characters in sorting, searching, and matching. Also in this release new data files for support of the Unicode Common Locale Data Repository (CLDR), which provides customization for different languages.

Reorderable Categories. The data files for CLDR order characters strictly by certain major categories. This allows programmers to parametrically reorder these groups of characters to put them in the desired order for different languages. For example, numbers can be ordered after letters, or Cyrillic before Latin. The reorderable categories are:

whitespace, punctuation, general symbols, currency symbols, and numbers, then Latin, Greek, Coptic, Cyrillic, ..., Egyptian Hieroglyphs, and finally, CJK.

Distinguishing Symbols from Punctuation. UCA provides an option for ignoring certain characters when comparing strings. By default, these are whitespace, punctuation, and general symbols. The data files for CLDR modify that default so that symbols are compared significantly, while still ignoring whitespace and punctuation. Thus, for example, "I♥NY" is not sorted the same as "I☠NY".

Special Database Values. The data files for CLDR provide special weights for two noncharacters:

1. A special noncharacter <HIGH> (U+FFFF) for specification of a range in a database, allowing "Sch" ≤ X ≤ "Sch<HIGH>" to pick all strings starting with "sch" plus those that sort equivalently.

2. A special noncharacter <LOW> (U+FFFE) for merged database fields, allowing "Disílva<LOW>John" to sort next to "Disilva<LOW>John".

The version of CLDR using these new data files is planned for release at the start of December, 2010.

The text of the UCA standard has been clarified in different areas. Implementers should pay special attention to the changes regarding ill-formed sequences, noncharacters, and unassigned code points in CJK blocks.

For more information, see:

* The UCA Standard 6.0.0: http://www.unicode.org/reports/tr10/
* The UCA charts: http://unicode.org/charts/collation/
* The UCA data: http://unicode.org/Public/UCA/6.0.0/
* Merged database fields: http://unicode.org/reports/tr10/#Interleaved_Levels

About The Unicode Consortium

The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard and related globalization standards. The membership of the consortium represents a broad spectrum of corporations and organizations in the computer and information processing industry.

Members are: Adobe, Apple, Google, Government of Bangladesh, Government of India, IBM, Microsoft, Monotype Imaging, Oracle, The Society for Natural Language Technology Research, SAP, The University of California (Berkeley), The University of California (Santa Cruz), Yahoo!, plus well over a hundred Associate, Liaison, and Individual members.

For more information, please contact the Unicode Consortium. http://www.unicode.org/contacts.html

Unicode 6.0 Internationalized Domain Names

Mountain View, CA, USA – October 29, 2010 – The new version of Unicode Technical Standard #46, Unicode IDNA Compatibility Processing, has been updated for Unicode Version 6.0, adding support for 2,088 characters in internationalized domain names (IDN).

The specification provides two main features for use with the new specification for internationalized domain names released in August 2010 (IDNA2008):

1. A comprehensive mapping to reflect user expectations for casing and other variants of domain names. This mapping is allowed by IDNA2008, and follows the same principles as in the previous version of that specification (IDNA2003, in force from 2003 until August). It thus provides users consistency between old and new versions.

2. A compatibility mechanism that supports internationalized domain names valid under the IDNA2003 specification and the IDNA2008 specification. This second feature allows browsers, search engines, and other clients to handle both old and new domain names during the transitional period until registries update their rules to follow IDNA2008.

UTS #46 supplies normative data tables that are synchonized with the latest version of Unicode, allowing implementations to update without recalculation.

This new release of UTS #46 also provides a custom option to recognize legacy international domain names containing special ASCII characters such as "_".

About The Unicode Consortium

The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard and related globalization standards. The membership of the consortium represents a broad spectrum of corporations and organizations in the computer and information processing industry.

Members are: Adobe, Apple, Google, Government of Bangladesh, Government of India, IBM, Microsoft, Monotype Imaging, Oracle, The Society for Natural Language Technology Research, SAP, The University of California (Berkeley), The University of California (Santa Cruz), Yahoo!, plus well over a hundred Associate, Liaison, and Individual members.

For more information, please contact the Unicode Consortium. http://www.unicode.org/contacts.html

Tuesday, October 12, 2010

Unicode Version 6.0: Support for Popular Symbols in Asia

The newly finalized Unicode Version 6.0 adds 2,088 characters, with over 1,000 new symbols.

A long-awaited feature of Unicode 6.0 is the encoding of hundreds of symbols for mobile phones. These emoji characters are in widespread use, especially in Japan, and have become an essential part of text messages there and elsewhere. Unicode 6.0 now provides for data interchange between different mobile vendors and across the internet. The symbols include symbols for many domains: maps and transport, phases of the moon, UI symbols (such as fast-forward) and many others.

A late-breaking addition is the newly created official symbol for the Indian rupee. With the help of the Indian government and our colleagues in ISO, the consortium was able to accelerate the encoding process. Once computers and mobile phones update to the new version of Unicode, people will be able to use the rupee sign like they use $ or € now.

This October 2010 release includes the Unicode Character Database (UCD), Unicode Standard Annexes (UAXes), and code charts. With the release of these components, implementers are able update their software to Unicode 6.0 without delay. The final text of the core specification will be available in early 2011.

To access Unicode 6.0, see http://www.unicode.org/versions/Unicode6.0.0.

For more information on emoji, see http://unicode.org/faq/emoji_dingbats.html

For a formatted version of this message with images, see http://unicode.org/press/pr-6.0.html.

Tuesday, August 31, 2010

Public Review Issue #172: Proposed Update Unicode IDNA Compatibility Processing

The Unicode Technical Committee has posted a new issue for public review and comment. Details are on the following web page:

http://www.unicode.org/review/

Review period for the new item closes on September 9, 2010.

Please see the page for links to discussion and relevant documents. Briefly, the new issue is:

#172 Proposed Update UTS #46, Unicode IDNA Compatibility Processing

http://www.unicode.org/reports/tr46/proposed.html

There is a proposed update with the following features: alignment with Unicode 6.0, the addition of conformance test files, and support of the IDNA2003 option UseSTD3ASCIIRules=false.

Feedback is requested both on both the draft text http://www.unicode.org/reports/tr46/proposed.html and draft data files http://unicode.org/Public/idna/6.0.0/

If you have comments for official UTC consideration, please post them by submitting your comments through our feedback & reporting page:

http://www.unicode.org/reporting.html

If you wish to discuss issues on the Unicode mail list, then please use the following link to subscribe (if necessary). Please be aware that discussion comments on the Unicode mail list are not automatically recorded as input to the UTC. You must use the reporting link above to generate comments for UTC consideration.

http://www.unicode.org/consortium/distlist.html

Tuesday, August 24, 2010

Public Review Issue #176: Properties of Two Khmer Characters

The Unicode Technical Committee has posted a new issue for public review and comment. Details are on the following web page:http://www.unicode.org/review/.

Review periods for the new items close on October 25, 2010.
Please see the page for links to discussion and relevant documents. Briefly, the new issue is:

PRI #176: Properties of Two Khmer Characters

The UTC is considering potential changes to the General_Category property values and default collation weighting of two Khmer characters, U+17B4 KHMER VOWEL INHERENT AQ and U+17B5 KHMER VOWEL INHERENT AA. The UTC is seeking feedback on this topic. In particular, the UTC would be interested in learning of any current implementations which might be adversely affected by any of the proposed modifications to the General_Category and/or default collation weighting of these two characters. Please see the background document http://www.unicode.org/review/pr-176.html for details on the proposal.

If you have comments for official UTC consideration, please post them by submitting your comments through our feedback & reporting page: http://www.unicode.org/reporting.html.

If you wish to discuss issues on the Unicode mail list, then please use the following link to subscribe (if necessary). Please be aware that discussion comments on the Unicode mail list are not automatically recorded as input to the UTC. You must use the reporting link above to generate comments for UTC consideration. http://www.unicode.org/consortium/distlist.html.


----
All of the Unicode Consortium lists are strictly opt-in lists for members or interested users of our standards. We make every effort to remove users who do not wish to receive e-mail from us. To see why you are getting this mail and how to remove yourself from our lists if you want, please see http://www.unicode.org/consortium/distlist.html#announcements.

Public Review Issue #175: CLDR 1.9 Collation Changes

The Unicode CLDR committee is making Unicode locale-sensitive collation a major focus for the next release, CLDR 1.9. There are specific changes for a large number of languages, plus a change in the default ordering of punctuation vs symbols for all languages.

Please see the background document for more information: http://www.unicode.org/review/pr-175.html

If you have any feedback on any of the actions, please file a ticket with CLDR as described in the background document.

Review period for this issue closes on October 1, 2010.

If you wish to discuss issues on the CLDR Users mail list, then please use the following link to subscribe (if necessary). Please be aware that discussion comments on the mail list are not automatically recorded as input to the committee. You must use the submission mechanism described in the background document to generate comments for consideration. http://www.unicode.org/consortium/distlist.html


----
All of the Unicode Consortium lists are strictly opt-in lists for members or interested users of our standards. We make every effort to remove users who do not wish to receive e-mail from us. To see why you are getting this mail and how to remove yourself from our lists if you want, please see http://www.unicode.org/consortium/distlist.html#announcements

Friday, August 6, 2010

Unicode Security and Domain Names

The Unicode Consortium has released three important specifications related to Internationalized Domain Names (IDNs) and Security.

UTS #46: Unicode IDNA Compatibility Processing
http://www.unicode.org/reports/tr46/

UTR# 36: Unicode Security Considerations
http://www.unicode.org/reports/tr36/

UTR# 39: Unicode Security Mechanisms
http://www.unicode.org/reports/tr39/


UTS #46: Unicode IDNA Compatibility Processing

Client software, such as browsers and emailers, faces a difficult transition from the version of international domain names approved in 2003 (IDNA2003), to the revision approved in 2010 (IDNA2008). The specification in this document provides a mechanism that minimizes the impact of this transition for client software, allowing client software to access domains that are valid under either system. The specification provides two main features: One is a comprehensive mapping to support current user expectations for casing and other variants of domain names.
Such a mapping is allowed by IDNA2008. The second is a compatibility mechanism that supports the existing domain names that were allowed under IDNA2003. This second feature is intended to improve client behavior during the transitional period.


UTR# 36: Unicode Security Considerations

Because Unicode contains such a large number of characters and incorporates the varied writing systems of the world, incorrect usage can expose programs or systems to possible security attacks. This is especially important as more and more products are internationalized.

This document describes some of the security considerations that programmers, system analysts, standards developers, and users should take into account, and provides specific recommendations to reduce the risk of problems.


UTR# 39: Unicode Security Mechanisms

Because Unicode contains such a large number of characters and incorporates the varied writing systems of the world, incorrect usage can expose programs or systems to possible security attacks. This document specifies mechanisms that can be used to detect possible security problems.

Monday, August 2, 2010

New PRI: Proposed Draft UTR #49 Unicode Character Categories

The Unicode Technical Committee has posted a new issue for public review and comment. Details are on the following web page: http://www.unicode.org/review/ Review periods for the new items close on October 25, 2010. Please see the page for links to discussion and relevant documents.

Briefly, the new issue is:
PRI #174 Proposed Draft UTR #49, "Unicode Character Categories"
http://www.unicode.org/reports/tr49/tr49-1.html
This proposed draft UTR presents an approach to the categorization of Unicode characters, and documents a data file that implementers can use for defining Unicode character categories.

If you have comments for official UTC consideration, please post them by submitting your comments through our feedback & reporting page: http://www.unicode.org/reporting.html If you wish to discuss issues on the Unicode mail list, then please use the following link to subscribe (if necessary). Please be aware that discussion comments on the Unicode mail list are not automatically recorded as input to the UTC. You must use the reporting link above to generate comments for UTC consideration. http://www.unicode.org/consortium/distlist.html