Expand description
Determine displayed width of char and str types according to
Unicode Standard Annex #11
and other portions of the Unicode standard.
See the Rules for determining width section
for the exact rules.
This crate is #![no_std].
use unicode_width::UnicodeWidthStr;
let teststr = "Hello, world!";
let width = UnicodeWidthStr::width(teststr);
println!("{}", teststr);
println!("The above string is {} columns wide.", width);§"cjk" feature flag
This crate has one Cargo feature flag, "cjk"
(enabled by default).
It enables the UnicodeWidthChar::width_cjk
and UnicodeWidthStr::width_cjk,
which perform an alternate width calculation
more suited to CJK contexts. The flag also unseals the
UnicodeWidthChar and UnicodeWidthStr traits.
Disabling the flag (with no_default_features in Cargo.toml)
will reduce the amount of static data needed by the crate.
use unicode_width::UnicodeWidthStr;
let teststr = "“𘀀”";
assert_eq!(teststr.width(), 4);
#[cfg(feature = "cjk")]
assert_eq!(teststr.width_cjk(), 6);§Rules for determining width
This crate currently uses the following rules to determine the width of a character or string, in order of decreasing precedence. These may be tweaked in the future.
- In the following cases, the width of a string differs from the sum of the widths of its constituent characters:
- The sequence
"\r\n"has width 1. - Emoji-specific ligatures:
- Well-formed, fully-qualified emoji ZWJ sequences have width 2.
- Emoji modifier sequences have width 2.
- Emoji presentation sequences have width 2.
- Outside of an East Asian context, text presentation sequences have width 1 if their base character:
- Has the
Emoji_Presentationproperty, and - Is not in the Enclosed Ideographic Supplement block.
- Has the
- Script-specific ligatures:
- For all the following ligatures, the insertion of any number of default-ignorable
combining marks anywhere in the sequence will not change the total width. In addition, for all non-Arabic
ligatures, the insertion of any number of
'\u{200D}'ZERO WIDTH JOINERs will not affect the width. - Arabic: A character sequence consisting of one character with
Joining_Group=Lam, followed by any number of characters withJoining_Type=Transparent, followed by one character withJoining_Group=Alef, has total width 1. For example:لا,لآ,ڸا,لٟٞأ - Buginese:
"\u{1A15}\u{1A17}\u{200D}\u{1A10}"(<a, -i> ya,ᨕᨗᨐ) has total width 1. - Hebrew:
"א\u{200D}ל"(Alef-Lamed,אל) has total width 1. - Khmer: Coeng signs consisting of
'\u{17D2}'followed by a character in'\u{1780}'..='\u{1782}' | '\u{1784}'..='\u{1787}' | '\u{1789}'..='\u{178C}' | '\u{178E}'..='\u{1793}' | '\u{1795}'..='\u{1798}' | '\u{179B}'..='\u{179D}' | '\u{17A0}' | '\u{17A2}' | '\u{17A7}' | '\u{17AB}'..='\u{17AC}' | '\u{17AF}'have width 0. - Lisu: Tone letter combinations consisting of a character in the range
'\u{A4F8}'..='\u{A4FB}'followed by a character in the range'\u{A4FC}'..='\u{A4FD}'have width 1. For example:ꓹꓼ - Old Turkic:
"\u{10C32}\u{200D}\u{10C03}"(𐰲𐰃) has total width 1. - Tifinagh: A sequence of a Tifinagh consonant in the range
'\u{2D31}'..='\u{2D65}' | '\u{2D6F}', followed by either'\u{2D7F}'TIFINAGH CONSONANT JOINER or'\u{200D}', followed by another Tifinangh consonant, has total width 1. For example:ⵏ⵿ⴾ
- For all the following ligatures, the insertion of any number of default-ignorable
combining marks anywhere in the sequence will not change the total width. In addition, for all non-Arabic
ligatures, the insertion of any number of
- In an East Asian context only,
<,=, or>have width 2 when followed by'\u{0338}'COMBINING LONG SOLIDUS OVERLAY. The two characters may be separated by any number of characters whose canonical decompositions consist only of characters meeting one of the following requirements:- Has
Canonical_Combining_Classgreater than 1, or - Is a default-ignorable combining mark.
- Has
- The sequence
- In all other cases, the width of the string equals the sum of its character widths:
'\u{2D7F}'TIFINAGH CONSONANT JOINER has width 1 (outside of the ligatures described previously).'\u{115F}'HANGUL CHOSEONG FILLER and'\u{17A4}'KHMER INDEPENDENT VOWEL QAA have width 2.'\u{17D8}'KHMER SIGN BEYYAL has width 3.- The following have width 0:
- Characters
with the
Default_Ignorable_Code_Pointproperty. - Characters
with the
Grapheme_Extendproperty. - The following 8 characters, all of which have NFD decompositions consisting of two
Grapheme_Extendcharacters:'\u{0CC0}'KANNADA VOWEL SIGN II,'\u{0CC7}'KANNADA VOWEL SIGN EE,'\u{0CC8}'KANNADA VOWEL SIGN AI,'\u{0CCA}'KANNADA VOWEL SIGN O,'\u{0CCB}'KANNADA VOWEL SIGN OO,'\u{1B3B}'BALINESE VOWEL SIGN RA REPA TEDUNG,'\u{1B3D}'BALINESE VOWEL SIGN LA LENGA TEDUNG, and'\u{1B43}'BALINESE VOWEL SIGN PEPET TEDUNG.
- Characters
with a
Hangul_Syllable_TypeofVowel_Jamo(V) orTrailing_Jamo(T). - The following
Prepended_Concatenation_Marks: - Characters
with the
Grapheme_Extend=Prependproperty, that are not alsoPrepended_Concatenation_Marks. '\u{A8FA}'DEVANAGARI CARET.
- Characters
with the
- Characters
with an
East_Asian_WidthofFullwidthorWidehave width 2. - Characters fulfilling all of the following conditions have width 2 in an East Asian context, and width 1 otherwise:
- Has an
East_Asian_WidthofAmbiguous, or has a canonical decomposition to anAmbiguouscharacter followed by'\u{0338}'COMBINING LONG SOLIDUS OVERLAY, or is'\u{0387}'GREEK ANO TELEIA, and - Does not have a
General_CategoryofLetterorModifier_Symbol.
- Has an
- All other characters have width 1.
§Canonical equivalence
Canonically equivalent strings are assigned the same width (CJK and non-CJK).
Constants§
- UNICODE_
VERSION - The version of Unicode that this version of unicode-width is based on.
Traits§
- Unicode
Width Char - Methods for determining displayed width of Unicode characters.
- Unicode
Width Str - Methods for determining displayed width of Unicode strings.