Description
A standard for storage of the avatar text record in ENS.
Status
draft
Created
2023-04-03
Author
  • adraffy

ENSIP-15: ENS Name Normalization Standard

Abstract

This ENSIP standardizes Ethereum Name Service (ENS) name normalization process outlined in ENSIP-1 § Name Syntax.

Motivation

Specification

Definitions

Algorithm

  1. Split the name into labels.
  2. Normalize each label.
  3. Join the labels together into a name again.

Normalize

  1. Tokenize — transform the label into Text and Emoji tokens.
    • If there are no tokens, the label cannot be normalized.
  2. Apply NFC to each Text token.
    • Example: Text["à"][61 300] → [E0]Text["à"]
  3. Strip FE0F from each Emoji token.
  4. Validate — check if the tokens are valid and obtain the Label Type.
    • The Label Type and Restricted state may be presented to user for additional security.
  5. Concatenate the tokens together.
    • Return the normalized label.

Examples:

  1. "_$A" [5F 24 41]"_$a" [5F 24 61]ASCII
  2. "E︎̃" [45 FE0E 303]"ẽ" [1EBD]Latin
  3. "𓆏🐸" [1318F 1F438]"𓆏🐸" [1318F 1F438]Restricted: Egyp
  4. "nı̇ck" [6E 131 307 63 6B]error: Disallowed character

Tokenize

Convert a label into a list of Text and Emoji tokens, each with a payload of codepoints. The complete list of character types and emoji sequences can be found in spec.json.

  1. Allocate an empty codepoint buffer.
  2. Find the longest Emoji Sequence that matches the remaining input.
    • Example: 👨🏻‍💻 [1F468 1F3FB 200D 1F4BB]
      • Match (1): 👨️ [1F468] man
      • Match (2): 👨🏻 [1F468 1F3FB] man: light skin tone
      • Match (4): 👨🏻‍💻 [1F468 1F3FB 200D 1F4BB] man technologist: light skin tone — longest match!
    • FE0F is optional from the input during matching.
      • Example: 👨‍❤️‍👨 [1F468 200D 2764 FE0F 200D 1F468]
        • Match: 1F468 200D 2764 FE0F 200D 1F468 — fully-qualified
        • Match: 1F468 200D 2764 200D 1F468 — missing FE0F
        • No match: 1F468 FE0F 200D 2764 FE0F 200D 1F468 — extra FE0F
        • No match: 1F468 200D 2764 FE0F FE0F 200D 1F468 — has (2) FE0F
    • This is equivalent to /^(emoji1|emoji2|...)/ where \uFE0F is replaced with \uFE0F? and * is replaced with \x2A.
  3. If an Emoji Sequence is found:
    • If the buffer is nonempty, emit a Text token, and clear the buffer.
    • Emit an Emoji token with the fully-qualified matching sequence.
    • Remove the matched sequence from the input.
  4. Otherwise:
    1. Remove the leading codepoint from the input.
    2. Determine the character type:
      • If Valid, append the codepoint to the buffer.
        • This set can be precomputed from the union of characters in all groups and their NFD decompositions.
      • If Mapped, append the corresponding mapped codepoint(s) to the buffer.
      • If Ignored, do nothing.
      • Otherwise, the label cannot be normalized.
  5. Repeat until all the input is consumed.
  6. If the buffer is nonempty, emit a final Text token with its contents.
    • Return the list of emitted tokens.

Examples:

  1. "xyz👨🏻" [78 79 7A 1F468 1F3FB]Text["xyz"] + Emoji["👨🏻"]
  2. "A💩︎︎b" [41 FE0E 1F4A9 FE0E FE0E 62]Text["a"] + Emoji["💩️"] + Text["b"]
  3. "a™️" [61 2122 FE0F]Text["atm"]

Validate

Given a list of Emoji and Text tokens, determine if the label is valid and return the Label Type. If any assertion fails, the name cannot be normalized.

  1. If only Emoji tokens:
    • Return "Emoji"
  2. If a single Text token and every characters is ASCII (00..7F):
    • 5F (_) LOW LINE can only occur at the start.
      • Must match /^_*[^_]*$/
      • Examples: "___" and "__abc" are valid, "abc__" and "_abc_" are invalid.
    • The 3rd and 4th characters must not both be 2D (-) HYPHEN-MINUS.
      • Must not match /^..--/
      • Examples: "ab-c" and "---a"are valid, "xn--" and ---- are invalid.
    • Return "ASCII"
      • The label is free of Fenced and Combining Mark characters, and not confusable.
  3. Concatenate all the tokens together.
    • 5F (_) LOW LINE can only occur at the start.
    • The first and last characters cannot be Fenced.
      • Examples: "a’s" and "a・a" are valid, "’85" and "joneses’" and "・a・" are invalid.
    • Fenced characters cannot be contiguous.
      • Examples: "a・a’s" is valid, "6’0’’" and "a・・a" are invalid.
  4. The first character of every Text token must not be a Combining Mark.
  5. Concatenate the Text tokens together.
  6. Find the first Group that contain every text character:
    • If no group is found, the label cannot be normalized.
  7. If the group is not CM Whitelisted:
    • Apply NFD to the concatenated text characters.
    • For every contiguous sequence of NSM characters:
      • Each character must be unique.
        • Example: "x̀̀" [78 300 300] has (2) grave accents.
      • The number of NSM characters cannot exceed Maximum NSM (4).
        • Example: "إؐؑؒؓؔ"‎ [625 610 611 612 613 614] has (6) NSM.
  8. Wholes — check if text characters form a confusable.
  9. The label is valid.
    • Return the name of the group as the Label Type.

Examples:

  1. Emoji["💩️"] + Emoji["💩️"]"Emoji"
  2. Text["abc$123"]"ASCII"
  3. Emoji["🚀️"] + Text["à"]"Latin"

Wholes

A label is whole-script confusable if a similarly-looking valid label can be constructed using one alternative character from a different group. The complete list of Whole Confusables can be found in spec.json. Each Whole Confusable has a set of non-confusing characters ("valid") and a set of confusing characters ("confused") where each character may be the member of one or more groups.

Example: Whole Confusable for "g"

| Type | Code | Form | Character | Latn | Hani | Japn | Kore | Armn | Cher | Lisu | | :-: | -: | :-: | :- | :-: | :-: | :-: | :-: | :-: | :-: | :-: | | valid | 67 | g | LATIN SMALL LETTER G | A | A | A | A | | confused | 581 | ց | ARMENIAN SMALL LETTER CO | | | | | B | | confused | 13C0 | | CHEROKEE LETTER NAH | | | | | | C | | confused | 13F3 | | CHEROKEE LETTER YU | | | | | | C | | confused | A4D6 | | LISU LETTER GA | | | | | | | D |

  1. Allocate an empty character buffer.
  2. Start with the set of ALL groups.
  3. For each unique character in the label:
    • If the character is Confused (a member of a Whole Confusable):
      • Retain groups with Whole Confusable characters excluding the Confusable Extent of the matching Confused character.
      • If no groups remain, the label is not confusable.
      • The Confusable Extent is the fully-connected graph formed from different groups with the same confusable and different confusables of the same group.
        • The mapping from Confused to Confusable Extent can be precomputed.
      • In the table above, Whole Confusable for "g", the rectangle formed by each capital letter is a Confusable Extent:
        • A is [g] ⊗ [Latin, Han, Japanese, Korean]
        • B is [ց] ⊗ [Armn]
        • C is [, ] ⊗ [Cher]
        • D is [] ⊗ [Lisu]
      • A Confusable Extent can span multiple characters and multiple groups. Consider the (incomplete) Whole Confusable for "o":
        • 6F (o) LATIN SMALL LETTER OLatin, Han, Japanese, and Korean
        • 3007 (〇) IDEOGRAPHIC NUMBER ZEROHan, Japanese, Korean, and Bopomofo
        • Confusable Extent is [o, ] ⊗ [Latin, Han, Japanese, Korean, Bopomofo]
    • If the character is Unique, the label is not confusable.
      • This set can be precomputed from characters that appear in exactly one group and are not Confused.
    • Otherwise:
      • Append the character to the buffer.
  4. If any Confused characters were found:
    • If there are no buffered characters, the label is confusable.
    • If any of the remaining groups contain all of the buffered characters, the label is confusable.
    • Example: "0х" [30 445]
      1. 30 (0) DIGIT ZERO
        • Not Confused or Unique, add to buffer.
      2. 445 (х) CYRILLIC SMALL LETTER HA
        • Confusable Extent is [х, 4B3 (ҳ) CYRILLIC SMALL LETTER HA WITH DESCENDER] ⊗ [Cyrillic]
        • Whole Confusable excluding the extent is [78 (x) LATIN SMALL LETTER X, ...] → [Latin, ...]
        • Remaining groups: ALL ∩ [Latin, ...] → [Latin, ...]
      3. There was (1) buffered character:
        • Latin also contains 30"0x" [30 78]
      4. The label is confusable.
  5. The label is not confusable.

A label composed of confusable characters isn't necessarily confusable.

Split

Join

Description of spec.json

Description of nf.json

Derivation

Backwards Compatibility

Security Considerations

Copyright

Copyright and related rights waived via CC0.

Appendix: Reference Specifications

Appendix: Additional Resources

Appendix: Validation Tests

A list of validation tests are provided with the following interpretation:

Annex: Beautification

Follow algorithm, except: