Understanding how data is encoded and interpreted by machines and software is essential knowledge for any linguist starting into NLP or NLP developer.
Text encoding is the process of converting text characters and symbols into a unique sequence of binary numbers, or bits, so that computers can read it, store it, transmit it, process it. Besides that, it also takes a couple of things into considerations, such as character set selection, encoding schemes (or methods) and metadata for representing text (for example, when you save a text as a UTF-8 encoded file, it will include the information of the character set (Unicode) and encoding scheme (UTF-8).
Fundamental part of text encoding is character encoding. Character encoding deals specifically with assigning numerical values to characters, which are minimal units of text with semantic value.
Character set is a group of characters used to represent text, such as Latin alphabet, Arabic script characters, Han characters, etc.
Character set map is a mapping or a table that associates characters in one encoding to their corresponding representations in another encoding, useful for transliteration, internationalization, localization, etc.
Following the evolution of computing, a few different ways to encode text have been developed. The most common ones are:
- Unicode – the most-comprehensive encoding standard that includes characters from all writing systems. It can be implemented in various forms as well, such as UTF-8 (most efficient), UTF-16 and UTF-32, which use different bits per character
- ASCII (American Standard Code for Information Interchange) – scheme that represents characters using 7 or 8 bits and covers basic Latin characters
- ISO-8859 – extended ASCII standard that includes additional characters for particular languages and regions
Here is an example of how the character “A” is represented in binary using different encoding methods:
Encoding | Binary value |
ASCII | 01000001 |
UTF-8 | 01000001 |
UTF-16 | 00000000 01000001 |
UTF-32 | 00000000 00000000 00000000 01000001 |
ISO-8859 | 01000001 |
Characters and symbols can also be represent by their codepoint hexadecimal value in Unicode, here are a couple of examples:
Char | Unicode value |
A | U+0041 |
糖 | U+7CD6 |
چ | U+0686 |
დ | U+10D3 |
ф | U+ 0444 |
¥ | U+00A5 |
? | U+003F |
*You can check the Unicode value by typing a character in Microsoft Word and clicking ALT+X.
Ensuring that the same encoding is consistently used when working with text data is very important, as encoding differences cause character mismatches leading to text corruption. Different encodings can also result in varying byte lengths for characters, as shown above, so you can always check the length of characters (for example, Python’s function len() is encoding-aware).
Proper handling special characters, diacritics and ligatures is also very important and some of the ways of dealing with them are:
- normalization – converting characters with diacritics to their base forms (“ü” to “u”; “å” to “a”; “ć” to “c”)
- tokenization – consider how the text is being tokenized, for example some NLP tokenizers treat special characters and ligatures as separate tokens, which may impact further analyses (for example, “fi” or “fl”; “state-of-the-art” or “state”, “of”, “the” and “art”)
- word and character embeddings in NLP models – consider if embeddings include special characters and diacritics
- sorting and collation (determining the order in which characters or words are arranged) – consider locale-specific sorting and collation rules when working with multilingual data