Hanzipy is a Python library that simplifies working with Chinese characters. This library is very useful for linguists, digital humanities researchers, NLP engineers, Chinese language learners and anyone working with digitized Chinese materials.
The main tasks this library can perform include converting Chinese characters to pinyin, identifying and analyzing character radicals, determining the frequency of characters in a text, and providing detailed information about individual Chinese characters, such as their meanings, pronunciations, and stroke orders. Additionally, it offers tools for simplifying or traditionalizing Chinese characters.
The main sources for this library are famous for studying Chinese language structure, usage, and frequency: CEDICT, Gavin Grover’s Decomposition Data, Leiden Word Frequency Data, Jun Da Character Frequency Data.
Installation
!pip install hanzipy
import hanzipy
Imports
Hanzipy is divided into 2 modules: dictionary and decomposer. To begin using hanzipy, you’ll need to import its modules into your Python script. The dictionary module allows you to look up detailed information about Chinese characters, while the decomposer module lets you decompose characters into their constituent parts, such as radicals and strokes.
# Import dictionary
from hanzipy.decomposer import HanziDecomposer
decomposer = HanziDecomposer()
# Import decomposer
from hanzipy.dictionary import HanziDictionary
dictionary = HanziDictionary()
Decomposition Module
Task 1: Decompose Character
You can break down a character to its components (radicals and strokes) with this function: decomposer.decompose(character, decomposition_type=None).
There are 214 traditional radicals, which are the standardized components, often representing a common element or meaning shared across characters. Strokes are the individual lines or marks used to write Chinese characters. There are generally 8 to 12 basic types of strokes, but when considering all variations, there can be around 30 different stroke types in Chinese characters.
print(decomposer.decompose("智"))
{‘character’: ‘智’, ‘once’: [‘知’, ‘日’], ‘radical’: [‘矢’, ‘口’, ‘日’], ‘graphical’: [‘丿’, ‘一’, ‘人’, ‘一’, ‘口’, ‘口’, ‘一’]}
To get radicals only:
print(decomposer.decompose("智", 2))
{‘character’: ‘智’, ‘components’: [‘矢’, ‘口’, ‘日’]}
To decompose a string of characters use function: decomposer.decompose_many(characters). The function returns a single dictionary object.
print(decomposer.decompose_many("说话"))
{‘说’: {‘character’: ‘说’, ‘once’: [‘讠’, ‘兑’], ‘radical’: [‘讠’, ‘丷’, ‘口’, ‘儿’], ‘graphical’: [‘㇊’, ‘丶’, ‘丷’, ‘口’, ‘丿’, ‘乚’]}, ‘话’: {‘character’: ‘话’, ‘once’: [‘讠’, ‘舌’], ‘radical’: [‘讠’, ‘舌’], ‘graphical’: [‘㇊’, ‘丶’, ‘㇒’, ‘一’, ‘丨’, ‘口’]}}
You can also check if a certain component exists:
print(decomposer.component_exists("&"))
False
Task 2: Radical Meaning
Radicals often carry semantic or phonetic significance. With the function below you can look up the meaning of a certain radical: decomposer.get_radical_meaning(radical)
print(decomposer.get_radical_meaning("心"))
heart
Task 3: Search Characters by Components
This function allows you to search for all characters that include a certain radical or stroke. The function returns a list of characters that include that certain component: decomposer.get_characters_with_component(component).
print(decomposer.get_characters_with_component("鬥"))
[‘鬧’, ‘鬩’]
Dictionary Module
This module provides functions to retrieve information like pinyin, meaning, and other linguistic details of Chinese characters.
Task 1: Convert Characters to Pinyin
Pinyin is the romanization system for Standard Chinese, which is used as one of the input methods on electronic devices, used for teaching Chinese and for categorizing entries in dictionaries.
The function used for getting pinyin for characters is dictionary.get_pinyin(character).
# Example Chinese word
word = "你好"
# Convert to pinyin
pinyin = dictionary.get_pinyin(word)
print(pinyin)
[‘ni3 hao3’]
However, since this function actually looks up in the dictionary, it provides pinyin for single characters and words, so if you want to get pinyin for a phrase, sentence or larger chunk of text, you would need to define a new function:
def get_pinyin_for_text(text):
# Convert each character to pinyin and join results
return " ".join("".join(dictionary.get_pinyin(char)) for char in text)
# Convert to pinyin
pinyin = get_pinyin_for_text("从文本到智能")
print(pinyin)
cong2Cong2cong2 Wen2wen2 ben3 dao4 zhi4 Neng2neng2
Task 2: Character Lookup
You can look up the character definition in the dictionary with the function dictionary.definition_lookup(), that takes 2 parameters: character and script_type (optional). Script type can be “simplified” or “traditional”. The function returns a dictionary with pinyin and definition.
print(dictionary.definition_lookup("语"))
[{‘traditional’: ‘語’,’simplified’: ‘语’, ‘pinyin’: ‘yu3’, ‘definition’: dialect/language/speech’}, {‘traditional’: ‘語’, ‘simplified’: ‘语’, ‘pinyin’: ‘yu4’, ‘definition’: ‘to tell to’}]
Task 3: Dictionary Search
This function allows you to search through all possible dictionary entries that contain a specific character, including its use in compound words, idioms, and phrases: dictionary.dictionary_search().
search_results = dictionary.dictionary_search("智")
print(search_results)
[{‘traditional’: ‘不智’, ‘simplified’: ‘不智’, ‘pinyin’: ‘bu4 zhi4’, ‘definition’: ‘unwise’}, {‘traditional’: ‘不經一事,不長一智’, ‘simplified’: ‘不经一事,不长一智’, ‘pinyin’: ‘bu4 jing1 yi1 shi4 , bu4 zhang3 yi1 zhi4’, ‘definition’: “you can’t gain knowledge without practical experience (idiom)/wisdom only comes with experience”},…]
You can also refine the search a bit, by filtering results to get only words starting or ending with a specific character:
starting_with_char = [entry for entry in search_results if entry['simplified'].startswith("character")]
ending_with_char = [entry for entry in search_results if entry['simplified'].endswith("character")]
You can also search the most frequent ones:
sorted_by_frequency = sorted(search_results, key=lambda entry: entry.get('frequency', 0), reverse=True)
top_frequent_results = sorted_by_frequency[:5]
If you set the search_parameter to “only” you should get a match for a specific sequence of characters, helping to narrow down to exact matches or compound words. However, this works a bit strange and the results are actually listing all the possible examples from the dictionary for each character separately. So, to get an exact match for a phrase, you would need to define a function:
def exact_phrase_search(phrase):
# Perform a general search for the phrase
search_results = dictionary.dictionary_search(phrase)
# Filter results to include only those that exactly match the phrase
exact_matches = [
entry for entry in search_results
if entry['simplified'] == phrase or entry['traditional'] == phrase
]
return exact_matches
phrase = "一條道走到黑"
results = exact_phrase_search(phrase)
print(results)
[{‘traditional’: ‘一條道走到黑’, ‘simplified’: ‘一条道走到黑’, ‘pinyin’: ‘yi1 tiao2 dao4 zou3 dao4 hei1’, ‘definition’: “to stick to one’s ways/to cling to one’s course”}]
Task 4: Character Frequency
You can retrieve the character frequency based on the Junda Frequency List and Leiden Word Frequency Data with this function: dictionary.get_character_frequency(). This frequency value typically represents how common or rare the character is within a certain corpus or dataset, providing insight into how often the character appears in written Chinese. The function returns a numeric value indicating the frequency of the character.
print(dictionary.get_character_frequency("智"))
{‘number’: 883, ‘character’: ‘智’, ‘count’: ‘38064’, ‘percentage’: ‘87.0842656094’, ‘pinyin’: ‘zhi4’, ‘meaning’: ‘wisdom/knowledge’}
The “number” represents the rank or position of a character in terms of frequency within a dataset and you can use this number to find a character with this function:
print(dictionary.get_character_in_frequency_list_by_position(883))
{‘number’: 883, ‘character’: ‘智’, ‘count’: ‘38064’, ‘percentage’: ‘87.0842656094’, ‘pinyin’: ‘zhi4’, ‘meaning’: ‘wisdom/knowledge’}
The “count” represents the total number of occurrences of the character withing the dataset.
Task 5: Phonetic Regularity
Phonetic regularity refers to how a character’s pronunciation corresponds to its phonetic component(s) in the character’s structure. This function can be applied either to a single character or to a decomposition object that provides details about the character’s structure: dictionary.determine_phonetic_regularity().
The function returns a list of components of a character and their pronounciation (pinyin) with regularity scores for each component (0= no match; 1=exact match; 2=syllable match; 3=initial match; 4=final match).
print(dictionary.determine_phonetic_regularity("油"))
{‘you2’: {‘character’: ‘油’, ‘component’: [‘氵’, ‘由’, ‘氵’, ‘二’, ‘丨’, ‘丨’, ‘凵’], ‘phonetic_pinyin’: [‘shui3’, ‘you2’, ‘shui3’, ‘er4’, ‘gun3’, ‘shu4’, ‘kan3’], ‘regularity’: [0, 1, 0, 0, 0, 0, 0]}}
One more example with final match:
print(dictionary.determine_phonetic_regularity("体"))
{‘ti3’: {‘character’: ‘体’, ‘component’: [‘亻’, ‘本’, ‘亻’, ‘木’, ‘木’, ‘一’], ‘phonetic_pinyin’: [‘ren2’, ‘ben3’, ‘ren2’, ‘Mu4’, ‘mu4’, ‘yi1’], ‘regularity’: [0, 0, 0, 0, 0, 4]}}
Refrences and useful resources: