Getting Started with Hanzipy: Python Library for Chinese Language Data

Hanzipy is a Python library that simplifies working with Chinese characters. This library is very useful for linguists, digital humanities researchers, NLP engineers, Chinese language learners and anyone working with digitized Chinese materials.

The main tasks this library can perform include converting Chinese characters to pinyin, identifying and analyzing character radicals, determining the frequency of characters in a text, and providing detailed information about individual Chinese characters, such as their meanings, pronunciations, and stroke orders. Additionally, it offers tools for simplifying or traditionalizing Chinese characters.

The main sources for this library are famous for studying Chinese language structure, usage, and frequency: CEDICT, Gavin Grover’s Decomposition Data, Leiden Word Frequency Data, Jun Da Character Frequency Data.

Installation

!pip install hanzipy

import hanzipy

Imports

Hanzipy is divided into 2 modules: dictionary and decomposer. To begin using hanzipy, you’ll need to import its modules into your Python script. The dictionary module allows you to look up detailed information about Chinese characters, while the decomposer module lets you decompose characters into their constituent parts, such as radicals and strokes.

# Import dictionary

from hanzipy.decomposer import HanziDecomposer

decomposer = HanziDecomposer()

# Import decomposer

from hanzipy.dictionary import HanziDictionary

dictionary = HanziDictionary()

Decomposition Module

Task 1: Decompose Character

You can break down a character to its components (radicals and strokes) with this function: decomposer.decompose(character, decomposition_type=None).

There are 214 traditional radicals, which are the standardized components, often representing a common element or meaning shared across characters. Strokes are the individual lines or marks used to write Chinese characters. There are generally 8 to 12 basic types of strokes, but when considering all variations, there can be around 30 different stroke types in Chinese characters.

print(decomposer.decompose("智"))

{‘character’: ‘智’, ‘once’: [‘知’, ‘日’], ‘radical’: [‘矢’, ‘口’, ‘日’], ‘graphical’: [‘丿’, ‘一’, ‘人’, ‘一’, ‘口’, ‘口’, ‘一’]}

To get radicals only:

print(decomposer.decompose("智", 2))

{‘character’: ‘智’, ‘components’: [‘矢’, ‘口’, ‘日’]}

To decompose a string of characters use function: decomposer.decompose_many(characters). The function returns a single dictionary object.

print(decomposer.decompose_many("说话"))

{‘说’: {‘character’: ‘说’, ‘once’: [‘讠’, ‘兑’], ‘radical’: [‘讠’, ‘丷’, ‘口’, ‘儿’], ‘graphical’: [‘㇊’, ‘丶’, ‘丷’, ‘口’, ‘丿’, ‘乚’]}, ‘话’: {‘character’: ‘话’, ‘once’: [‘讠’, ‘舌’], ‘radical’: [‘讠’, ‘舌’], ‘graphical’: [‘㇊’, ‘丶’, ‘㇒’, ‘一’, ‘丨’, ‘口’]}}

You can also check if a certain component exists:

print(decomposer.component_exists("&"))

False

Task 2: Radical Meaning

Radicals often carry semantic or phonetic significance. With the function below you can look up the meaning of a certain radical: decomposer.get_radical_meaning(radical)

print(decomposer.get_radical_meaning("心"))

heart

Task 3: Search Characters by Components

This function allows you to search for all characters that include a certain radical or stroke. The function returns a list of characters that include that certain component: decomposer.get_characters_with_component(component).

print(decomposer.get_characters_with_component("鬥"))

[‘鬧’, ‘鬩’]

Dictionary Module

This module provides functions to retrieve information like pinyin, meaning, and other linguistic details of Chinese characters.

Task 1: Convert Characters to Pinyin

Pinyin is the romanization system for Standard Chinese, which is used as one of the input methods on electronic devices, used for teaching Chinese and for categorizing entries in dictionaries.

The function used for getting pinyin for characters is dictionary.get_pinyin(character).

# Example Chinese word

word = "你好"

# Convert to pinyin

pinyin = dictionary.get_pinyin(word)

print(pinyin)

[‘ni3 hao3’]

However, since this function actually looks up in the dictionary, it provides pinyin for single characters and words, so if you want to get pinyin for a phrase, sentence or larger chunk of text, you would need to define a new function:

def get_pinyin_for_text(text):

    # Convert each character to pinyin and join results

    return " ".join("".join(dictionary.get_pinyin(char)) for char in text)

# Convert to pinyin

pinyin = get_pinyin_for_text("从文本到智能")

print(pinyin)

cong2Cong2cong2 Wen2wen2 ben3 dao4 zhi4 Neng2neng2

Task 2: Character Lookup

You can look up the character definition in the dictionary with the function dictionary.definition_lookup(), that takes 2 parameters: character and script_type (optional). Script type can be “simplified” or “traditional”. The function returns a dictionary with pinyin and definition.

print(dictionary.definition_lookup("语"))

[{‘traditional’: ‘語’,’simplified’: ‘语’, ‘pinyin’: ‘yu3’, ‘definition’: dialect/language/speech’}, {‘traditional’: ‘語’, ‘simplified’: ‘语’, ‘pinyin’: ‘yu4’, ‘definition’: ‘to tell to’}]

Task 3: Dictionary Search

This function allows you to search through all possible dictionary entries that contain a specific character, including its use in compound words, idioms, and phrases: dictionary.dictionary_search().

search_results = dictionary.dictionary_search("智")

print(search_results)

[{‘traditional’: ‘不智’, ‘simplified’: ‘不智’, ‘pinyin’: ‘bu4 zhi4’, ‘definition’: ‘unwise’}, {‘traditional’: ‘不經一事,不長一智’, ‘simplified’: ‘不经一事,不长一智’, ‘pinyin’: ‘bu4 jing1 yi1 shi4 , bu4 zhang3 yi1 zhi4’, ‘definition’: “you can’t gain knowledge without practical experience (idiom)/wisdom only comes with experience”},…]

You can also refine the search a bit, by filtering results to get only words starting or ending with a specific character:

starting_with_char = [entry for entry in search_results if entry['simplified'].startswith("character")]

ending_with_char = [entry for entry in search_results if entry['simplified'].endswith("character")]

You can also search the most frequent ones:

sorted_by_frequency = sorted(search_results, key=lambda entry: entry.get('frequency', 0), reverse=True)

top_frequent_results = sorted_by_frequency[:5]

If you set the search_parameter to “only” you should get a match for a specific sequence of characters, helping to narrow down to exact matches or compound words. However, this works a bit strange and the results are actually listing all the possible examples from the dictionary for each character separately. So, to get an exact match for a phrase, you would need to define a function:

def exact_phrase_search(phrase):

    # Perform a general search for the phrase

    search_results = dictionary.dictionary_search(phrase)

    # Filter results to include only those that exactly match the phrase

    exact_matches = [

        entry for entry in search_results

        if entry['simplified'] == phrase or entry['traditional'] == phrase

    ]

    return exact_matches

phrase = "一條道走到黑"

results = exact_phrase_search(phrase)

print(results)

[{‘traditional’: ‘一條道走到黑’, ‘simplified’: ‘一条道走到黑’, ‘pinyin’: ‘yi1 tiao2 dao4 zou3 dao4 hei1’, ‘definition’: “to stick to one’s ways/to cling to one’s course”}]

Task 4: Character Frequency

You can retrieve the character frequency based on the Junda Frequency List and Leiden Word Frequency Data with this function: dictionary.get_character_frequency(). This frequency value typically represents how common or rare the character is within a certain corpus or dataset, providing insight into how often the character appears in written Chinese. The function returns a numeric value indicating the frequency of the character.

print(dictionary.get_character_frequency("智"))

{‘number’: 883, ‘character’: ‘智’, ‘count’: ‘38064’, ‘percentage’: ‘87.0842656094’, ‘pinyin’: ‘zhi4’, ‘meaning’: ‘wisdom/knowledge’}

The “number” represents the rank or position of a character in terms of frequency within a dataset and you can use this number to find a character with this function:

print(dictionary.get_character_in_frequency_list_by_position(883))

{‘number’: 883, ‘character’: ‘智’, ‘count’: ‘38064’, ‘percentage’: ‘87.0842656094’, ‘pinyin’: ‘zhi4’, ‘meaning’: ‘wisdom/knowledge’}

The “count” represents the total number of occurrences of the character withing the dataset.

Task 5: Phonetic Regularity

Phonetic regularity refers to how a character’s pronunciation corresponds to its phonetic component(s) in the character’s structure. This function can be applied either to a single character or to a decomposition object that provides details about the character’s structure: dictionary.determine_phonetic_regularity().

The function returns a list of components of a character and their pronounciation (pinyin) with regularity scores for each component (0= no match; 1=exact match; 2=syllable match; 3=initial match; 4=final match).

print(dictionary.determine_phonetic_regularity("油"))

{‘you2’: {‘character’: ‘油’, ‘component’: [‘氵’, ‘由’, ‘氵’, ‘二’, ‘丨’, ‘丨’, ‘凵’], ‘phonetic_pinyin’: [‘shui3’, ‘you2’, ‘shui3’, ‘er4’, ‘gun3’, ‘shu4’, ‘kan3’], ‘regularity’: [0, 1, 0, 0, 0, 0, 0]}}

One more example with final match:

print(dictionary.determine_phonetic_regularity("体"))

{‘ti3’: {‘character’: ‘体’, ‘component’: [‘亻’, ‘本’, ‘亻’, ‘木’, ‘木’, ‘一’], ‘phonetic_pinyin’: [‘ren2’, ‘ben3’, ‘ren2’, ‘Mu4’, ‘mu4’, ‘yi1’], ‘regularity’: [0, 0, 0, 0, 0, 4]}}

Refrences and useful resources:

hanzipy · PyPI

Synkied/hanzipy: Hanzipy is a Chinese character and NLP module for Chinese language processing for python. It is primarily written to help provide a framework for Chinese language learners to explore Chinese. (github.com)

nieldlr/hanzi: HanziJS is a Chinese character and NLP module for Chinese language processing for Node.js (github.com)

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *