Module kuzukiri.kuzukiri

Classes

class Segmenter (terminals, parentheses, force, max_buf_length)

Text Segmentation Class

Args

terminals : Optional[set[str]]
a set of terminal characters (Default: {'。', '.', '!', '?', '\n'})
parentheses : Optional[map[str, str]]
pairs of parentheses (Default: {'「': '」', '『': '』', '(': ')', '[': ']', '【': '】'})
force : Optional[set[str]]
a set of terminal characters, those ignore parentheses (Default: {})
max_buf_length : Optional[int]
max buffer size (Default: 1000)

Methods

def split(self, text)

Execute text segmentation

Args

text (str) : target text

Returns

List[str]
list of segmented texts
def split_with_norm(self, text)

Execute text segmentation with normalization

After splitting, NFKC normalization and trimming are performed.

Args

text (str) : target text

Returns

List[str]
list of segmented texts