Module kuzukiri.kuzukiri
Classes
class Segmenter (terminals, parentheses, force, max_buf_length)-
Text Segmentation Class
Args
terminals:Optional[set[str]]- a set of terminal characters (Default: {'。', '.', '!', '?', '\n'})
parentheses:Optional[map[str, str]]- pairs of parentheses (Default: {'「': '」', '『': '』', '(': ')', '[': ']', '【': '】'})
force:Optional[set[str]]- a set of terminal characters, those ignore parentheses (Default: {})
max_buf_length:Optional[int]- max buffer size (Default: 1000)
Methods
def split(self, text)-
Execute text segmentation
Args
text (str) : target text
Returns
List[str]- list of segmented texts
def split_with_norm(self, text)-
Execute text segmentation with normalization
After splitting, NFKC normalization and trimming are performed.
Args
text (str) : target text
Returns
List[str]- list of segmented texts