ChonjukChapterParser Class Documentation
Overview
The ChonjukChapterParser class is responsible for extracting chapter annotations from Tibetan text data that contains specific chapter markers.
Input Data format
chX-"Chapter Title" Chapter Text
X represents the chapter number.
Chapter Title is the title of the chapter in double quotes.
Chapter Text is the body of the chapter.
Class Methods
__init__(self)
Initializes the
ChonjukChapterParserinstance.Sets up the configuration needed for parsing chapter annotations.
get_updated_text(self, text: str) -> str
Cleans the input text by removing chapter markers.
Returns the cleaned text.
get_annotations(self, text: str) -> List[Dict]
Extracts chapter annotations from the input text.
Get the updated annotation span after removing chapter markers.
Returns a list of chapter annotations.
parse(self, input: str, output_path: Path = PECHAS_PATH, metadata: Union[Dict, Path] = None)
Extract chapter annotations from the text.
Instantiate
Pechaclass and save the chapter annotations to the output path.
Example Usage
Here is an example of how to use the ChonjukChapterParser to parse text and extract chapter annotations.
from pathlib import Path
# Initialize the parser
parser = ChonjukChapterParser()
# Example input text
input_text = '''
རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྱ་ཨ་བ་ཏཱ་ར།
བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ།
ch1-"བྱང་ཆུབ་སེམས་ཀྱི་ཕན་ཡོན་བཤད་པ།" བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང་། །
ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །
ch2-"སྡིག་པ་བཤགས་པ།" དགེ་བ་བསྒོམ་ཕྱིར་བདག་གི་དད་པའི་ཤུགས། །
'''
# Parse the input text and save to an output path
parser.parse(input_text, output_path=Path("/path/to/output"))
After running the above code, the chapter annotations will be extracted from the input text and saved to the specified output path.The annotations attribute of parser would look like this.
assert parser.annotations == [
{
"chapter_number": "1",
"chapter_title": "བྱང་ཆུབ་སེམས་ཀྱི་ཕན་ཡོན་བཤད་པ།",
"Chapter": {"start": 145, "end": 446},
},
{
"chapter_number": "2",
"chapter_title": "སྡིག་པ་བཤགས་པ།",
"Chapter": {"start": 449, "end": 896},
},
]
The file structure on the output path would look like this:
- output_path(dir)
- I00B6F749(dir)
- base(dir)
- da0c.txt
- layers(dir)
- da0c
- Chapter-123.json