Tutorials

A Story of Parsing, Annotating, and Serializing Tibetan Text

Let’s follow a story of how we can process a Tibetan text through our pipeline. We’ll use a simple example of a Tibetan verse with its translation.

Our Sample Data

Let’s say we have this Tibetan text with its English translation:

བདེ་གཤེགས་སྤྱན་རས་གཟིགས་དབང་ཕྱུག་ལ་ཕྱག་འཚལ་ལོ། །
I pay homage to the Lord Avalokiteśvara.

དེ་ཡི་མཚན་ཉིད་རྣམས་ནི་མཐོང་བ་མེད། །
His characteristics cannot be seen.

དེ་ཡི་སྐུ་ནི་མཐོང་བ་མེད། །
His body cannot be seen.

དེ་ཡི་ཡི་གེ་ནི་མཐོང་བ་མེད། །
His letters cannot be seen.

Chapter 1: The Parser’s Tale

Our parser’s job is to break this text into meaningful segments. Let’s create a parser that understands Tibetan verses:

from typing import List, Dict, Any
from openpecha.pecha import Pecha
from openpecha.pecha.annotations import AnnotationModel, AnnotationType

class TibetanVerseParser:
    def __init__(self):
        self.segments = []
        self.current_position = 0
    
    def parse(self, text: str) -> List[Dict[str, Any]]:
        """
        Parse Tibetan text into verses and their translations.
        """
        # Split by double newlines to separate verses
        verses = text.split('\n\n')
        
        for verse in verses:
            # Split into Tibetan and English
            lines = verse.strip().split('\n')
            if len(lines) >= 2:
                tibetan = lines[0].strip()
                english = lines[1].strip()
                
                # Create segment for Tibetan text
                tibetan_segment = {
                    'text': tibetan,
                    'start': self.current_position,
                    'end': self.current_position + len(tibetan),
                    'type': 'tibetan'
                }
                self.current_position += len(tibetan) + 1
                
                # Create segment for English translation
                english_segment = {
                    'text': english,
                    'start': self.current_position,
                    'end': self.current_position + len(english),
                    'type': 'translation'
                }
                self.current_position += len(english) + 2  # +2 for the newlines
                
                self.segments.extend([tibetan_segment, english_segment])
        
        return self.segments

# Let's try our parser
parser = TibetanVerseParser()
segments = parser.parse(our_tibetan_text)
print("Parsed segments:", segments)

Chapter 2: The Annotation Adventure

Now that we have our segments, let’s add annotations to mark them as Tibetan verses and translations:

def create_verse_annotations(pecha: Pecha, segments: List[Dict[str, Any]]) -> List[AnnotationModel]:
    """
    Create annotations for Tibetan verses and their translations.
    """
    annotations = []
    
    for i, segment in enumerate(segments):
        # Create text selector
        text_selector = {
            "@type": "TextSelector",
            "resource": "base",
            "offset": {
                "@type": "Offset",
                "begin": {
                    "@type": "BeginAlignedCursor",
                    "value": segment['start']
                },
                "end": {
                    "@type": "BeginAlignedCursor",
                    "value": segment['end']
                }
            }
        }
        
        # Create annotation data
        annotation_data = {
            "@type": "AnnotationData",
            "@id": f"verse_{i}",
            "key": "verse_type",
            "value": {
                "@type": "String",
                "value": segment['type']
            }
        }
        
        # Create the annotation
        annotation = {
            "@type": "Annotation",
            "@id": f"ann_{i}",
            "target": text_selector,
            "data": [annotation_data]
        }
        
        annotations.append(annotation)
    
    return annotations

# Create annotations
annotations = create_verse_annotations(pecha, segments)
print("Created annotations:", annotations)

Chapter 3: The Serializer’s Journey

Finally, let’s create a serializer to package everything together:

class TibetanVerseSerializer:
    def __init__(self):
        self.annotation_store = {
            "@type": "AnnotationStore",
            "@id": "tibetan_verse_store",
            "resources": [
                {
                    "@type": "TextResource",
                    "@id": "base",
                    "@include": "verses.txt"
                }
            ],
            "annotationsets": [
                {
                    "@type": "AnnotationDataSet",
                    "@id": "verse_annotation",
                    "keys": [
                        {
                            "@type": "DataKey",
                            "@id": "verse_type"
                        }
                    ],
                    "data": []
                }
            ],
            "annotations": []
        }
    
    def serialize(self, pecha: Pecha, annotations: List[AnnotationModel]) -> Dict[str, Any]:
        """
        Serialize the pecha and its annotations.
        """
        # Add annotations to the store
        self.annotation_store["annotations"] = annotations
        
        # Add annotation data to the dataset
        for annotation in annotations:
            for data in annotation["data"]:
                self.annotation_store["annotationsets"][0]["data"].append(data)
        
        return self.annotation_store

# Let's serialize our data
serializer = TibetanVerseSerializer()
serialized_data = serializer.serialize(pecha, annotations)

# Save the serialized data
import json
with open('tibetan_verses.json', 'w', encoding='utf-8') as f:
    json.dump(serialized_data, f, ensure_ascii=False, indent=2)

The Final Output

After running our pipeline, we get a JSON file that looks like this:

{
  "@type": "AnnotationStore",
  "@id": "tibetan_verse_store",
  "resources": [
    {
      "@type": "TextResource",
      "@id": "base",
      "@include": "verses.txt"
    }
  ],
  "annotationsets": [
    {
      "@type": "AnnotationDataSet",
      "@id": "verse_annotation",
      "keys": [
        {
          "@type": "DataKey",
          "@id": "verse_type"
        }
      ],
      "data": [
        {
          "@type": "AnnotationData",
          "@id": "verse_0",
          "key": "verse_type",
          "value": {
            "@type": "String",
            "value": "tibetan"
          }
        },
        {
          "@type": "AnnotationData",
          "@id": "verse_1",
          "key": "verse_type",
          "value": {
            "@type": "String",
            "value": "translation"
          }
        }
        // ... more annotations ...
      ]
    }
  ],
  "annotations": [
    {
      "@type": "Annotation",
      "@id": "ann_0",
      "target": {
        "@type": "TextSelector",
        "resource": "base",
        "offset": {
          "@type": "Offset",
          "begin": {
            "@type": "BeginAlignedCursor",
            "value": 0
          },
          "end": {
            "@type": "BeginAlignedCursor",
            "value": 45
          }
        }
      },
      "data": [
        {
          "@type": "AnnotationData",
          "@id": "verse_0",
          "set": "verse_annotation"
        }
      ]
    }
    // ... more annotations ...
  ]
}

Epilogue: What We’ve Learned

In this story, we’ve seen how to:

Parse Tibetan text into meaningful segments
Add annotations to mark different types of content
Serialize everything into a structured format

The resulting JSON file can be used by other tools to:

Display the text with proper formatting
Extract specific types of content
Perform analysis on the text
Create translations or other derived works

Remember that this is just one way to process Tibetan text. You can extend this pipeline to handle more complex cases, such as:

Multiple translations
Commentary layers
Cross-references
Metadata about the text
And much more!