COLRC is an online database for various words in the Coeur d'Alene language. They still have countless documents that are either scanned PDFS or images of old books that need to be digitized to help the language revitalization members catalogue more words for the database along with having access to these amazing resources.
Gladys Reichard collected dozens of stories from the Coeur d'Alene (CdA) tribe. She transcribed them both by hand and typing them on a special keyboard for the language's orthography. All the documents had the CdA sentences above English.
My work focused on conducting OCR on the typed documents.
Challenges did arise with the documents as they were poorly scanned, typewriter mistakes, and noise.
All data used were in CdA and combination of an image of a line of text and another file with my best guess at the transcription.
350 actual screenshots from the various documents with accompanying transcriptions.
I measured the Zipf-Score and found it was high, with around 1.23 (Calculations done in survey_data.py). However, due to the rarity of some characters, it was difficult to capture it all. But I tried my best.
hä xuxʷiýä tsɩtsɩḿi'ĺt kuḿ ɫa la'ʷ
stu'ᵘshä'pmät. hɔi ɫä häpi'lumxʷ äku'stus xuic
1,000 'fake' data
Due to be a one-woman team, I hadn't the time to take a significant amount of ground-truth pairs. Because of that, I decided to create synthetic data using various Python scripts and packages found here
There were two different types of models I used: TrOCR and Tesseract
Tesseract is an open-source OCR (Optical Character Recognition) engine using a Long Short-Term Memory neural network which is a special type of Recurrent Neural Network. It accepts an image (PNG, JPG) and outputs the image's text using various methods. Currently, Tesseract has 100+ languages, including some Indigenous languages.
The official training framework/toolkit for creating and fine-tuning Tesseract OCR models and allows you to train custom models for specific fonts, languages, or document styles not well-covered by default Tesseract models
To train it you have ground truth data (image + text pairs) as input to teach the model what characters look like in your specific context
Tesstrain allows fine-tuning from an existing pre-trained model (transfer learning) rather than training from scratch, which saves significant time and data