Tokenizing Balochi with HuggingFace’s Tokenizer and FastAI/Spacy

from blog Alex Strick van Linschoten, | ↗ original
In this blog I want to walk through how I trained my first tokenizer(s) on a small Balochi language corpus. I used the Huggingface Tokenizers library and FastAI / Spacy to get a sense of the interfaces involved. There’s also some naive pre-processing I did to get the corpus into a format that the tokenizer could handle. I’m not sure if this is...