Papers2

#sentence embeddings

No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

The paper tries several different ways to translate five low-resource Turkic languages, instead of forcing one method to fit all.

#low-resource machine translation#Turkic languages#NLLB-200

Not triaged yet

FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

Intermediate

Jonas Golde, Patrick Haller et al.Dec 15arXiv

FINERWEB is a new, carefully built dataset pipeline that teaches computers to spot names of people, places, and more across 91 languages and 25 writing systems.

#multilingual NER#named entity recognition#LLM supervision

Not triaged yet