Trond Trosterud & Sjur Nørstebø Moshagen: An Open Source Infrastructure for Language Technology
An essential part of building a language technology
infrastructure "for the rest of us" is reuse and flexibility.
Most of the languages not covered by existing language technology
do not have computer programs with an ability to process text or
speech in their language. Making such programs is not an easy
task: The dominating languages combine an abundance of accessible
text resources with an impoverished inflectional structure.
Methodologies for these languages capitalise upon this, and use
statistical and list-based methods where the primitives are the
words (character strings between spaces). The majority of the
languages of the world are in the opposite situation, and combine
a paucity of textual resources with a rich morphological
structure, ranging from 8 forms of each noun in Scandinavian to
more than thousand forms of each verb in Finnish. An on top comes
the productive compounding of Northern Europe.
We will present the infrastructure and tools used and developed by
the Divvun and Giellatekno groups at the University of Tromsø to
develop proofing tools, localisation, intelligent language
learning resources and other linguistic resources for almost 40
minority languages. The main importance in this infrastructure
lies in its portability and reusability. The available resources
to do the required work to develop these resources are limited,
and thus one can not afford to redo the same or similar work
again and again in different projects, the way it often is done
for the largest language communities in the world.
Summing up: Our approach makes language technology solutions for
complex or lesser-resourced languages possible, and our
open-source infrastructure makes it an efficient and doable
approach for the larger OS community. Join in!
Download video: http://videos.fscons.org/fscons/2013/c362_-_2013-11-10_1515_-_an_open_source_infrastructure_for_language_technology_-_trond_trosterud_-_sjur_norstebo_moshagen_-_35.ogv