Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Studying colexification through massively parallell corpora
Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.ORCID iD: 0000-0002-6027-4156
2016 (English)In: The Lexical Typology of Semantic Shifts / [ed] Päivi Juvonen, Maria Koptjevskaja-Tamm, Berlin: Walter de Gruyter, 2016, p. 157-176Chapter in book (Refereed)
Abstract [en]

Large-sample studies in lexical typology are limited by whatever lexical information is available or can be obtained for all the languages in the study. Various types of word lists, from simple Swadesh lists to large dictionaries, can be used for this purpose. Unfortunately, these resources often present only a very fragmentary view of a given language’s vocabulary. As a complement, we propose an additional source of lexical information: parallel texts. Books such as the New Testament have been translated into thousands of languages, and it is possible to automatically extract word lists from their vocabulary, which can then be applied to lexical typological studies. In particular, we focus on studying colexification using a sample of 1 001 different languages, based on 1 142 translations of the New Testament. We find that although the automatically extracted word lists contain errors, their quality can be sufficiently good to find real areal patterns, such as the ‘tree’/’fire’ colexification that is widespread in the Sahul area.

Place, publisher, year, edition, pages
Berlin: Walter de Gruyter, 2016. p. 157-176
Keywords [en]
colexification, lexical typology, word alignment, parallel texts, multilingual nlp
National Category
Language Technology (Computational Linguistics) General Language Studies and Linguistics
Research subject
Linguistics; Computational Linguistics
Identifiers
URN: urn:nbn:se:su:diva-159765DOI: 10.1515/9783110377675-006ISBN: 9783110377521 (print)OAI: oai:DiVA.org:su-159765DiVA, id: diva2:1245613
Available from: 2018-09-05 Created: 2018-09-05 Last updated: 2018-09-06Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Search in DiVA

By author/editor
Östling, Robert
By organisation
Computational Linguistics
Language Technology (Computational Linguistics)General Language Studies and Linguistics

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 427 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf