Change search
Refine search result
1 - 39 of 39
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1. Berggren, Max
    et al.
    Karlgren, Jussi
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Parkvall, Mikael
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Inferring the location of authors from words in their texts2015In: Proceedings of the 20th Nordic Conference of Computational Linguistics: NODALIDA 2015 / [ed] Beáta Megyesi, Linköping: Linköping University Electronic Press, ACL Anthology , 2015, p. 211-218Conference paper (Refereed)
    Abstract [en]

    For the purposes of computational dialectology or other geographically bound text analysis tasks, texts must be annotated with their or their authors' location. Many texts are locatable but most have no ex- plicit annotation of place. This paper describes a series of experiments to determine how positionally annotated microblog posts can be used to learn location indicating words which then can be used to locate blog texts and their authors. A Gaussian distribution is used to model the locational qualities of words. We introduce the notion of placeness to describe how locational words are.

    We find that modelling word distributions to account for several locations and thus several Gaussian distributions per word, defining a filter which picks out words with high placeness based on their local distributional context, and aggregating locational information in a centroid for each text gives the most useful results. The results are applied to data in the Swedish language.

  • 2. Bjerva, Johannes
    et al.
    Grigonyte, Gintare
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Plank, Barbara
    Neural Networks and Spelling Features for Native Language Identification2017In: The Twelfth Workshop on Innovative Use of NLP for Building Educational Applications: Proceedings of the Workshop, Association for Computational Linguistics, 2017, p. 235-239Conference paper (Refereed)
    Abstract [en]

    We present the RUG-SU team's submission at the Native Language Identification Shared Task 2017. We combine several approaches into an ensemble, based on spelling error features, a simple neural network using word representations, a deep residual network using word and character features, and a system based on a recurrent neural network. Our best system is an ensemble of neural networks, reaching an F1 score of 0.8323. Although our system is not the highest ranking one, we do outperform the baseline by far.

  • 3. Bjerva, Johannes
    et al.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Cross-lingual Learning of Semantic Textual Similarity with Multilingual Word Representations2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics / [ed] Jörg Tiedemann, Linköping: Linköping University Electronic Press, 2017, p. 211-215, article id 024Conference paper (Refereed)
    Abstract [en]

    Assessing the semantic similarity between sentences in different languages is challenging. We approach this problem by leveraging multilingual distributional word representations, where similar words in different languages are close to each other. The availability of parallel data allows us to train such representations on a large amount of languages. This allows us to leverage semantic similarity data for languages for which no such data exists. We train and evaluate on five language pairs, including English, Spanish, and Arabic. We are able to train wellperforming systems for several language pairs, without any labelled data for that language pair.

  • 4. Bjerva, Johannes
    et al.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Han Veiga, Maria
    Tiedemann, Jörg
    Augenstein, Isabelle
    What Do Language Representations Really Represent?2019In: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 45, no 2, p. 381-389Article in journal (Refereed)
    Abstract [en]

    A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just as it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, whereas genetic relationships—a convenient benchmark used for evaluation in previous work—appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another.

  • 5.
    Börstell, Carl
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Hörberg, Thomas
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Distribution and duration of signs and parts of speech in Swedish Sign Language2016In: Sign Language and Linguistics, ISSN 1387-9316, E-ISSN 1569-996X, Vol. 19, no 2, p. 143-196Article in journal (Refereed)
    Abstract [en]

    In this paper, we investigate frequency and duration of signs and parts of speech in Swedish Sign Language (SSL) using the SSL Corpus. The duration of signs is correlated with frequency, with high-frequency items having shorter duration than low-frequency items. Similarly, function words (e.g. pronouns) have shorter duration than content words (e.g. nouns). In compounds, forms annotated as reduced display shorter duration. Fingerspelling duration correlates with word length of corresponding Swedish words, and frequency and word length play a role in the lexicalization of fingerspellings. The sign distribution in the SSL Corpus shows a great deal of cross-linguistic similarity with other sign languages in terms of which signs appear as high-frequency items, and which categories of signs are distributed across text types (e.g. conversation vs. narrative). We find a correlation between an increase in age and longer mean sign duration, but see no significant difference in sign duration between genders.

  • 6.
    Börstell, Carl
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Iconic Locations in Swedish Sign Language: Mapping Form to Meaning with Lexical Databases2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa / [ed] Jörg Tiedemann, Linköping: Linköping University Electronic Press, 2017, p. 221-225, article id 026Conference paper (Refereed)
    Abstract [en]

    In this paper, we describe a method for mapping the phonological feature location of Swedish Sign Language (SSL) signs to the meanings in the Swedish semantic dictionary SALDO. By doing so, we observe clear differences in the distribution of meanings associated with different locations on the body. The prominence of certain locations for specific meanings clearly point to iconic mappings between form and meaning in the lexicon of SSL, which pinpoints modalityspecific properties of the visual modality.

  • 7.
    Börstell, Carl
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Östling, Robert
    University of Helsinki, Finland.
    Visualizing Lects in a Sign Language Corpus: Mining Lexical Variation Data in Lects of Swedish Sign Language2016In: Workshop Proceedings: 7th Workshop on the Representation and Processing of Sign Languages: Corpus Mining / [ed] Eleni Efthimiou, Stavroula-Evita Fotinea, Thomas Hanke, Julie Hochgesang, Jette Kristoffersen, Johanna Mesch, Paris: ELRA , 2016, p. 13-18Conference paper (Refereed)
    Abstract [en]

    In this paper, we discuss the possibilities for mining lexical variation data across (potential) lects in Swedish Sign Language (SSL). The data come from the SSL Corpus (SSLC), a continuously expanding corpus of SSL, its latest release containing 43 307 annotated sign tokens, distributed over 42 signers and 75 time-aligned video and annotation files. After extracting the raw data from the SSLC annotation files, we created a database for investigating lexical distribution/variation across three possible lects, by merging the raw data with an external metadata file, containing information about the age, gender, and regional background of each of the 42 signers in the corpus. We go on to present a first version of an easy-to-use graphical user interface (GUI) that can be used as a tool for investigating lexical variation across different lects, and demonstrate a few interesting finds. This tool makes it easier for researchers and non-researchers alike to have the corpus frequencies for individual signs visualized in an instant, and the tool can easily be updated with future expansions of the SSLC.

  • 8. Cap, Fabienne
    et al.
    Adesam, Yvonne
    Ahrenberg, Lars
    Borin, Lars
    Bouma, Gerlof
    Forsberg, Markus
    Kann, Viggo
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Smith, Aaron
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nivre, Joakim
    SWORD: Towards Cutting-Edge Swedish Word Processing2016In: Proceedings of SLTC 2016, 2016Conference paper (Refereed)
    Abstract [en]

    Despite many years of research on Swedish language technology, there is still no well-documented standard for Swedish word processing covering the whole spectrum from low-level tokenization to morphological analysis and disambiguation. SWORD is a new initiative within the SWE-CLARIN consortium aiming to develop documented standards for Swedish word processing. In this paper, we report on a pilot study of Swedish tokenization, where we compare the output of six different tokenizers on four different text types. For one text type (Wikipedia articles), we also compare to the tokenization produced by six manual annotators.

  • 9.
    Dalianis, Hercules
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Östling, RobertStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.Weegar, RebeckaStockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.Wirén, MatsStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Special Issue of Selected Contributions from the Seventh Swedish Language Technology Conference (SLTC 2018)2019Conference proceedings (editor) (Other academic)
    Abstract [en]

    This Special Issue contains three papers that are extended versions of abstracts presented at the Seventh Swedish Language Technology Conference (SLTC 2018), held at Stockholm University 8–9 November 2018.1 SLTC 2018 received 34 submissions, of which 31 were accepted for presentation. The number of registered participants was 113, including both attendees at SLTC 2018 and two co-located workshops that took place on 7 November. 32 participants were internationally affiliated, of which 14 were from outside the Nordic countries. Overall participation was thus on a par with previous editions of SLTC, but international participation was higher.

  • 10.
    Ek, Adam
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Grigonytė, Gintarė
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Gustafson Capková, Sofia
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Identifying Speakers and Addressees in Dialogues Extracted from Literary Fiction2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) / [ed] Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga, European Language Resources Association, 2018, p. 817-824Conference paper (Refereed)
    Abstract [en]

    This paper describes an approach to identifying speakers and addressees in dialogues extracted from literary fiction, along with a dataset annotated for speaker and addressee. The overall purpose of this is to provide annotation of dialogue interaction between characters in literary corpora in order to allow for enriched search facilities and construction of social networks from the corpora. To predict speakers and addressees in a dialogue, we use a sequence labeling approach applied to a given set of characters. We use features relating to the current dialogue, the preceding narrative, and the complete preceding context. The results indicate that even with a small amount of training data, it is possible to build a fairly accurate classifier for speaker and addressee identification across different authors, though the identification of addressees is the more difficult task.

  • 11.
    Kurfali, Murathan
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Noisy Parallel Corpus Filtering through Projected Word Embeddings2019In: Proceedings of the Fourth Conference on Machine Translation (WMT), Association for Computational Linguistics, 2019, Vol. 3, p. 279-283Conference paper (Refereed)
    Abstract [en]

    We present a very simple method for parallel text cleaning of low-resource languages, based on projection of word embeddings trained on large monolingual corpora in high-resource languages. In spite of its simplicity, we approach the strong baseline system in the downstream machine translation evaluation.

  • 12.
    Kurfali, Murathan
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Zero-shot transfer for implicit discourse relation classification2019In: 20th Annual Meeting of the Special Interest Group on Discourse and Dialogue: Proceedings of the Conference, Association for Computational Linguistics, 2019, p. 226-231Conference paper (Refereed)
    Abstract [en]

    Automatically classifying the relation between sentences in a discourse is a challenging task, in particular when there is no overt expression of the relation. It becomes even more challenging by the fact that annotated training data exists only for a small number of languages, such as English and Chinese. We present a new system using zero-shot transfer learning for implicit discourse relation classification, where the only resource used for the target language is unannotated parallel text. This system is evaluated on the discourse-annotated TEDMDB parallel corpus, where it obtains good results for all seven languages using only English training data.

  • 13.
    Loftsson, Hrafn
    et al.
    Reykjaviks universitet, Island.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of Icelandic2013In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), Linköping University Electronic Press, Linköpings universitet, 2013, p. 105-119Conference paper (Refereed)
    Abstract [en]

    In this paper, we experiment with using Stagger, an open-source implementation of an Averaged Perceptron tagger, to tag Icelandic, a morphologically complex language. By adding languagespecific linguistic features and using IceMorphy, an unknown word guesser, we obtain state-of- the-art tagging accuracy of 92.82%. Furthermore, by adding data from a morphological database, and word embeddings induced from an unannotated corpus, the accuracy increases to 93.84%. This is equivalent to an error reduction of 5.5%, compared to the previously best tagger for Icelandic, consisting of linguistic rules and a Hidden Markov Model.

  • 14.
    Nilsson Björkenstam, Kristina
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Björkstrand, Thomas
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Grigonyté, Gintaré
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Gustafson-Capková, Sofia
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Mesch, Johanna
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Schönström, Krister
    Stockholm University, Faculty of Humanities, Department of Linguistics, Swedish as a Second Language for the Deaf.
    Wallin, Lars
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    SWE-CLARIN partner presentation: Natural Language Processing Resources from the Department of Linguistics, Stockholm University2014In: The first Swedish national SWE-CLARIN workshop: LT-based e-HSS in Sweden – taking stock and looking ahead / [ed] Lars Borin, 2014Conference paper (Other academic)
    Abstract [en]

    The aim of the CLARIN Research Infrastructure and SWE-CLARIN is to facilitate for scholars in the humanities and social sciences to access primary data in the form of natural language, and to provide tools for exploring, annotating and analysing these data. This paper gives an overview of the resources and tools developed at the Department of Linguistics at Stockholm University planned to be made available within the SWE-CLARIN project. The paper also outlines our collaborations with neighbouring areas in the humanities and social sciences where these resources and tools will be put to use.

  • 15.
    Nilsson Björkenstam, Kristina
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Modelling the informativeness and timing of non-verbal cues in parent–child interaction2016In: The 54th Annual Meeting of the Association for Computational Linguistics: Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning, Stroudsburg, PA, USA: Association for Computational Linguistics, 2016, p. 82-90Conference paper (Refereed)
    Abstract [en]

    How do infants learn the meanings of their first words? This study investigates the informativeness and temporal dynamics of non-verbal cues that signal the speaker's referent in a model of early word–referent mapping. To measure the information provided by such cues, a supervised classifier is trained on information extracted from a multimodally annotated corpus of 18 videos of parent–child interaction with three children aged 7 to 33 months. Contradicting previous research, we find that gaze is the single most informative cue, and we show that this finding can be attributed to our fine-grained temporal annotation. We also find that offsetting the timing of the non-verbal cues reduces accuracy, especially if the offset is negative. This is in line with previous research, and suggests that synchrony between verbal and non-verbal cues is important if they are to be perceived as causally related.

  • 16.
    Nilsson Björkenstam, Kristina
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Modelling the informativeness of different modalities in parent-child interaction2015In: Workshop on Extensive and Intensive Recordings of Children's Language Environment / [ed] Alex Cristia, Melanie Soderstrom, 2015Conference paper (Refereed)
  • 17.
    Sjons, Johan
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Hörberg, Thomas
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Bjerva, Johannes
    Articulation rate in Swedish child-directed speech increases as a function of the age of the child even when surprisal is controlled for2017In: / [ed] Francisco Lacerda, David House, Mattias Heldner, Joakim Gustafson, Sofia Strömbergsson, Marcin Włodarczak, The International Speech Communication Association (ISCA), 2017, p. 1794-1798Conference paper (Refereed)
    Abstract [en]

    In earlier work, we have shown that articulation rate in Swedish child-directed speech (CDS) increases as a function of the age of the child, even when utterance length and differences in articulation rate between subjects are controlled for. In this paper we show on utterance level in spontaneous Swedish speech that i) for the youngest children, articulation rate in CDS is lower than in adult-directed speech (ADS), ii) there is a significant negative correlation between articulation rate and surprisal (the negative log probability) in ADS, and iii) the increase in articulation rate in Swedish CDS as a function of the age of the child holds, even when surprisal along with utterance length and differences in articulation rate between speakers are controlled for. These results indicate that adults adjust their articulation rate to make it fit the linguistic capacity of the child.

  • 18. Tjong Kim Sang, Erik
    et al.
    Bollmann, Marcel
    Boschker, Remko
    Casacuberta, Francisco
    Dietz, Feike
    Dipper, Stefanie
    Domingo, Miguel
    van der Goot, Robe
    van Koppen, Marjo
    Ljubešić, Nikola
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Petran, Florian
    Pettersson, Eva
    Scherrer, Yves
    Schraagen, Marijn
    Sevens, Leen
    Tiedemann, Jörg
    Vanallemeersch, Tom
    Zervanou, Kalliopi
    The CLIN27 Shared Task: Translating Historical Text to Contemporary Language for Improving Automatic Linguistic Annotation2017In: Computational Linguistics in the Netherlands Journal, ISSN 2211-4009, Vol. 7, p. 53-64Article in journal (Refereed)
    Abstract [en]

    The CLIN27 shared task evaluates the effect of translating historical text to modern text with the goal of improving the quality of the output of contemporary natural language processing tools applied to the text. We focus on improving part-of-speech tagging analysis of seventeenth-century Dutch. Eight teams took part in the shared task. The best results were obtained by teams employing character-based machine translation. The best system obtained an error reduction of 51% in comparison with the baseline of tagging unmodified text. This is close to the error reduction obtained by human translation (57%).

  • 19.
    Wirén, Mats
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    N. Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Modelling the Informativeness of Non-Verbal Cues in Parent–Child Interaction2017In: Proceedings of Interspeech 2017 / [ed] Francisco Lacerda, David House, Mattias Heldner, Joakim Gustafson, Sofia Strömbergsson, Marcin Włodarczak, The International Speech Communication Association (ISCA), 2017, p. 2203-2207Conference paper (Refereed)
    Abstract [en]

    Non-verbal cues from speakers, such as eye gaze and hand positions, play an important role in word learning. This is consistent with the notion that for meaning to be reconstructed, acoustic patterns need to be linked to time-synchronous patterns from at least one other modality. In previous studies of a multimodally annotated corpus of parent–child interaction, we have shown that parents interacting with infants at the early word-learning stage (7–9 months) display a large amount of time-synchronous patterns, but that this behaviour tails off with increasing age of the children. Furthermore, we have attempted to quantify the informativeness of the different nonverbal cues, that is, to what extent they actually help to discriminate between different possible referents, and how critical the timing of the cues is. The purpose of this paper is to generalise our earlier model by quantifying informativeness resulting from non-verbal cues occurring both before and after their associated verbal references.

  • 20.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics. University of Helsinki, Finland.
    A Bayesian model for joint word alignment and part-of-speech transfer2016In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan: Association for Computational Linguistics, 2016, p. 620-629Conference paper (Refereed)
    Abstract [en]

    Current methods for word alignment require considerable amounts of parallel text to deliver accurate results, a requirement which is met only for a small minority of the world’s approximately 7,000 languages. We show that by jointly performing word alignment and annotation transfer in a novel Bayesian model, alignment accuracy can be improved for language pairs where annotations are available for only one of the languages—a finding which could facilitate the study and processing of a vast number of low-resource languages. We also present an evaluation where our method is used to perform single-source and multi-source part-of-speech transfer with 22 translations of the same text in four different languages. This allows us to quantify the considerable variation in accuracy depending on the specific source text(s) used, even with different translations into the same language.

  • 21.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    A Construction Grammar Method for Disambiguating Swedish Compounds2010In: SLTC 2010 Workshop on Compounds and Multiword Expressions, 2010Conference paper (Refereed)
    Abstract [en]

    This study discusses the structure of Swedish compounds within the framework of Construction Grammar, and applies the result to Word Sense Disambiguation of compound components. A construction-based approach is shown to achieve significantly better results than a set of baselines.

  • 22.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics.
    Bayesian Models for Multilingual Word Alignment2015Doctoral thesis, monograph (Other academic)
    Abstract [en]

    In this thesis I explore Bayesian models for word alignment, how they can be improved through joint annotation transfer, and how they can be extended to parallel texts in more than two languages. In addition to these general methodological developments, I apply the algorithms to problems from sign language research and linguistic typology.

    In the first part of the thesis, I show how Bayesian alignment models estimated with Gibbs sampling are more accurate than previous methods for a range of different languages, particularly for languages with few digital resources available—which is unfortunately the state of the vast majority of languages today. Furthermore, I explore how different variations to the models and learning algorithms affect alignment accuracy.

    Then, I show how part-of-speech annotation transfer can be performed jointly with word alignment to improve word alignment accuracy. I apply these models to help annotate the Swedish Sign Language Corpus (SSLC) with part-of-speech tags, and to investigate patterns of polysemy across the languages of the world.

    Finally, I present a model for multilingual word alignment which learns an intermediate representation of the text. This model is then used with a massively parallel corpus containing translations of the New Testament, to explore word order features in 1001 languages.

  • 23.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Bayesian Word Alignment for Massively Parallel Texts2014In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, Association for Computational Linguistics, 2014, p. 123-127Conference paper (Refereed)
    Abstract [en]

    There has been a great amount of work done in the field of bitext alignment, but the problem of aligning words in massively parallel texts with hundreds or thousands of languages is largely unexplored. While the basic task is similar, there are also important differences in purpose, method and evaluation between the problems. In this work, I present a non-parametric Bayesian model that can be used for simultaneous word alignment in massively parallel corpora. This method is evaluated on a corpus containing 1144 translations of the New Testament.

  • 24.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Part of Speech Tagging: Shallow or Deep Learning?2018In: Northern European Journal of Language Technology (NEJLT), ISSN 2000-1533, Vol. 5, no 1, p. 1-15Article in journal (Refereed)
    Abstract [en]

    Deep neural networks have advanced the state of the art in numerous fields, but they generally suffer from low computational efficiency and the level of improvement compared to more efficient machine learning models is not always significant. We perform a thorough PoS tagging evaluation on the Universal Dependencies treebanks, pitting a state-of-the-art neural network approach against UDPipe and our sparse structured perceptron-based tagger, efselab. In terms of computational efficiency, efselab is three orders of magnitude faster than the neural network model, while being more accurate than either of the other systems on 47 of 65 treebanks.

  • 25.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Stagger: A modern POS tagger for Swedish2012In: / [ed] Pierre Nugues, 2012Conference paper (Refereed)
  • 26.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Stagger: an Open-Source Part of Speech Tagger for Swedish2013In: Northern European Journal of Language Technology (NEJLT), ISSN 2000-1533, Vol. 3, p. 1-18Article in journal (Refereed)
    Abstract [en]

    This work presents Stagger, a new open-source part of speech tagger for Swedish based on the Averaged Perceptron. By using the SALDO morphological lexicon and semi-supervised learning in the form of Collobert and Weston embeddings, it reaches an accuracy of 96.4% on the standard Stockholm-Umeå Corpus dataset, making it the best single part of speech tagging system reported for Swedish. Accuracy increases to 96.6% on the latest version of the corpus, where the annotation has been revised to increase consistency. Stagger is also evaluated on a new corpus of Swedish blog posts, investigating its out-of-domain performance.

  • 27.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Studying colexification through massively parallell corpora2016In: The Lexical Typology of Semantic Shifts / [ed] Päivi Juvonen, Maria Koptjevskaja-Tamm, Berlin: Walter de Gruyter, 2016, p. 157-176Chapter in book (Refereed)
    Abstract [en]

    Large-sample studies in lexical typology are limited by whatever lexical information is available or can be obtained for all the languages in the study. Various types of word lists, from simple Swadesh lists to large dictionaries, can be used for this purpose. Unfortunately, these resources often present only a very fragmentary view of a given language’s vocabulary. As a complement, we propose an additional source of lexical information: parallel texts. Books such as the New Testament have been translated into thousands of languages, and it is possible to automatically extract word lists from their vocabulary, which can then be applied to lexical typological studies. In particular, we focus on studying colexification using a sample of 1 001 different languages, based on 1 142 translations of the New Testament. We find that although the automatically extracted word lists contain errors, their quality can be sufficiently good to find real areal patterns, such as the ‘tree’/’fire’ colexification that is widespread in the Sahul area.

  • 28.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Svenska dialektkartor på sekunden2015In: Språkbruk, ISSN 0358-9293, Vol. 3, p. 10-13Article in journal (Other (popular science, discussion, etc.))
  • 29.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Word order typology through multilingual word alignment2015In: The 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing: Proceedings of the Conference, Volume 2: Short Papers, 2015, p. 205-211Conference paper (Refereed)
    Abstract [en]

    With massively parallel corpora of hundreds or thousands of translations of the same text, it is possible to automatically perform typological studies of language structure using very large language samples. We investigate the domain of wordorder using multilingual word alignment and high-precision annotation transfer in a corpus with 1144 translations in 986 languages of the New Testament. Results are encouraging, with 86% to 96% agreementbetween our method and the manually created WALS database for a range of different word order features. Beyond reproducing the categorical data in WALS and extending it to hundreds of other languages, we also provide quantitative data for therelative frequencies of different word orders, and show the usefulness of this for language comparison. Our method has applications for basic research in linguistic typology, as well as for NLP tasks like transfer learning for dependency parsing, which has been shown to benefit from word order information.

  • 30.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Bjerva, Johannes
    SU-RUG at the CoNLL-SIGMORPHON 2017 shared task: Morphological inflection with attentional sequence-to-sequence models2017In: Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection / [ed] Mans Hulden, Vancouver, Canada: Association for Computational Linguistics, 2017, p. 110-113Conference paper (Refereed)
    Abstract [en]

    This paper describes the Stockholm University/University of Groningen (SU-RUG) system for the SIGMORPHON 2017 shared task on morphological inflection. Our system is based on an attentional sequence-to-sequence neural network model using Long Short-Term Memory (LSTM) cells, with joint training of morphological inflection and the inverse transformation, i.e. lemmatization and morphological analysis. Our system outperforms the baseline with a large margin, and our submission ranks as the 4th best team for the track we participate in (task 1, high resource).

  • 31.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Börstell, Carl
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics. Radboud University, Netherlands.
    Courtaux, Servane
    Visual Iconicity Across Sign Languages: Large-Scale Automated Video Analysis of Iconic Articulators and Locations2018In: Frontiers in Psychology, ISSN 1664-1078, E-ISSN 1664-1078, Vol. 9, article id 725Article in journal (Refereed)
    Abstract [en]

    We use automatic processing of 120,000 sign videos in 31 different sign languages to show a cross-linguistic pattern for two types of iconic form–meaning relationships in the visual modality. First, we demonstrate that the degree of inherent plurality of concepts, based on individual ratings by non-signers, strongly correlates with the number of hands used in the sign forms encoding the same concepts across sign languages. Second, we show that certain concepts are iconically articulated around specific parts of the body, as predicted by the associational intuitions by non-signers. The implications of our results are both theoretical and methodological. With regard to theoretical implications, we corroborate previous research by demonstrating and quantifying, using a much larger material than previously available, the iconic nature of languages in the visual modality. As for the methodological implications, we show how automatic methods are, in fact, useful for performing large-scale analysis of sign language data, to a high level of accuracy, as indicated by our manual error analysis.

  • 32.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Börstell, Carl
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Gärdenfors, Moa
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Universal Dependencies for Swedish Sign Language2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa / [ed] Jörg Tiedemann, Linköping: Linköping University Electronic Press, 2017, p. 303-308Conference paper (Refereed)
    Abstract [en]

    We describe the first effort to annotate a signed language with syntactic dependency structure: the Swedish Sign Language portion of the Universal Dependencies treebanks. The visual modality presents some unique challenges in analysis and annotation, such as the possibility of both hands articulating separate signs simultaneously, which has implications for the concept of projectivity in dependency grammars. Our data is sourced from the Swedish Sign Language Corpus, and if used in conjunction these resources contain very richly annotated data: dependency structure and parts of speech, video recordings, signer metadata, and since the whole material is also translated into Swedish the corpus is also a parallel text.

  • 33.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Börstell, Carl
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Wallin, Lars
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Enriching the Swedish Sign Language Corpus with Part of Speech Tags Using Joint Bayesian Word Alignment and Annotation Transfer2015In: Proceedings of the 20th Nordic Conference of Computational Linguistics: NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania / [ed] Beáta Megyesi, Linköping University Electronic Press, 2015, p. 263-268Conference paper (Refereed)
    Abstract [en]

    We have used a novel Bayesian model of joint word alignment and part of speech (PoS) annotation transfer to enrich the Swedish Sign Language Corpus with PoS tags. The annotations were then hand-corrected in order to both improve annotation quality for the corpus, and allow the empirical evaluation presented herein.

  • 34.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Grigonyte, Gintare
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Transparent text quality assessment with convolutional neural networks2017In: The Twelfth Workshop on Innovative Use of NLP for Building Educational Applications: Proceedings of the Workshop, Association for Computational Linguistics, 2017, p. 282-286Conference paper (Refereed)
    Abstract [en]

    We present a very simple model for text quality assessment based on a deep convolutional neural network, where the only supervision required is one corpus of user-generated text of varying quality, and one contrasting text corpus of consistently high quality. Our model is able to provide local quality assessments in different parts of a text, which allows visual feedback about where potentially problematic parts of the text are located, as well as a way to evaluate which textual features are captured by our model. We evaluate our method on two corpora: a large corpus of manually graded student essays and a longitudinal corpus of language learner written production, and find that the text quality metric learned by our model is a fairly strong predictor of both essay grade and learner proficiency level.

  • 35.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Scherrer, Yves
    Tiedemann, Jörg
    Tang, Gongbo
    Nieminen, Tommi
    The Helsinki Neural Machine Translation System2017In: Proceedings of the Conference on Machine Translation (WMT): Shared Task Papers, Association for Computational Linguistics, 2017, Vol. 2, p. 338-347Conference paper (Refereed)
    Abstract [en]

    We introduce the Helsinki Neural Machine Translation system (HNMT) and how it is applied in the news translation task at WMT 2017, where it ranked first in both the human and automatic evaluations for English–Finnish. We discuss the successof English–Finnish translations and the overall advantage of NMT over a strong SMT baseline. We also discuss our sub-missions for English–Latvian, English–Chinese and Chinese–English.

  • 36.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Smolentzov, André
    Stockholm University, Faculty of Humanities, Department of Linguistics.
    Tyrefors Hinnerich, Björn
    Stockholm University, Faculty of Social Sciences, Department of Economics.
    Höglin, Erik
    Automated Essay Scoring for Swedish2013In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, 2013, p. 42-47Conference paper (Refereed)
    Abstract [en]

    We present the first system developed for automated grading of high school essays written in Swedish. The system uses standard text quality indicators and is able to compare vocabulary and grammar to large reference corpora of blog posts and newspaper articles. The system is evaluated on a corpus of 1 702 essays, each graded independently by the student’s own teacher and also in a blind re-grading process by another teacher. We show that our system’s performance is fair, given the low agreementbetween the two human graders, and furthermore show how it could improve efficiency in a practical setting where one seeks to identify incorrectly graded essays.

  • 37.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Tiedemann, Jörg
    Continuous multilinguality with language vectors2017In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Short Papers / [ed] Mirella Lapata, Phil Blunsom, Alexander Koller, Association for Computational Linguistics, 2017, Vol. 2, p. 644-649Conference paper (Refereed)
    Abstract [en]

    Most existing models for multilingual natural language processing (NLP) treat language as a discrete category, and make predictions for either one language or the other. In contrast, we propose using continuous vector representations of language. We show that these can be learned efficiently with a character-based neural language model, and used to improve inference about language varieties not seen during training. In experiments with 1303 Bible translations into 990 different languages, we empirically explore the capacity of multilingual language models, and also show that the language vectors capture genetic relationships between languages.

  • 38.
    Östling, Robert
    et al.
    University of Helsinki, Finland.
    Tiedemann, Jörg
    University of Helsinki, Finland.
    Efficient word alignment with Markov Chain Monte Carlo2016In: Prague Bulletin of Mathematical Linguistics, ISSN 0032-6585, E-ISSN 1804-0462, no 106, p. 125-146Article in journal (Refereed)
    Abstract [en]

    We present efmaral, a new system for efficient and accurate word alignment using a Bayesian model with Markov Chain Monte Carlo (MCMC) inference. Through careful selection of data structures and model architecture we are able to surpass the fast_align system, commonly used for performance-critical word alignment, both in computational efficiency and alignment accuracy. Our evaluation shows that a phrase-based statistical machine translation (SMT) system produces translations of higher quality when using word alignments from efmaral than from fast_align, and that translation quality is on par with what is obtained using giza++, a tool requiring orders of magnitude more processing time. More generally we hope to convince the reader that Monte Carlo sampling, rather than being viewed as a slow method of last resort, should actually be the method of choice for the SMT practitioner and others interested in word alignment.

  • 39.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Compounding in a Swedish Blog Corpus2013In: Computer mediated discourse across languages / [ed] Laura Álvarez López, Charlotta Seiler Brylla & Philip Shaw, Stockholm: Acta Universitatis Stockholmiensis, 2013, p. 45-63Chapter in book (Refereed)
1 - 39 of 39
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf