Change search
Refine search result
1 - 97 of 97
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Wikse Barrow, Carla
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics. Karolinska Institutet, Sweden.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Strömbergsson, Sofia
    Subjective ratings of age-of-acquisition: exploring issues of validity and rater reliability2019In: Journal of Child Language, ISSN 0305-0009, E-ISSN 1469-7602, Vol. 46, no 2, p. 199-213Article in journal (Refereed)
    Abstract [en]

    This study aimed to investigate concerns of validity and reliability in subjective ratings of age-of-acquisition (AoA), through exploring characteristics of the individual rater. An additional aim was to validate the obtained AoA ratings against two corpora – one of child speech and one of adult speech – specifically exploring whether words over-represented in the child-speech corpus are rated with lower AoA than words characteristic of the adult-speech corpus. The results show that less than one-third of participating informants’ ratings are valid and reliable. However, individuals with high familiarity with preschool-aged children provide more valid and reliable ratings, compared to individuals who do not work with or have children of their own. The results further show a significant, age-adjacent difference in rated AoA for words from the two different corpora, thus strengthening their validity. The study provides AoA data, of high specificity, for 100 child-specific and 100 adult-specific Swedish words.

  • 2. Rosén, Dan
    et al.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Volodina, Elena
    Error Coding of Second-Language Learner Texts Based on Mostly Automatic Alignment of Parallel Corpora2018In: CLARIN Annual Conference 2018: Proceedings / [ed] Inguna Skadina, Maria Eskevich, 2018, p. 181-184Conference paper (Refereed)
    Abstract [en]

    Error coding of second-language learner text, that is, detecting, correcting and annotating errors, is a cumbersome task which in turn requires interpretation of the text to decide what the errors are. This paper describes a system with which the annotator corrects the learner text by editing it prior to the actual error annotation. During the editing, the system automatically generates a parallel corpus of the learner and corrected texts. Based on this, the work of the annotator consists of three independent tasks that are otherwise often conflated in error coding: correcting the learner text, repairing inconsistent alignments, and performing the actual error annotation.

  • 3. Ibbotson, Paul
    et al.
    Hartman, Rose M.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Frequency filter: an open access tool for analysing language development2018In: Language, Cognition and Neuroscience, ISSN 2327-3798, E-ISSN 2327-3801, Vol. 33, no 10, p. 1325-1339Article in journal (Refereed)
    Abstract [en]

    We present an open-access analytic tool, which allows researchers to simultaneously control for and combine language data from the child, the caregiver, multiple languages, and across multiple time points to make inferences about the social and cognitive factors driving the shape of language development. We demonstrate how the tool works in three domains of language learning and across six languages. The results demonstrate the usefulness of this approach as well as providing deeper insight into three areas of language production and acquisition: egocentric language use, the learnability of nouns versus verbs, and imageability. We have made the Frequency Filter tool freely available as an R-package for other researchers to use at https://github.com/rosemm/FrequencyFilter.

  • 4.
    Ek, Adam
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Grigonytė, Gintarė
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Gustafson Capková, Sofia
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Identifying Speakers and Addressees in Dialogues Extracted from Literary Fiction2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) / [ed] Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga, European Language Resources Association, 2018, p. 817-824Conference paper (Refereed)
    Abstract [en]

    This paper describes an approach to identifying speakers and addressees in dialogues extracted from literary fiction, along with a dataset annotated for speaker and addressee. The overall purpose of this is to provide annotation of dialogue interaction between characters in literary corpora in order to allow for enriched search facilities and construction of social networks from the corpora. To predict speakers and addressees in a dialogue, we use a sequence labeling approach applied to a given set of characters. We use features relating to the current dialogue, the preceding narrative, and the complete preceding context. The results indicate that even with a small amount of training data, it is possible to build a fairly accurate classifier for speaker and addressee identification across different authors, though the identification of addressees is the more difficult task.

  • 5. Megyesi, Beáta
    et al.
    Granstedt, Lena
    Johansson, Sofia
    Stockholm University, Faculty of Humanities, Department of Swedish Language and Multilingualism, Scandinavian Languages.
    Prentice, Julia
    Rosén, Dan
    Schenström, Carl-Johan
    Sundberg, Gunlög
    Stockholm University, Faculty of Humanities, Department of Swedish Language and Multilingualism, Scandinavian Languages.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Volodina, Elena
    Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish2018In: Proceedings of the 7th Workshop on NLP for Computer Assisted Language Learning at SLTC 2018 (NLP4CALL 2018), Linköping: Linköping University Electronic Press, 2018, p. 47-56, article id 006Conference paper (Refereed)
    Abstract [en]

    This paper reports on the status of learner corpus anonymization for the ongoing research infrastructure project SweLL. The main project aim is to deliver and make available for research a well-annotated corpus of essays written by second language (L2) learners of Swedish. As the practice shows, annotation of learner texts is a sensitive process demanding a lot of compromises between ethical and legal demands on the one hand, and research and technical demands, on the other. Below, is a concise description of the current status of pseudonymization of language learner data to ensure anonymity of the learners, with numerous examples of the above-mentioned compromises.

  • 6.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Part of Speech Tagging: Shallow or Deep Learning?2018In: Northern European Journal of Language Technology (NEJLT), ISSN 2000-1533, Vol. 5, no 1, p. 1-15Article in journal (Refereed)
    Abstract [en]

    Deep neural networks have advanced the state of the art in numerous fields, but they generally suffer from low computational efficiency and the level of improvement compared to more efficient machine learning models is not always significant. We perform a thorough PoS tagging evaluation on the Universal Dependencies treebanks, pitting a state-of-the-art neural network approach against UDPipe and our sparse structured perceptron-based tagger, efselab. In terms of computational efficiency, efselab is three orders of magnitude faster than the neural network model, while being more accurate than either of the other systems on 47 of 65 treebanks.

  • 7. Strömbergsson, Sofia
    et al.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Götze, Jana
    Edlund, Jens
    Simulating Speech Errors in Swedish, Norwegian and English2018Conference paper (Refereed)
  • 8.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Börstell, Carl
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics. Radboud University, Netherlands.
    Courtaux, Servane
    Visual Iconicity Across Sign Languages: Large-Scale Automated Video Analysis of Iconic Articulators and Locations2018In: Frontiers in Psychology, ISSN 1664-1078, E-ISSN 1664-1078, Vol. 9, article id 725Article in journal (Refereed)
    Abstract [en]

    We use automatic processing of 120,000 sign videos in 31 different sign languages to show a cross-linguistic pattern for two types of iconic form–meaning relationships in the visual modality. First, we demonstrate that the degree of inherent plurality of concepts, based on individual ratings by non-signers, strongly correlates with the number of hands used in the sign forms encoding the same concepts across sign languages. Second, we show that certain concepts are iconically articulated around specific parts of the body, as predicted by the associational intuitions by non-signers. The implications of our results are both theoretical and methodological. With regard to theoretical implications, we corroborate previous research by demonstrating and quantifying, using a much larger material than previously available, the iconic nature of languages in the visual modality. As for the methodological implications, we show how automatic methods are, in fact, useful for performing large-scale analysis of sign language data, to a high level of accuracy, as indicated by our manual error analysis.

  • 9.
    Strömbergsson, Sofia
    et al.
    Division of Speech and Language Pathology, Department of Clinical Science, Intervention and Technology (CLINTEC), Karolinska Institutet (KI), Stockholm, Sweden.
    Edlund, Jens
    Department of Speech, Music and Hearing, KTH, Stockholm, Sweden.
    Götze, Jana
    Division of Speech and Language Pathology, Department of Clinical Science, Intervention and Technology (CLINTEC), Karolinska Institutet (KI), Stockholm, Sweden.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Approximating phonotactic input in children’s linguistic environments from orthographic transcripts2017In: Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017), Stockholm: The International Speech Communication Association (ISCA), 2017., Stockholm: The International Speech Communication Association (ISCA), 2017, p. 2214-2217Conference paper (Refereed)
    Abstract [en]

    Child-directed spoken data is the ideal source of support for claims about children’s linguistic environments. However, phonological transcriptions of child-directed speech are scarce,compared to sources like adult-directed speech or text data. Acquiring reliable descriptions of children’s phonological environments from more readily accessible sources would mean considerable savings of time and money. The first step towards this goal is to quantify the reliability of descriptions derived from such secondary sources. We investigate how phonological distributions vary across different modalities (spoken vs. written), and across the age of the intended audience (children vs. adults). Using a previously unseen collection of Swedish adult- and child-directed spoken and written data, we combine lexicon look-up and grapheme-to-phonemeconversion to approximate phonological characteristics. The analysis shows distributional differences across datasets both for single phonemes and for longer phoneme sequences. Some of these are predictably attributed to lexical and contextual characteristics of text vs. speech.The generated phonological transcriptions are remarkably reliable. The differences in phonological distributions between child-directed speech and secondary sources highlight a need for compensatory measures when relying on written data or onadult-directed spoken data, and/or for continued collection ofactual child-directed speech in research on children’s language environments.

  • 10.
    Sjons, Johan
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Hörberg, Thomas
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Bjerva, Johannes
    Articulation rate in Swedish child-directed speech increases as a function of the age of the child even when surprisal is controlled for2017In: Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017) / [ed] Marcin Włodarczak, Stockholm: The International Speech Communication Association (ISCA), 2017, p. 1794-1798Conference paper (Refereed)
  • 11.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Tiedemann, Jörg
    Continuous multilinguality with language vectors2017In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, p. 644-649Conference paper (Refereed)
    Abstract [en]

    Most existing models for multilingual natural language processing (NLP) treat language as a discrete category, and make predictions for either one language or the other. In contrast, we propose using continuous vector representations of language. We show that these can be learned efficiently with a character-based neural language model, and used to improve inference about language varieties not seen during training. In experiments with 1303 Bible translations into 990 different languages, we empirically explore the capacity of multilingual language models, and also show that the language vectors capture genetic relationships between languages.

  • 12. Bjerva, Johannes
    et al.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Cross-lingual Learning of Semantic Textual Similarity with Multilingual Word Representations2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics / [ed] Jörg Tiedemann, Linköping: Linköping University Electronic Press, 2017, p. 211-215, article id 024Conference paper (Refereed)
    Abstract [en]

    Assessing the semantic similarity between sentences in different languages is challenging. We approach this problem by leveraging multilingual distributional word representations, where similar words in different languages are close to each other. The availability of parallel data allows us to train such representations on a large amount of languages. This allows us to leverage semantic similarity data for languages for which no such data exists. We train and evaluate on five language pairs, including English, Spanish, and Arabic. We are able to train wellperforming systems for several language pairs, without any labelled data for that language pair.

  • 13.
    Börstell, Carl
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Iconic Locations in Swedish Sign Language: Mapping Form to Meaning with Lexical Databases2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa / [ed] Jörg Tiedemann, Linköping: Linköping University Electronic Press, 2017, p. 221-225, article id 026Conference paper (Refereed)
    Abstract [en]

    In this paper, we describe a method for mapping the phonological feature location of Swedish Sign Language (SSL) signs to the meanings in the Swedish semantic dictionary SALDO. By doing so, we observe clear differences in the distribution of meanings associated with different locations on the body. The prominence of certain locations for specific meanings clearly point to iconic mappings between form and meaning in the lexicon of SSL, which pinpoints modalityspecific properties of the visual modality.

  • 14.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Schneider, Gerold
    English Department, University of Zurich, Switzerland.
    Measuring Encoding Efficiency in Swedish and English Language Learner Speech Production2017In: The 18th Annual Conference of the International Speech Communication Association Interspeech 2017 / [ed] Marcin Włodarczak, The International Speech Communication Association (ISCA), 2017, article id 337Conference paper (Refereed)
    Abstract [en]

    We use n-gram language models to investigate how far lan- guage approximates an optimal code for human communication in terms of Information Theory [1], and what differences there are between Learner proficiency levels. Although the language of lower level learners is simpler, it is less optimal in terms of information theory, and as a consequence more difficult to pro- cess. 

  • 15.
    Marklund, Ellen
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Cortes, Elísabet Eir
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Sjons, Johan
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    MMN responses in adults after exposure to bimodal and unimodal frequency distributions of rotated speech2017In: Proceedings of Interspeech 2017, The International Speech Communication Association (ISCA), 2017, p. 1804-1808Conference paper (Refereed)
    Abstract [en]

    The aim of the present study is to further the understanding of the relationship between perceptual categorization and exposure to different frequency distributions of sounds. Previous studies have shown that speech sound discrimination proficiency is in- fluenced by exposure to different distributions of speech sound continua varying along one or several acoustic dimensions, both in adults and in infants. In the current study, adults were presented with either a bimodal or a unimodal frequency distri- bution of spectrally rotated sounds along a continuum (a vowel continuum before rotation). Categorization of the sounds, quantified as amplitude of the event-related potential (ERP) component mismatch negativity (MMN) in response to two of the sounds, was measured before and after exposure. It was expected that the bimodal group would have a larger MMN amplitude after exposure whereas the unimodal group would have a smaller MMN amplitude after exposure. Contrary to expectations, the MMN amplitude was smaller overall after exposure, and no difference was found between groups. This suggests that either the previously reported sensitivity to frequency distributions of speech sounds is not present for non-speech sounds, or the MMN amplitude is not a sensitive enough measure of categorization to detect an influence from passive exposure, or both.

  • 16.
    Wirén, Mats
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Modelling the Informativeness of Non-Verbal Cues in Parent–Child Interaction2017In: Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017), Stockholm: The International Speech Communication Association (ISCA), 2017, p. 2203-2207, article id 1143Conference paper (Refereed)
    Abstract [en]

    Non-verbal cues from speakers, such as eye gaze and hand positions, play an important role in word learning. This is consistent with the notion that for meaning to be reconstructed, acoustic patterns need to be linked to time-synchronous patterns from at least one other modality. In previous studies of a multimodally annotated corpus of parent–child interaction, we have shown that parents interacting with infants at the early word-learning stage (7–9 months) display a large amount of time-synchronous patterns, but that this behaviour tails off with increasing age of the children. Furthermore, we have attempted to quantify the informativeness of the different nonverbal cues, that is, to what extent they actually help to discriminate between different possible referents, and how critical the timing of the cues is. The purpose of this paper is to generalise our earlier model by quantifying informativeness resulting from non-verbal cues occurring both before and after their associated verbal references.

  • 17.
    Bjerva, Johannes
    et al.
    University of Groningen.
    Grigonyte, Gintare
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Plank, Barbara
    University of Groningen.
    Neural Networks and Spelling Features for Native Language Identification2017In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, 2017, p. 235-239Conference paper (Refereed)
    Abstract [en]

    We present the RUG-SU team's submission at the Native Language Identification Shared Task 2017. We combine several approaches into an ensemble, based on spelling error features, a simple neural network using word representations, a deep residual network using word and character features, and a system based on a recurrent neural network. Our best system is an ensemble of neural networks, reaching an F1 score of 0.8323. Although our system is not the highest ranking one, we do outperform the baseline by far.

  • 18. Volodina, Elena
    et al.
    Pilán, IldikóBorin, LarsGintare, GrigonyteStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.Nilsson Björkenstam, KristinaStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Proceedings of the Joint 6th Workshop on NLP for Computer Assisted Language Learning and 2nd Workshop on NLP for Research on Language Acquisition2017Conference proceedings (editor) (Refereed)
    Abstract [en]

    For the second year in a row we brought two related themes of NLP for Computer-Assisted Language Learning and NLP for Language Acquisition together. The goal of organizing joint workshops is to provide a meeting place for researchers working on language learning issues including both empirical and experimental studies and NLP-based applications. The resulting volume covers a variety of topics from the two fields and - hopefully - showcases the challenges and achievements in the field.

    The seven papers in this volume cover native language identification in learner writings, using syntactic complexity development in language learner language to identify reading comprehension texts of appropriate level, exploring the potential of parallel corpora to predict mother-language specific problem areas for learners of another language, tools for learning languages - both well-resourced ones such as English as well as endangered or under-resourced ones such as Yakut and Võro, as well as exploring the potential of automatically identifying and correcting word-level errors in Swedish learner writing.

  • 19.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Bjerva, Johannes
    SU-RUG at the CoNLL-SIGMORPHON 2017 shared task: Morphological inflection with attentional sequence-to-sequence models2017In: Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection, Vancouver, Canada: Association for Computational Linguistics, 2017, p. 110-113Conference paper (Refereed)
    Abstract [en]

    This paper describes the Stockholm University/University of Groningen (SU-RUG) system for the SIGMORPHON 2017 shared task on morphological inflection. Our system is based on an attentional sequence-to-sequence neural network model using Long Short-Term Memory (LSTM) cells, with joint training of morphological inflection and the inverse transformation, i.e. lemmatization and morphological analysis. Our system outperforms the baseline with a large margin, and our submission ranks as the 4th best team for the track we participate in (task 1, high resource).

  • 20. Tjong Kim Sang, Erik
    et al.
    Bollmann, Marcel
    Boschker, Remko
    Casacuberta, Francisco
    Dietz, Feike
    Dipper, Stefanie
    Domingo, Miguel
    van der Goot, Robe
    van Koppen, Marjo
    Ljubešić, Nikola
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Petran, Florian
    Pettersson, Eva
    Scherrer, Yves
    Schraagen, Marijn
    Sevens, Leen
    Tiedemann, Jörg
    Vanallemeersch, Tom
    Zervanou, Kalliopi
    The CLIN27 Shared Task: Translating Historical Text to Contemporary Language for Improving Automatic Linguistic Annotation2017In: Computational Linguistics in the Netherlands Journal, ISSN 2211-4009, Vol. 7, p. 53-64Article in journal (Refereed)
    Abstract [en]

    The CLIN27 shared task evaluates the effect of translating historical text to modern text with the goal of improving the quality of the output of contemporary natural language processing tools applied to the text. We focus on improving part-of-speech tagging analysis of seventeenth-century Dutch. Eight teams took part in the shared task. The best results were obtained by teams employing character-based machine translation. The best system obtained an error reduction of 51% in comparison with the baseline of tagging unmodified text. This is close to the error reduction obtained by human translation (57%).

  • 21.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Scherrer, Yves
    University of Helsinki.
    Tiedemann, Jörg
    University of Helsinki.
    Tang, Gongbo
    Uppsala University.
    Nieminen, Tommi
    University of Helsinki.
    The Helsinki Neural Machine Translation System2017In: Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark: Association for Computational Linguistics, 2017, p. 338-347Conference paper (Refereed)
  • 22.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Grigonyte, Gintare
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Transparent text quality assessment with convolutional neural networks2017In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, 2017, p. 282-286Conference paper (Refereed)
    Abstract [en]

    We present a very simple model for text quality assessment based on a deep convolutional neural network, where the only supervision required is one corpus of user-generated text of varying quality, and one contrasting text corpus of consistently high quality. Our model is able to provide local quality assessments in different parts of a text, which allows visual feedback about where potentially problematic parts of the text are located, as well as a way to evaluate which textual features are captured by our model. We evaluate our method on two corpora: a large corpus of manually graded student essays and a longitudinal corpus of language learner written production, and find that the text quality metric learned by our model is a fairly strong predictor of both essay grade and learner proficiency level.

  • 23.
    Nilsson Björkenstam, Kristina
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Grigonyté, Gintaré
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Trump säger det igen, igen och igen2017In: Språktidningen, ISSN 1654-5028, no 2, p. 24-27Article in journal (Other (popular science, discussion, etc.))
  • 24.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Börstell, Carl
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language. Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Gärdenfors, Moa
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Universal Dependencies for Swedish Sign Language2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics / [ed] Jörg Tiedemann, Linköping: Linköping University Electronic Press, 2017, p. 303-308Conference paper (Refereed)
    Abstract [en]

    We describe the first effort to annotate a signed language with syntactic dependency structure: the Swedish Sign Language portion of the Universal Dependencies treebanks. The visual modality presents some unique challenges in analysis and annotation, such as the possibility of both hands articulating separate signs simultaneously, which has implications for the concept of projectivity in dependency grammars. Our data is sourced from the Swedish Sign Language Corpus, and if used in conjunction these resources contain very richly annotated data: dependency structure and parts of speech, video recordings, signer metadata, and since the whole material is also translated into Swedish the corpus is also a parallel text.

  • 25.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    A Bayesian model for joint word alignment and part-of-speech transfer2016In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan: Association for Computational Linguistics, 2016, p. 620-629Conference paper (Refereed)
    Abstract [en]

    Current methods for word alignment require considerable amounts of parallel text to deliver accurate results, a requirement which is met only for a small minority of the world’s approximately 7,000 languages. We show that by jointly performing word alignment and annotation transfer in a novel Bayesian model, alignment accuracy can be improved for language pairs where annotations are available for only one of the languages—a finding which could facilitate the study and processing of a vast number of low-resource languages. We also present an evaluation where our method is used to perform single-source and multi-source part-of-speech transfer with 22 translations of the same text in four different languages. This allows us to quantify the considerable variation in accuracy depending on the specific source text(s) used, even with different translations into the same language.

  • 26. Volodina, Elena
    et al.
    Megyesi, Beata
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Granstedt, Lena
    Prentice, Julia
    Reichenberg, Monica
    Sundberg, Gunlög
    Stockholm University, Faculty of Humanities, Department of Swedish Language and Multilingualism, Scandinavian Languages.
    A Friend in Need? Research agenda for electronic Second Language infrastructure2016Conference paper (Refereed)
    Abstract [en]

    In this article, we describe the research and societal needs as well as ongoing efforts to shape Swedish as a Second Language (L2) infrastructure. Our aim is to develop an electronic research infrastructure that would stimulate empiric research into learners' language development by preparing data and developing language technology methods and algorithms that can successfully deal with deviations in the learner language.

  • 27.
    Sjons, Johan
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Hörberg, Thomas
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Articulation rate in child-directed speech increases as a function of child age2016In: Fonetik 2016, 2016Conference paper (Other academic)
    Abstract [en]

    It has been shown that articulation rate (AR), the number of produced linguistic units per time unit with pauses excluded, is lower in child-directed speech (CDS) than in adult-directed speech (ADS). The present study is the first corpus-based longitudinal study to investigate AR in Swedish CDS as a function of child age while also control-ling for utterance length in terms of number of syllables and for individual differences between speakers. AR in transcribed utterances of 7 parents directed at their respective child during different ages was analyzed with mixed effects modeling. Results show a signif-icantly higher AR in longer than in shorter utterances and a significant increase in AR as a function of infant age. Future studies include comparison with entropy-based measures.

  • 28.
    Börstell, Carl
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Hörberg, Thomas
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Distribution and duration of signs and parts of speech in Swedish Sign Language2016In: Sign Language and Linguistics, ISSN 1387-9316, E-ISSN 1569-996X, Vol. 19, no 2, p. 143-196Article in journal (Refereed)
    Abstract [en]

    In this paper, we investigate frequency and duration of signs and parts of speech in Swedish Sign Language (SSL) using the SSL Corpus. The duration of signs is correlated with frequency, with high-frequency items having shorter duration than low-frequency items. Similarly, function words (e.g. pronouns) have shorter duration than content words (e.g. nouns). In compounds, forms annotated as reduced display shorter duration. Fingerspelling duration correlates with word length of corresponding Swedish words, and frequency and word length play a role in the lexicalization of fingerspellings. The sign distribution in the SSL Corpus shows a great deal of cross-linguistic similarity with other sign languages in terms of which signs appear as high-frequency items, and which categories of signs are distributed across text types (e.g. conversation vs. narrative). We find a correlation between an increase in age and longer mean sign duration, but see no significant difference in sign duration between genders.

  • 29.
    Grigonyté, Gintaré
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Language-independent exploration of repetition and variation in longitudinal child-directed speech: A tool and resources2016In: Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition at SLTC, Umeå, 16th November 2016 / [ed] Elena Volodina, Gintarė Grigonytė, Ildikó Pilán, Kristina Nilsson Björkenstam, Lars Borin, Linköping: Linköping University Electronic Press, 2016, p. 41-50Conference paper (Refereed)
    Abstract [en]

    We present a language-independent tool, called Varseta, for extracting variation sets in child-directed speech. This tool is evaluated against a gold standard corpus annotated with variation sets, MINGLE-3-VS, and used to explore variation sets in 26 languages in CHILDES-26-VS, a comparable corpus derived from the CHILDES database. The tool and the resources are freely available for re-search.

  • 30.
    Wirén, Mats
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Grigonytė, Gintarė
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Cortes, Elisabet Eir
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Longitudinal Studies of Variation Sets in Child-directed Speech2016In: The 54th Annual Meeting of the Association for Computational Linguistics: Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning, Stroudsburg, PA, USA: Association for Computational Linguistics, 2016, p. 44-52Conference paper (Refereed)
    Abstract [en]

    One of the characteristics of child-directed speech is its high degree of repetitiousness. Sequences of repetitious utterances with a constant intention, variation sets, have been shown to be correlated with children’s language acquisition. To obtain a baseline for the occurrences of variation sets in Swedish, we annotate 18 parent–child dyads using a generalised definition according to which the varying form may pertain not just to the wording but also to prosody and/or non-verbal cues. To facilitate further empirical investigation, we introduce a surface algorithm for automatic extraction of variation sets which is easily replicable and language-independent. We evaluate the algorithm on the Swedish gold standard, and use it for extracting variation sets in Croatian, English and Russian. We show that the proportion of variation sets in child-directed speech decreases consistently as a function of children's age across Swedish, Croatian, English and Russian.

  • 31.
    Nilsson Björkenstam, Kristina
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Modelling the informativeness and timing of non-verbal cues in parent–child interaction2016In: The 54th Annual Meeting of the Association for Computational Linguistics: Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning, Stroudsburg, PA, USA: Association for Computational Linguistics, 2016, p. 82-90Conference paper (Refereed)
    Abstract [en]

    How do infants learn the meanings of their first words? This study investigates the informativeness and temporal dynamics of non-verbal cues that signal the speaker's referent in a model of early word–referent mapping. To measure the information provided by such cues, a supervised classifier is trained on information extracted from a multimodally annotated corpus of 18 videos of parent–child interaction with three children aged 7 to 33 months. Contradicting previous research, we find that gaze is the single most informative cue, and we show that this finding can be attributed to our fine-grained temporal annotation. We also find that offsetting the timing of the non-verbal cues reduces accuracy, especially if the offset is negative. This is in line with previous research, and suggests that synchrony between verbal and non-verbal cues is important if they are to be perceived as causally related.

  • 32.
    Bjerva, Johannes
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics. University of Groningen.
    Börstell, Carl
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Morphological complexity influences Verb–Object order in Swedish Sign Language2016In: Proceedings of the 1st Workshop on Computational Linguistics for Linguistic Complexity (CL4LC) / [ed] Dominique Brunato, Felice Dell'Orletta, Giulia Venturi, Thomas François & Philippe Blache, Osaka: International Committee on Computational Linguistics (ICCL) , 2016, p. 137-141Conference paper (Refereed)
    Abstract [en]

    Computational linguistic approaches to sign languages could benefit from investigating how complexity influences structure. We investigate whether morphological complexity has an effect on the order of Verb (V) and Object (O) in Swedish Sign Language (SSL), on the basis of elicited data from five Deaf signers. We find a significant difference in the distribution of the orderings OV vs. VO, based on an analysis of morphological weight. While morphologically heavy verbs exhibit a general preference for OV, humanness seems to affect the ordering in the opposite direction, with [+human] Objects pushing towards a preference for VO.

  • 33. Volodina, Elena
    et al.
    Grigonyté, GintaréStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.Pilán, IldikóNilsson Björkenstam, KristinaStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.Borin, Lars
    Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition at SLTC: Umeå 16th November 20162016Conference proceedings (editor) (Refereed)
  • 34. Schneider, Gerold
    et al.
    Grigonyte, Gintare
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Statistical sequence and parsing models for descriptive linguistics and psycholinguistics2016In: New Approaches to English Linguistics: Building bridges / [ed] Olga Timofeeva, Anne-Christine Gardner, Alpo Honkapohja, Sarah Chevalier, John Benjamins Publishing Company, 2016, p. 281-320Chapter in book (Refereed)
    Abstract [en]

    This study shows that using computational linguistic models is beneficial for descriptive linguistics and psycholinguistics. It applies two models to various English genres and learner language: 1) surprisal and 2) a syntactic parser, allowing us to investigate the role of ambiguity and the interplay between idiom and syntax principles. We find that surprisal and ambiguity are higher for learner language, while parser scores and model fit are lower. In addition, the random application of alternations leads to more ambiguous sentences. Failures to generate optimal orderings in the sense of relevance theory, such as nonnative-like utterances by language learners exhibit, increase processing load, both for human and automatic processors. As human and automatic parsing difficulties correlate, we suggest syntactic parsers as psycholinguistic processing models.

  • 35.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Studying colexification through massively parallell corpora2016In: The Lexical Typology of Semantic Shifts / [ed] Päivi Juvonen, Maria Koptjevskaja-Tamm, Berlin: Walter de Gruyter, 2016, p. 157-176Chapter in book (Refereed)
    Abstract [en]

    Large-sample studies in lexical typology are limited by whatever lexical information is available or can be obtained for all the languages in the study. Various types of word lists, from simple Swadesh lists to large dictionaries, can be used for this purpose. Unfortunately, these resources often present only a very fragmentary view of a given language’s vocabulary. As a complement, we propose an additional source of lexical information: parallel texts. Books such as the New Testament have been translated into thousands of languages, and it is possible to automatically extract word lists from their vocabulary, which can then be applied to lexical typological studies. In particular, we focus on studying colexification using a sample of 1 001 different languages, based on 1 142 translations of the New Testament. We find that although the automatically extracted word lists contain errors, their quality can be sufficiently good to find real areal patterns, such as the ‘tree’/’fire’ colexification that is widespread in the Sahul area.

  • 36.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Velupillai, Sumithra
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Swedification patterns of Latin and Greek affixes in clinical text2016In: Nordic Journal of Linguistics, ISSN 0332-5865, E-ISSN 1502-4717, Vol. 39, no 1, p. 5-37Article in journal (Refereed)
    Abstract [en]

    Swedish medical language is rich with Latin and Greek terminology which has undergone a Swedification since the 1980s. However, many original expressions are still used by clinical professionals. The goal of this study is to obtain precise quantitative measures of how the foreign terminology is manifested in Swedish clinical text. To this end, we explore the use of Latin and Greek affixes in Swedish medical texts in three genres: clinical text, scientific medical text and online medical information for laypersons. More specifically, we use frequency lists derived from tokenised Swedish medical corpora in the three domains, and extract word pairs belonging to types that display both the original and Swedified spellings. We describe six distinct patterns explaining the variation in the usage of Latin and Greek affixes in clinical text. The results show that to a large extent affixes in clinical text are Swedified and that prefixes are used more conservatively than suffixes.

  • 37. Cap, Fabienne
    et al.
    Adesam, Yvonne
    Ahrenberg, Lars
    Borin, Lars
    Bouma, Gerlof
    Forsberg, Markus
    Kann, Viggo
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Smith, Aaron
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nivre, Joakim
    SWORD: Towards Cutting-Edge Swedish Word Processing2016In: Proceedings of SLTC 2016, 2016Conference paper (Refereed)
    Abstract [en]

    Despite many years of research on Swedish language technology, there is still no well-documented standard for Swedish word processing covering the whole spectrum from low-level tokenization to morphological analysis and disambiguation. SWORD is a new initiative within the SWE-CLARIN consortium aiming to develop documented standards for Swedish word processing. In this paper, we report on a pilot study of Swedish tokenization, where we compare the output of six different tokenizers on four different text types. For one text type (Wikipedia articles), we also compare to the tokenization produced by six manual annotators.

  • 38.
    Börstell, Carl
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Mesch, Johanna
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Gärdenfors, Moa
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Towards an Annotation of Syntactic Structure in the Swedish Sign Language Corpus2016In: Workshop Proceedings: 7th Workshop on the Representation and Processing of Sign Languages: Corpus Mining / [ed] Eleni Efthimiou, Stavroula-Evita Fotinea, Thomas Hanke, Julie Hochgesang, Jette Kristoffersen, Johanna Mesch, Paris: ELRA , 2016, p. 19-24Conference paper (Refereed)
    Abstract [en]

    This paper describes on-going work on extending the annotation of the Swedish Sign Language Corpus (SSLC) with a level of syntactic structure. The basic annotation of SSLC in ELAN consists of six tiers: four for sign glosses (two tiers for each signer; one for each of a signer’s hands), and two for written Swedish translations (one for each signer). In an additional step by Östling et al. (2015), all ¨ glosses of the corpus have been further annotated for parts of speech. Building on the previous steps, we are now developing annotation of clause structure for the corpus, based on meaning and form. We define a clause as a unit in which a predicate asserts something about one or more elements (the arguments). The predicate can be a (possibly serial) verbal or nominal. In addition to predicates and their arguments, criteria for delineating clauses include non-manual features such as body posture, head movement and eye gaze. The goal of this work is to arrive at two additional annotation tier types in the SSLC: one in which the sign language texts are segmented into clauses, and the other in which the individual signs are annotated for their argument types.

  • 39.
    Östling, Robert
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Börstell, Carl
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Wallin, Lars
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Enriching the Swedish Sign Language Corpus with Part of Speech Tags Using Joint Bayesian Word Alignment and Annotation Transfer2015In: Proceedings of the 20th Nordic Conference of Computational Linguistics: NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania / [ed] Beáta Megyesi, Linköping University Electronic Press, 2015, p. 263-268Conference paper (Refereed)
    Abstract [en]

    We have used a novel Bayesian model of joint word alignment and part of speech (PoS) annotation transfer to enrich the Swedish Sign Language Corpus with PoS tags. The annotations were then hand-corrected in order to both improve annotation quality for the corpus, and allow the empirical evaluation presented herein.

  • 40. Berggren, Max
    et al.
    Karlgren, Jussi
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Parkvall, Mikael
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Inferring the location of authors from words in their texts2015In: Proceedings of the 20th Nordic Conference of Computational Linguistics: NODALIDA 2015 / [ed] Beáta Megyesi, Linköping: Linköping University Electronic Press, ACL Anthology , 2015, p. 211-218Conference paper (Refereed)
    Abstract [en]

    For the purposes of computational dialectology or other geographically bound text analysis tasks, texts must be annotated with their or their authors' location. Many texts are locatable but most have no ex- plicit annotation of place. This paper describes a series of experiments to determine how positionally annotated microblog posts can be used to learn location indicating words which then can be used to locate blog texts and their authors. A Gaussian distribution is used to model the locational qualities of words. We introduce the notion of placeness to describe how locational words are.

    We find that modelling word distributions to account for several locations and thus several Gaussian distributions per word, defining a filter which picks out words with high placeness based on their local distributional context, and aggregating locational information in a centroid for each text gives the most useful results. The results are applied to data in the Swedish language.

  • 41. Bielinskiene, Agne
    et al.
    Boizou, Loic
    Grigonyté, Gintaré
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Kovalevskaite, Jolanta
    Rimkute, Erika
    Utka, Andrius
    Lietuvių kalbos terminų automatinis atpažinimas ir apibrėžimas2015 (ed. 1)Book (Refereed)
    Abstract [en]

    This book presents the most recent advances in the field of Lithuanian terminology extraction as well as the first attempt on automatic extraction of Lithuanian term defining contexts. The first work in descriptive terminology by Lithuanian researchers appeared in early 2000s, i.e. R. Marcinkevičienė (2000) and I. Zeller (dissertation "Term recognition and their analysis", 2005). Nevertheless, the larger proportion of research on Lithuanian terminology is still dominated by the prescriptive view, when a lot of attention and research is given to principles and norms of terminology, as well as diachronic aspects of terminology. Chapter 1 describes differences of descriptive and prescriptive terminology. The authors want to emphasize that the prescriptive terminology involves standardisation and approval of terms, while decisions are based on existing terminology dictionaries, documents, standards, lexicons and databases of approved terms. Whereas in the corpus-based terminology management, which is one of the branches of the descriptive terminology, the main focus is placed on the usage of terms in natural language in a corpus, rather than on the standardisation. The empirical research approaches benefit from various automatic term analysis and term extraction tools, which come in handy in corpus-based terminology management. New terminology research has shown that it is very important to harmonize the methods of prescriptive and descriptive terminology. The combination of both methods allows faster processing of evergrowing data, which is very relevant to challenges of the modern lexicography that include quick and efficient creation of dynamic lexicographical sources.

  • 42.
    Nilsson Björkenstam, Kristina
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Modelling the informativeness of different modalities in parent-child interaction2015In: Workshop on Extensive and Intensive Recordings of Children's Language Environment / [ed] Alex Cristia, Melanie Soderstrom, 2015Conference paper (Refereed)
  • 43.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Clematide, SimonUniversity of Zurich.Volk, MartinUniversity of Zurich.Utka, AndriusVytautas Magnus University.
    Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools, NODALIDA 20152015Conference proceedings (editor) (Refereed)
    Abstract [en]

    Recent years have seen an increased interest in and availability of many different kinds of corpora. These range from small, but carefully annotated treebanks to large parallel corpora and very large monolingual corpora for big data research.

    It remains a challenge to offer flexible and powerful query tools for multilayer annotations of small corpora. When dealing with large corpora, query tools also need to scale in terms of processing speed and reporting through statistical information and visualization options. This becomes evident, for example, when dealing with very large corpora (such as complete Wikipedia corpora) or multi-parallel corpora (such as Europarl or JRC Acquis).

    The QueryVis workshop has gathered researchers who develop and evaluate new corpus query and visualization tools for linguistics, language technology and related disciplines. The papers focus on the design of query languages, and on various new visualization options for monolingual and parallel corpora, both for written and spoken language.

    We hope that QueryVis will stimulate discussions and trigger new ideas for the workshop participants and any reader of the proceedings. The preparation of the workshop and the reviewing of the submissions has already been an inspiring experience.

    All papers were peer-reviewed by three program committee members. We would like to thank all reviewers and contributors for their work and for sharing their thoughts and experiences with us.

    Let us all join our forces to make corpus exploration a rewarding, entertaining, and exciting experience which will grant us ever new insights into language and thought.

  • 44.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Svenska dialektkartor på sekunden2015In: Språkbruk, ISSN 0358-9293, Vol. 3, p. 10-13Article in journal (Other (popular science, discussion, etc.))
  • 45.
    Nilsson Björkenstam, Kristina
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Variation sets in child-directed speech2015In: / [ed] Ellen Marklund, Iris-Corinna Schwarz, 2015Conference paper (Refereed)
  • 46.
    Cortes, Elisabet Eir
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Gerholm, ToveStockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.Marklund, EllenStockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.Marklund, UlrikaStockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.Molnar, MonikaNilsson Björkenstam, KristinaStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.Schwarz, Iris-CorinnaStockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.Sjons, JohanStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    WILD 2015: Book of Abstracts2015Conference proceedings (editor) (Other academic)
    Abstract [en]

    WILD 2015 is the second Workshop on Infant Language Development, held June 10-12 2015 in Stockholm, Sweden. WILD 2015 was organized by Stockholm Babylab and the Department of Linguistics, Stockholm University. About 150 delegates met over three conference days, convening on infant speech perception, social factors of language acquisition, bilingual language development in infancy, early language comprehension and lexical development, neurodevelopmental aspects of language acquisition, methodological issues in infant language research, modeling infant language development, early speech production, and infant-directed speech. Keynote speakers were Alejandrina Cristia, Linda Polka, Ghislaine Dehaene-Lambertz, Angela D. Friederici and Paula Fikkert.

    Organizing this conference would of course not have been possible without our funding agencies Vetenskapsrådet and Riksbankens Jubiléumsfond. We would like to thank Francisco Lacerda, Head of the Department of Linguistics, and the Departmental Board for agreeing to host WILD this year. We would also like to thank the administrative staff for their help and support in this undertaking, especially Ann Lorentz-Baarman and Linda Habermann.

    The WILD 2015 Organizing Committee: Ellen Marklund, Iris-Corinna Schwarz, Elísabet Eir Cortes, Johan Sjons, Ulrika Marklund, Tove Gerholm, Kristina Nilsson Björkenstam and Monika Molnar.

  • 47.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Word order typology through multilingual word alignment2015In: The 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing: Proceedings of the Conference, Volume 2: Short Papers, 2015, p. 205-211Conference paper (Refereed)
    Abstract [en]

    With massively parallel corpora of hundreds or thousands of translations of the same text, it is possible to automatically perform typological studies of language structure using very large language samples. We investigate the domain of wordorder using multilingual word alignment and high-precision annotation transfer in a corpus with 1144 translations in 986 languages of the New Testament. Results are encouraging, with 86% to 96% agreementbetween our method and the manually created WALS database for a range of different word order features. Beyond reproducing the categorical data in WALS and extending it to hundreds of other languages, we also provide quantitative data for therelative frequencies of different word orders, and show the usefulness of this for language comparison. Our method has applications for basic research in linguistic typology, as well as for NLP tasks like transfer learning for dependency parsing, which has been shown to benefit from word order information.

  • 48.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Baldwin, Timothy
    University of Melbourne.
    Automatic Detection of Multilingual Dictionaries on the Web2014In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014, p. 93-98Conference paper (Refereed)
  • 49.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Bayesian Word Alignment for Massively Parallel Texts2014In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, Association for Computational Linguistics, 2014, p. 123-127Conference paper (Refereed)
    Abstract [en]

    There has been a great amount of work done in the field of bitext alignment, but the problem of aligning words in massively parallel texts with hundreds or thousands of languages is largely unexplored. While the basic task is similar, there are also important differences in purpose, method and evaluation between the problems. In this work, I present a non-parametric Bayesian model that can be used for simultaneous word alignment in massively parallel corpora. This method is evaluated on a corpus containing 1144 translations of the New Testament.

  • 50.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Schneider, Gerold
    From lexical bundles to surprisal: Measuring the idiom principle2014In: Lexical bundles in English non-fiction writing: forms and functions, 2014Conference paper (Refereed)
    Abstract [en]

    Lexical bundles (LB) testify to Sinclair's idiom principle (SIP), and measure formulaicity, complexity and (non-) creativity (FCN). We exploit the information-theoretic measure of surprisal to analyze these.Frequency as measure of LB has been criticized (McEnery et al, 2006:208–220), instead collocation measures were suggested until Biber (2009:286–290) raised three criticisms. First, MI ranks rare collocations, which often include idioms, highest. We answer that also idioms are formulaic, and there are collocation measures which have a bias towards frequent collocations.Second, MI doesn't respect word order. We thus use directed word transition probabilities like surprisal (Levy and Jaeger 2007):3-gram surprisal =Third, formulaic sequences are often discontinuous. We thus sum over sequences, use 3-grams as atoms, and address syntactic surprisal.We argue that abstracting to surprisal as measure of LB and FCN is appropriate, as it expresses reader expectations and text entropy. We use surprisal to analyse differences between:

    1. spoken and written learner language (L2);
    2. L2 across proficiency levels;
    3. L2 compared with L1

    We test Pawley and Syder (1983)'s and Levy and Jaeger (2007)'s hypothesis that native speakers play the tug-of-war between formulaicity and expressiveness best, thus minimizing comprehension difficulty, according to the uniform information density principle.

  • 51. Schneider, Gerold
    et al.
    Grigonyte, Gintare
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    From surprisal to tagging and syntactic parsing: measuring the idiom and syntax principle2014Conference paper (Refereed)
    Abstract [en]

    We introduced surprisal as abstraction from lexical bundles to lexical bundleness. There are forces beyond lexical bundles: on the one hand word-sequence abstractions to word classes, on the other hand the syntax principle (SSP) in contradistinction to the idiom principle (SIP). We ultimately aim for a model of their mutual influence (Sinclair 1991).We motivate the use of models, then abstract to word-class models using a part-of-speech tagger, and to syntactic models, using a large-scale parser. Part-of-speech taggers assign word-classes based on sequences. They typically achieve high accuracy. Areas of low accuracy and low tagger confidence for word class assignment indicate low model fit, and thus often high entropy, lack of formulaic sequences. Tagger model fit can be used as measure of morphosyntactic bundleness.Although creative language (SSP) is rarer, it needs to be respected. We thus also use a syntactic parser language model (Schneider 2008) which combines SSP in form of a hand-written competence grammar and SIP as probabilistic performance disambiguation, paying tribute to Hoey (2005)'s insights on lexical priming. We show that parser model fit is lower on low-level L2 texts, as we can expect according to Pawley and Syder (1983). Finally, we introduce measures of syntactic surprisal.

  • 52. Utka, A.
    et al.
    Grigonyté, GintaréStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.Kapočiūtė-Dzikienė, J.