Change search
Refine search result
123 1 - 50 of 114
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1. Berggren, Max
    et al.
    Karlgren, Jussi
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Parkvall, Mikael
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Inferring the location of authors from words in their texts2015In: Proceedings of the 20th Nordic Conference of Computational Linguistics: NODALIDA 2015 / [ed] Beáta Megyesi, Linköping: Linköping University Electronic Press, ACL Anthology , 2015, p. 211-218Conference paper (Refereed)
    Abstract [en]

    For the purposes of computational dialectology or other geographically bound text analysis tasks, texts must be annotated with their or their authors' location. Many texts are locatable but most have no ex- plicit annotation of place. This paper describes a series of experiments to determine how positionally annotated microblog posts can be used to learn location indicating words which then can be used to locate blog texts and their authors. A Gaussian distribution is used to model the locational qualities of words. We introduce the notion of placeness to describe how locational words are.

    We find that modelling word distributions to account for several locations and thus several Gaussian distributions per word, defining a filter which picks out words with high placeness based on their local distributional context, and aggregating locational information in a centroid for each text gives the most useful results. The results are applied to data in the Swedish language.

  • 2. Bielinskiene, Agne
    et al.
    Boizou, Loic
    Grigonyte, Gintare
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Kovalevskaite, Jolanta
    Markievicz, Irena
    Rimkute, Erika
    Utka, Andrius
    Viliunas, Giedrius
    Švietimo ir mokslo terminų žodynas (Dictionary of Terms of Science and Education)2013Other (Other academic)
  • 3. Bielinskiene, Agne
    et al.
    Boizou, Loic
    Grigonyté, Gintaré
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Kovalevskaite, Jolanta
    Rimkute, Erika
    Utka, Andrius
    Lietuvių kalbos terminų automatinis atpažinimas ir apibrėžimas2015 (ed. 1)Book (Refereed)
    Abstract [en]

    This book presents the most recent advances in the field of Lithuanian terminology extraction as well as the first attempt on automatic extraction of Lithuanian term defining contexts. The first work in descriptive terminology by Lithuanian researchers appeared in early 2000s, i.e. R. Marcinkevičienė (2000) and I. Zeller (dissertation "Term recognition and their analysis", 2005). Nevertheless, the larger proportion of research on Lithuanian terminology is still dominated by the prescriptive view, when a lot of attention and research is given to principles and norms of terminology, as well as diachronic aspects of terminology. Chapter 1 describes differences of descriptive and prescriptive terminology. The authors want to emphasize that the prescriptive terminology involves standardisation and approval of terms, while decisions are based on existing terminology dictionaries, documents, standards, lexicons and databases of approved terms. Whereas in the corpus-based terminology management, which is one of the branches of the descriptive terminology, the main focus is placed on the usage of terms in natural language in a corpus, rather than on the standardisation. The empirical research approaches benefit from various automatic term analysis and term extraction tools, which come in handy in corpus-based terminology management. New terminology research has shown that it is very important to harmonize the methods of prescriptive and descriptive terminology. The combination of both methods allows faster processing of evergrowing data, which is very relevant to challenges of the modern lexicography that include quick and efficient creation of dynamic lexicographical sources.

  • 4.
    Bjerva, Johannes
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Genetic Algorithms in the Brill Tagger: Moving towards language independence2013Independent thesis Advanced level (degree of Master (One Year)), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    The viability of using rule-based systems for part-of-speech tagging was revitalised when a simple rule-based tagger was presented by Brill (1992). This tagger is based on an algorithm which automatically derives transformation rules from a corpus, using an error-driven approach. In addition to performing on par with state of the art stochastic systems for part-of-speech tagging, it has the advantage that the automatically derived rules can be presented in a human-readable format.

    In spite of its strengths, the Brill tagger is quite language dependent, and performs much better on languages similar to English than on languages with richer morphology. This issue is addressed in this paper through defining rule templates automatically with a search that is optimised using Genetic Algorithms. This allows the Brill GA-tagger to search a large search space for templates which in turn generate rules which are appropriate for various target languages, which has the added advantage of removing the need for researchers to define rule templates manually.

    The Brill GA-tagger performs significantly better (p<0.001) than the standard Brill tagger on all 9 target languages (Chinese, Japanese, Turkish, Slovene, Portuguese, English, Dutch, Swedish and Icelandic), with an error rate reduction of between 2% -- 15% for each language.

  • 5.
    Bjerva, Johannes
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Predicting the N400 Component in Manipulated and Unchanged Texts with a Semantic Probability Model2012Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Within the field of computational linguistics, recent research has made successful advances in integrating word space models with n-gram models. This is of particular interest when a model that encapsulates both semantic and syntactic information is desirable. A potential application for this can be found in the field of psycholinguistics, where the neural response N400 has been found to occur in contexts with semantic incongruities. Previous research has found correlations between cloze probabilities and N400, while more recent research has found correlations between cloze probabilities and language models.

    This essay attempts to uncover whether or not a more direct connection between integrated models and N400 can be found, hypothesizing that low probabilities elicit strong N400 responses and vice versa. In an EEG experiment, participants read a text manipulated using a language model, and a text left unchanged. Analysis of the results shows that the manipulations to some extent yielded results supporting the hypothesis. Further results are found when analysing responses to the unchanged text. However, no significant correlations between N400 and the computational model are found. Future research should improve the experimental paradigm, so that a larger scale EEG recording can be used to construct a large EEG corpus.

  • 6.
    Bjerva, Johannes
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics. University of Groningen.
    Börstell, Carl
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Morphological complexity influences Verb–Object order in Swedish Sign Language2016In: Proceedings of the 1st Workshop on Computational Linguistics for Linguistic Complexity (CL4LC) / [ed] Dominique Brunato, Felice Dell'Orletta, Giulia Venturi, Thomas François & Philippe Blache, Osaka: International Committee on Computational Linguistics (ICCL) , 2016, p. 137-141Conference paper (Refereed)
    Abstract [en]

    Computational linguistic approaches to sign languages could benefit from investigating how complexity influences structure. We investigate whether morphological complexity has an effect on the order of Verb (V) and Object (O) in Swedish Sign Language (SSL), on the basis of elicited data from five Deaf signers. We find a significant difference in the distribution of the orderings OV vs. VO, based on an analysis of morphological weight. While morphologically heavy verbs exhibit a general preference for OV, humanness seems to affect the ordering in the opposite direction, with [+human] Objects pushing towards a preference for VO.

  • 7.
    Bjerva, Johannes
    et al.
    University of Groningen.
    Grigonyte, Gintare
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Plank, Barbara
    University of Groningen.
    Neural Networks and Spelling Features for Native Language Identification2017In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, 2017, p. 235-239Conference paper (Refereed)
    Abstract [en]

    We present the RUG-SU team's submission at the Native Language Identification Shared Task 2017. We combine several approaches into an ensemble, based on spelling error features, a simple neural network using word representations, a deep residual network using word and character features, and a system based on a recurrent neural network. Our best system is an ensemble of neural networks, reaching an F1 score of 0.8323. Although our system is not the highest ranking one, we do outperform the baseline by far.

  • 8. Bjerva, Johannes
    et al.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Cross-lingual Learning of Semantic Textual Similarity with Multilingual Word Representations2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics / [ed] Jörg Tiedemann, Linköping: Linköping University Electronic Press, 2017, p. 211-215, article id 024Conference paper (Refereed)
    Abstract [en]

    Assessing the semantic similarity between sentences in different languages is challenging. We approach this problem by leveraging multilingual distributional word representations, where similar words in different languages are close to each other. The availability of parallel data allows us to train such representations on a large amount of languages. This allows us to leverage semantic similarity data for languages for which no such data exists. We train and evaluate on five language pairs, including English, Spanish, and Arabic. We are able to train wellperforming systems for several language pairs, without any labelled data for that language pair.

  • 9.
    Byström, Emil
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Knowledge-based CoreferenceResolution in Swedish2012Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Automatic coreference resolution is the automatic identification of expressions with the same referents. The state of the art systems are data driven and based on machine learning algorithms. Data drivenapproaches to coreference resolution require big amounts of annotated data, which is time consumingand expensive to obtain. Haghigi and Klein [1] present a knowledge based approach where coreference is resolved with heuristics using rich syntactic and semantic features. Haghigi and Klein’s system isinteresting because its performance is in line with data-driven systems and the requirements of annotateddata is low. In the present study a knowledge based system for coreference resolution in Swedish was implementedand its performance evaluated. The system is based on the system of Haghigi and Klein. To be able to evaluate and implement the algorithm, a database annotated with coreferential chains is needed. Asthere is no freely available resource with data annotated with coreference in Swedish, the annotation ofthe gold standard part of SUC 2.0 is also described. Results from the evaluation of the implementation show that the syntactic and semantic filters implemented did not improve baseline results. The filters falsely allow or constrain coreference as insufficient linguistic information is available. It is argued thatfocusing on rich syntactic and semantic features improves future work on knowledge-based coreferenceresolution in Swedish.

  • 10.
    Börstell, Carl
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Hörberg, Thomas
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Distribution and duration of signs and parts of speech in Swedish Sign Language2016In: Sign Language and Linguistics, ISSN 1387-9316, E-ISSN 1569-996X, Vol. 19, no 2, p. 143-196Article in journal (Refereed)
    Abstract [en]

    In this paper, we investigate frequency and duration of signs and parts of speech in Swedish Sign Language (SSL) using the SSL Corpus. The duration of signs is correlated with frequency, with high-frequency items having shorter duration than low-frequency items. Similarly, function words (e.g. pronouns) have shorter duration than content words (e.g. nouns). In compounds, forms annotated as reduced display shorter duration. Fingerspelling duration correlates with word length of corresponding Swedish words, and frequency and word length play a role in the lexicalization of fingerspellings. The sign distribution in the SSL Corpus shows a great deal of cross-linguistic similarity with other sign languages in terms of which signs appear as high-frequency items, and which categories of signs are distributed across text types (e.g. conversation vs. narrative). We find a correlation between an increase in age and longer mean sign duration, but see no significant difference in sign duration between genders.

  • 11.
    Börstell, Carl
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Mesch, Johanna
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Gärdenfors, Moa
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Towards an Annotation of Syntactic Structure in the Swedish Sign Language Corpus2016In: Workshop Proceedings: 7th Workshop on the Representation and Processing of Sign Languages: Corpus Mining / [ed] Eleni Efthimiou, Stavroula-Evita Fotinea, Thomas Hanke, Julie Hochgesang, Jette Kristoffersen, Johanna Mesch, Paris: ELRA , 2016, p. 19-24Conference paper (Refereed)
    Abstract [en]

    This paper describes on-going work on extending the annotation of the Swedish Sign Language Corpus (SSLC) with a level of syntactic structure. The basic annotation of SSLC in ELAN consists of six tiers: four for sign glosses (two tiers for each signer; one for each of a signer’s hands), and two for written Swedish translations (one for each signer). In an additional step by Östling et al. (2015), all ¨ glosses of the corpus have been further annotated for parts of speech. Building on the previous steps, we are now developing annotation of clause structure for the corpus, based on meaning and form. We define a clause as a unit in which a predicate asserts something about one or more elements (the arguments). The predicate can be a (possibly serial) verbal or nominal. In addition to predicates and their arguments, criteria for delineating clauses include non-manual features such as body posture, head movement and eye gaze. The goal of this work is to arrive at two additional annotation tier types in the SSLC: one in which the sign language texts are segmented into clauses, and the other in which the individual signs are annotated for their argument types.

  • 12.
    Börstell, Carl
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Iconic Locations in Swedish Sign Language: Mapping Form to Meaning with Lexical Databases2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa / [ed] Jörg Tiedemann, Linköping: Linköping University Electronic Press, 2017, p. 221-225, article id 026Conference paper (Refereed)
    Abstract [en]

    In this paper, we describe a method for mapping the phonological feature location of Swedish Sign Language (SSL) signs to the meanings in the Swedish semantic dictionary SALDO. By doing so, we observe clear differences in the distribution of meanings associated with different locations on the body. The prominence of certain locations for specific meanings clearly point to iconic mappings between form and meaning in the lexicon of SSL, which pinpoints modalityspecific properties of the visual modality.

  • 13. Cap, Fabienne
    et al.
    Adesam, Yvonne
    Ahrenberg, Lars
    Borin, Lars
    Bouma, Gerlof
    Forsberg, Markus
    Kann, Viggo
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Smith, Aaron
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nivre, Joakim
    SWORD: Towards Cutting-Edge Swedish Word Processing2016In: Proceedings of SLTC 2016, 2016Conference paper (Refereed)
    Abstract [en]

    Despite many years of research on Swedish language technology, there is still no well-documented standard for Swedish word processing covering the whole spectrum from low-level tokenization to morphological analysis and disambiguation. SWORD is a new initiative within the SWE-CLARIN consortium aiming to develop documented standards for Swedish word processing. In this paper, we report on a pilot study of Swedish tokenization, where we compare the output of six different tokenizers on four different text types. For one text type (Wikipedia articles), we also compare to the tokenization produced by six manual annotators.

  • 14.
    Cortes, Elisabet Eir
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Gerholm, ToveStockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.Marklund, EllenStockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.Marklund, UlrikaStockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.Molnar, MonikaNilsson Björkenstam, KristinaStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.Schwarz, Iris-CorinnaStockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.Sjons, JohanStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    WILD 2015: Book of Abstracts2015Conference proceedings (editor) (Other academic)
    Abstract [en]

    WILD 2015 is the second Workshop on Infant Language Development, held June 10-12 2015 in Stockholm, Sweden. WILD 2015 was organized by Stockholm Babylab and the Department of Linguistics, Stockholm University. About 150 delegates met over three conference days, convening on infant speech perception, social factors of language acquisition, bilingual language development in infancy, early language comprehension and lexical development, neurodevelopmental aspects of language acquisition, methodological issues in infant language research, modeling infant language development, early speech production, and infant-directed speech. Keynote speakers were Alejandrina Cristia, Linda Polka, Ghislaine Dehaene-Lambertz, Angela D. Friederici and Paula Fikkert.

    Organizing this conference would of course not have been possible without our funding agencies Vetenskapsrådet and Riksbankens Jubiléumsfond. We would like to thank Francisco Lacerda, Head of the Department of Linguistics, and the Departmental Board for agreeing to host WILD this year. We would also like to thank the administrative staff for their help and support in this undertaking, especially Ann Lorentz-Baarman and Linda Habermann.

    The WILD 2015 Organizing Committee: Ellen Marklund, Iris-Corinna Schwarz, Elísabet Eir Cortes, Johan Sjons, Ulrika Marklund, Tove Gerholm, Kristina Nilsson Björkenstam and Monika Molnar.

  • 15.
    Drangert, Lisette
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Longitudinella förändringar av yttranden inom variationsmängder i barnriktat tal: En korpusstudie av yttrandetyper och verb2016Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Variation sets are a feature in child-directed speech characterized by successive utterances in which the adult speaker repeats and reorders their message with a constant intent. The aim of this study was to investigate variation sets over time in speech directed to children in the ages 7-33 months. The purpose was to study which types of utterances that dominates the variation sets at different ages, and which utterance-types that tend to co-occur within the variation sets. Furthermore intent was studied as well as change in verb tense in these variation sets. A script was written to categorize types of utterances with data from a corpus consisting of child directed speech. A quantitative research was performed on the results based on four different age groups.

    The complexity of the utterances within variation sets was shown to grow with the increasing age of the children. Furthermore a noticeable difference was observed in the intent of the adult speaker, correlating with the age of the child, and also a decrease in use of interjection combined with yes/no- questions and complex utterances the older the children were. A suggested interpretation of the result was that the adult tend to take both sides of the conversation when the children are young as opposed to when they speak to older, more verbal, children that can provide the answer themselves. 

  • 16.
    Ek, Adam
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Blending Words or: How I Learned to Stop Worrying and Love the Blendguage: A computational study of lexical blending in Swedish2018Independent thesis Advanced level (degree of Master (One Year)), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    This thesis investigates Swedish lexical blends. A lexical blend is defined as the concatenation of two words, where at least one word has been reduced. Lexical blends are approached from two perspectives. First, the thesis investigates lexical blends as they appear in the Swedish language. It is found that there is a significant statistical relationship between the two source words in terms of orthographic, phonemic and syllabic length and frequency in a reference corpus. Furthermore, some uncommon lexical blends created from pronouns and interjections are described. A description of lexical blends through semantic construction and similarity to other word formation processes are also described. Secondly, the thesis develops a model which predicts source words of lexical blends. To predict the source words a logistic regression model is used. The evaluation shows that using a ranking approach, the correct source words are the highest ranking word pair in 32.2% of the cases. In the top 10 ranking word pairs, the correct word pair is found in 60.6% of the cases. The results are lower than in previous studies, but the number of blends used is also smaller. It is shown that lexical blends which overlap are easier to predict than lexical blends which do not overlap. Using feature ablation, it is shown that semantic and frequency related features have the most important for the prediction of source words.

  • 17.
    Ek, Adam
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Grigonytė, Gintarė
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Gustafson Capková, Sofia
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Identifying Speakers and Addressees in Dialogues Extracted from Literary Fiction2018In: 11th edition of the Language Resources and Evaluation Conference, European Language Resources Association, 2018Conference paper (Refereed)
    Abstract [en]

    This paper describes an approach to identifying speakers and addressees in dialogues extracted from literary fiction, along with a dataset annotated for speaker and addressee. The overall purpose of this is to provide annotation of dialogue interaction between characters in literary corpora in order to allow for enriched search facilities and construction of social networks from the corpora. To predict speakers and addressees in a dialogue, we use a sequence labeling approach applied to a given set of characters. We use features relating to the current dialogue, the preceding narrative, and the complete preceding context. The results indicate that even with a small amount of training data, it is possible to build a fairly accurate classifier for speaker and addressee identification across different authors, though the identification of addressees is the more difficult task.

  • 18. Eklund, Robert
    et al.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Effects of open and directed prompts on filled pauses and utterance production2010In: Proceedings from Fonetik 2010, Lund, June 2–4, 2010 / [ed] Susanne Schötz and Gilbert Ambrazaitis, Lund: Mediatryck , 2010, p. 23-28Conference paper (Other academic)
    Abstract [en]

    This paper describes an experiment where open and directed prompts were alternated when collecting speech data for the deployment of a call-routing application. The experiment tested whether open and directed prompts resulted in any differences with respect to the filled pauses exhibited by the callers, which is interesting in the light of the “many-options” hypothesis of filled pause production. The experiment also investigated the effects of the prompts on utterance form and meaning of the callers.

  • 19.
    Eklås Tejman, Claudia
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Automatisk citatidentifiering för nyhetstext på svenska2015Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    The strategies for marking quotations in Swedish differ from most other European languages. Since most systems for quotation identification are developed for English, there was a need for a quotation identification system specifically adapted for Swedish. A gold standard of 100 quotes from SUC 3.0 and 206 quotes from unformatted, web crawled news data was compiled to analyse the syntactic structures and marking patterns of Swedish quotation. A rule based system for quotation identification based on the patterns was developed. It achieved an F-score of 0.79 for the raw news data that contained the gold standard quotes and was able to identify 13 of 19 marking patterns. It could not determine whether the quotes ended after the reporting phrase or not, since the raw text data lacked formatting for most common way to mark the end of a quote in Swedish.

  • 20.
    Ekman, Sara
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Automatisk extraktion av nyckelord ur ett kundforum2018Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Conversations in a customer forum span across different topics and the language is inconsistent. The text type do not meet the demands for automatic keyword extraction. This essay examines how keywords can be automatically extracted despite these difficulties. Focus in the study are three areas of keyword extraction. The first factor regards how the established keyword extraction method TF*IDF performs compared to four methods created with the unusual material in mind. The next factor deals with different ways to calculate word frequency. The third factor regards if the methods use only posts, only titles, or both in their extractions. Non-parametric tests were conducted to evaluate the extractions. A number of Friedman's tests shows the methods in some cases differ in their ability to identify relevant keywords. In post-hoc tests performed between the highest performing methods, one of the new methods perform significantly better than the other new methods but not better than TF*IDF. No difference was found between the use of different text types or ways to calculate word frequency. For future research reliability test of manually annotated keywords is recommended. A larger sample size should be used than in the current study and further suggestions are given to improve the results of keyword extractions.

  • 21.
    Engdahl, Johan
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Tremänning eller syssling: Automatisk sökning i bloggar efter ordisoglosser i Sverige2012Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [sv]

    Ibland används i två dialekter olika ord för samma sak. Syftet med denna studie är att visa vad somkan automatiseras i sökandet efter ord-isoglosser. Detta undersöks genom att skriva och utvärdera ettprogram som genom att analasyera bloggtext söker efter ordisoglosser i Sverige. En isogloss är engeografisk gräns mellan två olika språkliga egenskaper, till exempel prosodi eller betoning, eller som idetta fall ord. Programmet mappar skribentens kommun till orden från bloggtexterna i en databas. Lagttill detta låter programmet användaren söka efter antingen hur vanligt ett ord är i Sveriges kommunerjämfört med riksgenomsnittet; eller vilket av två olika ord som är vanligast inom varje kommun, enligtett två-sidigt proportionstest. Resultatet av de gjorda sökningarna skrevs till en fil och plottades sedanmanuellt. Utvärderingen visar att programmet kan hitta några ordisoglosser mellan kommuner, och attkartorna i viss utsträckning stämmer överrens med de resultat som Parkvall (Parkvall, 2011; Parkvall,2012) påvisar. Detta indikerar att programmet är en bra början för liknande studier. Förbättringar avprogrammet är att användaren tillåts använda reguljära uttryck för att få bort ambuigitet.

  • 22.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Baldwin, Timothy
    University of Melbourne.
    Automatic Detection of Multilingual Dictionaries on the Web2014In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014, p. 93-98Conference paper (Refereed)
  • 23.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Clematide, Simon
    Institute of Computational Linguistics, University of Zurich, Switzerland.
    Rinaldi, Fabio
    Institute of Computational Linguistics, University of Zurich, Switzerland.
    How preferred are preferred terms?2013In: eLex 2013 / [ed] Kosem, I., Kallas, J., Gantar, P., Krek, S., Langemets, M., Tuulik, M., 2013, p. 452-459Conference paper (Refereed)
  • 24.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Clematide, SimonUniversity of Zurich.Volk, MartinUniversity of Zurich.Utka, AndriusVytautas Magnus University.
    Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools, NODALIDA 20152015Conference proceedings (editor) (Refereed)
    Abstract [en]

    Recent years have seen an increased interest in and availability of many different kinds of corpora. These range from small, but carefully annotated treebanks to large parallel corpora and very large monolingual corpora for big data research.

    It remains a challenge to offer flexible and powerful query tools for multilayer annotations of small corpora. When dealing with large corpora, query tools also need to scale in terms of processing speed and reporting through statistical information and visualization options. This becomes evident, for example, when dealing with very large corpora (such as complete Wikipedia corpora) or multi-parallel corpora (such as Europarl or JRC Acquis).

    The QueryVis workshop has gathered researchers who develop and evaluate new corpus query and visualization tools for linguistics, language technology and related disciplines. The papers focus on the design of query languages, and on various new visualization options for monolingual and parallel corpora, both for written and spoken language.

    We hope that QueryVis will stimulate discussions and trigger new ideas for the workshop participants and any reader of the proceedings. The preparation of the workshop and the reviewing of the submissions has already been an inspiring experience.

    All papers were peer-reviewed by three program committee members. We would like to thank all reviewers and contributors for their work and for sharing their thoughts and experiences with us.

    Let us all join our forces to make corpus exploration a rewarding, entertaining, and exciting experience which will grant us ever new insights into language and thought.

  • 25.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Hammarberg, Björn
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Pronunciation and Spelling: the Case of Misspellings in Swedish L2 Written Essays2014In: Human Language Technologies - The Baltic Perspective, Baltic HLT 2014 / [ed] Andrius Utka, Gintarė Grigonytė, Jurgita Kapočiūtė-Dzikienė, Jurgita Vaičenonienė, Amsterdam: IOS Press, 2014, p. 95-98Conference paper (Refereed)
    Abstract [en]

    This research presents an investigation performed on the ASU corpus. We analyse to what extent does the pronunciation of intended words reflects in spelling errors done by L2 Swedish learners. We also propose a method that helps to automatically discriminate the misspellings affected by pronunciation from other types of misspellings.

  • 26.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Velupillai, Sumithra
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Swedification patterns of Latin and Greek affixes in clinical text2016In: Nordic Journal of Linguistics, ISSN 0332-5865, E-ISSN 1502-4717, Vol. 39, no 1, p. 5-37Article in journal (Refereed)
    Abstract [en]

    Swedish medical language is rich with Latin and Greek terminology which has undergone a Swedification since the 1980s. However, many original expressions are still used by clinical professionals. The goal of this study is to obtain precise quantitative measures of how the foreign terminology is manifested in Swedish clinical text. To this end, we explore the use of Latin and Greek affixes in Swedish medical texts in three genres: clinical text, scientific medical text and online medical information for laypersons. More specifically, we use frequency lists derived from tokenised Swedish medical corpora in the three domains, and extract word pairs belonging to types that display both the original and Swedified spellings. We describe six distinct patterns explaining the variation in the usage of Latin and Greek affixes in clinical text. The results show that to a large extent affixes in clinical text are Swedified and that prefixes are used more conservatively than suffixes.

  • 27.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Schneider, Gerold
    From lexical bundles to surprisal: Measuring the idiom principle2014In: Lexical bundles in English non-fiction writing: forms and functions, 2014Conference paper (Refereed)
    Abstract [en]

    Lexical bundles (LB) testify to Sinclair's idiom principle (SIP), and measure formulaicity, complexity and (non-) creativity (FCN). We exploit the information-theoretic measure of surprisal to analyze these.Frequency as measure of LB has been criticized (McEnery et al, 2006:208–220), instead collocation measures were suggested until Biber (2009:286–290) raised three criticisms. First, MI ranks rare collocations, which often include idioms, highest. We answer that also idioms are formulaic, and there are collocation measures which have a bias towards frequent collocations.Second, MI doesn't respect word order. We thus use directed word transition probabilities like surprisal (Levy and Jaeger 2007):3-gram surprisal =Third, formulaic sequences are often discontinuous. We thus sum over sequences, use 3-grams as atoms, and address syntactic surprisal.We argue that abstracting to surprisal as measure of LB and FCN is appropriate, as it expresses reader expectations and text entropy. We use surprisal to analyse differences between:

    1. spoken and written learner language (L2);
    2. L2 across proficiency levels;
    3. L2 compared with L1

    We test Pawley and Syder (1983)'s and Levy and Jaeger (2007)'s hypothesis that native speakers play the tug-of-war between formulaicity and expressiveness best, thus minimizing comprehension difficulty, according to the uniform information density principle.

  • 28.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Schneider, Gerold
    English Department, University of Zurich, Switzerland.
    Measuring Encoding Efficiency in Swedish and English Language Learner Speech Production2017In: The 18th Annual Conference of the International Speech Communication Association Interspeech 2017 / [ed] Marcin Włodarczak, The International Speech Communication Association (ISCA), 2017, article id 337Conference paper (Refereed)
    Abstract [en]

    We use n-gram language models to investigate how far lan- guage approximates an optimal code for human communication in terms of Information Theory [1], and what differences there are between Learner proficiency levels. Although the language of lower level learners is simpler, it is less optimal in terms of information theory, and as a consequence more difficult to pro- cess. 

  • 29.
    Grigonyté, Gintaré
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences. Karolinska Institutet, Sweden.
    Velupillai, Sumithra
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Improving Readability of Swedish Electronic Health Records through Lexical Simplification: First Results2014In: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), Stroudsburg, USA: Association for Computational Linguistics, 2014, p. 74-83Conference paper (Refereed)
    Abstract [en]

    This paper describes part of an ongoing effort to improve the readability of Swedish electronic health records (EHRs). An EHR contains systematic documentation of a single patient’s medical history across time, entered by healthcare professionals with the purpose of enabling safe and informed care. Linguistically, medical records exemplify a highly specialised domain, which can be superficially characterised as having telegraphic sentences involving displaced or missing words, abundant abbreviations, spelling variations including misspellings, and terminology. We report results on lexical simplification of Swedish EHRs, by which we mean detecting the unknown, out-ofdictionary words and trying to resolve them either as compounded known words, abbreviations or misspellings.

  • 30.
    Grigonyté, Gintaré
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences. Karolinska Institute, Sweden.
    Velupillai, Sumithra
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Spelling Variation of Latin and Greek words in Swedish Medical Text2014Conference paper (Refereed)
  • 31.
    Grigonyté, Gintaré
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Language-independent exploration of repetition and variation in longitudinal child-directed speech: A tool and resources2016In: Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition at SLTC, Umeå, 16th November 2016 / [ed] Elena Volodina, Gintarė Grigonytė, Ildikó Pilán, Kristina Nilsson Björkenstam, Lars Borin, Linköping: Linköping University Electronic Press, 2016, p. 41-50Conference paper (Refereed)
    Abstract [en]

    We present a language-independent tool, called Varseta, for extracting variation sets in child-directed speech. This tool is evaluated against a gold standard corpus annotated with variation sets, MINGLE-3-VS, and used to explore variation sets in 26 languages in CHILDES-26-VS, a comparable corpus derived from the CHILDES database. The tool and the resources are freely available for re-search.

  • 32.
    Hammarberg, Björn
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Grigonyté, Gintaré
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Non-Native Writers’ Errors – a Challenge to a Spell-Checker2014In: 1st Nordic workshop on evaluation of spellchecking and proofing tools (NorWEST2014), 2014, , p. 3Conference paper (Refereed)
    Abstract [en]

    Spell checkers are widely used and if they do their job properly are also highly useful. Usually they are built on the assumption that the text to be corrected is written by a mature native speaker. However non-native speakers are in an even greater need of using spell checkers than native speakers. On the other hand current spell checkers do not take the linguistic problems of learners into account and thus they are poor in identifying errors and supplying the adequate corrections. There is a number of linguistic complexities specific to non-native learners that a spell-checker would need to handle in order to be successful.

  • 33.
    Hjelm, Hans
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Extraction of Cross Language Term Correspondences2006In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006), 2006Conference paper (Refereed)
  • 34.
    Hjelm, Hans
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Identifying Cross Language Term Equivalents Using Statistical Machine Translation and Distributional Association Measures2007In: Proceedings of Nodalida 2007, the 16th Nordic Conference of Computational Linguistics / [ed] Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit, 2007Conference paper (Refereed)
    Abstract [en]

    This article presents a comparison of the accuracy of a number of different approaches for identifying cross language term equivalents (translations). The methods investigated are on the one hand associative measures, commonly used in word-space models or in Information Retrieval and on the other hand a Statistical Machine Translation (SMT) approach. I have performed tests on six language pairs, using the JRC-Acquis parallel corpus as training material and Eurovoc as a gold standard. The SMT approach is shown to be more effective than the associative measures. The best results are achieved by taking a weighted average of the scores of the SMT approach and disparate associative measures.

  • 35.
    Hjelm, Hans
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Buitelaar, Paul
    Multilingual Evidence Improves Clustering-based Taxonomy Extraction2008In: Proceedings of the 18th European Conference on Artificial Intelligence (ECAI 2008), 2008Conference paper (Refereed)
  • 36.
    Hjelm, Hans
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Schwarz, Christoph
    LiSa - Morphological Analysis for Information Retrieval2006In: Proceedings of the 15th NODALIDA conference, Joensuu 2005, 2006Conference paper (Refereed)
  • 37.
    Hultin, Felix
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Understanding Context-free Grammars through Data Visualization2016Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Ever since the late 1950's, context-free grammars have played an important role within the field of linguistics, been a part of introductory courses and expanded into other fields of study. Meanwhile, data visualization in modern web development has made it possible to do feature rich visualization in the browser. In this thesis, these two developments are united, by developing a browser based app, to write context-free grammars, parse sentences and visualize the output. A user experience study with usability-tests and user-interviews is conducted, in order to investigate the possible benefits and disadvantages of said visualization when writing context-free grammars. The results show that data visualization was limitedly used by participants, in that it helped them to see if sentences were parsed and, if a sentence was not parsed, at which position parsing went wrong. Future improvements on the software and studies on them are proposed as well as the expansion of the field of data visualization within linguistics.

  • 38.
    Hägglöf, Hillevi
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Tengstrand, Lisa
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    A Random Indexing Approach to Unsupervised Selectional Preference Induction2011Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    A selectional preference is the relation between a head-word and plausible arguments of that head-word. Estimation of the association feature between these words is important to natural language processing applications such as Word Sense Disambiguation. This study presents a novel approach to selectional preference induction within a Random Indexing word space. This is a spatial representation of meaning where distributional patterns enable estimation of the similarity between words. Using only frequency statistics about words to estimate how strongly one word selects another, the aim of this study is to develop a flexible method that is not language dependent and does not require any annotated resourceswhich is in contrast to methods from previous research. In order to optimize the performance of the selectional preference model, experiments including parameter tuning and variation of corpus size were conducted. The selectional preference model was evaluated in a pseudo-word evaluation which lets the selectional preference model decide which of two arguments have a stronger correlation to a given verb. Results show that varying parameters and corpus size does not affect the performance of the selectional preference model in a notable way. The conclusion of the study is that the language modelused does not provide the adequate tools to model selectional preferences. This might be due to a noisy representation of head-words and their arguments.

  • 39. Ibbotson, Paul
    et al.
    Hartman, Rose M.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Frequency filter: an open access tool for analysing language development2018In: Language, Cognition and Neuroscience, ISSN 2327-3798, E-ISSN 2327-3801, Vol. 33, no 10, p. 1325-1339Article in journal (Refereed)
    Abstract [en]

    We present an open-access analytic tool, which allows researchers to simultaneously control for and combine language data from the child, the caregiver, multiple languages, and across multiple time points to make inferences about the social and cognitive factors driving the shape of language development. We demonstrate how the tool works in three domains of language learning and across six languages. The results demonstrate the usefulness of this approach as well as providing deeper insight into three areas of language production and acquisition: egocentric language use, the learnability of nouns versus verbs, and imageability. We have made the Frequency Filter tool freely available as an R-package for other researchers to use at https://github.com/rosemm/FrequencyFilter.

  • 40.
    Kasaty, Anna
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Koponen, Eeva
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Klintfors, Eeva
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Swedish Nominal Morphophonology Implemented within the Two-level Model in PC-Kimmo1998Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    This paper presents a description of Swdish morphophonology and an attempt to create a Swedish pronunciation morpheme lexicon as a part of a text-to-speech system at Telia Research AB.

  • 41.
    Koponen, Eeva
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics. Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Klintfors, Eeva
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Effects of Target-Word Frequency Rate on Sound-Meaning-Connection in Five to Fifteen Month-Old Swedish Infants1999Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    The purpose of this study was to examine the effects of manipulating target-word frequency rate and target-word phrase position on sound-meaning-connection in five to fifteen month old Swedish infants. Three different test conditions, each one of them a film showing objects and corresponding phrases made of randomly generated artificial words, were designed. The structure of the first, high variability test condition included context-dependent information and the structures of the second and the third, low variability test conditions were characterised by frequent nonsense target-word rate, target-words occurring in phrase final position. The aim of the artificial input language was to ensure the novelty of test material, and to simulate the type of learning situation - when the semantic content of words is arbitrary - facing young infants in the beginning of language learning. Analysis of informants looking behaviour, prior to, and after exposure to the objects and the corresponding audio input, were performed. Results showed that the structure of high variability test condition and the structure of low variability test conditions were associated with significant between-group differences. This finding indicates that the nonsense phrases in low variability test conditions managed to 'explain' the objects just like semantically meaningful phrases do. When compared with past research, these findings seem to suggest that experience-dependent mechanisms may support, besides word segmentation, even more complicated aspects of language learning, such as acquisition of syntax.

  • 42.
    Lindström, Mathias
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Automatic Segmentation of Swedish Medical Words with Greek and Latin Morphemes: A Computational Morphological Analysis2015Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Raw text data online has increased the need for designing artificial systems capable of processing raw data efficiently and at a low cost in the field of natural language processing (NLP). A well-developed morphological analysis is an important cornerstone of NLP, in particular when word look-up is an important stage of processing. Morphological analysis has many advantages, including reducing the number of word forms to be stored computationally, as well as being cost-efficient and time-efficient. NLP is relevant in the field of medicine, especially in automatic text analysis, which is a relatively young field in Swedish medical texts. Much of the stored information is highly unstructured and disorganized.

    Using raw corpora, this paper aims to contribute to automatic morphological segmentation by experimenting with state-of-art-tools for unsupervised and semi-supervised word segmentation of Swedish words in medical texts. The results show that a reasonable segmentation is more dependent on a high number of word types, rather than a special type of corpora. The results also show that semi-supervised word segmentation in the form of annotated training data greatly increases the performance.

  • 43.
    Ljunglöf, Peter
    et al.
    Göteborgs universitet.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Syntactic parsing2010In: Handbook of Natural Language Processing / [ed] Nitin Indurkhya & Fred J. Damerau, Boca Raton, Florida: Chapman & Hall/CRC , 2010, 2, p. 59-91Chapter in book (Other (popular science, discussion, etc.))
    Abstract [en]

    This chapter presents basic techniques for grammar-driven natural language parsing, that is, analyzing a string of words (typically a sentence) to determine its structural description according to a formal grammar. In most circumstances, this is not a goal in itself but rather an intermediary step for the purpose of further processing, such as the assignment of a meaning to the sentence. To this end, the desired output of grammar-driven parsing is typically a hierarchical, syntactic structure suitable for semantic interpretation (the topic of Chapter 5). The string of words constituting the input will usually have been processed in separate phases of tokenization (Chapter 2) and lexical analysis (Chapter 3), which is hence not part of parsing proper.

  • 44.
    Loftsson, Hrafn
    et al.
    Reykjaviks universitet, Island.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of Icelandic2013In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), Linköping University Electronic Press, Linköpings universitet, 2013, p. 105-119Conference paper (Refereed)
    Abstract [en]

    In this paper, we experiment with using Stagger, an open-source implementation of an Averaged Perceptron tagger, to tag Icelandic, a morphologically complex language. By adding languagespecific linguistic features and using IceMorphy, an unknown word guesser, we obtain state-of- the-art tagging accuracy of 92.82%. Furthermore, by adding data from a morphological database, and word embeddings induced from an unannotated corpus, the accuracy increases to 93.84%. This is equivalent to an error reduction of 5.5%, compared to the previously best tagger for Icelandic, consisting of linguistic rules and a Hidden Markov Model.

  • 45.
    Marklund, Ellen
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Cortes, Elísabet Eir
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Sjons, Johan
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    MMN responses in adults after exposure to bimodal and unimodal frequency distributions of rotated speech2017In: Proceedings of Interspeech 2017, The International Speech Communication Association (ISCA), 2017, p. 1804-1808Conference paper (Refereed)
    Abstract [en]

    The aim of the present study is to further the understanding of the relationship between perceptual categorization and exposure to different frequency distributions of sounds. Previous studies have shown that speech sound discrimination proficiency is in- fluenced by exposure to different distributions of speech sound continua varying along one or several acoustic dimensions, both in adults and in infants. In the current study, adults were presented with either a bimodal or a unimodal frequency distri- bution of spectrally rotated sounds along a continuum (a vowel continuum before rotation). Categorization of the sounds, quantified as amplitude of the event-related potential (ERP) component mismatch negativity (MMN) in response to two of the sounds, was measured before and after exposure. It was expected that the bimodal group would have a larger MMN amplitude after exposure whereas the unimodal group would have a smaller MMN amplitude after exposure. Contrary to expectations, the MMN amplitude was smaller overall after exposure, and no difference was found between groups. This suggests that either the previously reported sensitivity to frequency distributions of speech sounds is not present for non-speech sounds, or the MMN amplitude is not a sensitive enough measure of categorization to detect an influence from passive exposure, or both.

  • 46.
    Megyesi, Beáta
    et al.
    Department of Linguistics and Philology, Uppsala University.
    Granstedt, Lena
    Department of Language Studies, Umeå University.
    Johansson, Sofia
    Stockholm University, Faculty of Humanities, Department of Swedish Language and Multilingualism.
    Rosén, Dan
    Språkbanken, Department of Swedish, University of Gothenburg.
    Schenström, Carl-Johan
    Språkbanken, Department of Swedish, University of Gothenburg.
    Sundberg, Gunlög
    Stockholm University, Faculty of Humanities, Department of Swedish Language and Multilingualism.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Volodina, Elena
    Språkbanken, Department of Swedish, University of Gothenburg.
    Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish2018In: Proceedings of 7th Workshop on NLP for Computer Assisted Language Learning at SLTC 2018, Linköping, Sweden, 2018, p. 47-56Conference paper (Refereed)
    Abstract [en]

    This paper reports on the status of learner corpus anonymization for the ongoing research infrastructure project SweLL. The main project aim is to deliver and make available for research a well-annotated corpus of essays written by second language (L2) learners of Swedish. As the practice shows, annotation of learner texts is a sensitive process demanding a lot of compromises between ethical and legal demands on the one hand, and research and technical demands, on the other. Below, is a concise description of the current status of pseudonymization of language learner data to ensure anonymity of the learners, with numerous examples of the above-mentioned compromises.

  • 47.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    SUC-CORE: A Balanced Corpus Annotated with Noun Phrase Coreference2013In: Northern European Journal of Language Technology (NEJLT), ISSN 2000-1533, Vol. 3, no 2, p. 19-39Article in journal (Refereed)
    Abstract [en]

    This paper describes SUC-CORE, a subset of the Stockholm Umeå Corpus and the Swedish Treebank annotated with noun phrase coreference. While most coreference annotated corpora consist of texts of similar types within related domains, SUC-CORE consists of both informative and imaginative prose and covers a wide range of literary genres and domains.This allows for exploration of coreference across different text types, but it also means that there are limited amounts of data within each type. Future work on coreference resolution for Swedish should include making more annotated data available for the research community.

  • 48.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    The MINGLE annotation scheme: Multimodal annotation of parent-child interation in a free play setting (version 1.0)2012Report (Other academic)
    Abstract [en]

    A cognitive model of language learning must be dialogue-driven and multimodal to reflect how parent and child interact, using words, eye gaze, and object manipulation. We present a scheme for multimodal annotation of parent-child interaction. The purpose is to add verbal and non-verbal annotation to a corpus of longitudinal video and sound recordings of parent-child dyads. In this guideline, we describe the transcription of adult and child speech and vocalizations, and the annotation of both empty-hand gestures and object-related actions by both adults and children.

  • 49.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    What is a corpus and why are corpora important tools?2013Conference paper (Other academic)
  • 50.
    Nilsson Björkenstam, Kristina
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Björkstrand, Thomas
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Grigonyté, Gintaré
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Gustafson-Capková, Sofia
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Mesch, Johanna
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Schönström, Krister
    Stockholm University, Faculty of Humanities, Department of Linguistics, Swedish as a Second Language for the Deaf.
    Wallin, Lars
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    SWE-CLARIN partner presentation: Natural Language Processing Resources from the Department of Linguistics, Stockholm University2014In: The first Swedish national SWE-CLARIN workshop: LT-based e-HSS in Sweden – taking stock and looking ahead / [ed] Lars Borin, 2014Conference paper (Other academic)
    Abstract [en]

    The aim of the CLARIN Research Infrastructure and SWE-CLARIN is to facilitate for scholars in the humanities and social sciences to access primary data in the form of natural language, and to provide tools for exploring, annotating and analysing these data. This paper gives an overview of the resources and tools developed at the Department of Linguistics at Stockholm University planned to be made available within the SWE-CLARIN project. The paper also outlines our collaborations with neighbouring areas in the humanities and social sciences where these resources and tools will be put to use.

123 1 - 50 of 114
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf