Planned maintenance
A system upgrade is planned for 10/12-2024, at 12:00-13:00. During this time DiVA will be unavailable.
Change search
Refine search result
123 1 - 50 of 145
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Andersson, Marta
    et al.
    Stockholm University, Faculty of Humanities, Department of English.
    Kurfali, Murathan
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    A sentiment-annotated dataset of English causal connectives2020In: Proceedings of the 14th Linguistic Annotation Workshop / [ed] Stefanie Dipper, Amir Zeldes, 2020, p. 24-33Conference paper (Refereed)
    Abstract [en]

    This paper investigates the semantic prosody of three causal connectives: due to, owing to and because of in seven varieties of the English language. While research in the domain of English causality exists, we are not aware of studies that would cover the domain of causal connectives in English. Our claim is that connectives such as because of link two arguments, (at least) one of which will include a phrase that contributes to the interpretation of the relation as positive or negative, and hence define the prosody of the connective used. As our results demonstrate, the majority of the prosodies identified are negative for all three connectives; the proportions are stable across the varieties of English studied, and contrary to our expectations, we find no significant differences between the functions of the connectives and discourse preferences. Further, we investigate whether automatizing the sentiment annotation procedure via a simple language-model based classifier is possible. The initial results highlights the complexity of the task and the need for complicated systems, probably aided with other related datasets to achieve reasonable performance.

    Download full text (pdf)
    fulltext
  • 2. Basirat, Ali
    et al.
    de Lhoneux, Miryam
    Kulmizev, Artur
    Kurfali, Murathan
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nivre, Joakim
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Polyglot Parsing for One Thousand and One Languages (And Then Some)2019Conference paper (Other academic)
  • 3. Berggren, Max
    et al.
    Karlgren, Jussi
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Parkvall, Mikael
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Inferring the location of authors from words in their texts2015In: Proceedings of the 20th Nordic Conference of Computational Linguistics: NODALIDA 2015 / [ed] Beáta Megyesi, Linköping: Linköping University Electronic Press, ACL Anthology , 2015, p. 211-218Conference paper (Refereed)
    Abstract [en]

    For the purposes of computational dialectology or other geographically bound text analysis tasks, texts must be annotated with their or their authors' location. Many texts are locatable but most have no ex- plicit annotation of place. This paper describes a series of experiments to determine how positionally annotated microblog posts can be used to learn location indicating words which then can be used to locate blog texts and their authors. A Gaussian distribution is used to model the locational qualities of words. We introduce the notion of placeness to describe how locational words are.

    We find that modelling word distributions to account for several locations and thus several Gaussian distributions per word, defining a filter which picks out words with high placeness based on their local distributional context, and aggregating locational information in a centroid for each text gives the most useful results. The results are applied to data in the Swedish language.

  • 4. Bielinskiene, Agne
    et al.
    Boizou, Loic
    Grigonyté, Gintaré
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Kovalevskaite, Jolanta
    Rimkute, Erika
    Utka, Andrius
    Lietuvių kalbos terminų automatinis atpažinimas ir apibrėžimas2015 (ed. 1)Book (Refereed)
    Abstract [en]

    This book presents the most recent advances in the field of Lithuanian terminology extraction as well as the first attempt on automatic extraction of Lithuanian term defining contexts. The first work in descriptive terminology by Lithuanian researchers appeared in early 2000s, i.e. R. Marcinkevičienė (2000) and I. Zeller (dissertation "Term recognition and their analysis", 2005). Nevertheless, the larger proportion of research on Lithuanian terminology is still dominated by the prescriptive view, when a lot of attention and research is given to principles and norms of terminology, as well as diachronic aspects of terminology. Chapter 1 describes differences of descriptive and prescriptive terminology. The authors want to emphasize that the prescriptive terminology involves standardisation and approval of terms, while decisions are based on existing terminology dictionaries, documents, standards, lexicons and databases of approved terms. Whereas in the corpus-based terminology management, which is one of the branches of the descriptive terminology, the main focus is placed on the usage of terms in natural language in a corpus, rather than on the standardisation. The empirical research approaches benefit from various automatic term analysis and term extraction tools, which come in handy in corpus-based terminology management. New terminology research has shown that it is very important to harmonize the methods of prescriptive and descriptive terminology. The combination of both methods allows faster processing of evergrowing data, which is very relevant to challenges of the modern lexicography that include quick and efficient creation of dynamic lexicographical sources.

    Download full text (pdf)
    fulltext
  • 5.
    Bjerva, Johannes
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Genetic Algorithms in the Brill Tagger: Moving towards language independence2013Independent thesis Advanced level (degree of Master (One Year)), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    The viability of using rule-based systems for part-of-speech tagging was revitalised when a simple rule-based tagger was presented by Brill (1992). This tagger is based on an algorithm which automatically derives transformation rules from a corpus, using an error-driven approach. In addition to performing on par with state of the art stochastic systems for part-of-speech tagging, it has the advantage that the automatically derived rules can be presented in a human-readable format.

    In spite of its strengths, the Brill tagger is quite language dependent, and performs much better on languages similar to English than on languages with richer morphology. This issue is addressed in this paper through defining rule templates automatically with a search that is optimised using Genetic Algorithms. This allows the Brill GA-tagger to search a large search space for templates which in turn generate rules which are appropriate for various target languages, which has the added advantage of removing the need for researchers to define rule templates manually.

    The Brill GA-tagger performs significantly better (p<0.001) than the standard Brill tagger on all 9 target languages (Chinese, Japanese, Turkish, Slovene, Portuguese, English, Dutch, Swedish and Icelandic), with an error rate reduction of between 2% -- 15% for each language.

    Download full text (pdf)
    fulltext
  • 6.
    Bjerva, Johannes
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Predicting the N400 Component in Manipulated and Unchanged Texts with a Semantic Probability Model2012Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Within the field of computational linguistics, recent research has made successful advances in integrating word space models with n-gram models. This is of particular interest when a model that encapsulates both semantic and syntactic information is desirable. A potential application for this can be found in the field of psycholinguistics, where the neural response N400 has been found to occur in contexts with semantic incongruities. Previous research has found correlations between cloze probabilities and N400, while more recent research has found correlations between cloze probabilities and language models.

    This essay attempts to uncover whether or not a more direct connection between integrated models and N400 can be found, hypothesizing that low probabilities elicit strong N400 responses and vice versa. In an EEG experiment, participants read a text manipulated using a language model, and a text left unchanged. Analysis of the results shows that the manipulations to some extent yielded results supporting the hypothesis. Further results are found when analysing responses to the unchanged text. However, no significant correlations between N400 and the computational model are found. Future research should improve the experimental paradigm, so that a larger scale EEG recording can be used to construct a large EEG corpus.

    Download full text (pdf)
    Bjerva2012
  • 7.
    Bjerva, Johannes
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics. University of Groningen, The Netherlands.
    Börstell, Carl
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Morphological complexity influences Verb–Object order in Swedish Sign Language2016In: Computational Linguistics for Linguistic Complexity: Proceedings of the Workshop / [ed] Dominique Brunato, Felice Dell'Orletta, Giulia Venturi, Thomas François, Philippe Blache, International Committee on Computational Linguistics (ICCL) , 2016, p. 137-141Conference paper (Refereed)
    Abstract [en]

    Computational linguistic approaches to sign languages could benefit from investigating how complexity influences structure. We investigate whether morphological complexity has an effect on the order of Verb (V) and Object (O) in Swedish Sign Language (SSL), on the basis of elicited data from five Deaf signers. We find a significant difference in the distribution of the orderings OV vs. VO, based on an analysis of morphological weight. While morphologically heavy verbs exhibit a general preference for OV, humanness seems to affect the ordering in the opposite direction, with [+human] Objects pushing towards a preference for VO.

    Download full text (pdf)
    fulltext
  • 8. Bjerva, Johannes
    et al.
    Grigonyte, Gintare
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Plank, Barbara
    Neural Networks and Spelling Features for Native Language Identification2017In: The Twelfth Workshop on Innovative Use of NLP for Building Educational Applications: Proceedings of the Workshop, Association for Computational Linguistics, 2017, p. 235-239Conference paper (Refereed)
    Abstract [en]

    We present the RUG-SU team's submission at the Native Language Identification Shared Task 2017. We combine several approaches into an ensemble, based on spelling error features, a simple neural network using word representations, a deep residual network using word and character features, and a system based on a recurrent neural network. Our best system is an ensemble of neural networks, reaching an F1 score of 0.8323. Although our system is not the highest ranking one, we do outperform the baseline by far.

    Download full text (pdf)
    fulltext
  • 9. Bjerva, Johannes
    et al.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Cross-lingual Learning of Semantic Textual Similarity with Multilingual Word Representations2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics / [ed] Jörg Tiedemann, Linköping: Linköping University Electronic Press, 2017, p. 211-215, article id 024Conference paper (Refereed)
    Abstract [en]

    Assessing the semantic similarity between sentences in different languages is challenging. We approach this problem by leveraging multilingual distributional word representations, where similar words in different languages are close to each other. The availability of parallel data allows us to train such representations on a large amount of languages. This allows us to leverage semantic similarity data for languages for which no such data exists. We train and evaluate on five language pairs, including English, Spanish, and Arabic. We are able to train wellperforming systems for several language pairs, without any labelled data for that language pair.

    Download full text (pdf)
    fulltext
  • 10. Bjerva, Johannes
    et al.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Han Veiga, Maria
    Tiedemann, Jörg
    Augenstein, Isabelle
    What Do Language Representations Really Represent?2019In: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 45, no 2, p. 381-389Article in journal (Other academic)
    Abstract [en]

    A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just as it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, whereas genetic relationships—a convenient benchmark used for evaluation in previous work—appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another.

    Download full text (pdf)
    fulltext
  • 11.
    Boglind, Fredrik
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Aligning Historical Ciphertext and Plaintext Using Statistical Machine Translation Methods2024Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    This thesis examines how generative alignment models, specifically IBM Models 1 and 2 commonly used in statistical machine translation, can be used for the task of aligning historical ciphertexts with their corresponding plaintexts. The aim is to address a problem in the field of Historical Cryptology: matching separate ciphertexts and plaintexts that have been stored apart in archives. Using synthetically generated data mimicking historical ciphers, the thesis evaluates how well these models can be adapted for cryptographic alignment and cipher key recreation.In an experiment involving synthetically generated historical cryptographic text, model parameters are optimized, and performance is assessed across various cryptographic features, including homophonic encryption levels, error types, and text lengths. Results indicate that while IBM Model 2 generally outperforms Model 1, both models perform poorly when handling different types of errors, particularly additions and deletions in ciphertexts.The study establishes a baseline for alignment tasks in Historical Cryptology and demonstrates the potential and limitations of applying statistical machine translation techniques to cryptographic problems.The findings suggest that while generative alignment models can be adapted to be used with encrypted historical texts, further refinements may be necessary.

    Download full text (pdf)
    fulltext
  • 12.
    Byström, Emil
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Knowledge-based CoreferenceResolution in Swedish2012Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Automatic coreference resolution is the automatic identification of expressions with the same referents. The state of the art systems are data driven and based on machine learning algorithms. Data drivenapproaches to coreference resolution require big amounts of annotated data, which is time consumingand expensive to obtain. Haghigi and Klein [1] present a knowledge based approach where coreference is resolved with heuristics using rich syntactic and semantic features. Haghigi and Klein’s system isinteresting because its performance is in line with data-driven systems and the requirements of annotateddata is low. In the present study a knowledge based system for coreference resolution in Swedish was implementedand its performance evaluated. The system is based on the system of Haghigi and Klein. To be able to evaluate and implement the algorithm, a database annotated with coreferential chains is needed. Asthere is no freely available resource with data annotated with coreference in Swedish, the annotation ofthe gold standard part of SUC 2.0 is also described. Results from the evaluation of the implementation show that the syntactic and semantic filters implemented did not improve baseline results. The filters falsely allow or constrain coreference as insufficient linguistic information is available. It is argued thatfocusing on rich syntactic and semantic features improves future work on knowledge-based coreferenceresolution in Swedish.

  • 13.
    Börstell, Carl
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Hörberg, Thomas
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Distribution and duration of signs and parts of speech in Swedish Sign Language2016In: Sign Language and Linguistics, ISSN 1387-9316, E-ISSN 1569-996X, Vol. 19, no 2, p. 143-196Article in journal (Refereed)
    Abstract [en]

    In this paper, we investigate frequency and duration of signs and parts of speech in Swedish Sign Language (SSL) using the SSL Corpus. The duration of signs is correlated with frequency, with high-frequency items having shorter duration than low-frequency items. Similarly, function words (e.g. pronouns) have shorter duration than content words (e.g. nouns). In compounds, forms annotated as reduced display shorter duration. Fingerspelling duration correlates with word length of corresponding Swedish words, and frequency and word length play a role in the lexicalization of fingerspellings. The sign distribution in the SSL Corpus shows a great deal of cross-linguistic similarity with other sign languages in terms of which signs appear as high-frequency items, and which categories of signs are distributed across text types (e.g. conversation vs. narrative). We find a correlation between an increase in age and longer mean sign duration, but see no significant difference in sign duration between genders.

  • 14.
    Börstell, Carl
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Mesch, Johanna
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Gärdenfors, Moa
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Towards an Annotation of Syntactic Structure in the Swedish Sign Language Corpus2016In: Workshop Proceedings: 7th Workshop on the Representation and Processing of Sign Languages: Corpus Mining / [ed] Eleni Efthimiou, Stavroula-Evita Fotinea, Thomas Hanke, Julie Hochgesang, Jette Kristoffersen, Johanna Mesch, Paris: ELRA , 2016, p. 19-24Conference paper (Refereed)
    Abstract [en]

    This paper describes on-going work on extending the annotation of the Swedish Sign Language Corpus (SSLC) with a level of syntactic structure. The basic annotation of SSLC in ELAN consists of six tiers: four for sign glosses (two tiers for each signer; one for each of a signer’s hands), and two for written Swedish translations (one for each signer). In an additional step by Östling et al. (2015), all ¨ glosses of the corpus have been further annotated for parts of speech. Building on the previous steps, we are now developing annotation of clause structure for the corpus, based on meaning and form. We define a clause as a unit in which a predicate asserts something about one or more elements (the arguments). The predicate can be a (possibly serial) verbal or nominal. In addition to predicates and their arguments, criteria for delineating clauses include non-manual features such as body posture, head movement and eye gaze. The goal of this work is to arrive at two additional annotation tier types in the SSLC: one in which the sign language texts are segmented into clauses, and the other in which the individual signs are annotated for their argument types.

    Download full text (pdf)
    fulltext
  • 15.
    Börstell, Carl
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Iconic Locations in Swedish Sign Language: Mapping Form to Meaning with Lexical Databases2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa / [ed] Jörg Tiedemann, Linköping: Linköping University Electronic Press, 2017, p. 221-225, article id 026Conference paper (Refereed)
    Abstract [en]

    In this paper, we describe a method for mapping the phonological feature location of Swedish Sign Language (SSL) signs to the meanings in the Swedish semantic dictionary SALDO. By doing so, we observe clear differences in the distribution of meanings associated with different locations on the body. The prominence of certain locations for specific meanings clearly point to iconic mappings between form and meaning in the lexicon of SSL, which pinpoints modalityspecific properties of the visual modality.

    Download full text (pdf)
    fulltext
  • 16. Cap, Fabienne
    et al.
    Adesam, Yvonne
    Ahrenberg, Lars
    Borin, Lars
    Bouma, Gerlof
    Forsberg, Markus
    Kann, Viggo
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Smith, Aaron
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nivre, Joakim
    SWORD: Towards Cutting-Edge Swedish Word Processing2016In: Proceedings of SLTC 2016, 2016Conference paper (Refereed)
    Abstract [en]

    Despite many years of research on Swedish language technology, there is still no well-documented standard for Swedish word processing covering the whole spectrum from low-level tokenization to morphological analysis and disambiguation. SWORD is a new initiative within the SWE-CLARIN consortium aiming to develop documented standards for Swedish word processing. In this paper, we report on a pilot study of Swedish tokenization, where we compare the output of six different tokenizers on four different text types. For one text type (Wikipedia articles), we also compare to the tokenization produced by six manual annotators.

    Download full text (pdf)
    fulltext
  • 17.
    Cohen, Julianne
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Automatisk kvalitetsbedömning av medicinska översättningar2022Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Download full text (pdf)
    fulltext
  • 18.
    Cortes, Elisabet Eir
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Gerholm, ToveStockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.Marklund, EllenStockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.Marklund, UlrikaStockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.Molnar, MonikaNilsson Björkenstam, KristinaStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.Schwarz, Iris-CorinnaStockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.Sjons, JohanStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    WILD 2015: Book of Abstracts2015Conference proceedings (editor) (Other academic)
    Abstract [en]

    WILD 2015 is the second Workshop on Infant Language Development, held June 10-12 2015 in Stockholm, Sweden. WILD 2015 was organized by Stockholm Babylab and the Department of Linguistics, Stockholm University. About 150 delegates met over three conference days, convening on infant speech perception, social factors of language acquisition, bilingual language development in infancy, early language comprehension and lexical development, neurodevelopmental aspects of language acquisition, methodological issues in infant language research, modeling infant language development, early speech production, and infant-directed speech. Keynote speakers were Alejandrina Cristia, Linda Polka, Ghislaine Dehaene-Lambertz, Angela D. Friederici and Paula Fikkert.

    Organizing this conference would of course not have been possible without our funding agencies Vetenskapsrådet and Riksbankens Jubiléumsfond. We would like to thank Francisco Lacerda, Head of the Department of Linguistics, and the Departmental Board for agreeing to host WILD this year. We would also like to thank the administrative staff for their help and support in this undertaking, especially Ann Lorentz-Baarman and Linda Habermann.

    The WILD 2015 Organizing Committee: Ellen Marklund, Iris-Corinna Schwarz, Elísabet Eir Cortes, Johan Sjons, Ulrika Marklund, Tove Gerholm, Kristina Nilsson Björkenstam and Monika Molnar.

    Download full text (pdf)
    fulltext
  • 19.
    Dalianis, Hercules
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Östling, RobertStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.Weegar, RebeckaStockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.Wirén, MatsStockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Special Issue of Selected Contributions from the Seventh Swedish Language Technology Conference (SLTC 2018)2019Conference proceedings (editor) (Other academic)
    Abstract [en]

    This Special Issue contains three papers that are extended versions of abstracts presented at the Seventh Swedish Language Technology Conference (SLTC 2018), held at Stockholm University 8–9 November 2018.1 SLTC 2018 received 34 submissions, of which 31 were accepted for presentation. The number of registered participants was 113, including both attendees at SLTC 2018 and two co-located workshops that took place on 7 November. 32 participants were internationally affiliated, of which 14 were from outside the Nordic countries. Overall participation was thus on a par with previous editions of SLTC, but international participation was higher.

  • 20.
    Drangert, Lisette
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Longitudinella förändringar av yttranden inom variationsmängder i barnriktat tal: En korpusstudie av yttrandetyper och verb2016Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Variation sets are a feature in child-directed speech characterized by successive utterances in which the adult speaker repeats and reorders their message with a constant intent. The aim of this study was to investigate variation sets over time in speech directed to children in the ages 7-33 months. The purpose was to study which types of utterances that dominates the variation sets at different ages, and which utterance-types that tend to co-occur within the variation sets. Furthermore intent was studied as well as change in verb tense in these variation sets. A script was written to categorize types of utterances with data from a corpus consisting of child directed speech. A quantitative research was performed on the results based on four different age groups.

    The complexity of the utterances within variation sets was shown to grow with the increasing age of the children. Furthermore a noticeable difference was observed in the intent of the adult speaker, correlating with the age of the child, and also a decrease in use of interjection combined with yes/no- questions and complex utterances the older the children were. A suggested interpretation of the result was that the adult tend to take both sides of the conversation when the children are young as opposed to when they speak to older, more verbal, children that can provide the answer themselves. 

    Download full text (pdf)
    fulltext
  • 21.
    Ek, Adam
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Blending Words or: How I Learned to Stop Worrying and Love the Blendguage: A computational study of lexical blending in Swedish2018Independent thesis Advanced level (degree of Master (One Year)), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    This thesis investigates Swedish lexical blends. A lexical blend is defined as the concatenation of two words, where at least one word has been reduced. Lexical blends are approached from two perspectives. First, the thesis investigates lexical blends as they appear in the Swedish language. It is found that there is a significant statistical relationship between the two source words in terms of orthographic, phonemic and syllabic length and frequency in a reference corpus. Furthermore, some uncommon lexical blends created from pronouns and interjections are described. A description of lexical blends through semantic construction and similarity to other word formation processes are also described. Secondly, the thesis develops a model which predicts source words of lexical blends. To predict the source words a logistic regression model is used. The evaluation shows that using a ranking approach, the correct source words are the highest ranking word pair in 32.2% of the cases. In the top 10 ranking word pairs, the correct word pair is found in 60.6% of the cases. The results are lower than in previous studies, but the number of blends used is also smaller. It is shown that lexical blends which overlap are easier to predict than lexical blends which do not overlap. Using feature ablation, it is shown that semantic and frequency related features have the most important for the prediction of source words.

    Download full text (pdf)
    Blending Words or: How I Learned to Stop Worrying and Love the Blendguage
  • 22.
    Ek, Adam
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Distinguishing Narration and Speech in Prose Fiction Dialogues2019In: Proceedings of the Digital Humanities in the Nordic Countries 4th Conference / [ed] Costanza Navarretta, Manex Agirrezabal, Bente Maegaard, CEUR-WS.org , 2019, p. 124-132Conference paper (Refereed)
    Abstract [en]

    This paper presents a supervised method for a novel task, namely, detecting elements of narration in passages of dialogue in prose fiction. The method achieves an F1-score of 80.8%, exceeding the best baseline by almost 33 percentage points. The purpose of the method is to enable a more fine-grained analysis of fictional dialogue than has previously been possible, and to provide a component for the further analysis of narrative structure in general.

    Download full text (pdf)
    fulltext
  • 23.
    Ek, Adam
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Grigonytė, Gintarė
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Gustafson Capková, Sofia
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Identifying Speakers and Addressees in Dialogues Extracted from Literary Fiction2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) / [ed] Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga, European Language Resources Association, 2018, p. 817-824Conference paper (Refereed)
    Abstract [en]

    This paper describes an approach to identifying speakers and addressees in dialogues extracted from literary fiction, along with a dataset annotated for speaker and addressee. The overall purpose of this is to provide annotation of dialogue interaction between characters in literary corpora in order to allow for enriched search facilities and construction of social networks from the corpora. To predict speakers and addressees in a dialogue, we use a sequence labeling approach applied to a given set of characters. We use features relating to the current dialogue, the preceding narrative, and the complete preceding context. The results indicate that even with a small amount of training data, it is possible to build a fairly accurate classifier for speaker and addressee identification across different authors, though the identification of addressees is the more difficult task.

    Download full text (pdf)
    fulltext
  • 24. Eklund, Robert
    et al.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Effects of open and directed prompts on filled pauses and utterance production2010In: Proceedings from Fonetik 2010, Lund, June 2–4, 2010 / [ed] Susanne Schötz and Gilbert Ambrazaitis, Lund: Mediatryck , 2010, p. 23-28Conference paper (Other academic)
    Abstract [en]

    This paper describes an experiment where open and directed prompts were alternated when collecting speech data for the deployment of a call-routing application. The experiment tested whether open and directed prompts resulted in any differences with respect to the filled pauses exhibited by the callers, which is interesting in the light of the “many-options” hypothesis of filled pause production. The experiment also investigated the effects of the prompts on utterance form and meaning of the callers.

    Download full text (pdf)
    FULLTEXT01
  • 25.
    Eklås Tejman, Claudia
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Automatisk citatidentifiering för nyhetstext på svenska2015Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    The strategies for marking quotations in Swedish differ from most other European languages. Since most systems for quotation identification are developed for English, there was a need for a quotation identification system specifically adapted for Swedish. A gold standard of 100 quotes from SUC 3.0 and 206 quotes from unformatted, web crawled news data was compiled to analyse the syntactic structures and marking patterns of Swedish quotation. A rule based system for quotation identification based on the patterns was developed. It achieved an F-score of 0.79 for the raw news data that contained the gold standard quotes and was able to identify 13 of 19 marking patterns. It could not determine whether the quotes ended after the reporting phrase or not, since the raw text data lacked formatting for most common way to mark the end of a quote in Swedish.

  • 26.
    Ekman, Sara
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Automatisk extraktion av nyckelord ur ett kundforum2018Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Conversations in a customer forum span across different topics and the language is inconsistent. The text type do not meet the demands for automatic keyword extraction. This essay examines how keywords can be automatically extracted despite these difficulties. Focus in the study are three areas of keyword extraction. The first factor regards how the established keyword extraction method TF*IDF performs compared to four methods created with the unusual material in mind. The next factor deals with different ways to calculate word frequency. The third factor regards if the methods use only posts, only titles, or both in their extractions. Non-parametric tests were conducted to evaluate the extractions. A number of Friedman's tests shows the methods in some cases differ in their ability to identify relevant keywords. In post-hoc tests performed between the highest performing methods, one of the new methods perform significantly better than the other new methods but not better than TF*IDF. No difference was found between the use of different text types or ways to calculate word frequency. For future research reliability test of manually annotated keywords is recommended. A larger sample size should be used than in the current study and further suggestions are given to improve the results of keyword extractions.

    Download full text (pdf)
    Automatisk extraktion av nyckelord ur ett kundforum
  • 27.
    Engdahl, Johan
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Tremänning eller syssling: Automatisk sökning i bloggar efter ordisoglosser i Sverige2012Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [sv]

    Ibland används i två dialekter olika ord för samma sak. Syftet med denna studie är att visa vad somkan automatiseras i sökandet efter ord-isoglosser. Detta undersöks genom att skriva och utvärdera ettprogram som genom att analasyera bloggtext söker efter ordisoglosser i Sverige. En isogloss är engeografisk gräns mellan två olika språkliga egenskaper, till exempel prosodi eller betoning, eller som idetta fall ord. Programmet mappar skribentens kommun till orden från bloggtexterna i en databas. Lagttill detta låter programmet användaren söka efter antingen hur vanligt ett ord är i Sveriges kommunerjämfört med riksgenomsnittet; eller vilket av två olika ord som är vanligast inom varje kommun, enligtett två-sidigt proportionstest. Resultatet av de gjorda sökningarna skrevs till en fil och plottades sedanmanuellt. Utvärderingen visar att programmet kan hitta några ordisoglosser mellan kommuner, och attkartorna i viss utsträckning stämmer överrens med de resultat som Parkvall (Parkvall, 2011; Parkvall,2012) påvisar. Detta indikerar att programmet är en bra början för liknande studier. Förbättringar avprogrammet är att användaren tillåts använda reguljära uttryck för att få bort ambuigitet.

    Download full text (pdf)
    Tremänning eller syssling
  • 28. Erolcan Er, Mustafa
    et al.
    Kurfali, Murathan
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Zeyrek, Deniz
    Lightweight Connective Detection Using Gradient Boosting2024In: ISA 2024: 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC-COLING 2024, Workshop Proceedings, European Language Resources Association, 2024, p. 53-59Conference paper (Refereed)
    Abstract [en]

    In this work, we introduce a lightweight discourse connective detection system. Employing gradient boosting trained on straightforward, low-complexity features, this proposed approach sidesteps the computational demands of the current approaches that rely on deep neural networks. Considering its simplicity, our approach achieves competitive results while offering significant gains in terms of time even on CPU. Furthermore, the stable performance across two unrelated languages suggests the robustness of our system in the multilingual scenario. The model is designed to support the annotation of discourse relations, particularly in scenarios with limited resources, while minimizing performance loss.

  • 29.
    Gillholm, Katarina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Neural maskinöversättning av gawarbati2023Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Recent neural models have led to huge improvements in machine translation, but performance is still suboptimal for languages without large parallel datasets, so called low resource languages. Gawarbati is a small, threatened low resource language with only 5000 parallel sentences. This thesis uses transfer learning and hyperparameters optimized for small datasets to explore possibilities and limitations for neural machine translation from Gawarbati to English. Transfer learning, where the parent model was trained on parallel data between Hindi and English, improved results by 1.8 BLEU and 1.3 chrF. Hyperparameters optimized for small datasets increased BLEU by 0.6 but decreased chrF by 1. Combining transfer learning and hyperparameters optimized for small datasets led to a decrease in performance by 0.5 BLEU and 2.2 chrF. The neural models outperform a word based statistical machine translation and GPT-3. The highest performing model only achieved 2.8 BLEU and 19 chrF, which illustrates the limitations of machine translation for low resource languages and the critical need for more data.

    Download full text (pdf)
    neural_maskinoversattning_av_gawarbati
  • 30.
    Gotting, Olof
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Generating Conceptual Metaphoric Paraphrases2021Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Metaphoric Paraphrase generation is a relatively new and unexplored Natural Language Generation task. The aim of the task is to develop computational systems that paraphrase literal sentences into cogent metaphoric ones. Challenges in the field include representation of common sense knowledge and ensuring meaning retention when dealing with phrases that are dissimilar in their literal sense. This thesis will deal with the specific task of paraphrasing literal adjective phrases into metaphoric noun phrases, taking into consideration the preceding context of the adjective phrase. Two different systems were developed as part of this study. The systems are identical, apart from the fact that one is endowed with a knowledge representation based on Conceptual Metaphor Theory. The paraphrases generated by the systems, along with paraphrases written by a native speaker of English, were scored on the parameters of meaning retention and creativity by a crowd-sourced panel. Both systems were able to generate cogent metaphoric paraphrases, although fairly unreliably compared to the human. The system endowed with Conceptual Metaphor Theory knowledge got a lower average meaning retention score and a higher average creativity score than the system without Conceptual Metaphor Theory knowledge representation. In addition to that it was found that less similarity in sentence embeddings of literal sentences and metaphoric paraphrases of them correlates with a higher level of perceived meaning retention and a lower perceived creativity of the metaphoric paraphrase. It was also found that less difference in GPT-2 log probability between literal sentences and metaphoric paraphrases of them correlates with humans evaluating the paraphrases as less creative.

    Download full text (pdf)
    Gotting_Generating_Conceptual_Metaphoric_Paraphrases
  • 31.
    Gren, Gustaf
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    A Tale of Two Domains: Automatic Identifi­cation of Hate Speech in Cross­-Domain Sce­narios2023Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    As our lives become more and more digital, our exposure to certain phenomena increases, one of which is hate speech. Thus, automatic hate speech identification is needed. This thesis explores three strategies for hate speech detection for cross­-domain scenarios: using a model trained on annotated data for a previous domain, a model trained on data from a novel methodology of automatic data derivation (with cross­-domain scenarios in mind), and using ChatGPT as a domain-­agnostic classifier. Results showed that cross-­domain scenarios remain a challenge for hate speech detection, results which are discussed out of both technical and ethical considera­tions.

    Download full text (pdf)
    fulltext
  • 32.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Baldwin, Timothy
    Automatic Detection of Multilingual Dictionaries on the Web2014In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), Association for Computational Linguistics, 2014, p. 93-98Conference paper (Refereed)
    Download full text (pdf)
    Automatic Detection of Multilingual Dictionaries on the Web
  • 33.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Clematide, Simon
    Institute of Computational Linguistics, University of Zurich, Switzerland.
    Rinaldi, Fabio
    Institute of Computational Linguistics, University of Zurich, Switzerland.
    How preferred are preferred terms?2013In: eLex 2013 / [ed] Kosem, I., Kallas, J., Gantar, P., Krek, S., Langemets, M., Tuulik, M., 2013, p. 452-459Conference paper (Refereed)
    Download full text (pdf)
    fulltext
  • 34.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Hammarberg, Björn
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Pronunciation and Spelling: the Case of Misspellings in Swedish L2 Written Essays2014In: Human Language Technologies - The Baltic Perspective, Baltic HLT 2014 / [ed] Andrius Utka, Gintarė Grigonytė, Jurgita Kapočiūtė-Dzikienė, Jurgita Vaičenonienė, Amsterdam: IOS Press, 2014, p. 95-98Conference paper (Refereed)
    Abstract [en]

    This research presents an investigation performed on the ASU corpus. We analyse to what extent does the pronunciation of intended words reflects in spelling errors done by L2 Swedish learners. We also propose a method that helps to automatically discriminate the misspellings affected by pronunciation from other types of misspellings.

    Download full text (pdf)
    Pronunciation and Spelling: the Case of Misspellings in Swedish L2 Written Essays
  • 35.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Velupillai, Sumithra
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Swedification patterns of Latin and Greek affixes in clinical text2016In: Nordic Journal of Linguistics, ISSN 0332-5865, E-ISSN 1502-4717, Vol. 39, no 1, p. 5-37Article in journal (Refereed)
    Abstract [en]

    Swedish medical language is rich with Latin and Greek terminology which has undergone a Swedification since the 1980s. However, many original expressions are still used by clinical professionals. The goal of this study is to obtain precise quantitative measures of how the foreign terminology is manifested in Swedish clinical text. To this end, we explore the use of Latin and Greek affixes in Swedish medical texts in three genres: clinical text, scientific medical text and online medical information for laypersons. More specifically, we use frequency lists derived from tokenised Swedish medical corpora in the three domains, and extract word pairs belonging to types that display both the original and Swedified spellings. We describe six distinct patterns explaining the variation in the usage of Latin and Greek affixes in clinical text. The results show that to a large extent affixes in clinical text are Swedified and that prefixes are used more conservatively than suffixes.

  • 36.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Schneider, Gerold
    From lexical bundles to surprisal: Measuring the idiom principle2014In: Lexical bundles in English non-fiction writing: forms and functions, 2014Conference paper (Refereed)
    Abstract [en]

    Lexical bundles (LB) testify to Sinclair's idiom principle (SIP), and measure formulaicity, complexity and (non-) creativity (FCN). We exploit the information-theoretic measure of surprisal to analyze these.Frequency as measure of LB has been criticized (McEnery et al, 2006:208–220), instead collocation measures were suggested until Biber (2009:286–290) raised three criticisms. First, MI ranks rare collocations, which often include idioms, highest. We answer that also idioms are formulaic, and there are collocation measures which have a bias towards frequent collocations.Second, MI doesn't respect word order. We thus use directed word transition probabilities like surprisal (Levy and Jaeger 2007):3-gram surprisal =Third, formulaic sequences are often discontinuous. We thus sum over sequences, use 3-grams as atoms, and address syntactic surprisal.We argue that abstracting to surprisal as measure of LB and FCN is appropriate, as it expresses reader expectations and text entropy. We use surprisal to analyse differences between:

    1. spoken and written learner language (L2);
    2. L2 across proficiency levels;
    3. L2 compared with L1

    We test Pawley and Syder (1983)'s and Levy and Jaeger (2007)'s hypothesis that native speakers play the tug-of-war between formulaicity and expressiveness best, thus minimizing comprehension difficulty, according to the uniform information density principle.

  • 37.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Schneider, Gerold
    Measuring Encoding Efficiency in Swedish and English Language Learner Speech Production2017In: Proceedings of Interspeech 2017 / [ed] Francisco Lacerda, David House, Mattias Heldner, Joakim Gustafson, Sofia Strömbergsson, Marcin Włodarczak, The International Speech Communication Association (ISCA), 2017, p. 1779-1783Conference paper (Refereed)
    Abstract [en]

    We use n-gram language models to investigate how far language approximates an optimal code for human communication in terms of Information Theory [1], and what differences there are between Learner proficiency levels. Although the language of lower level learners is simpler, it is less optimal in terms of information theory, and as a consequence more difficult to process.

    Download full text (pdf)
    fulltext
  • 38.
    Grigonyté, Gintaré
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences. Karolinska Institutet, Sweden.
    Velupillai, Sumithra
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Improving Readability of Swedish Electronic Health Records through Lexical Simplification: First Results2014In: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), Stroudsburg, USA: Association for Computational Linguistics, 2014, p. 74-83Conference paper (Refereed)
    Abstract [en]

    This paper describes part of an ongoing effort to improve the readability of Swedish electronic health records (EHRs). An EHR contains systematic documentation of a single patient’s medical history across time, entered by healthcare professionals with the purpose of enabling safe and informed care. Linguistically, medical records exemplify a highly specialised domain, which can be superficially characterised as having telegraphic sentences involving displaced or missing words, abundant abbreviations, spelling variations including misspellings, and terminology. We report results on lexical simplification of Swedish EHRs, by which we mean detecting the unknown, out-ofdictionary words and trying to resolve them either as compounded known words, abbreviations or misspellings.

    Download full text (pdf)
    Improving Readability of Swedish Electronic Health Records through Lexical Simplification: First Results
  • 39.
    Grigonyté, Gintaré
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences. Karolinska Institute, Sweden.
    Velupillai, Sumithra
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Spelling Variation of Latin and Greek words in Swedish Medical Text2014Conference paper (Refereed)
  • 40.
    Grigonyté, Gintaré
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Language-independent exploration of repetition and variation in longitudinal child-directed speech: A tool and resources2016In: Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition at SLTC, Umeå, 16th November 2016 / [ed] Elena Volodina, Gintarė Grigonytė, Ildikó Pilán, Kristina Nilsson Björkenstam, Lars Borin, Linköping: Linköping University Electronic Press, 2016, p. 41-50Conference paper (Refereed)
    Abstract [en]

    We present a language-independent tool, called Varseta, for extracting variation sets in child-directed speech. This tool is evaluated against a gold standard corpus annotated with variation sets, MINGLE-3-VS, and used to explore variation sets in 26 languages in CHILDES-26-VS, a comparable corpus derived from the CHILDES database. The tool and the resources are freely available for re-search.

  • 41.
    Hjelm, Hans
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Extraction of Cross Language Term Correspondences2006In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006), 2006Conference paper (Refereed)
  • 42.
    Hjelm, Hans
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Identifying Cross Language Term Equivalents Using Statistical Machine Translation and Distributional Association Measures2007In: Proceedings of Nodalida 2007, the 16th Nordic Conference of Computational Linguistics / [ed] Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit, 2007Conference paper (Refereed)
    Abstract [en]

    This article presents a comparison of the accuracy of a number of different approaches for identifying cross language term equivalents (translations). The methods investigated are on the one hand associative measures, commonly used in word-space models or in Information Retrieval and on the other hand a Statistical Machine Translation (SMT) approach. I have performed tests on six language pairs, using the JRC-Acquis parallel corpus as training material and Eurovoc as a gold standard. The SMT approach is shown to be more effective than the associative measures. The best results are achieved by taking a weighted average of the scores of the SMT approach and disparate associative measures.

  • 43.
    Hjelm, Hans
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Buitelaar, Paul
    Multilingual Evidence Improves Clustering-based Taxonomy Extraction2008In: Proceedings of the 18th European Conference on Artificial Intelligence (ECAI 2008), 2008Conference paper (Refereed)
  • 44.
    Hjelm, Hans
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Schwarz, Christoph
    LiSa - Morphological Analysis for Information Retrieval2006In: Proceedings of the 15th NODALIDA conference, Joensuu 2005, 2006Conference paper (Refereed)
  • 45.
    Hultin, Felix
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Understanding Context-free Grammars through Data Visualization2016Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Ever since the late 1950's, context-free grammars have played an important role within the field of linguistics, been a part of introductory courses and expanded into other fields of study. Meanwhile, data visualization in modern web development has made it possible to do feature rich visualization in the browser. In this thesis, these two developments are united, by developing a browser based app, to write context-free grammars, parse sentences and visualize the output. A user experience study with usability-tests and user-interviews is conducted, in order to investigate the possible benefits and disadvantages of said visualization when writing context-free grammars. The results show that data visualization was limitedly used by participants, in that it helped them to see if sentences were parsed and, if a sentence was not parsed, at which position parsing went wrong. Future improvements on the software and studies on them are proposed as well as the expansion of the field of data visualization within linguistics.

    Download full text (pdf)
    Understanding Context-free Grammars through Data Visualization
  • 46.
    Hägglöf, Hillevi
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Tengstrand, Lisa
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    A Random Indexing Approach to Unsupervised Selectional Preference Induction2011Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    A selectional preference is the relation between a head-word and plausible arguments of that head-word. Estimation of the association feature between these words is important to natural language processing applications such as Word Sense Disambiguation. This study presents a novel approach to selectional preference induction within a Random Indexing word space. This is a spatial representation of meaning where distributional patterns enable estimation of the similarity between words. Using only frequency statistics about words to estimate how strongly one word selects another, the aim of this study is to develop a flexible method that is not language dependent and does not require any annotated resourceswhich is in contrast to methods from previous research. In order to optimize the performance of the selectional preference model, experiments including parameter tuning and variation of corpus size were conducted. The selectional preference model was evaluated in a pseudo-word evaluation which lets the selectional preference model decide which of two arguments have a stronger correlation to a given verb. Results show that varying parameters and corpus size does not affect the performance of the selectional preference model in a notable way. The conclusion of the study is that the language modelused does not provide the adequate tools to model selectional preferences. This might be due to a noisy representation of head-words and their arguments.

    Download full text (pdf)
    fulltext
  • 47. Ibbotson, Paul
    et al.
    Hartman, Rose M.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Frequency filter: an open access tool for analysing language development2018In: Language, Cognition and Neuroscience, ISSN 2327-3798, E-ISSN 2327-3801, Vol. 33, no 10, p. 1325-1339Article in journal (Refereed)
    Abstract [en]

    We present an open-access analytic tool, which allows researchers to simultaneously control for and combine language data from the child, the caregiver, multiple languages, and across multiple time points to make inferences about the social and cognitive factors driving the shape of language development. We demonstrate how the tool works in three domains of language learning and across six languages. The results demonstrate the usefulness of this approach as well as providing deeper insight into three areas of language production and acquisition: egocentric language use, the learnability of nouns versus verbs, and imageability. We have made the Frequency Filter tool freely available as an R-package for other researchers to use at https://github.com/rosemm/FrequencyFilter.

  • 48.
    Kann, Amanda
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Massively Multilingual Token-Based Typology Using the Parallel Bible Corpus2024In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia: ELRA and ICCL , 2024, p. 11070-11079Conference paper (Refereed)
    Abstract [en]

    The parallel Bible corpus is a uniquely broad multilingual resource, covering over 1400 languages. While this data is potentially highly useful for extending language coverage in both token-based typology research and various low-resource NLP applications, the restricted register and translational nature of the Bible texts has raised concerns as to whether they are sufficiently representative of language use outside of their specific context. In this paper, we analyze the reliability and generalisability of word order statistics extracted from the Bible corpus from two angles: stability across different translations in the same language, and comparability with Universal Dependencies corpora and typological database classifications from URIEL and Grambank. We find that variation between same-language translations is generally low and that agreement with other data sources and previous work is generally high, suggesting that the impact of issues specific to massively parallel texts is smaller than previously posited.

    Download full text (pdf)
    fulltext
  • 49.
    Kasaty, Anna
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Koponen, Eeva
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Klintfors, Eeva
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Swedish Nominal Morphophonology Implemented within the Two-level Model in PC-Kimmo1998Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    This paper presents a description of Swdish morphophonology and an attempt to create a Swedish pronunciation morpheme lexicon as a part of a text-to-speech system at Telia Research AB.

  • 50.
    Koponen, Eeva
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics. Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Klintfors, Eeva
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Effects of Target-Word Frequency Rate on Sound-Meaning-Connection in Five to Fifteen Month-Old Swedish Infants1999Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    The purpose of this study was to examine the effects of manipulating target-word frequency rate and target-word phrase position on sound-meaning-connection in five to fifteen month old Swedish infants. Three different test conditions, each one of them a film showing objects and corresponding phrases made of randomly generated artificial words, were designed. The structure of the first, high variability test condition included context-dependent information and the structures of the second and the third, low variability test conditions were characterised by frequent nonsense target-word rate, target-words occurring in phrase final position. The aim of the artificial input language was to ensure the novelty of test material, and to simulate the type of learning situation - when the semantic content of words is arbitrary - facing young infants in the beginning of language learning. Analysis of informants looking behaviour, prior to, and after exposure to the objects and the corresponding audio input, were performed. Results showed that the structure of high variability test condition and the structure of low variability test conditions were associated with significant between-group differences. This finding indicates that the nonsense phrases in low variability test conditions managed to 'explain' the objects just like semantically meaningful phrases do. When compared with past research, these findings seem to suggest that experience-dependent mechanisms may support, besides word segmentation, even more complicated aspects of language learning, such as acquisition of syntax.

    Download full text (pdf)
    fulltext
123 1 - 50 of 145
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf