Change search
Refine search result
123 1 - 50 of 148
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Adesam, Yvonne
    Stockholm University, Faculty of Humanities, Department of Linguistics.
    The Multilingual Forest: Investigating High-quality Parallel Corpus Development2012Doctoral thesis, monograph (Other academic)
    Abstract [en]

    This thesis explores the development of parallel treebanks, collections of language data consisting of texts and their translations, with syntactic annotation and alignment, linking words, phrases, and sentences to show translation equivalence. We describe the semi-manual annotation of the SMULTRON parallel treebank, consisting of 1,000 sentences in English, German and Swedish. This description is the starting point for answering the first of two questions in this thesis.

    • What issues need to be considered to achieve a high-quality, consistent,parallel treebank?

    The units of annotation and the choice of annotation schemes are crucial for quality, and some automated processing is necessary to increase the size. Automatic quality checks and evaluation are essential, but manual quality control is still needed to achieve high quality.

    Additionally, we explore improving the automatically created annotation for one language, using information available from the annotation of the other languages. This leads us to the second of the two questions in this thesis.

    • Can we improve automatic annotation by projecting information available in the other languages?

    Experiments with automatic alignment, which is projected from two language pairs, L1–L2 and L1–L3, onto the third pair, L2–L3, show an improvement in precision, in particular if the projected alignment is intersected with the system alignment. We also construct a test collection for experiments on annotation projection to resolve prepositional phrase attachment ambiguities. While majority vote projection improves the annotation, compared to the basic automatic annotation, using linguistic clues to correct the annotation before majority vote projection is even better, although more laborious. However, some structural errors cannot be corrected by projection at all, as different languages have different wording, and thus different structures.

  • 2.
    Alemu Argaw, Atelach
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Asker, Lars
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Cöster, Rickard
    SICS.
    Karlgren, Jussi
    SICS.
    Sahlgren, Magnus
    SICS.
    Dictionary-based Amharic-French information retrieval2006In: Accessing multilingual information repositories: 6th workshop of the Cross-Language Evalution Forum, CLEF 2005, Vienna, Austria, 21-23 September, 2005, revised selected papers / [ed] Carol Peters, Fredric C. Gey, Julio Gonzalo, Henning Müller, Gareth J. F. Jones, Michael kluck, Bernardo Magnini, Maarten de Rijke, Berlin: Springer Berlin/Heidelberg, 2006, p. 83-92Conference paper (Other academic)
    Abstract [en]

    We present four approaches to the Amharic - French bilingual track at CLEF 2005. All experiments use a dictionary based approach to translate the Amharic queries into French Bags-of-words, but while one approach uses word sense discrimination on the translated side of the queries, the other one includes all senses of a translated word in the query for searching. We used two search engines: The SICS experimental engine and Lucene, hence four runs with the two approaches. Non-content bearing words were removed both before and after the dictionary lookup. TF/IDF values supplemented by a heuristic function was used to remove the stop words from the Amharic queries and two French stopwords lists were used to remove them from the French translations. In our experiments, we found that the SICS search engine performs better than Lucene and that using the word sense discriminated keywords produce a slightly better result than the full set of non discriminated keywords.

  • 3.
    Angelov, Krasimir
    et al.
    University of Gothenburg, Sweden.
    Liefke, KristinaGoethe University, Germany.Loukanova, RoussankaStockholm University, Faculty of Science, Department of Mathematics. Stockholm University, Faculty of Humanities, Department of Philosophy.Moortgat, MichaelUtrecht University, The Netherlands.Tojo, SatoshiSchool of Information Science, JAIST, Japan.
    Proceedings of the Symposium on Logic and Algorithms in Computational Linguistics 2018 (LACompLing2018)2018Conference proceedings (editor) (Refereed)
    Abstract [en]

    Computational linguistics studies natural language in its various manifestations from a computational point of view, both on the theoretical level (modeling grammar modules dealing with natural language form and meaning, and the relation between these two) and on the practical level (developing applications for language and speech technology). Right from the start in the 1950ties, there have been strong links with computer science, logic, and many areas of mathematics - one can think of Chomsky's contributions to the theory of formal languages and automata, or Lambek's logical modeling of natural language syntax. The symposium on Logic and Algorithms in Computational Linguistics 2018 (LACompLing2018) assesses the place of logic, mathematics, and computer science in present day computational linguistics. It intends to be a forum for presenting new results as well as work in progress.

  • 4.
    Bell, Linda
    et al.
    TeliaSonera (R and D).
    Boye, Johan
    TeliaSonera (R and D).
    Gustafson, Joakim
    TeliaSonera (R&D).
    Heldner, Mattias
    TeliaSonera (R&D).
    Lindström, Anders
    TeliaSonera (R and D).
    Wirén, Mats
    TeliaSonera (R&D).
    The Swedish NICE Corpus – Spoken dialogues between children and embodied characters in a computer game scenario2005In: Proceedings Interspeech 2005 - Eurospeech: 9th European Conference on Speech Communication and Technology, Lisbon, Portugal: ISCA , 2005, p. 2765-2768Conference paper (Refereed)
    Abstract [en]

    This article describes the collection and analysis of a Swedish database of spontaneous and unconstrained children-machine dialogues. The Swedish NICE corpus consists of spoken dialogues between children aged 8 to 15 and embodied fairytale characters in a computer game scenario. Compared to previously collected corpora of children's computer-directed speech, the Swedish NICE corpus contains extended interactions, including three-party conversation, in which the young users used spoken dialogue as the primary means of progression in the game.

  • 5.
    Bell, Linda
    et al.
    TeliaSonera (R & D).
    Boye, Johan
    TeliaSonera (R & D).
    Gustafson, Joakim
    TeliaSonera (R & D).
    Wirén, Mats
    TeliaSonera (R & D).
    Modality Convergence in a Multimodal Dialogue System2000In: Proceedings of Götalog, 2000, p. 29-34Conference paper (Other academic)
    Abstract [en]

    When designing multimodal dialogue systems allowing speech as well as graphical operations, it is important to understand not only how people make use of the different modalities in their utterances, but also how the system might influence a user's choice of modality by its own behavior. This paper describes an experiment in which subjects interacted with two versions of a simulated multimodal dialogue system. One version used predominantly graphical means when referring to specific objects; the other used predominantly verbal referential expressions. The purpose of the study was to find out what effect, if any, the system's referential strategy had on the user's behavior. The results provided limited support for the hypothesis that the system can influence users to adopt another modality for the purpose of referring

  • 6. Berndorfer, Stefan
    et al.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Automated Diagnosis Coding with Combined Text Representations2017In: Studies in Health Technology and Informatics, ISSN 0926-9630, E-ISSN 1879-8365, Vol. 235, p. 201-2015Article in journal (Refereed)
    Abstract [en]

    Automated diagnosis coding can be provided efficiently by learning predictive models from historical data; however, discriminating between thousands of codes while allowing a variable number of codes to be assigned is extremely difficult. Here, we explore various text representations and classification models for assigning ICD-9 codes to discharge summaries in MIMIC-III. It is shown that the relative effectiveness of the investigated representations depends on the frequency of the diagnosis code under consideration and that the best performance is obtained by combining models built using different representations.

  • 7. Bjerva, Johannes
    et al.
    Grigonyte, Gintare
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Plank, Barbara
    Neural Networks and Spelling Features for Native Language Identification2017In: The Twelfth Workshop on Innovative Use of NLP for Building Educational Applications: Proceedings of the Workshop, Association for Computational Linguistics, 2017, p. 235-239Conference paper (Refereed)
    Abstract [en]

    We present the RUG-SU team's submission at the Native Language Identification Shared Task 2017. We combine several approaches into an ensemble, based on spelling error features, a simple neural network using word representations, a deep residual network using word and character features, and a system based on a recurrent neural network. Our best system is an ensemble of neural networks, reaching an F1 score of 0.8323. Although our system is not the highest ranking one, we do outperform the baseline by far.

  • 8. Bjerva, Johannes
    et al.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Cross-lingual Learning of Semantic Textual Similarity with Multilingual Word Representations2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics / [ed] Jörg Tiedemann, Linköping: Linköping University Electronic Press, 2017, p. 211-215, article id 024Conference paper (Refereed)
    Abstract [en]

    Assessing the semantic similarity between sentences in different languages is challenging. We approach this problem by leveraging multilingual distributional word representations, where similar words in different languages are close to each other. The availability of parallel data allows us to train such representations on a large amount of languages. This allows us to leverage semantic similarity data for languages for which no such data exists. We train and evaluate on five language pairs, including English, Spanish, and Arabic. We are able to train wellperforming systems for several language pairs, without any labelled data for that language pair.

  • 9. Bjerva, Johannes
    et al.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Han Veiga, Maria
    Tiedemann, Jörg
    Augenstein, Isabelle
    What Do Language Representations Really Represent?2019In: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 45, no 2, p. 381-389Article in journal (Refereed)
    Abstract [en]

    A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just as it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, whereas genetic relationships—a convenient benchmark used for evaluation in previous work—appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another.

  • 10.
    Boye, Johan
    et al.
    TeliaSonera (R & D).
    Gustafson, Joakim
    TeliaSonera (R & D).
    Wirén, Mats
    TeliaSonera (R & D).
    Robust spoken language understanding in a computer game2006In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 48, no 3-4, p. 335-353Article in journal (Refereed)
    Abstract [en]

    We present and evaluate a robust method for the interpretation of spoken input to a conversational computer game. The scenario of the game is that of a player interacting with embodied fairy-tale characters in a 3D world via spoken dialogue (supplemented by graphical pointing actions) to solve various problems. The player himself cannot directly perform actions in the world, but interacts with the fairy-tale characters to have them perform various tasks, and to get information about the world and the problems to solve. Hence the role of spoken dialogue as the primary means of control is obvious and natural to the player. Naturally, this means that robust spoken language understanding becomes a critical component. To this end, the paper describes a semantic representation formalism and an accompanying parsing algorithm which works off the output of the speech recogniser's statistical language model. The evaluation shows that the parser is robust in the sense of considerably improving on the noisy output of the speech recogniser.

  • 11.
    Boye, Johan
    et al.
    TeliaSonera (R & D).
    Wirén, Mats
    TeliaSonera (R & D).
    Negotiative Spoken-Dialogue Interfaces to Databases2003In: Proceedings of Diabruck, Wallerfangen, Germany, 2003Conference paper (Refereed)
    Abstract [en]

    The aim of this paper is to develop a principled and empirically motivated approach to robust, negotiative spoken dialogue with databases. Robustness is achieved by limiting the set of representable utterance types. Still, the vast majority of utterances that occur in practice can be handled.

  • 12.
    Boye, Johan
    et al.
    TeliaSonera (R & D).
    Wirén, Mats
    TeliaSonera (R & D).
    Robust parsing and spoken negotiative dialogue with databases2008In: Natural Language Engineering, ISSN 1351-3249, E-ISSN 1469-8110, Vol. 14, no 3, p. 289-312Article in journal (Refereed)
    Abstract [en]

    This paper presents a robust parsing algorithm and semantic formalism for the interpretation of utterances in spoken negotiative dialogue with databases. The algorithm works in two passes: a domain-specific pattern-matching phase and a domain-independent semantic analysis phase. Robustness is achieved by limiting the set of representable utterance types to an empirically motivated subclass which is more expressive than propositional slot–value lists, but much less expressive than first-order logic. Our evaluation shows that in actual practice the vast majority of utterances that occur can be handled, and that the parsing algorithm is highly efficient and accurate.

  • 13.
    Boye, Johan
    et al.
    TeliaSonera (R & D).
    Wirén, Mats
    TeliaSonera (R & D).
    Robust Parsing of Utterances in Negotiative Dialogue2003In: Proceedings 8th European Conference on Speech Communication and Technology (Eurospeech), Geneva, Switzerland, 2003Conference paper (Refereed)
    Abstract [en]

    This paper presents an algorithm for domain-dependent parsing of utterances in negotiative dialogue. To represent such utterances, the algorithm outputs semantic expressions that are more expressive than propositional slot-filler structures. It is very fast and robust, yet precise and capable of correctly combining information from different utterance fragments.

  • 14.
    Boye, Johan
    et al.
    TeliaSonera (R & D).
    Wirén, Mats
    TeliaSonera (R & D).
    Gustafson, Joakim
    TeliaSonera (R & D).
    Contextual reasoning in multimodal dialogue systems: two case studies2004In: Proceedings of The 8th Workshop on the Semantics and Pragmatics of Dialogue Catalogue'04, Barcelona, 2004, p. 19-21Conference paper (Refereed)
    Abstract [en]

    This paper describes an approach to contextual reasoning for interpretation ofspoken multimodal dialogue. The approach is based on combining recencybased search for antecedents with an object-oriented domain representation insuch a way that the search is highly constrained by the type information of theantecedents. By furthermore representingcandidate antecedents from the dialoguehistory and visual context in a uniformway, a single machinery (based on -reduction in lambda calculus) can be usedfor resolving many kinds of underspecified utterances. The approach has beenimplemented in two highly different domains.

  • 15.
    Börstell, Carl
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Sign Language.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Iconic Locations in Swedish Sign Language: Mapping Form to Meaning with Lexical Databases2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa / [ed] Jörg Tiedemann, Linköping: Linköping University Electronic Press, 2017, p. 221-225, article id 026Conference paper (Refereed)
    Abstract [en]

    In this paper, we describe a method for mapping the phonological feature location of Swedish Sign Language (SSL) signs to the meanings in the Swedish semantic dictionary SALDO. By doing so, we observe clear differences in the distribution of meanings associated with different locations on the body. The prominence of certain locations for specific meanings clearly point to iconic mappings between form and meaning in the lexicon of SSL, which pinpoints modalityspecific properties of the visual modality.

  • 16. Cap, Fabienne
    et al.
    Adesam, Yvonne
    Ahrenberg, Lars
    Borin, Lars
    Bouma, Gerlof
    Forsberg, Markus
    Kann, Viggo
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Smith, Aaron
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nivre, Joakim
    SWORD: Towards Cutting-Edge Swedish Word Processing2016In: Proceedings of SLTC 2016, 2016Conference paper (Refereed)
    Abstract [en]

    Despite many years of research on Swedish language technology, there is still no well-documented standard for Swedish word processing covering the whole spectrum from low-level tokenization to morphological analysis and disambiguation. SWORD is a new initiative within the SWE-CLARIN consortium aiming to develop documented standards for Swedish word processing. In this paper, we report on a pilot study of Swedish tokenization, where we compare the output of six different tokenizers on four different text types. For one text type (Wikipedia articles), we also compare to the tokenization produced by six manual annotators.

  • 17.
    Carter, David
    et al.
    SRI International.
    Rayner, Manny
    SRI International.
    Eklund, Robert
    TeliaSonera (R & D).
    Kaja, Jaan
    TeliaSonera (R & D).
    Lyberg, Bertil
    TeliaSonera (R & D).
    Sautermeister, Per
    TeliaSonera (R & D).
    Wirén, Mats
    TeliaSonera (R& D).
    Neumeyer, Leonardo
    SRI International.
    Weng, Fuliang
    SRI International.
    Common speech/language issues2000In: The spoken language translator / [ed] Manny Rayner, David Carter, Pierrette Bouillon, Vassilis Digalakis, Mats Wirén, Cambridge: Cambridge University Press, 2000, p. 284-294Chapter in book (Other academic)
  • 18.
    Carter, David
    et al.
    SRI International.
    Rayner, Manny
    SRI International.
    Eklund, Robert
    TeliaSonera (R & D).
    MacDermid, Catriona
    TeliaSonera (R & D).
    Wirén, Mats
    TeliaSonera (R & D).
    Evaluation2000In: The spoken language translator / [ed] Manny Rayner, David Carter, Pierrette Bouillon, Vassilis Digalakis, Mats Wirén, Cambridge: Cambridge University Press, 2000, p. 297-312Chapter in book (Other academic)
  • 19.
    Domeij, Rickard
    Stockholm University, Faculty of Humanities, Department of Linguistics.
    Datorstödd språkgranskning under skrivprocessen: svensk språkkontroll ur användarperspektiv2003Doctoral thesis, comprehensive summary (Other academic)
    Abstract [sv]

    Datorstödd språkgranskning med kontroll av stavning, teckenanvändning, ordval och grammatik är ett av språkteknologins största tillämpningsområden. Sådana språkkontrollfunktioner har funnits i ordbehandlingsprogram för engelska i mer än ett decennium. Utvecklingen av motsvarande teknik för svenska ligger efter, men sedan några år tillbaka går det även att språkkontrollera svensk text i ett ordbehandlingsprogram. Tyvärr är det inte möjligt att i befintligt skick överföra språkteknologiska lösningar från engelska till svenska vilket bl.a. beror på att svenskan morfologiskt och morfosyntaktiskt sett är rikare. Detta visar sig t.ex. i svenskans mer intrikata kongruenssystem och dess oändliga möjligheter att skapa nya sammansatta ord. För att sammansättningarna inte ska signaleras som misstänkt felstavade i en svensk stavningskontroll, krävs därför speciella analysmetoder. Vid en kontroll av svensk grammatik behövs dessutom teknik för detektion och korrektion av inkongruens och särskrivna sammansättningar, vilka blivit allt vanligare bl.a. som en effekt av att många skriver med ordbehandlare och influeras av främmande språk, särskilt då engelskan. I avhandlingen redogörs för hur effektiva och adekvata språkgranskningsfunktioner för svenska utvecklats under nittiotalet i olika projekt på Nada, KTH i Stockholm. Det finns dock ofrånkomliga begränsningar i den bakomliggande tekniken som yttrar sig i missade fel, misstolkningar och falska alarm. Denna begränsade, fragmentariska och inte alltid pålitliga datorgranskning kan antas skapa problem i samspelet mellan dator och skribent, vilket väcker frågor om teknikens användbarhet och dess effekter på språk och språkförmåga. I avhandlingen presenteras två användarstudier med syfte att undersöka dessa problem mot bakgrund av kognitiv skrivforskning. En kombination av textanalys och tänka högt-metodik används för att studera hur skribent och skrivande påverkas vid datorstödd språkgranskning. Resultatet visar bl.a. att ett språkgranskningsprogram kunde hjälpa skribenterna att upptäcka, definiera och åtgärda språkliga problem som de själva förbigått av olika skäl. Dock gav programmet ibland inte tillräcklig instruktiv hjälp för att en skribent skulle kunna åtgärda problemet. En skribent kunde också luras att ändra på ett inkorrekt eller mindre lyckat sätt vid falskt alarm eller felaktiga diagnoser. Resultatet tyder dessutom på att den kognitiva avlastning som programmet erbjuder kan leda till att skribenten åtgärdar ett problem på ett korrekt men mindre lyckat sätt p.g.a. att hon inte lägger ner samma tankearbete på problemet som när hon arbetar manuellt. I ljuset av framkomna resultat diskuteras hur användbarheten i befintliga språkgranskningsprogram kan förbättras.

  • 20.
    Dziadek, Juliusz
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Duneld, Martin
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Improving Terminology Mapping in Clinical Text with Context-Sensitive Spelling Correction2017In: Studies in Health Technology and Informatics, ISSN 0926-9630, E-ISSN 1879-8365, Vol. 235, p. 241-245Article in journal (Refereed)
    Abstract [en]

    The mapping of unstructured clinical text to an ontology facilitates meaningful secondary use of health records but is non-trivial due to lexical variation and the abundance of misspellings in hurriedly produced notes. Here, we apply several spelling correction methods to Swedish medical text and evaluate their impact on SNOMED CT mapping; first in a controlled evaluation using medical literature text with induced errors, followed by a partial evaluation on clinical notes. It is shown that the best-performing method is context-sensitive, taking into account trigram frequencies and utilizing a corpus-based dictionary.

  • 21.
    Ek, Adam
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Blending Words or: How I Learned to Stop Worrying and Love the Blendguage: A computational study of lexical blending in Swedish2018Independent thesis Advanced level (degree of Master (One Year)), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    This thesis investigates Swedish lexical blends. A lexical blend is defined as the concatenation of two words, where at least one word has been reduced. Lexical blends are approached from two perspectives. First, the thesis investigates lexical blends as they appear in the Swedish language. It is found that there is a significant statistical relationship between the two source words in terms of orthographic, phonemic and syllabic length and frequency in a reference corpus. Furthermore, some uncommon lexical blends created from pronouns and interjections are described. A description of lexical blends through semantic construction and similarity to other word formation processes are also described. Secondly, the thesis develops a model which predicts source words of lexical blends. To predict the source words a logistic regression model is used. The evaluation shows that using a ranking approach, the correct source words are the highest ranking word pair in 32.2% of the cases. In the top 10 ranking word pairs, the correct word pair is found in 60.6% of the cases. The results are lower than in previous studies, but the number of blends used is also smaller. It is shown that lexical blends which overlap are easier to predict than lexical blends which do not overlap. Using feature ablation, it is shown that semantic and frequency related features have the most important for the prediction of source words.

  • 22.
    Ek, Adam
    Stockholm University, Faculty of Humanities, Department of Linguistics.
    Extracting social networks from fiction: Imaginary and invisible friends: Investigating the social world of imaginary friends.2017Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    This thesis develops an approach to extract the social relation between characters in literary text to create a social network. The approach uses co-occurrences of named entities, keywords associated with the named entities, and the dependency relations that exist between the named entities to construct the network. Literary texts contain a large amount of pronouns to represent the named entities, to resolve the antecedents of pronouns, a pronoun resolution system is implemented based on a standard pronoun resolution algorithm. The results indicate that the pronoun resolution system finds the correct named entity in 60,4\% of all cases. The social network is evaluated by comparing character importance rankings based on graph properties with an independently human generated importance rankings. The generated social networks correlate moderately to strongly with the independent character ranking.

  • 23.
    Ek, Adam
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Östling, Robert
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Grigonytė, Gintarė
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Gustafson Capková, Sofia
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Identifying Speakers and Addressees in Dialogues Extracted from Literary Fiction2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) / [ed] Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga, European Language Resources Association, 2018, p. 817-824Conference paper (Refereed)
    Abstract [en]

    This paper describes an approach to identifying speakers and addressees in dialogues extracted from literary fiction, along with a dataset annotated for speaker and addressee. The overall purpose of this is to provide annotation of dialogue interaction between characters in literary corpora in order to allow for enriched search facilities and construction of social networks from the corpora. To predict speakers and addressees in a dialogue, we use a sequence labeling approach applied to a given set of characters. We use features relating to the current dialogue, the preceding narrative, and the complete preceding context. The results indicate that even with a small amount of training data, it is possible to build a fairly accurate classifier for speaker and addressee identification across different authors, though the identification of addressees is the more difficult task.

  • 24.
    Eklund, Robert
    et al.
    TeliaSonera (R & D).
    Wirén, Mats
    TeliaSonera (R & D).
    ”Njutandes av en Monte Christo no 5 och en iskall Mojito”: Observationer om användning av s-particip2006In: Svenskans beskrivning 28, Förhandlingar vid Tjugoåttonde sammankomsten för svenskans beskrivning, 2006, p. 97-108Conference paper (Refereed)
  • 25.
    Enqvist, Sebastian
    et al.
    Stockholm University, Faculty of Humanities, Department of Philosophy. University of Amsterdam, The Netherlands.
    Seifan, Fatemeh
    Venema, Yde
    An expressive completeness theorem for coalgebraic modal mu-calculi2017In: Logical Methods in Computer Science, ISSN 1860-5974, E-ISSN 1860-5974, Vol. 13, no 2, article id 14Article in journal (Refereed)
    Abstract [en]

    Generalizing standard monadic second-order logic for Kripke models, we introduce monadic second-order logic interpreted over coalgebras for an arbitrary set functor. We then consider invariance under behavioral equivalence of MSO-formulas. More specifically, we investigate whether the coalgebraic mu-calculus is the bisimulation-invariant fragment of the monadic second-order language for a given functor. Using automata theoretic techniques and building on recent results by the third author, we show that in order to provide such a characterization result it suffices to find what we call an adequate uniform construction for the coalgebraic type functor. As direct applications of this result we obtain a partly new proof of the Janin-Walukiewicz Theorem for the modal mu-calculus, avoiding the use of syntactic normal forms, and bisimulation invariance results for the bag functor (graded modal logic) and all exponential polynomial functors (including the game functor). As a more involved application, involving additional non-trivial ideas, we also derive a characterization theorem for the monotone modal mu-calculus, with respect to a natural monadic second-order language for monotone neighborhood models.

  • 26.
    Eriksson, Anders
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics. Fonetik.
    Lacerda, Francisco
    Stockholm University, Faculty of Humanities, Department of Linguistics. Fonetik.
    Charlatanry in forensic speech science: A problem to be taken seriously2007In: International Journal of Speech, Language and the Law: (formerly Forensic Linguistics: ISSN 1350-1771), ISSN 1748-8885, Vol. 14, no 2, p. 169-193Article in journal (Refereed)
    Abstract [en]

    A lie detector which can reveal lie and deception in some automatic and perfectly reliable way is an old idea we have often met with in science fiction books and comic strips. This is all very well. It is when machines claimed to be lie detectors appear in the context of criminal investigations or security applications that we need to be concerned. In the present paper we will describe two types of ‘deception’ or ‘stress detectors’ (euphemisms to refer to what quite clearly are known as ‘lie detectors’). Both types of detection are claimed to be based on voice analysis but we found no scientific evidence to support the manufacturers’ claims. Indeed, our review of scientific studies will show that these machines perform at chance level when tested for reliability. Given such results and the absence of scientific support for the underlying principles it is justified to view the use of these machines as charlatanry and we argue that there are serious ethical and security reasons to demand that responsible authorities and institutions should not get involved in such practices.

  • 27. Eyben, Florian
    et al.
    Salomão, Gláucia Laís
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics. KTH (Royal Institute of Technology), Sweden.
    Sundberg, Johan
    Scherer, Klaus R.
    Schuller, Björn W.
    Emotion in the singing voice—a deeper look at acoustic features in the light of automatic classification2015In: EURASIP Journal on Audio, Speech, and Music Processing, ISSN 1687-4714, E-ISSN 1687-4722, article id 19Article in journal (Refereed)
    Abstract [en]

    We investigate the automatic recognition of emotions in the singing voice and study the worth and role of a variety of relevant acoustic parameters. The data set contains phrases and vocalises sung by eight renowned professional opera singers in ten different emotions and a neutral state. The states are mapped to ternary arousal and valence labels. We propose a small set of relevant acoustic features basing on our previous findings on the same data and compare it with a large-scale state-of-the-art feature set for paralinguistics recognition, the baseline feature set of the Interspeech 2013 Computational Paralinguistics ChallengE (ComParE). A feature importance analysis with respect to classification accuracy and correlation of features with the targets is provided in the paper. Results show that the classification performance with both feature sets is similar for arousal, while the ComParE set is superior for valence. Intra singer feature ranking criteria further improve the classification accuracy in a leave-one-singer-out cross validation significantly.

  • 28.
    Glant, Oliver
    Stockholm University, Faculty of Humanities, Department of Linguistics.
    Attitydanalys av svenska produktomdömen – behövs språkspecifika verktyg?2018Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Sentiment analysis of Swedish data is often performed using English tools and machine. This thesis compares using a neural network trained on Swedish data with a corresponding one trained on English data. Two datasets were used: approximately 200,000 non-neutral Swedish reviews from the company Prisjakt Sverige AB, one of the largest annotated datasets used for Swedish sentiment analysis, and 1,000,000 non-neutral English reviews from Amazon.com. Both networks were evaluated on 11,638 randomly selected reviews, in Swedish and in English machine translation. The test set had the same overrepresentation of positive reviews as the Swedish dataset (84% were positive). The results suggest that English tools can be used with machine translation for sentiment analysis of Swedish reviews, without loss of classification ability. However, the English tool required 33% more training data to achieve maximum performance. Evaluation on the unbalanced test set required extra consideration regarding statistical measures. F1-measure turned out to be reliable only when calculated for the underrepresented class. It then showed a strong correlation with the Matthews correlation coefficient, which has been found to be more reliable. This warrants further investigation into whether the correlation is valid for all different balances, which would simplify comparison between studies.

  • 29.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Baldwin, Timothy
    Automatic Detection of Multilingual Dictionaries on the Web2014In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), Association for Computational Linguistics, 2014, p. 93-98Conference paper (Refereed)
  • 30.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Clematide, Simon
    Institute of Computational Linguistics, University of Zurich, Switzerland.
    Rinaldi, Fabio
    Institute of Computational Linguistics, University of Zurich, Switzerland.
    How preferred are preferred terms?2013In: eLex 2013 / [ed] Kosem, I., Kallas, J., Gantar, P., Krek, S., Langemets, M., Tuulik, M., 2013, p. 452-459Conference paper (Refereed)
  • 31.
    Grigonyte, Gintare
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Hammarberg, Björn
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Pronunciation and Spelling: the Case of Misspellings in Swedish L2 Written Essays2014In: Human Language Technologies - The Baltic Perspective, Baltic HLT 2014 / [ed] Andrius Utka, Gintarė Grigonytė, Jurgita Kapočiūtė-Dzikienė, Jurgita Vaičenonienė, Amsterdam: IOS Press, 2014, p. 95-98Conference paper (Refereed)
    Abstract [en]

    This research presents an investigation performed on the ASU corpus. We analyse to what extent does the pronunciation of intended words reflects in spelling errors done by L2 Swedish learners. We also propose a method that helps to automatically discriminate the misspellings affected by pronunciation from other types of misspellings.

  • 32.
    Grigonyté, Gintaré
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences. Karolinska Institutet, Sweden.
    Velupillai, Sumithra
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Wirén, Mats
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Improving Readability of Swedish Electronic Health Records through Lexical Simplification: First Results2014In: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), Stroudsburg, USA: Association for Computational Linguistics, 2014, p. 74-83Conference paper (Refereed)
    Abstract [en]

    This paper describes part of an ongoing effort to improve the readability of Swedish electronic health records (EHRs). An EHR contains systematic documentation of a single patient’s medical history across time, entered by healthcare professionals with the purpose of enabling safe and informed care. Linguistically, medical records exemplify a highly specialised domain, which can be superficially characterised as having telegraphic sentences involving displaced or missing words, abundant abbreviations, spelling variations including misspellings, and terminology. We report results on lexical simplification of Swedish EHRs, by which we mean detecting the unknown, out-ofdictionary words and trying to resolve them either as compounded known words, abbreviations or misspellings.

  • 33.
    Grigonyté, Gintaré
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Nilsson Björkenstam, Kristina
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Language-independent exploration of repetition and variation in longitudinal child-directed speech: A tool and resources2016In: Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition at SLTC, Umeå, 16th November 2016 / [ed] Elena Volodina, Gintarė Grigonytė, Ildikó Pilán, Kristina Nilsson Björkenstam, Lars Borin, Linköping: Linköping University Electronic Press, 2016, p. 41-50Conference paper (Refereed)
    Abstract [en]

    We present a language-independent tool, called Varseta, for extracting variation sets in child-directed speech. This tool is evaluated against a gold standard corpus annotated with variation sets, MINGLE-3-VS, and used to explore variation sets in 26 languages in CHILDES-26-VS, a comparable corpus derived from the CHILDES database. The tool and the resources are freely available for re-search.

  • 34.
    Gustafson, Joakim
    et al.
    KTH Speech, Music and Hearing.
    Bell, Linda
    KTH Speech, Music and Hearing.
    Beskow, Jonas
    KTH Speech, Music and Hearing.
    Boye, Johan
    TeliaSonera (R & D).
    Carlson, Rolf
    KTH Speech, Music and Hearing.
    Edlund, Jens
    KTH Speech, Music and Hearing.
    Granström, Björn
    KTH Speech, Music and Hearing.
    House, David
    KTH Speech, Music and Hearing.
    Wirén, Mats
    TeliaSonera (R & D).
    AdApt — A Multimodal Conversational Dialogue System in an Apartment Domain2000In: Proceedings of the Sixth International Conference on Spoken Language Processing (ICSLP), Beijing, China, 2000, p. 134-137Conference paper (Refereed)
  • 35.
    Gustafson, Joakim
    et al.
    TeliaSonera (R & D), KTH Speech, Music and Hearing.
    Bell, Linda
    TeliaSonera (R & D), KTH Speech, Music and Hearing.
    Boye, Johan
    TeliaSonera (R & D).
    Edlund, Jens
    KTH Speech, Music and Hearing.
    Wirén, Mats
    TeliaSonera (R & D).
    Constraint Manipulation and Visualization in a Multimodal Dialogue System2002In: Proceedings of the ISCA Workshop on Multimodal Dialogue in Mobile Environments, Kloster Irsee, Germany., 2002Conference paper (Refereed)
    Abstract [en]

    When interacting with spoken and multimodal dialogue systems, it is often difficult for users to understand and influence how their input is processed by the system. In this paper, wedescribe how these problems were addressed in the multimodal real-estate dialogue systemAdApt. During the course of a dialogue, the user's contraints are translated into symbolicicons that are visualized on the screen and can be manipulated by drag-and-drop operations.Users are thus given a clear picture of how their utterances are understood, and are given atransparent means of controlling the interaction with the system.

  • 36.
    Gustafson, Joakim
    et al.
    TeliaSonera (R & D).
    Bell, Linda
    TeliaSonera (R & D).
    Boye, Johan
    TeliaSonera (R & D).
    Lindström, Anders
    TeliaSonera (R & D).
    Wirén, Mats
    TeliaSonera (R & D).
    The NICE fairy-tale game system2004In: Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004, Boston, 2004Conference paper (Refereed)
    Abstract [en]

    This paper presents the NICE fairy-tale game system, in which adults and children can interact with various animated characters in a 3D world. Computer games is an interesting application for spoken and multimodal dialogue systems. Moreover, for the development of future computer games, multimodal dialogue has the potential to greatly enrichen the user's experience. In this paper, we also present some requirements that have to be fulfilled to successfully integrate spoken dialogue technology with a computer game application

  • 37.
    Gustavsson, Lisa
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Marklund, Ellen
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Klintfors, Eeva
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Lacerda, Francisco
    Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics.
    Directional hearing in a humanoid robot: Evaluation of microphones regarding HRTF and azimuthal dependence2006In: Proceedings from Fonetik 2006 / [ed] Gilbert Ambrazaitis and Susanne Schötz, Lund: Department of Linguistics and Phonetics, Centre for Languages and Literature, Lund University , 2006, p. 45-49Conference paper (Other academic)
  • 38.
    Hammarberg, Björn
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, General Linguistics.
    Grigonyté, Gintaré
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Non-Native Writers’ Errors – a Challenge to a Spell-Checker2014In: 1st Nordic workshop on evaluation of spellchecking and proofing tools (NorWEST2014), 2014, , p. 3Conference paper (Refereed)
    Abstract [en]

    Spell checkers are widely used and if they do their job properly are also highly useful. Usually they are built on the assumption that the text to be corrected is written by a mature native speaker. However non-native speakers are in an even greater need of using spell checkers than native speakers. On the other hand current spell checkers do not take the linguistic problems of learners into account and thus they are poor in identifying errors and supplying the adequate corrections. There is a number of linguistic complexities specific to non-native learners that a spell-checker would need to handle in order to be successful.

  • 39.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Ensembles of Semantic Spaces: On Combining Models of Distributional Semantics with Applications in Healthcare2015Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Distributional semantics allows models of linguistic meaning to be derived from observations of language use in large amounts of text. By modeling the meaning of words in semantic (vector) space on the basis of co-occurrence information, distributional semantics permits a quantitative interpretation of (relative) word meaning in an unsupervised setting, i.e., human annotations are not required. The ability to obtain inexpensive word representations in this manner helps to alleviate the bottleneck of fully supervised approaches to natural language processing, especially since models of distributional semantics are data-driven and hence agnostic to both language and domain.

    All that is required to obtain distributed word representations is a sizeable corpus; however, the composition of the semantic space is not only affected by the underlying data but also by certain model hyperparameters. While these can be optimized for a specific downstream task, there are currently limitations to the extent the many aspects of semantics can be captured in a single model. This dissertation investigates the possibility of capturing multiple aspects of lexical semantics by adopting the ensemble methodology within a distributional semantic framework to create ensembles of semantic spaces. To that end, various strategies for creating the constituent semantic spaces, as well as for combining them, are explored in a number of studies.

    The notion of semantic space ensembles is generalizable across languages and domains; however, the use of unsupervised methods is particularly valuable in low-resource settings, in particular when annotated corpora are scarce, as in the domain of Swedish healthcare. The semantic space ensembles are here empirically evaluated for tasks that have promising applications in healthcare. It is shown that semantic space ensembles – created by exploiting various corpora and data types, as well as by adjusting model hyperparameters such as the size of the context window and the strategy for handling word order within the context window – are able to outperform the use of any single constituent model on a range of tasks. The semantic space ensembles are used both directly for k-nearest neighbors retrieval and for semi-supervised machine learning. Applying semantic space ensembles to important medical problems facilitates the secondary use of healthcare data, which, despite its abundance and transformative potential, is grossly underutilized.

  • 40.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Learning multiple distributed prototypes of semantic categories for named entity recognition2015In: International Journal of Data Mining and Bioinformatics, ISSN 1748-5681, Vol. 13, no 4, p. 395-411Article in journal (Refereed)
    Abstract [en]

    The scarcity of large labelled datasets comprising clinical text that can be exploited within the paradigm of supervised machine learning creates barriers for the secondary use of data from electronic health records. It is therefore important to develop capabilities to leverage the large amounts of unlabelled data that, indeed, tend to be readily available. One technique utilises distributional semantics to create word representations in a wholly unsupervised manner and uses existing training data to learn prototypical representations of predefined semantic categories. Features describing whether a given word belongs to a certain category are then provided to the learning algorithm. It has been shown that using multiple distributional semantic models, each employing a different word order strategy, can lead to enhanced predictive performance. Here, another hyperparameter is also varied – the size of the context window – and an experimental investigation shows that this leads to further performance gains.

  • 41.
    Henriksson, Aron
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Semantic Spaces of Clinical Text: Leveraging Distributional Semantics for Natural Language Processing of Electronic Health Records2013Licentiate thesis, comprehensive summary (Other academic)
    Abstract [en]

    The large amounts of clinical data generated by electronic health record systems are an underutilized resource, which, if tapped, has enormous potential to improve health care. Since the majority of this data is in the form of unstructured text, which is challenging to analyze computationally, there is a need for sophisticated clinical language processing methods. Unsupervised methods that exploit statistical properties of the data are particularly valuable due to the limited availability of annotated corpora in the clinical domain.

    Information extraction and natural language processing systems need to incorporate some knowledge of semantics. One approach exploits the distributional properties of language – more specifically, term co-occurrence information – to model the relative meaning of terms in high-dimensional vector space. Such methods have been used with success in a number of general language processing tasks; however, their application in the clinical domain has previously only been explored to a limited extent. By applying models of distributional semantics to clinical text, semantic spaces can be constructed in a completely unsupervised fashion. Semantic spaces of clinical text can then be utilized in a number of medically relevant applications.

    The application of distributional semantics in the clinical domain is here demonstrated in three use cases: (1) synonym extraction of medical terms, (2) assignment of diagnosis codes and (3) identification of adverse drug reactions. To apply distributional semantics effectively to a wide range of both general and, in particular, clinical language processing tasks, certain limitations or challenges need to be addressed, such as how to model the meaning of multiword terms and account for the function of negation: a simple means of incorporating paraphrasing and negation in a distributional semantic framework is here proposed and evaluated. The notion of ensembles of semantic spaces is also introduced; these are shown to outperform the use of a single semantic space on the synonym extraction task. This idea allows different models of distributional semantics, with different parameter configurations and induced from different corpora, to be combined. This is not least important in the clinical domain, as it allows potentially limited amounts of clinical data to be supplemented with data from other, more readily available sources. The importance of configuring the dimensionality of semantic spaces, particularly when – as is typically the case in the clinical domain – the vocabulary grows large, is also demonstrated.

  • 42.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences. Karolinska Institutet, Sweden.
    Dalianis, Hercules
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Prevalence Estimation of Protected Health Information in Swedish Clinical Text2017In: Studies in Health Technology and Informatics, ISSN 0926-9630, E-ISSN 1879-8365, Vol. 235, p. 216-220Article in journal (Refereed)
    Abstract [en]

    Obscuring protected health information (PHI) in the clinical text of health records facilitates the secondary use of healthcare data in a privacy-preserving manner. Although automatic de-identification of clinical text using machine learning holds much promise, little is known about the relative prevalence of PHI in different types of clinical text and whether there is a need for domain adaptation when learning predictive models from one particular domain and applying it to another. In this study, we address these questions by training a predictive model and using it to estimate the prevalence of PHI in clinical text written (1) in different clinical specialties, (2) in different types of notes (i.e., under different headings), and (3) by persons in different professional roles. It is demonstrated that the overall PHI density is 1.57%; however, substantial differences exist across domains.

  • 43.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Kvist, Maria
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences. Karolinska Institutet, Sweden.
    Dalianis, Hercules
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Duneld, Martin
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Identifying adverse drug event information in clinical notes with distributional semantic representations of context2015In: Journal of Biomedical Informatics, ISSN 1532-0464, E-ISSN 1532-0480, Vol. 57, p. 333-349Article in journal (Refereed)
    Abstract [en]

    For the purpose of post-marketing drug safety surveillance, which has traditionally relied on the volun- tary reporting of individual cases of adverse drug events (ADEs), other sources of information are now being explored, including electronic health records (EHRs), which give us access to enormous amounts of longitudinal observations of the treatment of patients and their drug use. Adverse drug events, which can be encoded in EHRs with certain diagnosis codes, are, however, heavily underreported. It is therefore important to develop capabilities to process, by means of computational methods, the more unstructured EHR data in the form of clinical notes, where clinicians may describe and reason around suspected ADEs. In this study, we report on the creation of an annotated corpus of Swedish health records for the purpose of learning to identify information pertaining to ADEs present in clinical notes. To this end, three key tasks are tackled: recognizing relevant named entities (disorders, symptoms, drugs), labeling attributes of the recognized entities (negation, speculation, temporality), and relationships between them (indication, adverse drug event). For each of the three tasks, leveraging models of distributional semantics – i.e., unsupervised methods that exploit co-occurrence information to model, typically in vector space, the meaning of words – and, in particular, combinations of such models, is shown to improve the predictive performance. The ability to make use of such unsupervised methods is critical when faced with large amounts of sparse and high-dimensional data, especially in domains where annotated resources are scarce.

  • 44.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Zhao, Jing
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Boström, Henrik
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Dalianis, Hercules
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection2015In: 2015 IEEE International Conference on Bioinformatics and Biomedicine: Proceedings / [ed] Jun (Luke) Huan et al., IEEE Computer Society, 2015, p. 343-350Conference paper (Refereed)
    Abstract [en]

    Electronic health records (EHRs) are emerging as a potentially valuable source for pharmacovigilance; however, adverse drug events (ADEs), which can be encoded in EHRs by a set of diagnosis codes, are heavily underreported. Alerting systems, able to detect potential ADEs on the basis of patient- specific EHR data, would help to mitigate this problem. To that end, the use of machine learning has proven to be both efficient and effective; however, challenges remain in representing the heterogeneous EHR data, which moreover tends to be high- dimensional and exceedingly sparse, in a manner conducive to learning high-performing predictive models. Prior work has shown that distributional semantics – that is, natural language processing methods that, traditionally, model the meaning of words in semantic (vector) space on the basis of co-occurrence information – can be exploited to create effective representations of sequential EHR data, not only free-text in clinical notes but also various clinical events such as diagnoses, drugs and measurements. When modeling data in semantic space, an im- portant design decision concerns the size of the context window around an object of interest, which governs the scope of co- occurrence information that is taken into account and affects the composition of the resulting semantic space. Here, we report on experiments conducted on 27 clinical datasets, demonstrating that performance can be significantly improved by modeling EHR data in ensembles of semantic spaces, consisting of multiple semantic spaces built with different context window sizes. A follow-up investigation is conducted to study the impact on predictive performance as increasingly more semantic spaces are included in the ensemble, demonstrating that accuracy tends to improve with the number of semantic spaces, albeit not monotonically so. Finally, a number of different strategies for combining the semantic spaces are explored, demonstrating the advantage of early (feature) fusion over late (classifier) fusion. Ensembles of semantic spaces allow multiple views of (sparse) data to be captured (densely) and thereby enable improved performance to be obtained on the task of detecting ADEs in EHRs.

  • 45.
    Henriksson, Aron
    et al.
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Zhao, Jing
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Boström, Henrik
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Dalianis, Hercules
    Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
    Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event Detection2015In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) / [ed] Eric Gaussier, Longbing Cao, Patrick Gallinari, James Kwok, Gabriella Pasi, Osmar Zaiane, Institute of Electrical and Electronics Engineers (IEEE), 2015Conference paper (Refereed)
    Abstract [en]

    The enormous amounts of data that are continuously recorded in electronic health record systems offer ample opportunities for data science applications to improve healthcare. There are, however, challenges involved in using such data for machine learning, such as high dimensionality and sparsity, as well as an inherent heterogeneity that does not allow the distinct types of clinical data to be treated in an identical manner. On the other hand, there are also similarities across data types that may be exploited, e.g., the possibility of representing some of them as sequences. Here, we apply the notions underlying distributional semantics, i.e., methods that model the meaning of words in semantic (vector) space on the basis of co-occurrence information, to four distinct types of clinical data: free-text notes, on the one hand, and clinical events, in the form of diagnosis codes, drug codes and measurements, on the other hand. Each semantic space contains continuous vector representations for every unique word and event, which can then be used to create representations of, e.g., care episodes that, in turn, can be exploited by the learning algorithm. This approach does not only reduce sparsity, but also takes into account, and explicitly models, similarities between various items, and it does so in an entirely data-driven fashion. Here, we report on a series of experiments using the random forest learning algorithm that demonstrate the effectiveness, in terms of accuracy and area under ROC curve, of the proposed representation form over the commonly used bag-of-items counterpart. The experiments are conducted on 27 real datasets that each involves the (binary) classification task of detecting a particular adverse drug event. It is also shown that combining structured and unstructured data leads to significant improvements over using only one of them.

  • 46.
    Hjelm, Hans
    Stockholm University, Faculty of Humanities, Department of Linguistics.
    Cross-language Ontology Learning: Incorporating and Exploiting Cross-language Data in the Ontology Learning Process2009Doctoral thesis, monograph (Other academic)
    Abstract [en]

    An ontology is a knowledge-representation structure, where words, terms or concepts are defined by their mutual hierarchical relations. Ontologies are becoming ever more prevalent in the world of natural language processing, where we currently see a tendency towards using semantics for solving a variety of tasks, particularly tasks related to information access. Ontologies, taxonomies and thesauri (all related notions) are also used in various variants by humans, to standardize business transactions or for finding conceptual relations between terms in, e.g., the medical domain.

    The acquisition of machine-readable, domain-specific semantic knowledge is time consuming and prone to inconsistencies. The field of ontology learning therefore provides tools for automating the construction of domain ontologies (ontologies describing the entities and relations within a particular field of interest), by analyzing large quantities of domain-specific texts.

    This thesis studies three main topics within the field of ontology learning. First, we examine which sources of information are useful within an ontology learning system and how the information sources can be combined effectively. Secondly, we do this with a special focus on cross-language text collections, to see if we can learn more from studying several languages at once, than we can from a single-language text collection. Finally, we investigate new approaches to formal and automatic evaluation of the quality of a learned ontology.

    We demonstrate how to combine information sources from different languages and use them to train automatic classifiers to recognize lexico-semantic relations. The cross-language data is shown to have a positive effect on the quality of the learned ontologies. We also give theoretical and experimental results, showing that our ontology evaluation method is a good complement to and in some aspects improves on the evaluation measures in use today.

  • 47.
    Hjelm, Hans
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Extraction of Cross Language Term Correspondences2006In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006), 2006Conference paper (Refereed)
  • 48.
    Hjelm, Hans
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Identifying Cross Language Term Equivalents Using Statistical Machine Translation and Distributional Association Measures2007In: Proceedings of Nodalida 2007, the 16th Nordic Conference of Computational Linguistics / [ed] Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit, 2007Conference paper (Refereed)
    Abstract [en]

    This article presents a comparison of the accuracy of a number of different approaches for identifying cross language term equivalents (translations). The methods investigated are on the one hand associative measures, commonly used in word-space models or in Information Retrieval and on the other hand a Statistical Machine Translation (SMT) approach. I have performed tests on six language pairs, using the JRC-Acquis parallel corpus as training material and Eurovoc as a gold standard. The SMT approach is shown to be more effective than the associative measures. The best results are achieved by taking a weighted average of the scores of the SMT approach and disparate associative measures.

  • 49.
    Hjelm, Hans
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Buitelaar, Paul
    Multilingual Evidence Improves Clustering-based Taxonomy Extraction2008In: Proceedings of the 18th European Conference on Artificial Intelligence (ECAI 2008), 2008Conference paper (Refereed)
  • 50.
    Hjelm, Hans
    et al.
    Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
    Schwarz, Christoph
    LiSa - Morphological Analysis for Information Retrieval2006In: Proceedings of the 15th NODALIDA conference, Joensuu 2005, 2006Conference paper (Refereed)
123 1 - 50 of 148
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf