Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2014 (English)In: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE conference proceedings, 2014, 450-457 p.Conference paper, Published paper (Refereed)
Abstract [en]

Creating sufficiently large annotated resources for supervised machine learning, and doing so for every problem and every domain, is prohibitively expensive. Techniques that leverage large amounts of unlabeled data, which are often readily available, may decrease the amount of data that needs to be annotated to obtain a certain level of performance, as well as improve performance when large annotated resources are indeed available. Here, the development of one such method is presented, where semantic features are generated by exploiting the available annotations to learn prototypical (vector) representations of each named entity class in semantic space, constructed by employing a model of distributional semantics (random indexing) over a large, unannotated, in-domain corpus. Binary features that describe whether a given word belongs to a specific named entity class are provided to the learning algorithm; the feature values are determined by calculating the (cosine) distance in semantic space to each of the learned prototype vectors and ascertaining whether they are below or above a given threshold, set to optimize Fβ-score. The proposed method is evaluated empirically in a series of experiments, where the case is health-record deidentification, a task that involves identifying protected health information (PHI) in text. It is shown that a conditional random fields model with access to the generated semantic features, in addition to a set of orthographic and syntactic features, significantly outperforms, in terms of F1-score, a baseline model without access to the semantic features. Moreover, the quality of the features is further improved by employing a number of slightly different models of distributional semantics in an ensemble. Finally, the way in which the features are generated allows one to optimize them for various Fβ -scores, giving some degree of control to trade off precision and recall. Methods that are able to improve performance on named entity recognition tasks by exploiting large amounts of unlabeled data may substantially reduce costs involved in creating annotated resources for every domain and every problem.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2014. 450-457 p.
Keyword [en]
distributional semantics, random indexing, named entity recognition, deidentification, electronic health records
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-110992DOI: 10.1109/BIBM.2014.6999199ISBN: 978-1-4799-5701-9/14 (print)OAI: oai:DiVA.org:su-110992DiVA: diva2:866455
Conference
2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2-5 November 2014, Belfast, UK
Funder
Swedish Foundation for Strategic Research , IIS11-0053
Available from: 2014-12-19 Created: 2014-12-19 Last updated: 2015-11-03Bibliographically approved
In thesis
1. Ensembles of Semantic Spaces: On Combining Models of Distributional Semantics with Applications in Healthcare
Open this publication in new window or tab >>Ensembles of Semantic Spaces: On Combining Models of Distributional Semantics with Applications in Healthcare
2015 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Distributional semantics allows models of linguistic meaning to be derived from observations of language use in large amounts of text. By modeling the meaning of words in semantic (vector) space on the basis of co-occurrence information, distributional semantics permits a quantitative interpretation of (relative) word meaning in an unsupervised setting, i.e., human annotations are not required. The ability to obtain inexpensive word representations in this manner helps to alleviate the bottleneck of fully supervised approaches to natural language processing, especially since models of distributional semantics are data-driven and hence agnostic to both language and domain.

All that is required to obtain distributed word representations is a sizeable corpus; however, the composition of the semantic space is not only affected by the underlying data but also by certain model hyperparameters. While these can be optimized for a specific downstream task, there are currently limitations to the extent the many aspects of semantics can be captured in a single model. This dissertation investigates the possibility of capturing multiple aspects of lexical semantics by adopting the ensemble methodology within a distributional semantic framework to create ensembles of semantic spaces. To that end, various strategies for creating the constituent semantic spaces, as well as for combining them, are explored in a number of studies.

The notion of semantic space ensembles is generalizable across languages and domains; however, the use of unsupervised methods is particularly valuable in low-resource settings, in particular when annotated corpora are scarce, as in the domain of Swedish healthcare. The semantic space ensembles are here empirically evaluated for tasks that have promising applications in healthcare. It is shown that semantic space ensembles – created by exploiting various corpora and data types, as well as by adjusting model hyperparameters such as the size of the context window and the strategy for handling word order within the context window – are able to outperform the use of any single constituent model on a range of tasks. The semantic space ensembles are used both directly for k-nearest neighbors retrieval and for semi-supervised machine learning. Applying semantic space ensembles to important medical problems facilitates the secondary use of healthcare data, which, despite its abundance and transformative potential, is grossly underutilized.

Place, publisher, year, edition, pages
Stockholm: Department of Computer and Systems Sciences, Stockholm University, 2015. 95 p.
Series
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 15-021
Keyword
natural language processing, machine learning, distributional semantics, ensemble learning, semantic space ensembles, medical informatics, electronic health records
National Category
Computer Science Language Technology (Computational Linguistics)
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-122465 (URN)978-91-7649-302-1 (ISBN)
Public defence
2015-12-17, Lilla hörsalen, NOD-huset, Borgarfjordsgatan 12, Kista, 13:00 (English)
Opponent
Supervisors
Projects
High-Performance Data Mining for Drug Effect Detection
Funder
Swedish Foundation for Strategic Research , IIS11-0053
Note

At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 4 and 5: Unpublished conference papers.

Available from: 2015-11-25 Created: 2015-11-02 Last updated: 2015-11-13Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full text

Search in DiVA

By author/editor
Henriksson, AronDalianis, HerculesKowalski, Stewart
By organisation
Department of Computer and Systems Sciences
Information Systems

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 198 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf