Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Building a De-identification System for Real Swedish Clinical Text Using Pseudonymised Clinical Text
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2019 (English)In: Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019) / [ed] Eben Holderness, Antonio Jimeno Yepes, Alberto Lavelli, Anne-Lyse Minard, James Pustejovsky, Fabio Rinaldi, Association for Computational Linguistics, 2019, p. 118-125Conference paper, Published paper (Refereed)
Abstract [en]

This article presents experiments with pseudonymised Swedish clinical text used as training data to de-identify real clinical text with the future aim to transfer non-sensitive training data to other hospitals. Conditional Random Fields (CFR) and Long Short-Term Memory (LSTM) machine learning algorithms were used to train de-identification models. The two models were trained on pseudonymised data and evaluated on real data. For benchmarking, models were also trained on real data, and evaluated on real data as well as trained on pseudonymised data and evaluated on pseudonymised data. CRF showed better performance for some PHI information like Date Part, First Name and Last Name; consistent with some reports in the literature. In contrast, poor performances on Location and Health Care Unit information were noted, partially due to the constrained vocabulary in the pseudonymised training data. It is concluded that it is possible to train transferable models based on pseudonymised Swedish clinical data, but even small narrative and distributional variation could negatively impact performance.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2019. p. 118-125
Keywords [en]
de-identification, electronic health records, machine learning, Swedish
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-177196DOI: 10.18653/v1/D19-6215ISBN: 978-1-950737-77-2 (print)OAI: oai:DiVA.org:su-177196DiVA, id: diva2:1379946
Conference
LOUHI 2019: The Tenth International Workshop on Health Text Mining and Information Analysis, Hong Kong, China, 3 November, 2019
Available from: 2019-12-17 Created: 2019-12-17 Last updated: 2019-12-17Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Search in DiVA

By author/editor
Berg, HannaDalianis, Hercules
By organisation
Department of Computer and Systems Sciences
Information Systems

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf