Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Releasing a Swedish Clinical Corpus after Removing all Words – De-identification Experiments with Conditional Random Fields and Random Forests
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2012 (English)In: Proceedings of the Third Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM 2012), 2012, p. 45-48Conference paper, Published paper (Refereed)
Abstract [en]

Patient records contain valuable information in the form of both structured data and free text; however this information is sensitive since it can reveal the identity of patients. In order to allow new methods and techniques to be developed and evaluated on real world clinical data without revealing such sensitive information, researchers could be given access to de-identified records without protected health information (PHI), such as names, telephone numbers, and so on. One approach to minimizing the risk of revealing PHI when releasing text corpora from such records is to include only features of the words instead of the words themselves. Such features may include parts of speech, word length, and so on from which the sensitive information cannot be derived. In order to investigate what performance losses can be expected when replacing specific words with features, an experiment with two state-of-the-art machine learning methods, conditional random fields and random forests, is presented, comparing their ability to support de-identification, using the Stockholm EPR PHI corpus as a benchmark test. The results indicate severe performance losses when the actual words are removed, leading to the conclusion that the chosen features are not sufficient for the suggested approach to be viable.

Place, publisher, year, edition, pages
2012. p. 45-48
Keywords [en]
de-identification, conditional random fields, random forests, Swedish clinical text
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-79527OAI: oai:DiVA.org:su-79527DiVA, id: diva2:549732
Conference
The Third Workshop on Building and Evaluating Resources for Biomedical Text Mining, 26th May 2012, Istanbul, Turkey
Available from: 2012-09-05 Created: 2012-09-05 Last updated: 2022-02-24Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

http://rapidlibrary.com/source.php?file=ulcvcceew8i89on&url=http%3A%2F%2Fpeople.dsv.su.se%2F%7Ehercules%2Fpapers%2FDalianis_and_Bostrom_2012_Releasing_a_Swedish_clinical_corpus_after_removing_all_words-de-identification_experiments_with_conditional_random_fields_and_random_forests.pdf&sec=e04696ee7c89c415

Authority records

Dalianis, HerculesBoström, Henrik

Search in DiVA

By author/editor
Dalianis, HerculesBoström, Henrik
By organisation
Department of Computer and Systems Sciences
Information Systems

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 49 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf