Releasing a Swedish Clinical Corpus after Removing all Words – De-identification Experiments with Conditional Random Fields and Random Forests
2012 (English)In: Proceedings of the Third Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM 2012), 2012, 45-48 p.Conference paper (Refereed)
Patient records contain valuable information in the form of both structured data and free text; however this information is sensitive since it can reveal the identity of patients. In order to allow new methods and techniques to be developed and evaluated on real world clinical data without revealing such sensitive information, researchers could be given access to de-identified records without protected health information (PHI), such as names, telephone numbers, and so on. One approach to minimizing the risk of revealing PHI when releasing text corpora from such records is to include only features of the words instead of the words themselves. Such features may include parts of speech, word length, and so on from which the sensitive information cannot be derived. In order to investigate what performance losses can be expected when replacing specific words with features, an experiment with two state-of-the-art machine learning methods, conditional random fields and random forests, is presented, comparing their ability to support de-identification, using the Stockholm EPR PHI corpus as a benchmark test. The results indicate severe performance losses when the actual words are removed, leading to the conclusion that the chosen features are not sufficient for the suggested approach to be viable.
Place, publisher, year, edition, pages
2012. 45-48 p.
de-identification, conditional random fields, random forests, Swedish clinical text
Research subject Computer and Systems Sciences
IdentifiersURN: urn:nbn:se:su:diva-79527OAI: oai:DiVA.org:su-79527DiVA: diva2:549732
The Third Workshop on Building and Evaluating Resources for Biomedical Text Mining, 26th May 2012, Istanbul, Turkey