Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Shades of Certainty: Annotation and Classification of Swedish Medical Records
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Access to information is fundamental in health care. This thesis presents research on Swedish medical records with the overall goal of building intelligent information access tools that can aid health personnel, researchers and other professions in their daily work, and, ultimately, improve health care in general.

The issue of ethics and identifiable information is addressed by creating an annotated gold standard corpus and porting an existing de-identification system to Swedish from English. The aim is to move towards making textual resources available to researchers without risking exposure of patients’ confidential information. Results for the rule-based system are not encouraging, but results for the gold standard are fairly high.

Affirmed, uncertain and negated information needs to be distinguished when building accurate information extraction tools. Annotation models are created, with the aim of building automated systems. One model distinguishes certain and uncertain sentences, and is applied on medical records from several clinical departments. In a second model, two polarities and three levels of certainty are applied on diagnostic statements from an emergency department. Overall results are promising. Differences are seen depending on clinical practice, annotation task and level of domain expertise among the annotators.

Using annotated resources for automatic classification is studied. Encouraging overall results using local context information are obtained. The fine-grained certainty levels are used for building classifiers for real-world e-health scenarios.

This thesis contributes two annotation models of certainty and one of identifiable information, applied on Swedish medical records. A deeper understanding of the language use linked to conveying certainty levels is gained. Three annotated resources that can be used for further research have been created, and implications for automated systems are presented.

Place, publisher, year, edition, pages
Stockholm: Department of Computer and Systems Sciences, Stockholm University , 2012. , 78 p.
Series
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 12-002
Keyword [en]
Clinical documentation, Certainty level classification, Annotation, E-health, Corpus creation, De-identification, Speculative language, Medical Records, Swedish, Natural Language Processing, Language Technology
National Category
Information Systems, Social aspects
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-74828ISBN: 978-91-7447-444-2 (print)OAI: oai:DiVA.org:su-74828DiVA: diva2:512263
Public defence
2012-04-27, Sal C, Forum 100, Isafjordsgatan 39, Kista, 13:00 (English)
Opponent
Supervisors
Available from: 2012-04-05 Created: 2012-03-27 Last updated: 2012-03-28Bibliographically approved
List of papers
1. Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial
Open this publication in new window or tab >>Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial
2009 (English)In: International Journal of Medical Informatics, ISSN 1386-5056, E-ISSN 1872-8243, Vol. 78, no 12, e19-e26 p.Article in journal (Refereed) Published
Abstract [en]

Background

Electronic patient records (EPRs) contain a large amount of information written in free text. This information is considered very valuable for research but is also very sensitive since the free text parts may contain information that could reveal the identity of a patient. Therefore, methods for de-identifying EPRs are needed. The work presented here aims to perform a manual and automatic Protected Health Information (PHI)-annotation trial for EPRs written in Swedish.

Methods

This study consists of two main parts: the initial creation of a manually PHI-annotated gold standard, and the porting and evaluation of an existing de-identification software written for American English to Swedish in a preliminary automatic de-identification trial. Results are measured with precision, recall and F-measure.

Results

This study reports fairly high Inter-Annotator Agreement (IAA) results on the manually created gold standard, especially for specific tags such as names. The average IAA over all tags was 0.65 F-measure (0.84 F-measure highest pairwise agreement). For name tags the average IAA was 0.80 F-measure (0.91 F-measure highest pairwise agreement). Porting a de-identification software written for American English to Swedish directly was unfortunately non-trivial, yielding poor results.

Conclusion

Developing gold standard sets as well as automatic systems for de-identification tasks in Swedish is feasible. However, discussions and definitions on identifiable information is needed, as well as further developments both on the tag sets and the annotation guidelines, in order to get a reliable gold standard. A completely new de-identification software needs to be developed.

Keyword
Medical informatics applications, Natural language processing, Medical record systems, Electronic patient records in Swedish, Protected health information, Ethical issues, Annotation
National Category
Computer and Information Science
Identifiers
urn:nbn:se:su:diva-33411 (URN)10.1016/j.ijmedinf.2009.04.005 (DOI)000272036200012 ()
Available from: 2009-12-23 Created: 2009-12-23 Last updated: 2017-12-12Bibliographically approved
2. How Certain are Clinical Assessments?: Annotating Swedish Clinical Text for (Un)certainties, Speculations and Negations
Open this publication in new window or tab >>How Certain are Clinical Assessments?: Annotating Swedish Clinical Text for (Un)certainties, Speculations and Negations
2010 (English)In: Proceedings of the of the Seventh International Conference on Language Resources and Evaluation, LREC 2010 / [ed] Nicoletta Calzolari, 2010, 3071-3075 p.Conference paper, Published paper (Other academic)
Abstract [en]

Clinical texts contain a large amount of information. Some of this information is embedded in contexts where e.g. a patient status is reasoned about, which may lead to a considerable amount of statements that indicate uncertainty and speculation. We believe that distinguishing such instances from factual statements will be very beneficial for automatic information extraction. We have annotated a subset of the Stockholm Electronic Patient Record Corpus for certain and uncertain expressions as well as speculative and negation keywords, with the purpose of creating a resource for the development of automatic detection of speculative language in Swedish clinical text. We have analyzed the results from the initial annotation trial by means of pairwise Inter-Annotator Agreement (IAA) measured with F-score. Our main findings are that IAA results for certain expressions and negations are very high, but for uncertain expressions and speculative keywords results are less encouraging. These instances need to be defined in more detail. With this annotation trial, we have created an important resource that can be used to further analyze the properties of speculative language in Swedish clinical text. Our intention is to release this subset to other research groups in the future after removing identifiable information.

National Category
Information Science
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-51917 (URN)2-9517408-6-7 (ISBN)
Conference
the Seventh International Conference on Language Resources and Evaluation, LREC 2010
Available from: 2011-01-12 Created: 2011-01-12 Last updated: 2012-03-27Bibliographically approved
3. Towards A Better Understanding of Uncertainties and Speculations in Swedish Clinical Text – Analysis of an Initial Annotation Trial
Open this publication in new window or tab >>Towards A Better Understanding of Uncertainties and Speculations in Swedish Clinical Text – Analysis of an Initial Annotation Trial
2010 (English)In: Proceedings of the Workshop on Negation and Speculation in Natural Language Processing, University of Antwerpen , 2010, 14-22 p.Conference paper, Published paper (Other academic)
Abstract [en]

In view of the increasing need to facilitate processing the content of scientific papers, we present an annotation scheme for annotating full papers with zones of conceptualisation, reflecting the information structure and knowledge types which constitute a scientific investigation. The latter are the Core Scientific Concepts (CoreSCs) and include Hypothesis, Motivation, Goal, Object, Background, Method, Experiment, Model, Observation, Result and Conclusion. The CoreSC scheme has been used to annotate a corpus of 265 full papers in physical chemistry and biochemistry and we are currently automating the recognition of CoreSCs in papers. We discuss how the CoreSC scheme relates to other views of scientific papers and indeed how the former could be used to help identify negation and speculation in scientific texts.

Place, publisher, year, edition, pages
University of Antwerpen, 2010
National Category
Information Science
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-52029 (URN)9789057282669 (ISBN)
Conference
Workshop on Negation and Speculation in Natural Language Processing
Available from: 2011-01-12 Created: 2011-01-12 Last updated: 2012-03-27Bibliographically approved
4. Factuality Levels of Diagnoses in Swedish Clinical Text
Open this publication in new window or tab >>Factuality Levels of Diagnoses in Swedish Clinical Text
2011 (English)In: User Centred Networked Health Care - Proceedings of MIE 2011 / [ed] Anne Moen, Stig Kjær Andersen, Jos Aarts, Petter Hurlen, 2011, 559-563 p.Conference paper, Published paper (Refereed)
Abstract [en]

Different levels of knowledge certainty, or factuality levels, are expressed in clinical health record documentation. This information is currently not fully exploited, as the subtleties expressed in natural language cannot easily be machine analyzed. Extracting relevant information from knowledge-intensive resources such as electronic health records can be used for improving health care in general by e.g. building automated information access systems. We present an annotation model of six factuality levels linked to diagnoses in Swedish clinical assessments from an emergency ward. Our main findings are that overall agreement is fairly high (0.7/0.58 F-measure, 0.73/0.6 Cohen's κ, Intra/Inter). These distinctions are important for knowledge models, since only approx. 50% of the diagnoses are affirmed with certainty. Moreover, our results indicate that there are patterns inherent in the diagnosis expressions themselves conveying factuality levels, showing that certainty is not only dependent on context cues.

Keyword
Diagnosis reasoning, factuality levels, annotation, Swedish, clinical text, electronic health records
National Category
Information Science
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-62347 (URN)10.3233/978-1-60750-806-9-559 (DOI)978-1-60750-805-2 (ISBN)
Conference
MIME 2011
Available from: 2011-09-15 Created: 2011-09-15 Last updated: 2012-03-27Bibliographically approved
5. Automatic Classification of Factuality Levels: A Case Study on Swedish Diagnoses and the Impact of Local Context
Open this publication in new window or tab >>Automatic Classification of Factuality Levels: A Case Study on Swedish Diagnoses and the Impact of Local Context
2011 (English)In: The Fourth International Symposium on Languages in Biology and Medicine, Singapore, 2011Conference paper, Published paper (Refereed)
Abstract [en]

Clinicians express different levels of knowledge certainty when reasoning about a patient’s status. Automatic extraction of relevant information is crucial in the clinical setting, which means that factuality levels need to be distinguished. We present an automatic classifier using Conditional Random Fields, which is trained and tested on a Swedish clinical corpus annotated for factuality levels at a diagnosis statement level: the Stockholm EPR Diagnosis-Factuality Corpus. The classifier obtains promising results (best overall results are 0.699 average F-measure using all classes, 0.762 F-measure using merged classes), using simple local context features. Preceding context is more useful than posterior, although best results are obtained using a window size of +/-4. Lower levels of certainty are more problematic than higher levels, which was also the case for the human annotators in creating the corpus. A manual error analysis shows that conjunctions and other higher-level features are common sources of errors.

Place, publisher, year, edition, pages
Singapore: , 2011
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-68729 (URN)
Conference
Fourth International Symposium on Languages in Biology and Medicine, LBM 2011
Available from: 2012-01-05 Created: 2012-01-05 Last updated: 2012-03-27Bibliographically approved
6. Fine-grained Certainty Level Annotations Used for Coarser-grained E-health Scenarios: Certainty Classication of Diagnostic Statements in Swedish Clinical Text
Open this publication in new window or tab >>Fine-grained Certainty Level Annotations Used for Coarser-grained E-health Scenarios: Certainty Classication of Diagnostic Statements in Swedish Clinical Text
2012 (English)In: Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II / [ed] Alexander Gelbukh, Berlin/Heidelberg: Springer Berlin/Heidelberg, 2012, 450-461 p.Conference paper, Published paper (Refereed)
Abstract [en]

An important task in information access methods is distinguishingfactual information from speculative or negated information.Fine-grained certainty levels of diagnostic statements in Swedish clinicaltext are annotated in a corpus from a medical university hospital.The annotation model has two polarities (positive and negative) andthree certainty levels. However, there are many e-health scenarios wheresuch ne-grained certainty levels are not practical for information extraction.Instead, more coarse-grained groups are needed. We presentthree scenarios: adverse event surveillance, decision support alerts andautomatic summaries and collapse the ne-grained certainty level classi-cations into coarser-grained groups. We build automatic classiers foreach scenario and analyze the results quantitatively. Annotation discrepanciesare analyzed qualitatively through manual corpus analysis. Ourmain ndings are that it is feasible to use a corpus of ne-grained certaintylevel annotations to build classiers for coarser-grained real-worldscenarios: 0.89, 0.91 and 0.8 F-score (overall average).

Place, publisher, year, edition, pages
Berlin/Heidelberg: Springer Berlin/Heidelberg, 2012
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 7192
Keyword
Clinical documentation, Certainty level classication, Annotation granularity, Automatic Summary, Decision Support Alerts, Adverse Event Surveillance, E-health
National Category
Information Systems, Social aspects
Research subject
Computer Science; IT for health
Identifiers
urn:nbn:se:su:diva-74810 (URN)10.1007/978-3-642-28601-8_38 (DOI)978-3-642-28600-1 (ISBN)
Conference
CICLing 2012, New Delhi, India, March 11–17, 2012
Available from: 2012-03-26 Created: 2012-03-26 Last updated: 2013-02-07Bibliographically approved

Open Access in DiVA

fulltext(760 kB)718 downloads
File information
File name FULLTEXT01.pdfFile size 760 kBChecksum SHA-512
3f4a031252f96b7f2733ab23ea5ccb3ece9deac8729cd3ac0ddb10e3846c6ccca4ada58f7b33a59a991c88942dbd86a1c2504c79ee472ac507459d5df4bcac61
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Velupillai, Sumithra
By organisation
Department of Computer and Systems Sciences
Information Systems, Social aspects

Search outside of DiVA

GoogleGoogle Scholar
Total: 718 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1658 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf