Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Extracting Clinical Findings from Swedish Health Record Text
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.ORCID iD: 0000-0001-6164-7762
2014 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Information contained in the free text of health records is useful for the immediate care of patients as well as for medical knowledge creation. Advances in clinical language processing have made it possible to automatically extract this information, but most research has, until recently, been conducted on clinical text written in English. In this thesis, however, information extraction from Swedish clinical corpora is explored, particularly focusing on the extraction of clinical findings. Unlike most previous studies, Clinical Finding was divided into the two more granular sub-categories Finding (symptom/result of a medical examination) and Disorder (condition with an underlying pathological process). For detecting clinical findings mentioned in Swedish health record text, a machine learning model, trained on a corpus of manually annotated text, achieved results in line with the obtained inter-annotator agreement figures. The machine learning approach clearly outperformed an approach based on vocabulary mapping, showing that Swedish medical vocabularies are not extensive enough for the purpose of high-quality information extraction from clinical text. A rule and cue vocabulary-based approach was, however, successful for negation and uncertainty classification of detected clinical findings. Methods for facilitating expansion of medical vocabulary resources are particularly important for Swedish and other languages with less extensive vocabulary resources. The possibility of using distributional semantics, in the form of Random indexing, for semi-automatic vocabulary expansion of medical vocabularies was, therefore, evaluated. Distributional semantics does not require that terms or abbreviations are explicitly defined in the text, and it is, thereby, a method suitable for clinical corpora. Random indexing was shown useful for extending vocabularies with medical terms, as well as for extracting medical synonyms and abbreviation dictionaries.

Place, publisher, year, edition, pages
Stockholm University: Department of Computer and Systems Sciences, Stockholm University , 2014. , 128 p.
Series
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 15-001
Keyword [en]
Named entity recognition, Corpora development, Clinical text processing, Distributional semantics, Random indexing, Vocabulary expansion, Assertion classification, Clinical text mining, Electronic health records, Swedish
National Category
Information Systems, Social aspects
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-109254ISBN: 978-91-7649-054-9 (print)OAI: oai:DiVA.org:su-109254DiVA: diva2:763910
Public defence
2015-01-23, Lilla hörsalen, NOD-huset, Borgarfjordsgatan 12, Kista, 13:00 (English)
Opponent
Supervisors
Available from: 2014-12-29 Created: 2014-11-17 Last updated: 2014-11-21Bibliographically approved
List of papers
1. Rule-based Entity Recognition and Coverage of SNOMED CT in Swedish Clinical Text
Open this publication in new window or tab >>Rule-based Entity Recognition and Coverage of SNOMED CT in Swedish Clinical Text
2012 (English)In: LREC 2012 8th ELRA Conference on Language Resources and Evaluation: Proceedings, European Language Resources Association (ELRA) , 2012, 1250-1257 p.Conference paper, Published paper (Refereed)
Abstract [en]

Named entity recognition of the clinical entities disorders, findings and body structures is needed for information extraction from unstructured text in health records. Clinical notes from a Swedish emergency unit were annotated and used for evaluating a rule- and terminology-based entity recognition system. This system used different preprocessing techniques for matching terms to SNOMED CT, and, one by one, four other terminologies were added. For the class body structure, the results improved with preprocessing, whereas only small improvements were shown for the classes disorder and finding. The best average results were achieved when all terminologies were used together. The entity body structure was recognised with a precision of 0.74 and a recall of 0.80, whereas lower results were achieved for disorder (precision: 0.75, recall: 0.55) and for finding (precision: 0.57, recall: 0.30). The proportion of entities containing abbreviations were higher for false negatives than for correctly recognised entities, and no entities containing more than two tokens were recognised by the system. Low recall for disorders and findings shows both that additional methods are needed for entity recognition and that there are many expressions in clinical text that are not included in SNOMED CT.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2012
Keyword
Electronic patient records, Swedish, SNOMED CT, named entity recognition
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-82257 (URN)000323927701056 ()978-2-9517408-7-7 (ISBN)
Conference
8th International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, 23-25 May, 2012
Available from: 2012-11-12 Created: 2012-11-12 Last updated: 2014-11-19Bibliographically approved
2. Vocabulary Expansion by Semantic Extraction of Medical Terms
Open this publication in new window or tab >>Vocabulary Expansion by Semantic Extraction of Medical Terms
2013 (English)In: Proceedings of the 5th International Symposiumon Languages in Biology and Medicine, 2013, 63-68 p.Conference paper, Published paper (Refereed)
Abstract [en]

Automatic methods for vocabulary expansion are valuable in supporting the development of terminological resources. Here, we evaluate two methods based on distributional semantics for extracting terms that belong to a certain semantic category. In a list of 1000 terms extracted from a corpus of Swedish medical text, the best method obtains a recall of 0.53 and 0.88, respectively, for identifying 90 terms that are known to belong to the semantic categories Medical Finding and Pharmaceutical Drug.

National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-98599 (URN)978-4-9907802-0-3 (ISBN)
Conference
The 5th International Symposium on Languages in Biology and Medicine (LBM 2013), Tokyo, Japan, 12 - 13 December, 2013
Available from: 2014-01-08 Created: 2014-01-08 Last updated: 2014-11-19Bibliographically approved
3. Synonym extraction and abbreviation expansion with ensembles of semantic spaces
Open this publication in new window or tab >>Synonym extraction and abbreviation expansion with ensembles of semantic spaces
Show others...
2014 (English)In: Journal of Biomedical Semantics, ISSN 2041-1480, E-ISSN 2041-1480, Vol. 5, no 6Article in journal (Refereed) Published
Abstract [en]

Background: Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. Results: A combination of two distributional models – Random Indexing and Random Permutation – employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora – a corpus of clinical text and a corpus of medical journal articles – further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms. Conclusions: This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models – with different model parameters – and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks.

Keyword
distributional semantics, random indexing, semantic space, ensemble methods, synonym extraction, abbreviation expansion
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-108651 (URN)10.1186/2041-1480-5-6 (DOI)000343707900002 ()
Available from: 2014-10-31 Created: 2014-10-31 Last updated: 2017-12-05Bibliographically approved
4. Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study
Open this publication in new window or tab >>Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study
2014 (English)In: Journal of Biomedical Informatics, ISSN 1532-0464, E-ISSN 1532-0480, Vol. 49, 148-158 p.Article in journal (Refereed) Published
Abstract [en]

Automatic recognition of clinical entities in the narrative text of health records is useful for constructing applications for documentation of patient care, as well as for secondary usage in the form of medical knowledge extraction. There are a number of named entity recognition studies on English clinical text, but less work has been carried out on clinical text in other languages. This study was performed on Swedish health records, and focused on four entities that are highly relevant for constructing a patient overview and for medical hypothesis generation, namely the entities: Disorder, Finding, Pharmaceutical Drug and Body Structure. The study had two aims: to explore how well named entity recognition methods previously applied to English clinical text perform on similar texts written in Swedish; and to evaluate whether it is meaningful to divide the more general category Medical Problem, which has been used in a number of previous studies, into the two more granular entities, Disorder and Finding. Clinical notes from a Swedish internal medicine emergency unit were annotated for the four selected entity categories, and the inter-annotator agreement between two pairs of annotators was measured, resulting in an average F-score of 0.79 for Disorder, 0.66 for Finding, 0.90 for Pharmaceutical Drug and 0.80 for Body Structure. A subset of the developed corpus was thereafter used for finding suitable features for training a conditional random fields model. Finally, a new model was trained on this subset, using the best features and settings, and its ability to generalise to held-out data was evaluated. This final model obtained an F-score of 0.81 for Disorder, 0.69 for Finding, 0.88 for Pharmaceutical Drug, 0.85 for Body Structure and 0.78 for the combined category Disorder + Finding. The obtained results, which are in line with or slightly lower than those for similar studies on English clinical text, many of them conducted using a larger training data set, show that the approaches used for English are also suitable for Swedish clinical text. However, a small proportion of the errors made by the model are less likely to occur in English text, showing that results might be improved by further tailoring the system to clinical Swedish. The entity recognition results for the individual entities Disorder and Finding show that it is meaningful to separate the general category Medical Problem into these two more granular entity types, e.g. for knowledge mining of co-morbidity relations and disorder-finding relations.

Keyword
Named entity recognition, Corpora development, Clinical text processing, Disorder, Finding, Swedish
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-106433 (URN)10.1016/j.jbi.2014.01.012 (DOI)000337772200015 ()
Note

AuthorCount:4;

Available from: 2014-08-06 Created: 2014-08-04 Last updated: 2017-12-05Bibliographically approved
5. Negation detection in Swedish clinical text: An adaption of NegEx to Swedish
Open this publication in new window or tab >>Negation detection in Swedish clinical text: An adaption of NegEx to Swedish
2011 (English)In: Journal of Biomedical Semantics, ISSN 2041-1480, E-ISSN 2041-1480, Vol. 2, no S3, 1-12 p.Article in journal (Refereed) Published
Abstract [en]

Background: Most methods for negation detection in clinical text have been developed for English text, and there is a need for evaluating the feasibility of adapting these methods to other languages. A Swedish adaption of the English rule-based negation detection system NegEx, which detects negations through the use of trigger phrases, was therefore evaluated. Results: The Swedish adaption of NegEx showed a precision of 75.2% and a recall of 81.9%, when evaluated on 558 manually classified sentences containing negation triggers, and a negative predictive value of 96.5% when evaluated on 342 sentences not containing negation triggers. Conclusions: The precision was significantly lower for the Swedish adaptation than published results for the English version, but since many negated propositions were identified through a limited set of trigger phrases, it could nevertheless be concluded that the same trigger phrase approach is possible in a Swedish context, even though it needs to be further developed. Availability: The triggers used for the evaluation of the Swedish adaption of NegEx are available at http://people.dsv.su.se/~mariask/resources/triggers.txt and can be used together with the original NegEx program for negation detection in Swedish clinical text.

Keyword
Negation detection, NLP, Medical informatics
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-62353 (URN)10.1186/2041-1480-2-S3-S3 (DOI)
Conference
Second Louhi Workshop on Text and Data Mining of Health Documents, Los Angeles, CA, USA, 05 June 2010
Available from: 2011-09-15 Created: 2011-09-15 Last updated: 2017-12-08Bibliographically approved
6. Cue-based assertion classification for Swedish clinical text-Developing a lexicon for pyConTextSwe
Open this publication in new window or tab >>Cue-based assertion classification for Swedish clinical text-Developing a lexicon for pyConTextSwe
Show others...
2014 (English)In: Artificial Intelligence in Medicine, ISSN 0933-3657, E-ISSN 1873-2860, Vol. 61, no 3, 137-144 p.Article in journal (Refereed) Published
Abstract [en]

Objective: The ability of a cue-based system to accurately assert whether a disorder is affirmed, negated, or uncertain is dependent, in part, on its cue lexicon. In this paper, we continue our study of porting an assertion system (pyConTextNLP) from English to Swedish (pyConTextSwe) by creating an optimized assertion lexicon for clinical Swedish. Methods and material: We integrated cues from four external lexicons, along with generated inflections and combinations. We used subsets of a clinical corpus in Swedish. We applied four assertion classes (definite existence, probable existence, probable negated existence and definite negated existence) and two binary classes (existence yes/no and uncertainty yes/no) to pyConTextSwe. We compared pyConTextSwe's performance with and without the added cues on a development set, and improved the lexicon further after an error analysis. On a separate evaluation set, we calculated the system's final performance. Results: Following integration steps, we added 454 cues to pyConTextSwe. The optimized lexicon developed after an error analysis resulted in statistically significant improvements on the development set (83%F-score, overall). The system's final F-scores on an evaluation set were 81% (overall). For the individual assertion classes, F-score results were 88% (definite existence), 81% (probable existence), 55% (probable negated existence), and 63% (definite negated existence). For the binary classifications existence yes/no and uncertainty yes/no, final system performance was 97%/87% and 78%/86% F-score, respectively. Conclusions: We have successfully ported pyConTextNLP to Swedish (pyConTextSwe). We have created an extensive and useful assertion lexicon for Swedish clinical text, which could form a valuable resource for similar studies, and which is publicly available.

Keyword
Assertion classification, Clinical text mining, Dictionaries, Medical Language Processing, Information extraction, Electronic health records
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-107440 (URN)10.1016/j.artmed.2014.01.001 (DOI)000340233700003 ()
Note

AuthorCount:7;

Available from: 2014-09-17 Created: 2014-09-15 Last updated: 2017-12-05Bibliographically approved

Open Access in DiVA

M. Skeppstedt Dissertation(1256 kB)332 downloads
File information
File name FULLTEXT01.pdfFile size 1256 kBChecksum SHA-512
4f126fc5b142393e156321924048e2d04c40d37a45ed2fbfaee6bbe9f52f46dd4921d37a76dcc18bbdc813fc95b2530f362736e2650f73104135ae3e613cbbcb
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Skeppstedt, Maria
By organisation
Department of Computer and Systems Sciences
Information Systems, Social aspects

Search outside of DiVA

GoogleGoogle Scholar
Total: 332 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1887 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf