Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Natural language processing and machine learning to enable automatic extraction and classification of patients' smoking status from electronic medical records
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences. Karolinska Institutet, Sweden.
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
Number of Authors: 42020 (English)In: Upsala Journal of Medical Sciences, ISSN 0300-9734, E-ISSN 2000-1967, Vol. 125, no 4, p. 316-324Article in journal (Refereed) Published
Abstract [en]

Background: The electronic medical record (EMR) offers unique possibilities for clinical research, but some important patient attributes are not readily available due to its unstructured properties. We applied text mining using machine learning to enable automatic classification of unstructured information on smoking status from Swedish EMR data.

Methods: Data on patients' smoking status from EMRs were used to develop 32 different predictive models that were trained using Weka, changing sentence frequency, classifier type, tokenization, and attribute selection in a database of 85,000 classified sentences. The models were evaluated using F-score and accuracy based on out-of-sample test data including 8500 sentences. The error weight matrix was used to select the best model, assigning a weight to each type of misclassification and applying it to the model confusion matrices. The best performing model was then compared to a rule-based method.

Results: The best performing model was based on the Support Vector Machine (SVM) Sequential Minimal Optimization (SMO) classifier using a combination of unigrams and bigrams as tokens. Sentence frequency and attributes selection did not improve model performance. SMO achieved 98.14% accuracy and 0.981 F-score versus 79.32% and 0.756 for the rule-based model.

Conclusion: A model using machine-learning algorithms to automatically classify patients' smoking status was successfully developed. Such algorithms may enable automatic assessment of smoking status and other unstructured data directly from EMRs without manual classification of complete case notes.

Place, publisher, year, edition, pages
2020. Vol. 125, no 4, p. 316-324
Keywords [en]
Clinical informatics, electronic medical records, machine learning, natural language processing, smoking, text mining
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:su:diva-184406DOI: 10.1080/03009734.2020.1792010ISI: 000550968900001PubMedID: 32696698OAI: oai:DiVA.org:su-184406DiVA, id: diva2:1473913
Available from: 2020-10-07 Created: 2020-10-07 Last updated: 2022-02-25Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textPubMed

Authority records

Dalianis, Hercules

Search in DiVA

By author/editor
Dalianis, Hercules
By organisation
Department of Computer and Systems Sciences
In the same journal
Upsala Journal of Medical Sciences
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 47 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf