Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Automatic Extraction and Classification of Patients’ Smoking Status from Free Text Using Natural Language Processing
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2016 (English)In: Value in Health, ISSN 1098-3015, E-ISSN 1524-4733, Vol. 19, no 7, A373Article in journal, Meeting abstract (Refereed) Published
Abstract [en]

Objectives

To develop a machine learning algorithm for automatic classification of smoking status (smoker, ex-smoker, non-smoker and unknown status) in EMRs, and validate the predictive accuracy compared to a rule-based method. Smoking is a leading cause of death worldwide and may introduce confounding in research based on real world data (RWD). Information on smoking is often documented in free text fields in Electronic Medical Records (EMRs), but structured RWD on smoking is sparse.

Methods

32 predictive models were trained with the Weka machine learning suite, tweaking sentence frequency, classifier type, tokenization and attribute selection using a database of 85,000 classified sentences. The models were evaluated using F-Score and Accuracy based on out-of-sample test data including 8,500 sentences. The error weight matrix was used to select the best model, assigning a weight to each type of misclassification and applying it to the models confusion matrices.

Results

The best performing model was based on the Support Vector Machine (SVM) Sequential Minimal Optimization (SMO) classifier using a polynomial kernel with parameter C equal to 6 and a combination of unigrams and bigrams as tokens. Sentence frequency and attributes selection did not improve model performance. SMO achieved 98.25% accuracy and 0.982 F-Score versus 79.32% and 0.756, respectively, for the rule-based model.

Conclusions

A model using machine learning algorithms to automatically classify patients smoking status was successfully developed. This algorithm would enable automatic assessment of smoking status directly from EMRs, obviating the need to extract complete case notes and manual classification.

Place, publisher, year, edition, pages
2016. Vol. 19, no 7, A373
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-136579DOI: 10.1016/j.jval.2016.09.158OAI: oai:DiVA.org:su-136579DiVA: diva2:1055443
Available from: 2016-12-12 Created: 2016-12-12 Last updated: 2017-02-09Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full text

Search in DiVA

By author/editor
Dalianis, Hercules
By organisation
Department of Computer and Systems Sciences
In the same journal
Value in Health
Information Systems

Search outside of DiVA

GoogleGoogle Scholar

Altmetric score

Total: 31 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf