Open this publication in new window or tab >>2015 (English)In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) / [ed] Eric Gaussier, Longbing Cao, Patrick Gallinari, James Kwok, Gabriella Pasi, Osmar Zaiane, IEEE, 2015Conference paper, Published paper (Refereed)
Abstract [en]
The enormous amounts of data that are continuously recorded in electronic health record systems offer ample opportunities for data science applications to improve healthcare. There are, however, challenges involved in using such data for machine learning, such as high dimensionality and sparsity, as well as an inherent heterogeneity that does not allow the distinct types of clinical data to be treated in an identical manner. On the other hand, there are also similarities across data types that may be exploited, e.g., the possibility of representing some of them as sequences. Here, we apply the notions underlying distributional semantics, i.e., methods that model the meaning of words in semantic (vector) space on the basis of co-occurrence information, to four distinct types of clinical data: free-text notes, on the one hand, and clinical events, in the form of diagnosis codes, drug codes and measurements, on the other hand. Each semantic space contains continuous vector representations for every unique word and event, which can then be used to create representations of, e.g., care episodes that, in turn, can be exploited by the learning algorithm. This approach does not only reduce sparsity, but also takes into account, and explicitly models, similarities between various items, and it does so in an entirely data-driven fashion. Here, we report on a series of experiments using the random forest learning algorithm that demonstrate the effectiveness, in terms of accuracy and area under ROC curve, of the proposed representation form over the commonly used bag-of-items counterpart. The experiments are conducted on 27 real datasets that each involves the (binary) classification task of detecting a particular adverse drug event. It is also shown that combining structured and unstructured data leads to significant improvements over using only one of them.
Place, publisher, year, edition, pages
IEEE, 2015
Keywords
distributional semantics, semantic space ensembles, heterogeneous data, electronic health records, adverse drug events, predictive modeling
National Category
Computer Sciences Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-122462 (URN)10.1109/DSAA.2015.7344867 (DOI)978-1-4673-8272-4 (ISBN)
Conference
IEEE International Conference on Data Science and Advanced Analytics (DSAA), Paris, France, October 19-21, 2015
Projects
High-Performance Data Mining for Drug Effect Detection
Funder
Swedish Foundation for Strategic Research , IIS11-0053
2015-11-022015-11-022025-02-01Bibliographically approved