Change search
Link to record
Permanent link

Direct link
Zhao, Jing
Publications (10 of 18) Show all publications
Zhao, J., Papapetrou, P., Asker, L. & Boström, H. (2020). Corrigendum to ‘Learning from heterogeneous temporal data in electronic health records’. [J. Biomed. Inform. 65 (2017) 105–119]. Journal of Biomedical Informatics, 101, Article ID 103352.
Open this publication in new window or tab >>Corrigendum to ‘Learning from heterogeneous temporal data in electronic health records’. [J. Biomed. Inform. 65 (2017) 105–119]
2020 (English)In: Journal of Biomedical Informatics, ISSN 1532-0464, E-ISSN 1532-0480, Vol. 101, article id 103352Article in journal (Other academic) Published
National Category
Computer Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-178462 (URN)10.1016/j.jbi.2019.103352 (DOI)
Note

Refers to:

Jing Zhao, Panagiotis Papapetrou, Lars Asker, Henrik Boström

Learning from heterogeneous temporal data in electronic health records

Journal of Biomedical Informatics, Volume 65, January 2017, Pages 105-119

Available from: 2020-01-29 Created: 2020-01-29 Last updated: 2022-02-26Bibliographically approved
Zhao, J., Papapetrou, P., Asker, L. & Boström, H. (2017). Learning from heterogeneous temporal data from electronic health records. Journal of Biomedical Informatics, 65, 105-119
Open this publication in new window or tab >>Learning from heterogeneous temporal data from electronic health records
2017 (English)In: Journal of Biomedical Informatics, ISSN 1532-0464, E-ISSN 1532-0480, Vol. 65, p. 105-119Article in journal (Refereed) Published
Abstract [en]

Electronic health records contain large amounts of longitudinal data that are valuable for biomedical informatics research. The application of machine learning is a promising alternative to manual analysis of such data. However, the complex structure of the data, which includes clinical events that are unevenly distributed over time, poses a challenge for standard learning algorithms. Some approaches to modeling temporal data rely on extracting single values from time series; however, this leads to the loss of potentially valuable sequential information. How to better account for the temporality of clinical data, hence, remains an important research question. In this study, novel representations of temporal data in electronic health records are explored. These representations retain the sequential information, and are directly compatible with standard machine learning algorithms. The explored methods are based on symbolic sequence representations of time series data, which are utilized in a number of different ways. An empirical investigation, using 19 datasets comprising clinical measurements observed over time from a real database of electronic health records, shows that using a distance measure to random subsequences leads to substantial improvements in predictive performance compared to using the original sequences or clustering the sequences. Evidence is moreover provided on the quality of the symbolic sequence representation by comparing it to sequences that are generated using domain knowledge by clinical experts. The proposed method creates representations that better account for the temporality of clinical events, which is often key to prediction tasks in the biomedical domain.

Keywords
random subsequence, time series classification, electronic health records, data mining, machine learning
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-137481 (URN)10.1016/j.jbi.2016.11.006 (DOI)000406235200008 ()
Available from: 2017-01-08 Created: 2017-01-08 Last updated: 2022-03-23Bibliographically approved
Zhao, J. (2017). Learning Predictive Models from Electronic Health Records. (Doctoral dissertation). Stockholm: Department of Computer and Systems Sciences, Stockholm University
Open this publication in new window or tab >>Learning Predictive Models from Electronic Health Records
2017 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The ongoing digitization of healthcare, which has been much accelerated by the widespread adoption of electronic health records, generates unprecedented amounts of clinical data in a readily computable form. This, in turn, affords great opportunities for making meaningful secondary use of clinical data in the endeavor to improve healthcare, as well as to support epidemiology and medical research. To that end, there is a need for techniques capable of effectively and efficiently analyzing large amounts of clinical data. While machine learning provides the necessary tools, learning effective predictive models from electronic health records comes with many challenges due to the complexity of the data. Electronic health records contain heterogeneous and longitudinal data that jointly provides a rich perspective of patient trajectories in the healthcare process. The diverse characteristics of the data need to be properly accounted for when learning predictive models from clinical data. However, how best to represent healthcare data for predictive modeling has been insufficiently studied. This thesis addresses several of the technical challenges involved in learning effective predictive models from electronic health records.

Methods are developed to address the challenges of (i) representing heterogeneous types of data, (ii) leveraging the concept hierarchy of clinical codes, and (iii) modeling the temporality of clinical events. The proposed methods are evaluated empirically in the context of detecting adverse drug events in electronic health records. Various representations of each type of data that account for its unique characteristics are investigated and it is shown that combining multiple representations yields improved predictive performance. It is also demonstrated how the information embedded in the concept hierarchy of clinical codes can be exploited, both for creating enriched feature spaces and for decomposing the predictive task. Moreover, incorporating temporal information leads to more effective predictive models by distinguishing between event occurrences in the patient history. Both single-point representations, using pre-assigned or learned temporal weights, and multivariate time series representations are shown to be more informative than representations in which temporality is ignored. Effective methods for representing heterogeneous and longitudinal data are key for enhancing and truly enabling meaningful secondary use of electronic health records through large-scale analysis of clinical data.

Place, publisher, year, edition, pages
Stockholm: Department of Computer and Systems Sciences, Stockholm University, 2017. p. 82
Series
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 17-001
Keywords
Data Science, Machine Learning, Predictive Modeling, Data Representation, Health Informatics, Electronic Health Records
National Category
Computer Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-137936 (URN)978-91-7649-682-4 (ISBN)978-91-7649-683-1 (ISBN)
Public defence
2017-03-02, Lilla hörsalen, NOD-huset, Borgarfjordsgatan 12, Kista, 13:00 (English)
Opponent
Supervisors
Available from: 2017-02-07 Created: 2017-01-13 Last updated: 2022-02-28Bibliographically approved
Henriksson, A., Zhao, J., Dalianis, H. & Boström, H. (2016). Ensembles of randomized trees using diverse distributed representations of clinical events. Paper presented at IEEE International Conference on Bioinformatics and Biomedicine 2015, Washington, DC, USA, 9-12 November 2015. BMC Medical Informatics and Decision Making, 16, 85-95, Article ID 69.
Open this publication in new window or tab >>Ensembles of randomized trees using diverse distributed representations of clinical events
2016 (English)In: BMC Medical Informatics and Decision Making, E-ISSN 1472-6947, Vol. 16, p. 85-95, article id 69Article in journal (Refereed) Published
Abstract [en]

Background: Learning deep representations of clinical events based on their distributions in electronic health records has been shown to allow for subsequent training of higher-performing predictive models compared to the use of shallow, count-based representations. The predictive performance may be further improved by utilizing multiple representations of the same events, which can be obtained by, for instance, manipulating the representation learning procedure. The question, however, remains how to make best use of a set of diverse representations of clinical events – modeled in an ensemble of semantic spaces – for the purpose of predictive modeling. Methods: Three different ways of exploiting a set of (ten) distributed representations of four types of clinical events – diagnosis codes, drug codes, measurements, and words in clinical notes – are investigated in a series of experiments using ensembles of randomized trees. Here, the semantic space ensembles are obtained by varying the context window size in the representation learning procedure. The proposed method trains a forest wherein each tree is built from a bootstrap replicate of the training set whose entire original feature set is represented in a randomly selected set of semantic spaces – corresponding to the considered data types – of a given context window size. Results: The proposed method significantly outperforms concatenating the multiple representations of the bagged dataset; it also significantly outperforms representing, for each decision tree, only a subset of the features in a randomly selected set of semantic spaces. A follow-up analysis indicates that the proposed method exhibits less diversity while significantly improving average tree performance. It is also shown that the size of the semantic space ensemble has a significant impact on predictive performance and that performance tends to improve as the size increases. Conclusions: The strategy for utilizing a set of diverse distributed representations of clinical events when constructing ensembles of randomized trees has a significant impact on predictive performance. The most successful strategy – significantly outperforming the considered alternatives – involves randomly sampling distributed representations of the clinical events when building each decision tree in the forest.

Keywords
Random forest, Distributional semantics, Heterogeneous data, Electronic health records, Pharmacovigilance, Adverse drug events
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-136587 (URN)10.1186/s12911-016-0309-0 (DOI)000405363700001 ()
Conference
IEEE International Conference on Bioinformatics and Biomedicine 2015, Washington, DC, USA, 9-12 November 2015
Available from: 2016-12-12 Created: 2016-12-12 Last updated: 2023-07-24Bibliographically approved
Zhao, J. & Henriksson, A. (2016). Learning temporal weights of clinical events using variable importance. Paper presented at IEEE International Conference on Bioinformatics and Biomedicine 2015, Washington, DC, USA, 9–12 November 2015. BMC Medical Informatics and Decision Making, 16(Suppl. 2), 111-121, Article ID 71.
Open this publication in new window or tab >>Learning temporal weights of clinical events using variable importance
2016 (English)In: BMC Medical Informatics and Decision Making, E-ISSN 1472-6947, Vol. 16, no Suppl. 2, p. 111-121, article id 71Article in journal (Refereed) Published
Abstract [en]

Background: Longitudinal data sources, such as electronic health records (EHRs), are very valuable for monitoring adverse drug events (ADEs). However, ADEs are heavily under-reported in EHRs. Using machine learning algorithms to automatically detect patients that should have had ADEs reported in their health records is an efficient and effective solution. One of the challenges to that end is how to take into account the temporality of clinical events, which are time stamped in EHRs, and providing these as features for machine learning algorithms to exploit. Previous research on this topic suggests that representing EHR data as a bag of temporally weighted clinical events is promising; however, the weights were in that case pre-assigned according to their time stamps, which is limited and potentially less accurate. This study therefore focuses on how to learn weights that effectively take into account the temporality and importance of clinical events for ADE detection. Methods: Variable importance obtained from the random forest learning algorithm is used for extracting temporal weights. Two strategies are proposed for applying the learned weights: weighted aggregation and weighted sampling. The first strategy aggregates the weighted clinical events from different time windows to form new features; the second strategy retains the original features but samples them by using their weights as probabilities when building each tree in the forest. The predictive performance of random forest models using the learned weights with the two strategies is compared to using pre-assigned weights. In addition, to assess the sensitivity of the weight-learning procedure, weights from different granularity levels are evaluated and compared. Results: In the weighted sampling strategy, using learned weights significantly improves the predictive performance, in comparison to using pre-assigned weights; however, there is no significant difference between them in the weighted aggregation strategy. Moreover, the granularity of the weight learning procedure has a significant impact on the former, but not on the latter. Conclusions: Learning temporal weights is significantly beneficial in terms of predictive performance with the weighted sampling strategy. Moreover, weighted aggregation generally diminishes the impact of temporal weighting of the clinical events, irrespective of whether the weights are pre-assigned or learned.

Keywords
Learning weights, Temporality, Adverse drug events, Electronic health records, Machine learning, Random forest, Pharmacovigilance
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-136593 (URN)10.1186/s12911-016-0311-6 (DOI)000405363700003 ()
Conference
IEEE International Conference on Bioinformatics and Biomedicine 2015, Washington, DC, USA, 9–12 November 2015
Available from: 2016-12-12 Created: 2016-12-12 Last updated: 2023-07-24Bibliographically approved
Zhao, J., Henriksson, A. & Boström, H. (2015). Cascading Adverse Drug Event Detection in Electronic Health Records. In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA): Proceedings. Paper presented at 2015 IEEE International Conference on Data Science and Advanced Analytics, Paris, France, 19-21 October, 2015. IEEE Computer Society
Open this publication in new window or tab >>Cascading Adverse Drug Event Detection in Electronic Health Records
2015 (English)In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA): Proceedings, IEEE Computer Society, 2015Conference paper, Published paper (Refereed)
Abstract [en]

The ability to detect adverse drug events (ADEs) in electronic health records (EHRs) is useful in many medical applications, such as alerting systems that indicate when an ADE-specific diagnosis code should be assigned. Automating the detection of ADEs can be attempted by applying machine learning to existing, labeled EHR data. How to do this in an effective manner is, however, an open question. The issues addressed in this study concern the granularity of the classification task: (1) If we wish to predict the occurrence of ADE, is it advantageous to conflate the various ADE class labels prior to learning, or should they be merged post prediction? (2) If we wish to predict a family of ADEs or even a specific ADE, can the predictive performance be enhanced by dividing the classification task into a cascading scheme: predicting first, on a coarse level, whether there is an ADE or not, and, in the former case, followed by a more specific prediction on which family the ADE belongs to, and then finally a prediction on the specific ADE within that particular family? In this study, we conduct a series of experiments using a real, clinical dataset comprising healthcare episodes that have been assigned one of eight ADE-related diagnosis codes and a set of randomly extracted episodes that have not been assigned any ADE code. It is shown that, when distinguishing between ADEs and non-ADEs, merging the various ADE labels prior to learning leads to significantly higher predictive performance in terms of accuracy and area under ROC curve. A cascade of random forests is moreover constructed to determine either the family of ADEs or the specific class label; here, the performance is indeed enhanced compared to directly employing a one-step prediction. This study concludes that, if predictive performance is of primary importance, the cascading scheme should be the recommended approach over employing a one-step prediction for detecting ADEs in EHRs.

Place, publisher, year, edition, pages
IEEE Computer Society, 2015
Keywords
electronic health records, adverse drug events, predictive modeling, cascading
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-122795 (URN)10.1109/DSAA.2015.7344869 (DOI)978-1-4673-8272-4 (ISBN)978-1-4673-8273-1 (ISBN)
Conference
2015 IEEE International Conference on Data Science and Advanced Analytics, Paris, France, 19-21 October, 2015
Available from: 2015-11-11 Created: 2015-11-10 Last updated: 2022-02-23Bibliographically approved
Henelius, A., Puolamäki, K., Karlsson, I., Zhao, J., Asker, L., Boström, H. & Papapetrou, P. (2015). GoldenEye++: a Closer Look into the Black Box. In: Alexander Gammerman, Vladimir Vovk, Harris Papadopoulos (Ed.), Statistical Learning and Data Sciences: Proceedings. Paper presented at Third International Symposium, SLDS 2015, Egham, UK, April 20-23, 2015 (pp. 96-105). Springer
Open this publication in new window or tab >>GoldenEye++: a Closer Look into the Black Box
Show others...
2015 (English)In: Statistical Learning and Data Sciences: Proceedings / [ed] Alexander Gammerman, Vladimir Vovk, Harris Papadopoulos, Springer, 2015, p. 96-105Conference paper, Published paper (Refereed)
Abstract [en]

Models with high predictive performance are often opaque, i.e., they do not allow for direct interpretation, and are hence of limited value when the goal is to understand the reasoning behind predictions. A recently proposed algorithm, GoldenEye, allows detection of groups of interacting variables exploited by a model. We employed this technique in conjunction with random forests generated from data obtained from electronic patient records for the task of detecting adverse drug events (ADEs). We propose a refined version of the GoldenEye algorithm, called GoldenEye++, utilizing a more sensitive grouping metric. An empirical investigation comparing the two algorithms on 27 datasets related to detecting ADEs shows that the new version of the algorithm in several cases finds groups of medically relevant interacting attributes, corresponding to prescribed drugs, undetected by the previous version. This suggests that the GoldenEye++ algorithm can be a useful tool for finding novel (adverse) drug interactions.

Place, publisher, year, edition, pages
Springer, 2015
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 9047
Keywords
Classifiers, Randomization, Adverse drug events
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-122825 (URN)10.1007/978-3-319-17091-6_5 (DOI)000361990900005 ()978-3-319-17090-9 (ISBN)978-3-319-17091-6 (ISBN)
Conference
Third International Symposium, SLDS 2015, Egham, UK, April 20-23, 2015
Available from: 2015-11-11 Created: 2015-11-10 Last updated: 2022-02-23Bibliographically approved
Zhao, J., Henriksson, A., Kvist, M., Asker, L. & Boström, H. (2015). Handling Temporality of Clinical Events for Drug Safety Surveillance. AMIA Annual Symposium Proceedings, 2015, 1371-1380
Open this publication in new window or tab >>Handling Temporality of Clinical Events for Drug Safety Surveillance
Show others...
2015 (English)In: AMIA Annual Symposium Proceedings, ISSN 1559-4076, Vol. 2015, p. 1371-1380Article in journal (Refereed) Published
Abstract [en]

Using longitudinal data in electronic health records (EHRs) for post-marketing adverse drug event (ADE) detection allows for monitoring patients throughout their medical history. Machine learning methods have been shown to be efficient and effective in screening health records and detecting ADEs. How best to exploit historical data, as encoded by clinical events in EHRs is, however, not very well understood. In this study, three strategies for handling temporality of clinical events are proposed and evaluated using an EHR database from Stockholm, Sweden. The random forest learning algorithm is applied to predict fourteen ADEs using clinical events collected from different lengths of patient history. The results show that, in general, including longer patient history leads to improved predictive performance, and that assigning weights to events according to time distance from the ADE yields the biggest improvement.

Keywords
drug safety surveillance, pharmacovigilance, adverse drug events, electronic health records, temporality, predictive modeling
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-123950 (URN)
Available from: 2015-12-09 Created: 2015-12-09 Last updated: 2022-02-23Bibliographically approved
Henriksson, A., Zhao, J., Boström, H. & Dalianis, H. (2015). Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection. In: Jun (Luke) Huan et al. (Ed.), 2015 IEEE International Conference on Bioinformatics and Biomedicine: Proceedings. Paper presented at IEEE BIBM, International Conference on Bioinformatics and Biomedicine, U.S.A, Washington, D.C., 09-12 November 2015 (pp. 343-350). IEEE Computer Society
Open this publication in new window or tab >>Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection
2015 (English)In: 2015 IEEE International Conference on Bioinformatics and Biomedicine: Proceedings / [ed] Jun (Luke) Huan et al., IEEE Computer Society, 2015, p. 343-350Conference paper, Published paper (Refereed)
Abstract [en]

Electronic health records (EHRs) are emerging as a potentially valuable source for pharmacovigilance; however, adverse drug events (ADEs), which can be encoded in EHRs by a set of diagnosis codes, are heavily underreported. Alerting systems, able to detect potential ADEs on the basis of patient- specific EHR data, would help to mitigate this problem. To that end, the use of machine learning has proven to be both efficient and effective; however, challenges remain in representing the heterogeneous EHR data, which moreover tends to be high- dimensional and exceedingly sparse, in a manner conducive to learning high-performing predictive models. Prior work has shown that distributional semantics – that is, natural language processing methods that, traditionally, model the meaning of words in semantic (vector) space on the basis of co-occurrence information – can be exploited to create effective representations of sequential EHR data, not only free-text in clinical notes but also various clinical events such as diagnoses, drugs and measurements. When modeling data in semantic space, an im- portant design decision concerns the size of the context window around an object of interest, which governs the scope of co- occurrence information that is taken into account and affects the composition of the resulting semantic space. Here, we report on experiments conducted on 27 clinical datasets, demonstrating that performance can be significantly improved by modeling EHR data in ensembles of semantic spaces, consisting of multiple semantic spaces built with different context window sizes. A follow-up investigation is conducted to study the impact on predictive performance as increasingly more semantic spaces are included in the ensemble, demonstrating that accuracy tends to improve with the number of semantic spaces, albeit not monotonically so. Finally, a number of different strategies for combining the semantic spaces are explored, demonstrating the advantage of early (feature) fusion over late (classifier) fusion. Ensembles of semantic spaces allow multiple views of (sparse) data to be captured (densely) and thereby enable improved performance to be obtained on the task of detecting ADEs in EHRs.

Place, publisher, year, edition, pages
IEEE Computer Society, 2015
Keywords
distributional semantics, semantic space ensembles, ensemble models, electronic health records, adverse drug events, predictive modeling, information fusion
National Category
Natural Language Processing Computer Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-122463 (URN)10.1109/BIBM.2015.7359705 (DOI)
Conference
IEEE BIBM, International Conference on Bioinformatics and Biomedicine, U.S.A, Washington, D.C., 09-12 November 2015
Projects
High-Performance Data Mining for Drug Effect Detection
Funder
Swedish Foundation for Strategic Research , IIS11-0053
Available from: 2015-11-02 Created: 2015-11-02 Last updated: 2025-02-01Bibliographically approved
Henriksson, A., Zhao, J., Boström, H. & Dalianis, H. (2015). Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event Detection. In: Eric Gaussier, Longbing Cao, Patrick Gallinari, James Kwok, Gabriella Pasi, Osmar Zaiane (Ed.), 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA): . Paper presented at IEEE International Conference on Data Science and Advanced Analytics (DSAA), Paris, France, October 19-21, 2015. IEEE
Open this publication in new window or tab >>Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event Detection
2015 (English)In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) / [ed] Eric Gaussier, Longbing Cao, Patrick Gallinari, James Kwok, Gabriella Pasi, Osmar Zaiane, IEEE, 2015Conference paper, Published paper (Refereed)
Abstract [en]

The enormous amounts of data that are continuously recorded in electronic health record systems offer ample opportunities for data science applications to improve healthcare. There are, however, challenges involved in using such data for machine learning, such as high dimensionality and sparsity, as well as an inherent heterogeneity that does not allow the distinct types of clinical data to be treated in an identical manner. On the other hand, there are also similarities across data types that may be exploited, e.g., the possibility of representing some of them as sequences. Here, we apply the notions underlying distributional semantics, i.e., methods that model the meaning of words in semantic (vector) space on the basis of co-occurrence information, to four distinct types of clinical data: free-text notes, on the one hand, and clinical events, in the form of diagnosis codes, drug codes and measurements, on the other hand. Each semantic space contains continuous vector representations for every unique word and event, which can then be used to create representations of, e.g., care episodes that, in turn, can be exploited by the learning algorithm. This approach does not only reduce sparsity, but also takes into account, and explicitly models, similarities between various items, and it does so in an entirely data-driven fashion. Here, we report on a series of experiments using the random forest learning algorithm that demonstrate the effectiveness, in terms of accuracy and area under ROC curve, of the proposed representation form over the commonly used bag-of-items counterpart. The experiments are conducted on 27 real datasets that each involves the (binary) classification task of detecting a particular adverse drug event. It is also shown that combining structured and unstructured data leads to significant improvements over using only one of them.

Place, publisher, year, edition, pages
IEEE, 2015
Keywords
distributional semantics, semantic space ensembles, heterogeneous data, electronic health records, adverse drug events, predictive modeling
National Category
Computer Sciences Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-122462 (URN)10.1109/DSAA.2015.7344867 (DOI)978-1-4673-8272-4 (ISBN)
Conference
IEEE International Conference on Data Science and Advanced Analytics (DSAA), Paris, France, October 19-21, 2015
Projects
High-Performance Data Mining for Drug Effect Detection
Funder
Swedish Foundation for Strategic Research , IIS11-0053
Available from: 2015-11-02 Created: 2015-11-02 Last updated: 2025-02-01Bibliographically approved
Organisations

Search in DiVA

Show all publications