Change search
Link to record
Permanent link

Direct link
Publications (10 of 106) Show all publications
Ngo, P., Tejedor, M., Olsen Svenning, T., Chomutare, T., Budrionis, A. & Dalianis, H. (2024). Deidentifying a Norwegian clinical corpus - An effort to create a privacy-preserving Norwegian large clinical language model. In: Proceedings of the CALD-pseudo Workshop at the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024: . Paper presented at Tthe 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 17-22 March 2024, St. Julians, Malta. (pp. 37-43). Association for Computational Linguistics
Open this publication in new window or tab >>Deidentifying a Norwegian clinical corpus - An effort to create a privacy-preserving Norwegian large clinical language model
Show others...
2024 (English)In: Proceedings of the CALD-pseudo Workshop at the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024, Association for Computational Linguistics , 2024, p. 37-43Conference paper, Published paper (Refereed)
Abstract [en]

This study discusses the methods and challenges of deidentifying and pseudonymizing Norwegian clinical text for research purposes. The results of the NorDeid tool for deidentification and pseudonymization on different types of protected health information were evaluated and discussed, as well as the extension of its functionality with regular expressions to identify specific types of sensitive information. This research used a clinical corpus of adult patients treated in a gastro-surgical department in Norway, which contains approximately nine million clinical notes. The study also highlights the challenges posed by the unique language and clinical terminology of Norway and emphasizes the importance of protecting privacy and the need for customized approaches to meet legal and research requirements.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2024
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-231309 (URN)
Conference
Tthe 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 17-22 March 2024, St. Julians, Malta.
Available from: 2024-06-18 Created: 2024-06-18 Last updated: 2025-02-07Bibliographically approved
Lamproudis, A., Mora, S., Olsen Svenning, T., Torsvik, T., Chomutare, T., Dinh Ngo, P. & Dalianis, H. (2024). De-identifying Norwegian Clinical Text using Resources from Swedish and Danish. In: AMIA Symposium, 2023: . Paper presented at AMIA 2023 Annual Symposium, New Orleans, USA, November 11-15, 2023 (pp. 456-464). American Medical Informatics Association (AMIA)
Open this publication in new window or tab >>De-identifying Norwegian Clinical Text using Resources from Swedish and Danish
Show others...
2024 (English)In: AMIA Symposium, 2023, American Medical Informatics Association (AMIA) , 2024, p. 456-464Conference paper, Published paper (Refereed)
Abstract [en]

The lack of relevant annotated datasets represents one key limitation in the application of Natural Language Pro- cessing techniques in a broad number of tasks, among them Protected Health Information (PHI) identification in Norwegian clinical text. In this work, the possibility of exploiting resources from Swedish, a very closely related language, to Norwegian is explored. The Swedish dataset is annotated with PHI information. Different processing and text augmentation techniques are evaluated, along with their impact in the final performance of the model. The augmentation techniques, such as injection and generation of both Norwegian and Scandinavian Named Entities into the Swedish training corpus, showed to increase the performance in the de-identification task for both Danish and Norwegian text. This trend was also confirmed by the evaluation of model performance on a sample Norwegian gastro surgical clinical text.

Place, publisher, year, edition, pages
American Medical Informatics Association (AMIA), 2024
Series
AMIA Annual Symposium proceedings, ISSN 1559-4076, E-ISSN 1942-597X
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-225839 (URN)38222432 (PubMedID)2-s2.0-85182546946 (Scopus ID)
Conference
AMIA 2023 Annual Symposium, New Orleans, USA, November 11-15, 2023
Available from: 2024-01-23 Created: 2024-01-23 Last updated: 2025-02-24Bibliographically approved
Vakili, T., Henriksson, A. & Dalianis, H. (2024). End-to-End Pseudonymization of Fine-Tuned Clinical BERT Models: Privacy Preservation with Maintained Data Utility. BMC Medical Informatics and Decision Making, Article ID 162.
Open this publication in new window or tab >>End-to-End Pseudonymization of Fine-Tuned Clinical BERT Models: Privacy Preservation with Maintained Data Utility
2024 (English)In: BMC Medical Informatics and Decision Making, E-ISSN 1472-6947, article id 162Article in journal (Refereed) Published
Abstract [en]

Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive.

One privacy-preserving technique that aims to mitigate these problems is training data pseudonymization. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks.

This study evaluates the predictive performance effects of end-to-end pseudonymization of clinical BERT models on five clinical NLP tasks compared to pre-training and fine-tuning on unaltered sensitive data. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.

Keywords
Natural language processing, language models, BERT, electronic health records, clinical text, de-identification, pseudonymization, privacy preservation, Swedish
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-232099 (URN)10.1186/s12911-024-02546-8 (DOI)38915012 (PubMedID)2-s2.0-85196757461 (Scopus ID)
Available from: 2024-07-24 Created: 2024-07-24 Last updated: 2025-02-07Bibliographically approved
Chomutare, T., Lamproudis, A., Budrionis, A., Svenning, T. O., Hind, L. I., Ngo, P. D., . . . Dalianis, H. (2024). Improving Quality of ICD-10 (International Statistical Classification of Diseases, Tenth Revision) Coding Using AI: Protocol for a Crossover Randomized Controlled Trial. JMIR Research Protocols, 13, Article ID e54593.
Open this publication in new window or tab >>Improving Quality of ICD-10 (International Statistical Classification of Diseases, Tenth Revision) Coding Using AI: Protocol for a Crossover Randomized Controlled Trial
Show others...
2024 (English)In: JMIR Research Protocols, E-ISSN 1929-0748, Vol. 13, article id e54593Article in journal (Refereed) Published
Abstract [en]

Background: Computer-assisted clinical coding (CAC) tools are designed to help clinical coders assign standardized codes, such as the ICD-10 (International Statistical Classification of Diseases, Tenth Revision), to clinical texts, such as discharge summaries. Maintaining the integrity of these standardized codes is important both for the functioning of health systems and for ensuring data used for secondary purposes are of high quality. Clinical coding is an error-prone cumbersome task, and the complexity of modern classification systems such as the ICD-11 (International Classification of Diseases, Eleventh Revision) presents significant barriers to implementation. To date, there have only been a few user studies; therefore, our understanding is still limited regarding the role CAC systems can play in reducing the burden of coding and improving the overall quality of coding. Objective: The objective of the user study is to generate both qualitative and quantitative data for measuring the usefulness of a CAC system, Easy-ICD, that was developed for recommending ICD-10 codes. Specifically, our goal is to assess whether our tool can reduce the burden on clinical coders and also improve coding quality. Methods: The user study is based on a crossover randomized controlled trial study design, where we measure the performance of clinical coders when they use our CAC tool versus when they do not. Performance is measured by the time it takes them to assign codes to both simple and complex clinical texts as well as the coding quality, that is, the accuracy of code assignment. Results: We expect the study to provide us with a measurement of the effectiveness of the CAC system compared to manual coding processes, both in terms of time use and coding quality. Positive outcomes from this study will imply that CAC tools hold the potential to reduce the burden on health care staff and will have major implications for the adoption of artificial intelligence-based CAC innovations to improve coding practice. Expected results to be published summer 2024. Conclusions: The planned user study promises a greater understanding of the impact CAC systems might have on clinical coding in real-life settings, especially with regard to coding time and quality. Further, the study may add new insights on how to meaningfully exploit current clinical text mining capabilities, with a view to reducing the burden on clinical coders, thus lowering the barriers and paving a more sustainable path to the adoption of modern coding systems, such as the new ICD-11.

Keywords
International Classification of Diseases, Tenth Revision, ICD-10, International Classification of Diseases, Eleventh Revision, ICD-11, Easy-ICD, clinical coding, artificial intelligence, machine learning, deep learning
National Category
Health Sciences
Identifiers
urn:nbn:se:su:diva-227758 (URN)10.2196/54593 (DOI)001186220500001 ()38470476 (PubMedID)2-s2.0-85188075439 (Scopus ID)
Available from: 2024-03-26 Created: 2024-03-26 Last updated: 2024-03-26Bibliographically approved
Ahrenberg, L., Ainiala, T., Aldrin, E., Holdt, Š. A., Caines, A., Dalianis, H., . . . Vu, X.-S. (2024). Introduction. In: Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Xuan-Son Vu (Ed.), Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024): . Paper presented at Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), March 2024, St. Julian’s, Malta. (pp. ii-iii).
Open this publication in new window or tab >>Introduction
Show others...
2024 (English)In: Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024) / [ed] Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Xuan-Son Vu, 2024, p. ii-iiiConference paper (Refereed)
National Category
Information Systems, Social aspects
Identifiers
urn:nbn:se:su:diva-236177 (URN)2-s2.0-85190584439 (Scopus ID)
Conference
Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), March 2024, St. Julian’s, Malta.
Available from: 2024-12-03 Created: 2024-12-03 Last updated: 2024-12-03Bibliographically approved
Lamproudis, A., Olsen Svenning, T., Torsvik, T., Chomutare, T., Budrionis, A., Dinh Ngo, P., . . . Dalianis, H. (2024). Using a Large Open Clinical Corpus for Improved ICD-10 Diagnosis Coding. In: AMIA Symposium, 2023: . Paper presented at AMIA 2023 Annual Symposium, New Orleans, USA, November 11-15, 2023 (pp. 465-473). American Medical Informatics Association (AMIA)
Open this publication in new window or tab >>Using a Large Open Clinical Corpus for Improved ICD-10 Diagnosis Coding
Show others...
2024 (English)In: AMIA Symposium, 2023, American Medical Informatics Association (AMIA) , 2024, p. 465-473Conference paper, Published paper (Refereed)
Abstract [en]

With the recent advances in natural language processing and deep learning, the development of tools that can assist medical coders in ICD-10 diagnosis coding and increase their efficiency in coding discharge summaries is significantly more viable than before. To that end, one important component in the development of these models is the datasets used to train them. In this study, such datasets are presented, and it is shown that one of them can be used to develop a BERT-based language model that can consistently perform well in assigning ICD-10 codes to discharge summaries written in Swedish. Most importantly, it can be used in a coding support setup where a tool can recommend potential codes to the coders. This reduces the range of potential codes to consider and, in turn, reduces the workload of the coder. Moreover, the de-identified and pseudonymised dataset is open to use for academic users.

Place, publisher, year, edition, pages
American Medical Informatics Association (AMIA), 2024
Series
AMIA Annual Symposium proceedings, ISSN 1559-4076, E-ISSN 1942-597X
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-225848 (URN)38222373 (PubMedID)2-s2.0-85182543990 (Scopus ID)
Conference
AMIA 2023 Annual Symposium, New Orleans, USA, November 11-15, 2023
Available from: 2024-01-23 Created: 2024-01-23 Last updated: 2025-02-24Bibliographically approved
Berg, N. & Dalianis, H. (2024). Using BART to Automatically Generate Discharge Summaries from Swedish Clinical Text. In: Dina Demner-Fushman; Sophia Ananiadou; Paul Thompson; Brian Ondov (Ed.), Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024: . Paper presented at LREC-COLING 2024, Patient-oriented language processing, 20 May 2024, Torino, Italy. (pp. 246-252). Association for Computational Linguistics
Open this publication in new window or tab >>Using BART to Automatically Generate Discharge Summaries from Swedish Clinical Text
2024 (English)In: Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024 / [ed] Dina Demner-Fushman; Sophia Ananiadou; Paul Thompson; Brian Ondov, Association for Computational Linguistics , 2024, p. 246-252Conference paper, Published paper (Refereed)
Abstract [en]

Documentation is a regular part of contemporary healthcare practices and one such documentation task is the creation of a discharge summary, which summarizes a care episode. However, to manually write discharge summaries is a time-consuming task, and research has shown that discharge summaries are often lacking quality in various respects. To alleviate this problem, text summarization methods could be applied on text from electronic health records, such as patient notes, to automatically create a discharge summary. Previous research has been conducted on this topic on text in various languages and with various methods, but no such research has been conducted on Swedish text. In this paper, four data sets extracted from a Swedish clinical corpora were used to fine-tune four BART language models to perform the task of summarizing Swedish patient notes into a discharge summary. Out of these models, the best performing model was manually evaluated by a senior, now retired, nurse and clinical coder. The evaluation results show that the best performing model produces discharge summaries of overall low quality. This is possibly due to issues in the data extracted from the Health Bank research infrastructure, which warrants further work on this topic.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2024
Keywords
Patient Discharge Summaries, text summarization, clinical text, Natural Language Processing, Transformer, BART, synthetic text, negative results
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-231321 (URN)
Conference
LREC-COLING 2024, Patient-oriented language processing, 20 May 2024, Torino, Italy.
Available from: 2024-06-18 Created: 2024-06-18 Last updated: 2025-02-07Bibliographically approved
Vakili, T., Hullmann, T., Henriksson, A. & Dalianis, H. (2024). When Is a Name Sensitive? Eponyms in Clinical Text and Implications for De-Identification. In: Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Xuan-Son Vu (Ed.), Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024): . Paper presented at Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), St. Julian’s, Malta, March 2024. (pp. 76-80). Association for Computational Linguistics (ACL)
Open this publication in new window or tab >>When Is a Name Sensitive? Eponyms in Clinical Text and Implications for De-Identification
2024 (English)In: Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024) / [ed] Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Xuan-Son Vu, Association for Computational Linguistics (ACL) , 2024, p. 76-80Conference paper, Published paper (Refereed)
Abstract [en]

Clinical data, in the form of electronic health records, are rich resources that can be tapped using natural language processing. At the same time, they contain very sensitive information that must be protected. One strategy is to remove or obscure data using automatic de-identification. However, the detection of sensitive data can yield false positives. This is especially true for tokens that are similar in form to sensitive entities, such as eponyms. These names tend to refer to medical procedures or diagnoses rather than specific persons. Previous research has shown that automatic de-identification systems often misclassify eponyms as names, leading to a loss of valuable medical information. In this study, we estimate the prevalence of eponyms in a real Swedish clinical corpus. Furthermore, we demonstrate that modern transformer-based de-identification systems are more accurate in distinguishing between names and eponyms than previous approaches.

Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL), 2024
National Category
Information Systems
Identifiers
urn:nbn:se:su:diva-236176 (URN)2-s2.0-85190604145 (Scopus ID)
Conference
Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), St. Julian’s, Malta, March 2024.
Available from: 2024-12-12 Created: 2024-12-12 Last updated: 2024-12-12Bibliographically approved
Valik, J. K., Ward, L., Tanushi, H., Johansson, A. F., Färnert, A., Mogensen, M. L., . . . Nauclér, P. (2023). Predicting sepsis onset using a machine learned causal probabilistic network algorithm based on electronic health records data. Scientific Reports, 13(1), Article ID 11760.
Open this publication in new window or tab >>Predicting sepsis onset using a machine learned causal probabilistic network algorithm based on electronic health records data
Show others...
2023 (English)In: Scientific Reports, E-ISSN 2045-2322, Vol. 13, no 1, article id 11760Article in journal (Refereed) Published
Abstract [en]

Sepsis is a leading cause of mortality and early identification improves survival. With increasing digitalization of health care data automated sepsis prediction models hold promise to aid in prompt recognition. Most previous studies have focused on the intensive care unit (ICU) setting. Yet only a small proportion of sepsis develops in the ICU and there is an apparent clinical benefit to identify patients earlier in the disease trajectory. In this cohort of 82,852 hospital admissions and 8038 sepsis episodes classified according to the Sepsis-3 criteria, we demonstrate that a machine learned score can predict sepsis onset within 48 h using sparse routine electronic health record data outside the ICU. Our score was based on a causal probabilistic network model-SepsisFinder-which has similarities with clinical reasoning. A prediction was generated hourly on all admissions, providing a new variable was registered. Compared to the National Early Warning Score (NEWS2), which is an established method to identify sepsis, the SepsisFinder triggered earlier and had a higher area under receiver operating characteristic curve (AUROC) (0.950 vs. 0.872), as well as area under precision-recall curve (APR) (0.189 vs. 0.149). A machine learning comparator based on a gradient-boosting decision tree model had similar AUROC (0.949) and higher APR (0.239) than SepsisFinder but triggered later than both NEWS2 and SepsisFinder. The precision of SepsisFinder increased if screening was restricted to the earlier admission period and in episodes with bloodstream infection. Furthermore, the SepsisFinder signaled median 5.5 h prior to antibiotic administration. Identifying a high-risk population with this method could be used to tailor clinical interventions and improve patient care.

National Category
Infectious Medicine General Practice
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-221264 (URN)10.1038/s41598-023-38858-4 (DOI)001034058100024 ()37474597 (PubMedID)2-s2.0-85165415402 (Scopus ID)
Available from: 2023-09-26 Created: 2023-09-26 Last updated: 2023-10-04Bibliographically approved
Vakili, T. & Dalianis, H. (2023). Using Membership Inference Attacks to Evaluate Privacy-Preserving Language Modeling Fails for Pseudonymizing Data. In: 24th Nordic Conference on Computational Linguistics (NoDaLiDa): . Paper presented at Nordic Conference on Computational Linguistics (pp. 318-323).
Open this publication in new window or tab >>Using Membership Inference Attacks to Evaluate Privacy-Preserving Language Modeling Fails for Pseudonymizing Data
2023 (English)In: 24th Nordic Conference on Computational Linguistics (NoDaLiDa), 2023, p. 318-323Conference paper, Published paper (Refereed)
Abstract [en]

Large pre-trained language models dominate the current state-of-the-art for many natural language processing applications, including the field of clinical NLP. Several studies have found that these can be susceptible to privacy attacks that are unacceptable in the clinical domain where personally identifiable information (PII) must not be exposed.

However, there is no consensus regarding how to quantify the privacy risks of different models. One prominent suggestion is to quantify these risks using membership inference attacks. In this study, we show that a state-of-the-art membership inference attack on a clinical BERT model fails to detect the privacy benefits from pseudonymizing data. This suggests that such attacks may be inadequate for evaluating token-level privacy preservation of PIIs.

Series
Northern European Association for Language Technology (NEALT), ISSN 1736-8197, E-ISSN 1736-6305 ; 52
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-216681 (URN)
Conference
Nordic Conference on Computational Linguistics
Available from: 2023-04-24 Created: 2023-04-24 Last updated: 2025-02-07Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0165-9926

Search in DiVA

Show all publications