Change search
Link to record
Permanent link

Direct link
Lamproudis, Anastasios
Publications (7 of 7) Show all publications
Lamproudis, A. & Henriksson, A. (2023). On the Impact of the Vocabulary for Domain-Adaptive Pretraining of Clinical Language Models. In: Ana Cecília A. Roque; Denis Gracanin; Ronny Lorenz; Athanasios Tsanas; Nathalie Bier; Ana Fred; Hugo Gamboa (Ed.), Biomedical Engineering Systems and Technologies: 15th International Joint Conference, BIOSTEC 2022, Virtual Event, February 9–11, 2022, Revised Selected Papers (pp. 315-332). Springer Nature
Open this publication in new window or tab >>On the Impact of the Vocabulary for Domain-Adaptive Pretraining of Clinical Language Models
2023 (English)In: Biomedical Engineering Systems and Technologies: 15th International Joint Conference, BIOSTEC 2022, Virtual Event, February 9–11, 2022, Revised Selected Papers / [ed] Ana Cecília A. Roque; Denis Gracanin; Ronny Lorenz; Athanasios Tsanas; Nathalie Bier; Ana Fred; Hugo Gamboa, Springer Nature , 2023, p. 315-332Chapter in book (Refereed)
Abstract [en]

Pretrained language models tailored to the target domain may improve predictive performance on downstream tasks. Such domain-specific language models are typically developed by pretraining on in-domain data, either from scratch or by continuing to pretrain an existing generic language model. Here, we focus on the latter situation and study the impact of the vocabulary for domain-adaptive pretraining of clinical language models. In particular, we investigate the impact of (i) adapting the vocabulary to the target domain, (ii) using different vocabulary sizes, and (iii) creating initial representations for clinical terms not present in the general-domain vocabulary based on subword averaging. The results confirm the benefits of adapting the vocabulary of the language model to the target domain; however, the choice of vocabulary size is not particularly sensitive with respect to downstream performance, while the benefits of subword averaging is reduced after a modest amount of domain-adaptive pretraining.

Place, publisher, year, edition, pages
Springer Nature, 2023
Series
Communications in Computer and Information Science, ISSN 1865-0929, E-ISSN 1865-0937 ; 1814
Keywords
Natural language processing, Clinical language models, Domain-adaptive pretraining, Clinical text
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-224979 (URN)10.1007/978-3-031-38854-5_16 (DOI)2-s2.0-85172242060 (Scopus ID)978-3-031-38853-8 (ISBN)
Available from: 2024-01-03 Created: 2024-01-03 Last updated: 2025-02-07Bibliographically approved
Vakili, T., Lamproudis, A., Henriksson, A. & Dalianis, H. (2022). Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data. In: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022): . Paper presented at Conference on Language Resources and Evaluation (LREC 2022), Marseilles, France, 21-23 June 2022 (pp. 4245-4252). European Language Resources Association
Open this publication in new window or tab >>Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data
2022 (English)In: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), European Language Resources Association , 2022, p. 4245-4252Conference paper, Published paper (Refereed)
Abstract [en]

Automatic de-identification is a cost-effective and straightforward way of removing large amounts of personally identifiable information from large and sensitive corpora. However, these systems also introduce errors into datasets due to their imperfect precision. These corruptions of the data may negatively impact the utility of the de-identified dataset. This paper de-identifies a very large clinical corpus in Swedish either by removing entire sentences containing sensitive data or by replacing sensitive words with realistic surrogates. These two datasets are used to perform domain adaptation of a general Swedish BERT model. The impact of the de-identification techniques is assessed by training and evaluating the models using six clinical downstream tasks. The results are then compared to a similar BERT model domain-adapted using an unaltered version of the clinical corpus. The results show that using an automatically de-identified corpus for domain adaptation does not negatively impact downstream performance. We argue that automatic de-identification is an efficient way of reducing the privacy risks of domain-adapted models and that the models created in this paper should be safe to distribute to other academic researchers.

Place, publisher, year, edition, pages
European Language Resources Association, 2022
Keywords
Privacy-preserving machine learning, pseudonymization, de-identification, Swedish clinical text, pre-trained language models, BERT, downstream tasks, NER, multi-label classification
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-207395 (URN)
Conference
Conference on Language Resources and Evaluation (LREC 2022), Marseilles, France, 21-23 June 2022
Available from: 2022-07-15 Created: 2022-07-15 Last updated: 2025-02-07
Lamproudis, A., Henriksson, A. & Dalianis, H. (2022). Evaluating Pretraining Strategies for Clinical BERT Models. In: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022): . Paper presented at Conference on Language Resources and Evaluation (LREC 2022), 21-23 June 2022, Marseille, France. (pp. 410-416). European Language Resources Association
Open this publication in new window or tab >>Evaluating Pretraining Strategies for Clinical BERT Models
2022 (English)In: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), European Language Resources Association , 2022, p. 410-416Conference paper, Published paper (Refereed)
Abstract [en]

Research suggests that using generic language models in specialized domains may be sub-optimal due to significant domain differences. As a result, various strategies for developing domain-specific language models have been proposed, including techniques for adapting an existing generic language model to the target domain, e.g. through various forms of vocabulary modifications and continued domain-adaptive pretraining with in-domain data. Here, an empirical investigation is carried out in which various strategies for adapting a generic language model to the clinical domain are compared to pretraining a pure clinical language model. Three clinical language models for Swedish, pretrained for up to ten epochs, are fine-tuned and evaluated on several downstream tasks in the clinical domain. A comparison of the language models’ downstream performance over the training epochs is conducted. The results show that the domain-specific language models outperform a general-domain language model, although there is little difference in performance between the various clinical language models. However, compared to pretraining a pure clinical language model with only in-domain data, leveraging and adapting an existing general-domain language model requires fewer epochs of pretraining with in-domain data.

Place, publisher, year, edition, pages
European Language Resources Association, 2022
Keywords
language models, domain-adaptive pretraining, Swedish clinical text
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-207397 (URN)
Conference
Conference on Language Resources and Evaluation (LREC 2022), 21-23 June 2022, Marseille, France.
Available from: 2022-07-15 Created: 2022-07-15 Last updated: 2025-02-07Bibliographically approved
Lamproudis, A., Henriksson, A., Karlsson Valik, J. & Nauclér, P. (2022). Improving the Timeliness of Early Prediction Models for Sepsis through Utility Optimization. In: 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI): . Paper presented at International Conference on Tools with Artificial Intelligence (ICTAI), 31 October- 02 November, 2022, Macao, China. (pp. 1062-1069).
Open this publication in new window or tab >>Improving the Timeliness of Early Prediction Models for Sepsis through Utility Optimization
2022 (English)In: 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), 2022, p. 1062-1069Conference paper, Published paper (Refereed)
Abstract [en]

Early prediction of sepsis can facilitate early intervention and lead to improved clinical outcomes. However, for early prediction models to be clinically useful, and also to reduce alarm fatigue, detection of sepsis needs to be timely with respect to onset, being neither too late nor too early. In this paper, we propose a utility-based loss function for training early prediction models, where utility is defined by a function according to when the predictions are made and in relation to onset as well as to specified early, optimal and late time points. Two versions of the utility-based loss function are evaluated and compared to a cross-entropy loss baseline. Experimental results, using real clinical data from electronic health records, show that incorporating the utility-based loss function leads to superior multimodal early prediction models, detecting sepsis both more accurately and more timely. We argue that improving the timeliness of early prediction models is important for increasing their utility and acceptance in a clinical setting.

Series
Proceedings - International Conference on Tools with Artificial Intelligence (ICTAI), ISSN 1082-3409, E-ISSN 2375-0197
Keywords
Early prediction, sepsis, electronic health records, multimodal learning
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-216829 (URN)10.1109/ICTAI56018.2022.00162 (DOI)
Conference
International Conference on Tools with Artificial Intelligence (ICTAI), 31 October- 02 November, 2022, Macao, China.
Available from: 2023-05-02 Created: 2023-05-02 Last updated: 2025-02-07Bibliographically approved
Lamproudis, A., Henriksson, A. & Dalianis, H. (2022). Vocabulary Modifications for Domain-adaptive Pretraining of Clinical Language Models. In: Nathalie Bier; Ana Fred; Hugo Gamboa (Ed.), Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies - HEALTHINF: . Paper presented at The 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022), 9 - 11 February, 2022, Online (pp. 180-188). SciTePress
Open this publication in new window or tab >>Vocabulary Modifications for Domain-adaptive Pretraining of Clinical Language Models
2022 (English)In: Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies - HEALTHINF / [ed] Nathalie Bier; Ana Fred; Hugo Gamboa, SciTePress , 2022, p. 180-188Conference paper, Published paper (Refereed)
Abstract [en]

Research has shown that using generic language models – specifically, BERT models – in specialized domains may be sub-optimal due to domain differences in language use and vocabulary. There are several techniques for developing domain-specific language models that leverage the use of existing generic language models, including continued and domain-adaptive pretraining with in-domain data. Here, we investigate a strategy based on using a domain-specific vocabulary, while leveraging a generic language model for initialization. The results demonstrate that domain-adaptive pretraining, in combination with a domain-specific vocabulary – as opposed to a general-domain vocabulary – yields improvements on two downstream clinical NLP tasks for Swedish. The results highlight the value of domain-adaptive pretraining when developing specialized language models and indicate that it is beneficial to adapt the vocabulary of the language model to the target domain prior to continued, domain-adaptive pretraining of a generic language model.

Place, publisher, year, edition, pages
SciTePress, 2022
Series
Biostec, ISSN 2184-349X, E-ISSN 2184-4305
Keywords
Natural Language Processing, Language Models, Domain-adaptive Pretraining, Clinical Text, Swedish
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-207403 (URN)10.5220/0010893800003123 (DOI)978-989-758-552-4 (ISBN)
Conference
The 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022), 9 - 11 February, 2022, Online
Available from: 2022-07-15 Created: 2022-07-15 Last updated: 2022-08-23Bibliographically approved
Lamproudis, A., Henriksson, A. & Dalianis, H. (2021). Developing a Clinical Language Model for Swedish: Continued Pretraining of Generic BERT with In-Domain Data. In: Galia Angelova; Maria Kunilovskaya; Ruslan Mitkov; Ivelina Nikolova-Koleva (Ed.), INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING 2021: Deep Learning for Natural Language ProcessingMethods and Applications: PROCEEDINGS. Paper presented at International Conference Recent Advances in Natural Language Processing (RANLP'21), online, September 1-3, 2021 (pp. 790-797). Shoumen: INCOMA Ltd.
Open this publication in new window or tab >>Developing a Clinical Language Model for Swedish: Continued Pretraining of Generic BERT with In-Domain Data
2021 (English)In: INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING 2021: Deep Learning for Natural Language ProcessingMethods and Applications: PROCEEDINGS / [ed] Galia Angelova; Maria Kunilovskaya; Ruslan Mitkov; Ivelina Nikolova-Koleva, Shoumen: INCOMA Ltd. , 2021, p. 790-797Conference paper, Published paper (Refereed)
Abstract [en]

The use of pretrained language models, finetuned to perform a specific downstream task, has become widespread in NLP. Using a generic language model in specialized domains may, however, be sub-optimal due to differences in language use and vocabulary. In this paper, it is investigated whether an existing, generic language model for Swedish can be improved for the clinical domain through continued pretraining with clinical text.

The generic and domain-specific language models are fine-tuned and evaluated on three representative clinical NLP tasks: (i) identifying protected health information, (ii) assigning ICD-10 diagnosis codes to discharge summaries, and (iii) sentence-level uncertainty prediction. The results show that continued pretraining on in-domain data leads to improved performance on all three downstream tasks, indicating that there is a potential added value of domain-specific language models for clinical NLP.

Place, publisher, year, edition, pages
Shoumen: INCOMA Ltd., 2021
Series
International Conference Recent Advances in Natural Language Processing, ISSN 1313-8502, E-ISSN 2603-2813
Keywords
natural language processing, language models, clinical text
National Category
Computer and Information Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-200467 (URN)10.26615/978-954-452-072-4_090 (DOI)978-954-452-072-4 (ISBN)
Conference
International Conference Recent Advances in Natural Language Processing (RANLP'21), online, September 1-3, 2021
Available from: 2022-01-05 Created: 2022-01-05 Last updated: 2022-01-28Bibliographically approved
Remmer, S., Lamproudis, A. & Dalianis, H. (2021). Multi-label Diagnosis Classification of Swedish Discharge Summaries – ICD-10 Code Assignment Using KB-BERT. In: Galia Angelova; Maria Kunilovskaya; Ruslan Mitkov; Ivelina Nikolova-Koleva (Ed.), INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING 2021: Deep Learning for Natural Language ProcessingMethods and Applications: PROCEEDINGS. Paper presented at International Conference Recent Advances in Natural Language Processing (RANLP'21), online, September 1-3, 2021 (pp. 1158-1166). Shoumen: INCOMA Ltd.
Open this publication in new window or tab >>Multi-label Diagnosis Classification of Swedish Discharge Summaries – ICD-10 Code Assignment Using KB-BERT
2021 (English)In: INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING 2021: Deep Learning for Natural Language ProcessingMethods and Applications: PROCEEDINGS / [ed] Galia Angelova; Maria Kunilovskaya; Ruslan Mitkov; Ivelina Nikolova-Koleva, Shoumen: INCOMA Ltd. , 2021, p. 1158-1166Conference paper, Published paper (Refereed)
Abstract [en]

The International Classification of Diseases (ICD) is a system for systematically recording patients’ diagnoses. Clinicians or professional coders assign ICD codes to patients’ medical records to facilitate funding, research, and ad- ministration. In most health facilities, clinical coding is a manual, time-demanding task that is prone to errors. A tool that automatically assigns ICD codes to free-text clinical notes could save time and reduce erroneous coding. While many previous studies have focused on ICD coding, research on Swedish patient records is scarce. This study explored different approaches to pairing Swedish clinical notes with ICD codes. KB-BERT, a BERT model pre-trained on Swedish text, was compared to the traditional supervised learning models Support Vector Machines, Decision Trees, and K-nearest Neighbours used as the baseline. When considering ICD codes grouped into ten blocks, the KB-BERT was superior to the baseline models, obtaining an F1-micro of 0.80 and an F1-macro of 0.58. When considering the 263 full ICD codes, the KB-BERT was outperformed by all baseline models at an F1-micro and F1-macro of zero. Wilcoxon signed-rank tests showed that the performance differences between the KB-BERT and the baseline mod- els were statistically significant.

Place, publisher, year, edition, pages
Shoumen: INCOMA Ltd., 2021
Series
International Conference Recent Advances in Natural Language Processing, ISSN 1313-8502, E-ISSN 2603-2813
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-200500 (URN)10.26615/978-954-452-072-4_130 (DOI)978-954-452-072-4 (ISBN)
Conference
International Conference Recent Advances in Natural Language Processing (RANLP'21), online, September 1-3, 2021
Available from: 2022-01-06 Created: 2022-01-06 Last updated: 2022-01-28Bibliographically approved
Organisations

Search in DiVA

Show all publications