Change search
Link to record
Permanent link

Direct link
Publications (10 of 79) Show all publications
Vakili, T., Henriksson, A. & Dalianis, H. (2025). Data-Constrained Synthesis of Training Data for De-Identification. In: Wanxiang Che; Joyce Nabende; Ekaterina Shutova; Mohammad Taher Pilehvar (Ed.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers): . Paper presented at The 63rd Annual Meeting of the Association for Computational Linguistics, 27 July-1 August, 2025, Vienna, Austria. (pp. 27414-27427). Association for Computational Linguistics
Open this publication in new window or tab >>Data-Constrained Synthesis of Training Data for De-Identification
2025 (English)In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) / [ed] Wanxiang Che; Joyce Nabende; Ekaterina Shutova; Mohammad Taher Pilehvar, Association for Computational Linguistics , 2025, p. 27414-27427Conference paper, Published paper (Refereed)
Abstract [en]

Many sensitive domains — such as the clinical domain — lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study — using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2025
Series
Association for Computational Linguistics (ACL). Annual Meeting Conference Proceedings, ISSN 0736-587X
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-246981 (URN)10.18653/v1/2025.acl-long.1329 (DOI)979-8-89176-251-0 (ISBN)
Conference
The 63rd Annual Meeting of the Association for Computational Linguistics, 27 July-1 August, 2025, Vienna, Austria.
Available from: 2025-09-15 Created: 2025-09-15 Last updated: 2025-11-27Bibliographically approved
Nikmehr, G., Bilbao-Jayo, A., Henriksson, A. & Almeida, A. (2025). Detecting Suicidal Ideation on Social Media Using Large Language Models with Zero-Shot Prompting. In: Proceedings of the 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health ICT4AWE - Volume 1: . Paper presented at 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health ICT4AWE, Porto, Portugal, 2025 (pp. 259-267). Science and Technology Publications, Lda
Open this publication in new window or tab >>Detecting Suicidal Ideation on Social Media Using Large Language Models with Zero-Shot Prompting
2025 (English)In: Proceedings of the 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health ICT4AWE - Volume 1, Science and Technology Publications, Lda , 2025, p. 259-267Conference paper, Published paper (Refereed)
Abstract [en]

Detecting suicidal ideation in social media posts using Natural Language Processing (NLP) and Machine Learning has become an essential approach for early intervention and providing support to at-risk individuals. The role of data is critical in this process, as the accuracy of NLP models largely depends on the quality and quantity of labeled data available for training. Traditional methods, such as keyword-based approaches and models reliant on manually annotated datasets, face limitations due to the complex and time-consuming nature of data labeling. This shortage of high-quality labeled data creates a significant bottleneck, limiting model fine-tuning. With the recent emergence of Large Language Models (LLMs) in various NLP applications, we utilize their strengths to classify posts expressing suicidal ideation. Specifically, we apply zero-shot prompting with LLMs, enabling effective classification even in data-scarce environments without needing extensive fine-tuning, thus reducing the dependence on large annotated datasets. Our findings suggest that zero-shot LLMs can match or exceed the performance of traditional approaches like fine-tuned RoBERTa in identifying suicidal ideation. Although no single LLM outperforms consistently across all tasks, their adaptability and effectiveness underscore their potential to detect suicidal thoughts without requiring manually labeled data.

Place, publisher, year, edition, pages
Science and Technology Publications, Lda, 2025
Series
International Conference on Information and Communication Technologies for Ageing Well and e-Health, ICT4AWE - Proceedings, E-ISSN 2184-4984
Keywords
Large Language Models, Natural Language Processing, Prompting, Suicidal Ideation Detection
National Category
Other Computer and Information Science
Identifiers
urn:nbn:se:su:diva-243460 (URN)10.5220/0013283400003938 (DOI)2-s2.0-105003532350 (Scopus ID)
Conference
11th International Conference on Information and Communication Technologies for Ageing Well and e-Health ICT4AWE, Porto, Portugal, 2025
Available from: 2025-05-26 Created: 2025-05-26 Last updated: 2025-05-26Bibliographically approved
Randl, K. R., Pavlopoulos, I., Henriksson, A. & Lindgren, T. (2025). Evaluating the Reliability of Self-Explanations in Large Language Models. In: Dino Pedreschi; Anna Monreale; Riccardo Guidotti; Roberto Pellungrini; Francesca Naretto (Ed.), Discovery Science: 27th International Conference, DS 2024, Pisa, Italy, October 14–16, 2024, Proceedings, Part I. Paper presented at Discovery Science, 27th International Conference, DS 2024, 14-16 October 2024, Pisa, Italy. (pp. 36-51). Springer Publishing Company
Open this publication in new window or tab >>Evaluating the Reliability of Self-Explanations in Large Language Models
2025 (English)In: Discovery Science: 27th International Conference, DS 2024, Pisa, Italy, October 14–16, 2024, Proceedings, Part I / [ed] Dino Pedreschi; Anna Monreale; Riccardo Guidotti; Roberto Pellungrini; Francesca Naretto, Springer Publishing Company , 2025, p. 36-51Conference paper, Published paper (Refereed)
Abstract [en]

This paper investigates the reliability of explanations generated by large language models~(LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations -- extractive and counterfactual -- using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective).

Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning.

We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g. SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.

Place, publisher, year, edition, pages
Springer Publishing Company, 2025
Series
Lecture Notes in Computer Science (LNCS), ISSN 0302-9743, E-ISSN 1611-3349 ; 15243
Keywords
Large Language Models, Self-Explanations, Counterfactuals
National Category
Computer Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-239126 (URN)10.1007/978-3-031-78977-9_3 (DOI)2-s2.0-85218499264 (Scopus ID)978-3-031-78976-2 (ISBN)978-3-031-78977-9 (ISBN)
Conference
Discovery Science, 27th International Conference, DS 2024, 14-16 October 2024, Pisa, Italy.
Available from: 2025-02-06 Created: 2025-02-06 Last updated: 2025-04-09Bibliographically approved
Bakagianni, J., Randl, K. R., Rocchietti, G., Rulli, C., Nardini, F. M., Henriksson, A., . . . Pavlopoulos, I. (2025). FoodSafeSum: Enabling Natural Language Processing Applications for Food Safety Document Summarization and Analysis. In: Christos Christodoulopoulos; Tanmoy Chakraborty; Carolyn Rose; Violet Peng (Ed.), Findings of the Association for Computational Linguistics: EMNLP 2025. Paper presented at Conference on Empirical Methods in Natural Language Processing (EMNLP), November 2025, Suzhou, China. (pp. 16786-16804). Association for Computational Linguistics
Open this publication in new window or tab >>FoodSafeSum: Enabling Natural Language Processing Applications for Food Safety Document Summarization and Analysis
Show others...
2025 (English)In: Findings of the Association for Computational Linguistics: EMNLP 2025 / [ed] Christos Christodoulopoulos; Tanmoy Chakraborty; Carolyn Rose; Violet Peng, Association for Computational Linguistics , 2025, p. 16786-16804Conference paper, Published paper (Refereed)
Abstract [en]

Food safety demands timely detection, regulation, and public communication, yet the lack of structured datasets hinders Natural Language Processing (NLP) research. We present and release a new dataset of human-written and Large Language Model (LLM)-generated summaries of food safety documents, plus food safety related metadata. We evaluate its utility on three NLP tasks directly reflecting food safety practices: multilabel classification for organizing documents into domain-specific categories; document retrieval for accessing regulatory and scientific evidence; and question answering via retrieval-augmented generation that improves factual accuracy.We show that LLM summaries perform comparably or better than human ones across tasks. We also demonstrate clustering of summaries for event tracking and compliance monitoring. This dataset enables NLP applications that support core food safety practices, including the organization of regulatory and scientific evidence, monitoring of compliance issues, and communication of risks to the public.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2025
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-250605 (URN)10.18653/v1/2025.findings-emnlp.911 (DOI)979-8-89176-335-7 (ISBN)
Conference
Conference on Empirical Methods in Natural Language Processing (EMNLP), November 2025, Suzhou, China.
Available from: 2025-12-18 Created: 2025-12-18 Last updated: 2025-12-19Bibliographically approved
Kopacheva, E., Henriksson, A., Dalianis, H., Hammar, T. & Lincke, A. (2025). Identifying Adverse Drug Events in Clinical Text Using Fine-Tuned Clinical Language Models: Machine Learning Study. JMIR Formative Research, 9, Article ID e71949.
Open this publication in new window or tab >>Identifying Adverse Drug Events in Clinical Text Using Fine-Tuned Clinical Language Models: Machine Learning Study
Show others...
2025 (English)In: JMIR Formative Research, E-ISSN 2561-326X, Vol. 9, article id e71949Article in journal (Refereed) Published
Abstract [en]

Background: Medications are essential for health care but can cause adverse drug events (ADEs), which are harmful and sometimes fatal. Detecting ADEs is a challenging task because they are often not documented in the structured data of electronic health records (EHRs). There is a need for automatically extracting ADE-related information from clinical notes, as manual review is labor-intensive and time-consuming.

Objective: This study aims to fine-tune the pretrained clinical language model, Swedish Deidentified Clinical Bidirectional Encoder Representations from Transformers (SweDeClin-BERT), for medical named entity recognition (NER) and relation extraction (RE) tasks, and to implement an integrated NER-RE approach to more effectively identify ADEs in clinical notes from clinical units in Sweden. The performance of this approach is compared with our previous machine learning method, which used conditional random fields (CRFs) and random forest (RF).

Methods: A subset of clinical notes from the Stockholm EPR (Electronic Patient Record) Corpus, dated 2009‐2010, containing suspected ADEs based on International Classification of Diseases, 10th Revision (ICD-10) codes in the A.1 and A.2 categories was randomly sampled. These notes were annotated by a physician with ADE-related entities and relations following the ADE annotation guidelines. We fine-tuned the SweDeClin-BERT model for the NER and RE tasks and implemented an integrated NER-RE pipeline to extract entities and relationships from clinical notes. The models were evaluated using 395 clinical notes from clinical units in Sweden. The NER-RE pipeline was then applied to classify the clinical notes as containing or not containing ADEs. In addition, we conducted an error analysis to better understand the model’s behavior and to identify potential areas for improvement.

Results: In total, 62% of notes contained an explicit description of an ADE, indicating that an ADE-related ICD-10 code alone does not ensure detailed event documentation. The fine-tuned SweDeClin-BERT model achieved an F1-score of 0.845 for NER and 0.81 for RE task, outperforming the baseline models (CRFs for NER and random forests for RE). In particular, the RE task showed a 53% improvement in macro-average F1-score compared to the baseline. The integrated NER-RE pipeline achieved an overall F1-score of 0.81.

Conclusions: Using a domain-specific language model like SweDeClin-BERT for detecting ADEs in clinical notes demonstrates improved classification performance (0.77 in strict and 0.81 in relaxed mode) compared to conventional machine learning models like CRFs and RF. The proposed fine-tuned ADE model requires further refinement and evaluation on annotated clinical notes from another hospital to evaluate the model’s generalizability. In addition, the annotation guidelines should be revised, as there is an overlap of words between the Finding and Disorder entity categories, which were not consistently distinguished by the annotators. Furthermore, future work should address the handling of compound words and split entities to better capture context in the Swedish language.

Keywords
adverse drug events, BERT, domain-specific language models, electronical health records, SweDeClin-BERT
National Category
Medical Informatics
Identifiers
urn:nbn:se:su:diva-247450 (URN)10.2196/71949 (DOI)2-s2.0-105015483860 (Scopus ID)
Available from: 2025-09-26 Created: 2025-09-26 Last updated: 2025-09-26Bibliographically approved
Randl, K. R., Pavlopoulos, J., Henriksson, A. & Lindgren, T. (2025). Mind the gap: from plausible to valid self-explanations in large language models. Machine Learning, 114(10), Article ID 220.
Open this publication in new window or tab >>Mind the gap: from plausible to valid self-explanations in large language models
2025 (English)In: Machine Learning, ISSN 0885-6125, E-ISSN 1573-0565, Vol. 114, no 10, article id 220Article in journal (Refereed) Published
Abstract [en]

This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations (SE)—extractive and counterfactual—using state-of-the-art LLMs (1B to 70B parameters) on three different classification tasks (both objective and subjective). In line with Agarwal et al. (Faithfulness versus plausibility: On the (Un)reliability of explanations from large language models. 2024. https://doi.org/10.48550/arXiv.2402.04614), our findings indicate a gap between perceived and actual model reasoning: while SE largely correlate with human judgment (i.e. are plausible), they do not fully and accurately follow the model’s decision process (i.e. are not faithful). Additionally, we show that counterfactual SE are not even necessarily valid in the sense of actually changing the LLM’s prediction. Our results suggest that extractive SE provide the LLM’s “guess” at an explanation based on training data. Conversely, counterfactual SE can help understand the LLM’s reasoning: We show that the issue of validity can be resolved by sampling counterfactual candidates at high temperature—followed by a validity check—and introducing a formula to estimate the number of tries needed to generate valid explanations. This simple method produces plausible and valid explanations that offer a 16 times faster alternative to SHAP on average in our experiments.

Keywords
Attention-based explainability, Counterfactuals, Gradient-based explainability, Interpretability, Large language models (LLMs), Self-explanations
National Category
Natural Language Processing
Identifiers
urn:nbn:se:su:diva-246656 (URN)10.1007/s10994-025-06838-6 (DOI)001563123000001 ()2-s2.0-105014633582 (Scopus ID)
Available from: 2025-09-09 Created: 2025-09-09 Last updated: 2025-10-06Bibliographically approved
Randl, K. R., Pavlopoulos, I., Henriksson, A., Lindgren, T. & Bakagianni, J. (2025). SemEval-2025 Task 9: The Food Hazard Detection Challenge. In: Sara Rosenthal; Aiala Rosá; Debanjan Ghosh; Marcos Zampieri (Ed.), Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025): . Paper presented at The 19th International Workshop on Semantic Evaluation, July 2025, Vienna, Austria. (pp. 2523-2534). Association for Computational Linguistics
Open this publication in new window or tab >>SemEval-2025 Task 9: The Food Hazard Detection Challenge
Show others...
2025 (English)In: Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025) / [ed] Sara Rosenthal; Aiala Rosá; Debanjan Ghosh; Marcos Zampieri, Association for Computational Linguistics , 2025, p. 2523-2534Conference paper, Published paper (Refereed)
Abstract [en]

In this challenge, we explored text-based food hazard prediction with long tail distributed classes. The task was divided into two subtasks: (1) predicting whether a web text implies one of ten food-hazard categories and identifying the associated food category, and (2) providing a more fine-grained classification by assigning a specific label to both the hazard and the product. Our findings highlight that large language model-generated synthetic data can be highly effective for oversampling long-tail distributions. Furthermore, we find that fine-tuned encoder-only, encoder-decoder, and decoder-only systems achieve comparable maximum performance across both subtasks. During this challenge, we gradually released (under CC BY-NC-SA 4.0) a novel set of 6,644 manually labeled food-incident reports.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2025
National Category
Computer Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-247395 (URN)979-8-89176-273-2 (ISBN)
Conference
The 19th International Workshop on Semantic Evaluation, July 2025, Vienna, Austria.
Available from: 2025-09-24 Created: 2025-09-24 Last updated: 2025-09-24Bibliographically approved
Vakili, T., Hansson, M. & Henriksson, A. (2025). SweClinEval: A Benchmark for Swedish Clinical Natural Language Processing. In: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025): . Paper presented at The Joint Nordic Conference on Computational Linguistics and Baltic Conference on Human Language Technologies, 2-5 March 2025, Tallin, Estonia. (pp. 767-775).
Open this publication in new window or tab >>SweClinEval: A Benchmark for Swedish Clinical Natural Language Processing
2025 (English)In: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), 2025, p. 767-775Conference paper, Published paper (Refereed)
Abstract [en]

The lack of benchmarks in certain domains and for certain languages makes it difficult to track progress regarding the state-of-the-art of NLP in those areas, potentially impeding progress in important, specialized domains. Here, we introduce the first Swedish benchmark for clinical NLP: SweClinEval. The first iteration of the benchmark consists of six clinical NLP tasks, encompassing both document-level classification and named entity recognition tasks, with real clinical data. We evaluate nine different encoder models, both Swedish and multilingual. The results show that domain-adapted models outperform generic models on sequence-level classification tasks, while certain larger generic models outperform the clinical models on named entity recognition tasks. We describe how the benchmark can be managed despite limited possibilities to share sensitive clinical data, and discuss plans for extending the benchmark in future iterations.

Series
NEALT Proceedings Series, ISSN 1736-8197, E-ISSN 1736-6305
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-240589 (URN)978-9908-53-109-0 (ISBN)
Conference
The Joint Nordic Conference on Computational Linguistics and Baltic Conference on Human Language Technologies, 2-5 March 2025, Tallin, Estonia.
Available from: 2025-03-10 Created: 2025-03-10 Last updated: 2025-11-27Bibliographically approved
van der Werff, S. D., van Rooden, S. D., Henriksson, A., Behnke, M., Aghdassi, S. J. S., van Mourik, M. S. M. & Nauclér, P. (2025). The future of healthcare-associated infection surveillance: Automated surveillance and using the potential of artificial intelligence. Journal of Internal Medicine, 298(2), 54-77
Open this publication in new window or tab >>The future of healthcare-associated infection surveillance: Automated surveillance and using the potential of artificial intelligence
Show others...
2025 (English)In: Journal of Internal Medicine, ISSN 0954-6820, E-ISSN 1365-2796, Vol. 298, no 2, p. 54-77Article in journal (Refereed) Published
Abstract [en]

Healthcare-associated infections (HAI) are common adverse events and surveillance is considered a core component of effective HAI reduction programs. Recently, efforts have focused on automating the traditional manual surveillance process by utilizing data from electronic health record (EHR) systems. Using EHR data for automated surveillance, algorithms have been developed to identify patients with ventilator-associated pneumonia and bloodstream, surgical site infections, urinary tract, and Clostridioides difficile infections (sensitivity 54.2%–100%, specificity 63.5%–100%). Methods based on natural language processing have been applied to extract information from unstructured clinical information. Further developments in artificial intelligence (AI), such as large language models, are expected to support and improve a variety of aspects within the surveillance process; for example, more precise identification of patients with HAI. However, AI-based methods have been applied less frequently in automated surveillance and more frequently for early prediction, particularly for sepsis. Despite heterogeneity in settings, populations, sepsis definitions, and model designs, AI models have shown promising results, with moderate to very good performance (accuracy 61–99%) and predicted sepsis within 0–40 hours before onset. AI-based prediction models that can detect patients at risk of developing different HAI should be explored further. The continuous evolution of AI and automation will transform HAI surveillance and prediction, offering more objective and timely infection rates and predictions. The implementation of AI-supported automated surveillance and prediction systems for HAI in daily practice remains scarce. The successful development and implementation of these systems demand requirements related to technical capabilities, governance, practical and regulatory considerations, and quality monitoring.

Keywords
artificial intelligence, automated surveillance, early prediction, healthcare-associated infections, sepsis
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-247397 (URN)10.1111/joim.20100 (DOI)001502822600001 ()40469046 (PubMedID)2-s2.0-105007237006 (Scopus ID)
Available from: 2025-09-24 Created: 2025-09-24 Last updated: 2025-09-24Bibliographically approved
Randl, K. R., Pavlopoulos, I., Henriksson, A. & Lindgren, T. (2024). CICLe: Conformal In-Context Learning for Largescale Multi-Class Food Risk Classification. In: Lun-Wei Ku; Andre Martins; Vivek Srikumar (Ed.), Findings of the Association for Computational Linguistics: ACL 2024. Paper presented at The 62nd Annual Meeting of the Association for Computational Linguistics, August 11-16 2024, Bangkok, Thailand. (pp. 7695-7715). Association for Computational Linguistics
Open this publication in new window or tab >>CICLe: Conformal In-Context Learning for Largescale Multi-Class Food Risk Classification
2024 (English)In: Findings of the Association for Computational Linguistics: ACL 2024 / [ed] Lun-Wei Ku; Andre Martins; Vivek Srikumar, Association for Computational Linguistics , 2024, p. 7695-7715Conference paper, Published paper (Refereed)
Abstract [en]

Contaminated or adulterated food poses a substantial risk to human health. Given sets of labeled web texts for training, Machine Learning and Natural Language Processing can be applied to automatically detect such risks. We publish a dataset of 7,546 short texts describing public food recall announcements. Each text is manually labeled, on two granularity levels (coarse and fine), for food products and hazards that the recall corresponds to. We describe the dataset and benchmark naive, traditional, and Transformer models. Based on our analysis, Logistic Regression based on a tf-idf representation outperforms RoBERTa and XLM-R on classes with low support. Finally, we discuss different prompting strategies and present an LLM-in-the-loop framework, based on Conformal Prediction, which boosts the performance of the base classifier while reducing energy consumption compared to normal prompting.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2024
Keywords
In-Context-Learning, Prompting, Text Classification, Food-Risk, Conformal Prediction
National Category
Computer Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-237876 (URN)10.18653/v1/2024.findings-acl.459 (DOI)979-8-89176-099-8 (ISBN)
Conference
The 62nd Annual Meeting of the Association for Computational Linguistics, August 11-16 2024, Bangkok, Thailand.
Available from: 2025-01-14 Created: 2025-01-14 Last updated: 2025-01-15Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-9731-1048

Search in DiVA

Show all publications