Change search
Link to record
Permanent link

Direct link
Publications (10 of 16) Show all publications
Vakili, T., Henriksson, A. & Dalianis, H. (2025). Data-Constrained Synthesis of Training Data for De-Identification. In: Wanxiang Che; Joyce Nabende; Ekaterina Shutova; Mohammad Taher Pilehvar (Ed.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers): . Paper presented at The 63rd Annual Meeting of the Association for Computational Linguistics, 27 July-1 August, 2025, Vienna, Austria. (pp. 27414-27427). Association for Computational Linguistics
Open this publication in new window or tab >>Data-Constrained Synthesis of Training Data for De-Identification
2025 (English)In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) / [ed] Wanxiang Che; Joyce Nabende; Ekaterina Shutova; Mohammad Taher Pilehvar, Association for Computational Linguistics , 2025, p. 27414-27427Conference paper, Published paper (Refereed)
Abstract [en]

Many sensitive domains — such as the clinical domain — lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study — using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2025
Series
Association for Computational Linguistics (ACL). Annual Meeting Conference Proceedings, ISSN 0736-587X
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-246981 (URN)10.18653/v1/2025.acl-long.1329 (DOI)979-8-89176-251-0 (ISBN)
Conference
The 63rd Annual Meeting of the Association for Computational Linguistics, 27 July-1 August, 2025, Vienna, Austria.
Available from: 2025-09-15 Created: 2025-09-15 Last updated: 2025-11-27Bibliographically approved
Vakili, T., Hansson, M. & Henriksson, A. (2025). SweClinEval: A Benchmark for Swedish Clinical Natural Language Processing. In: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025): . Paper presented at The Joint Nordic Conference on Computational Linguistics and Baltic Conference on Human Language Technologies, 2-5 March 2025, Tallin, Estonia. (pp. 767-775).
Open this publication in new window or tab >>SweClinEval: A Benchmark for Swedish Clinical Natural Language Processing
2025 (English)In: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), 2025, p. 767-775Conference paper, Published paper (Refereed)
Abstract [en]

The lack of benchmarks in certain domains and for certain languages makes it difficult to track progress regarding the state-of-the-art of NLP in those areas, potentially impeding progress in important, specialized domains. Here, we introduce the first Swedish benchmark for clinical NLP: SweClinEval. The first iteration of the benchmark consists of six clinical NLP tasks, encompassing both document-level classification and named entity recognition tasks, with real clinical data. We evaluate nine different encoder models, both Swedish and multilingual. The results show that domain-adapted models outperform generic models on sequence-level classification tasks, while certain larger generic models outperform the clinical models on named entity recognition tasks. We describe how the benchmark can be managed despite limited possibilities to share sensitive clinical data, and discuss plans for extending the benchmark in future iterations.

Series
NEALT Proceedings Series, ISSN 1736-8197, E-ISSN 1736-6305
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-240589 (URN)978-9908-53-109-0 (ISBN)
Conference
The Joint Nordic Conference on Computational Linguistics and Baltic Conference on Human Language Technologies, 2-5 March 2025, Tallin, Estonia.
Available from: 2025-03-10 Created: 2025-03-10 Last updated: 2025-11-27Bibliographically approved
Aracena, C., Miranda, L., Vakili, T., Villena, F., Quiroga, T., Núñez-Torres, F., . . . Dunstan, J. (2024). A Privacy-Preserving Corpus for Occupational Health in Spanish: Evaluation for NER and Classification Tasks. In: Proceedings of the 6th Clinical Natural Language Processing Workshop: . Paper presented at ClinicalNLP@NAACL-HLT 2024, The 6th Clinical Natural Language Processing Workshop NAACL 2024. 21 June 2024, Mexico City, Mexico. (pp. 111-121). Association for Computational Linguistics
Open this publication in new window or tab >>A Privacy-Preserving Corpus for Occupational Health in Spanish: Evaluation for NER and Classification Tasks
Show others...
2024 (English)In: Proceedings of the 6th Clinical Natural Language Processing Workshop, Association for Computational Linguistics , 2024, p. 111-121Conference paper, Published paper (Refereed)
Abstract [en]

Annotated corpora are essential to reliable natural language processing. While they are expensive to create, they are essential for building and evaluating systems. This study introduces a new corpus of 2,869 medical and admission reports collected by an occupational insurance and health provider. The corpus has been carefully annotated for personally identifiable information (PII) and is shared, masking this information. Two annotators adhered to annotation guidelines during the annotation process, and a referee later resolved annotation conflicts in a consolidation process to build a gold standard subcorpus. The inter-annotator agreement values, measured in F1, range between 0.86 and 0.93 depending on the selected subcorpus. The value of the corpus is demonstrated by evaluating its use for NER of PII and a classification task. The evaluations find that fine-tuned models and GPT-3.5 reach F1 of 0.911 and 0.720 in NER of PII, respectively. In the case of the insurance coverage classification task, using the original or de-identified corpus results in similar performance. The annotated data are released in de-identified form.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2024
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-232089 (URN)
Conference
ClinicalNLP@NAACL-HLT 2024, The 6th Clinical Natural Language Processing Workshop NAACL 2024. 21 June 2024, Mexico City, Mexico.
Available from: 2024-07-24 Created: 2024-07-24 Last updated: 2025-02-07Bibliographically approved
Dunstan, J., Vakili, T., Miranda, L., Villena, F., Aracena, C., Quiroga, T., . . . Rocco, V. (2024). A Pseudonymized Corpus of Occupational Health Narratives for Clinical Entity Recognition in Spanish. BMC Medical Informatics and Decision Making (24), Article ID 204.
Open this publication in new window or tab >>A Pseudonymized Corpus of Occupational Health Narratives for Clinical Entity Recognition in Spanish
Show others...
2024 (English)In: BMC Medical Informatics and Decision Making, E-ISSN 1472-6947, no 24, article id 204Article in journal (Refereed) Published
Abstract [en]

Despite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, in addition to annotating medical entities, corpus creators must also remove personally identifiable information (PII). This has become increasingly important in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. Additionally, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar surrogate value (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus.

Keywords
Natural language processing, Privacy, Named entity recognition, Corpus annotation
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-232090 (URN)10.1186/s12911-024-02609-w (DOI)001275573100002 ()39049027 (PubMedID)2-s2.0-85199343231 (Scopus ID)
Available from: 2024-07-24 Created: 2024-07-24 Last updated: 2025-02-07Bibliographically approved
Vakili, T., Henriksson, A. & Dalianis, H. (2024). End-to-End Pseudonymization of Fine-Tuned Clinical BERT Models: Privacy Preservation with Maintained Data Utility. BMC Medical Informatics and Decision Making, Article ID 162.
Open this publication in new window or tab >>End-to-End Pseudonymization of Fine-Tuned Clinical BERT Models: Privacy Preservation with Maintained Data Utility
2024 (English)In: BMC Medical Informatics and Decision Making, E-ISSN 1472-6947, article id 162Article in journal (Refereed) Published
Abstract [en]

Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive.

One privacy-preserving technique that aims to mitigate these problems is training data pseudonymization. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks.

This study evaluates the predictive performance effects of end-to-end pseudonymization of clinical BERT models on five clinical NLP tasks compared to pre-training and fine-tuning on unaltered sensitive data. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.

Keywords
Natural language processing, language models, BERT, electronic health records, clinical text, de-identification, pseudonymization, privacy preservation, Swedish
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-232099 (URN)10.1186/s12911-024-02546-8 (DOI)38915012 (PubMedID)2-s2.0-85196757461 (Scopus ID)
Available from: 2024-07-24 Created: 2024-07-24 Last updated: 2025-11-27Bibliographically approved
Ahrenberg, L., Ainiala, T., Aldrin, E., Holdt, Š. A., Caines, A., Dalianis, H., . . . Vu, X.-S. (2024). Introduction. In: Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Xuan-Son Vu (Ed.), Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024): . Paper presented at Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), March 2024, St. Julian’s, Malta. (pp. ii-iii).
Open this publication in new window or tab >>Introduction
Show others...
2024 (English)In: Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024) / [ed] Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Xuan-Son Vu, 2024, p. ii-iiiConference paper (Refereed)
National Category
Information Systems, Social aspects
Identifiers
urn:nbn:se:su:diva-236177 (URN)2-s2.0-85190584439 (Scopus ID)
Conference
Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), March 2024, St. Julian’s, Malta.
Available from: 2024-12-03 Created: 2024-12-03 Last updated: 2024-12-03Bibliographically approved
Lamproudis, A., Olsen Svenning, T., Torsvik, T., Chomutare, T., Budrionis, A., Dinh Ngo, P., . . . Dalianis, H. (2024). Using a Large Open Clinical Corpus for Improved ICD-10 Diagnosis Coding. In: AMIA Symposium, 2023: . Paper presented at AMIA 2023 Annual Symposium, New Orleans, USA, November 11-15, 2023 (pp. 465-473). American Medical Informatics Association (AMIA)
Open this publication in new window or tab >>Using a Large Open Clinical Corpus for Improved ICD-10 Diagnosis Coding
Show others...
2024 (English)In: AMIA Symposium, 2023, American Medical Informatics Association (AMIA) , 2024, p. 465-473Conference paper, Published paper (Refereed)
Abstract [en]

With the recent advances in natural language processing and deep learning, the development of tools that can assist medical coders in ICD-10 diagnosis coding and increase their efficiency in coding discharge summaries is significantly more viable than before. To that end, one important component in the development of these models is the datasets used to train them. In this study, such datasets are presented, and it is shown that one of them can be used to develop a BERT-based language model that can consistently perform well in assigning ICD-10 codes to discharge summaries written in Swedish. Most importantly, it can be used in a coding support setup where a tool can recommend potential codes to the coders. This reduces the range of potential codes to consider and, in turn, reduces the workload of the coder. Moreover, the de-identified and pseudonymised dataset is open to use for academic users.

Place, publisher, year, edition, pages
American Medical Informatics Association (AMIA), 2024
Series
AMIA Annual Symposium proceedings, ISSN 1559-4076, E-ISSN 1942-597X
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-225848 (URN)38222373 (PubMedID)2-s2.0-85182543990 (Scopus ID)
Conference
AMIA 2023 Annual Symposium, New Orleans, USA, November 11-15, 2023
Available from: 2024-01-23 Created: 2024-01-23 Last updated: 2025-02-24Bibliographically approved
Vakili, T., Hullmann, T., Henriksson, A. & Dalianis, H. (2024). When Is a Name Sensitive? Eponyms in Clinical Text and Implications for De-Identification. In: Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Xuan-Son Vu (Ed.), Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024): . Paper presented at Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), St. Julian’s, Malta, March 2024. (pp. 76-80). Association for Computational Linguistics (ACL)
Open this publication in new window or tab >>When Is a Name Sensitive? Eponyms in Clinical Text and Implications for De-Identification
2024 (English)In: Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024) / [ed] Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Xuan-Son Vu, Association for Computational Linguistics (ACL) , 2024, p. 76-80Conference paper, Published paper (Refereed)
Abstract [en]

Clinical data, in the form of electronic health records, are rich resources that can be tapped using natural language processing. At the same time, they contain very sensitive information that must be protected. One strategy is to remove or obscure data using automatic de-identification. However, the detection of sensitive data can yield false positives. This is especially true for tokens that are similar in form to sensitive entities, such as eponyms. These names tend to refer to medical procedures or diagnoses rather than specific persons. Previous research has shown that automatic de-identification systems often misclassify eponyms as names, leading to a loss of valuable medical information. In this study, we estimate the prevalence of eponyms in a real Swedish clinical corpus. Furthermore, we demonstrate that modern transformer-based de-identification systems are more accurate in distinguishing between names and eponyms than previous approaches.

Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL), 2024
National Category
Information Systems
Identifiers
urn:nbn:se:su:diva-236176 (URN)2-s2.0-85190604145 (Scopus ID)
Conference
Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), St. Julian’s, Malta, March 2024.
Available from: 2024-12-12 Created: 2024-12-12 Last updated: 2024-12-12Bibliographically approved
Vakili, T. (2023). Attacking and Defending the Privacy of Clinical Language Models. (Licentiate dissertation). Stockholm: Department of Computer and Systems Sciences, Stockholm University
Open this publication in new window or tab >>Attacking and Defending the Privacy of Clinical Language Models
2023 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

The state-of-the-art methods in natural language processing (NLP) increasingly rely on large pre-trained transformer models. The strength of the models stems from their large number of parameters and the enormous amounts of data used to train them. The datasets are of a scale that makes it difficult, if not impossible, to audit them manually. When unwieldy amounts of potentially sensitive data are used to train large machine learning models, a difficult problem arises: the unintended memorization of the training data.

All datasets—including those based on publicly available data—can contain sensitive information about individuals. When models unintentionally memorize these sensitive data, they become vulnerable to different types of privacy attacks. Very few datasets for NLP can be guaranteed to be free from sensitive data. Thus, to varying degrees, most NLP models are susceptible to privacy leakage. This susceptibility is especially concerning in clinical NLP, where the data typically consist of electronic health records. Unintentionally leaking publicly available data can be problematic, but leaking data from electronic health records is never acceptable from a privacy perspective. At the same time, clinical NLP has great potential to improve the quality and efficiency of healthcare.

This licentiate thesis investigates how these privacy risks can be mitigated using automatic de-identification. This is done by exploring the privacy risks of pre-training using clinical data and then evaluating the impact on the model accuracy of decreasing these risks. A BERT model pre-trained using clinical data is subjected to a training data extraction attack. The same model is also used to evaluate a membership inference attack that has been proposed to quantify the privacy risks associated with masked language models. Then, the impact of automatic de-identification on the performance of BERT models is evaluated for both pre-training and fine-tuning data.

The results show that extracting training data from BERT models is non-trivial and suggest that the risks can be further decreased by automatically de-identifying the training data. Automatic de-identification is found to preserve the utility of the data used for pre-training and fine-tuning BERT models, resulting in no reduction in performance compared to models trained using unaltered data. However, we also find that the current state-of-the-art membership inference attacks are unable to quantify the privacy benefits of automatic de-identification. The results show that automatic de-identification reduces the privacy risks of using sensitive data for NLP without harming the utility of the data, but that these privacy benefits may be difficult to quantify.

Abstract [sv]

Den språkteknologiska forskningen blir alltmer beroende av stora förtränade transformermodeller. Dessa kraftfulla språkmodeller utgörs av ett stort antal parametrar som tränas genom att bearbeta enorma datamängder. Träningsdatan är typiskt av en sådan omfattning att det är svårt – om inte omöjligt – att granska dem manuellt. När otympliga mängder av potentiellt känsliga data används för att träna stora språkmodeller uppstår ett svårhanterligt fenomen: oavsiktlig memorering.

Väldigt få datakällor är helt fria från känsliga personuppgifter. Eftersom stora språkmodeller visat sig memorera detaljer om sina träningsdata gör det dem sårbara för integritetsröjande attacker. Denna sårbarhet är särskilt oroväckande inom klinisk språkteknologi, där data typiskt utgörs av elektroniska patientjournaler. Det är problematiskt att röja personuppgifter även om de är offentliga, men att läcka information från en individs patientjournaler är en oacceptabel integritetskränkning. Samtidigt så har klinisk språkteknologi stor potential att både förbättra kvalitén och öka effektiviteten inom sjukvården.

Denna licentiatavhandling undersöker hur de nyss nämnda integritetsriskerna kan minskas med hjälp av automatisk avidentifiering. Detta undersöks genom att först utforska riskerna med att förträna språkmodeller med kliniska träningsdata och sedan jämföra hur modellernas tillförlitlighet och prestanda påverkas av att dessa risker minskas. En BERT-modell som förtränats med kliniska data utsätts för en attack som syftar till att extrahera träningsdata. Samma modell används också för att utvärdera en föreslagen metod för att kvantifiera integritetsrisker hos maskade språkmodeller och som baseras på modellernas mottaglighet för medlemskapsinferensattacker. Därefter utvärderas hur användbara automatiskt avidentifierade data är för att förträna BERT-modeller och för att träna dem att lösa specifika språkteknologiska problem.

Resultaten visar att det är icke-trivialt att extrahera träningsdata ur språkmodeller. Samtidigt kan de risker som ändå finns minskas genom att automatiskt avidentifiera modellernas träningsdata. Därtill visar resultaten att språkmodeller tränade med automatiskt avidentifierade data fungerar lika väl som de som tränats med känsliga data. Detta gäller både vid förträning och vid träning för specifika problem. Samtidigt visar experimenten med medlemskapsinferens att nuvarande metoder inte fångar integritetsfördelarna av att automatiskt avidentifiera träningsdata. Sammanfattningsvis visar denna avhandling att automatisk avidentifiering kan användas för att minska de integritetsrisker som kommer av att använda känsliga data samtidigt som deras användbarhet bibehålls. Än saknas dock vedertagna metoder för att kvantifiera dessa integritetsvinster.

Place, publisher, year, edition, pages
Stockholm: Department of Computer and Systems Sciences, Stockholm University, 2023
Series
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 23-004
National Category
Natural Language Processing Computer Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-216693 (URN)
Presentation
2023-05-15, M20, Borgarfjordsgatan 12, Kista, 10:00 (English)
Opponent
Supervisors
Available from: 2023-04-25 Created: 2023-04-24 Last updated: 2025-02-01Bibliographically approved
Vakili, T. & Dalianis, H. (2023). Using Membership Inference Attacks to Evaluate Privacy-Preserving Language Modeling Fails for Pseudonymizing Data. In: 24th Nordic Conference on Computational Linguistics (NoDaLiDa): . Paper presented at Nordic Conference on Computational Linguistics (pp. 318-323).
Open this publication in new window or tab >>Using Membership Inference Attacks to Evaluate Privacy-Preserving Language Modeling Fails for Pseudonymizing Data
2023 (English)In: 24th Nordic Conference on Computational Linguistics (NoDaLiDa), 2023, p. 318-323Conference paper, Published paper (Refereed)
Abstract [en]

Large pre-trained language models dominate the current state-of-the-art for many natural language processing applications, including the field of clinical NLP. Several studies have found that these can be susceptible to privacy attacks that are unacceptable in the clinical domain where personally identifiable information (PII) must not be exposed.

However, there is no consensus regarding how to quantify the privacy risks of different models. One prominent suggestion is to quantify these risks using membership inference attacks. In this study, we show that a state-of-the-art membership inference attack on a clinical BERT model fails to detect the privacy benefits from pseudonymizing data. This suggests that such attacks may be inadequate for evaluating token-level privacy preservation of PIIs.

Series
Northern European Association for Language Technology (NEALT), ISSN 1736-8197, E-ISSN 1736-6305 ; 52
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-216681 (URN)
Conference
Nordic Conference on Computational Linguistics
Available from: 2023-04-24 Created: 2023-04-24 Last updated: 2025-11-27Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-8988-8226

Search in DiVA

Show all publications