Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Privacy-Preserving Corpus for Occupational Health in Spanish: Evaluation for NER and Classification Tasks
Universidad de Chile.
Pontificia Universidad Católica de Chile.
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.ORCID iD: 0000-0001-8988-8226
Universidad de Chile.
Show others and affiliations
Number of Authors: 82024 (English)In: Proceedings of the 6th Clinical Natural Language Processing Workshop, Association for Computational Linguistics , 2024, p. 111-121Conference paper, Published paper (Refereed)
Abstract [en]

Annotated corpora are essential to reliable natural language processing. While they are expensive to create, they are essential for building and evaluating systems. This study introduces a new corpus of 2,869 medical and admission reports collected by an occupational insurance and health provider. The corpus has been carefully annotated for personally identifiable information (PII) and is shared, masking this information. Two annotators adhered to annotation guidelines during the annotation process, and a referee later resolved annotation conflicts in a consolidation process to build a gold standard subcorpus. The inter-annotator agreement values, measured in F1, range between 0.86 and 0.93 depending on the selected subcorpus. The value of the corpus is demonstrated by evaluating its use for NER of PII and a classification task. The evaluations find that fine-tuned models and GPT-3.5 reach F1 of 0.911 and 0.720 in NER of PII, respectively. In the case of the insurance coverage classification task, using the original or de-identified corpus results in similar performance. The annotated data are released in de-identified form.

Place, publisher, year, edition, pages
Association for Computational Linguistics , 2024. p. 111-121
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-232089OAI: oai:DiVA.org:su-232089DiVA, id: diva2:1885701
Conference
ClinicalNLP@NAACL-HLT 2024, The 6th Clinical Natural Language Processing Workshop NAACL 2024. 21 June 2024, Mexico City, Mexico.
Available from: 2024-07-24 Created: 2024-07-24 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Länk till publikationen

Authority records

Vakili, Thomas

Search in DiVA

By author/editor
Vakili, Thomas
By organisation
Department of Computer and Systems Sciences
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 82 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf