Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
On the Impact of the Vocabulary for Domain-Adaptive Pretraining of Clinical Language Models
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.ORCID iD: 0000-0001-9731-1048
2023 (English)In: Biomedical Engineering Systems and Technologies: 15th International Joint Conference, BIOSTEC 2022, Virtual Event, February 9–11, 2022, Revised Selected Papers / [ed] Ana Cecília A. Roque; Denis Gracanin; Ronny Lorenz; Athanasios Tsanas; Nathalie Bier; Ana Fred; Hugo Gamboa, Springer Nature , 2023, p. 315-332Chapter in book (Refereed)
Abstract [en]

Pretrained language models tailored to the target domain may improve predictive performance on downstream tasks. Such domain-specific language models are typically developed by pretraining on in-domain data, either from scratch or by continuing to pretrain an existing generic language model. Here, we focus on the latter situation and study the impact of the vocabulary for domain-adaptive pretraining of clinical language models. In particular, we investigate the impact of (i) adapting the vocabulary to the target domain, (ii) using different vocabulary sizes, and (iii) creating initial representations for clinical terms not present in the general-domain vocabulary based on subword averaging. The results confirm the benefits of adapting the vocabulary of the language model to the target domain; however, the choice of vocabulary size is not particularly sensitive with respect to downstream performance, while the benefits of subword averaging is reduced after a modest amount of domain-adaptive pretraining.

Place, publisher, year, edition, pages
Springer Nature , 2023. p. 315-332
Series
Communications in Computer and Information Science, ISSN 1865-0929, E-ISSN 1865-0937 ; 1814
Keywords [en]
Natural language processing, Clinical language models, Domain-adaptive pretraining, Clinical text
National Category
Natural Language Processing
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-224979DOI: 10.1007/978-3-031-38854-5_16Scopus ID: 2-s2.0-85172242060ISBN: 978-3-031-38853-8 (electronic)OAI: oai:DiVA.org:su-224979DiVA, id: diva2:1823858
Available from: 2024-01-03 Created: 2024-01-03 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Lamproudis, AnastasiosHenriksson, Aron

Search in DiVA

By author/editor
Lamproudis, AnastasiosHenriksson, Aron
By organisation
Department of Computer and Systems Sciences
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 50 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf