Optimizing the Dimensionality of Clinical Term Spaces for Improved Diagnosis Coding Support
2013 (English)In: Proceedings of the 4th International Louhi Workshop on Health Document Text Mining and Information Analysis (Louhi 2013) / [ed] Hanna Suominen, NICTA , 2013Conference paper (Refereed)
In natural language processing, dimensionality reduction is a common technique to reduce complexity that simultaneously addresses the sparseness property of language. It is also used as a means to capture some latent structure in text, such as the underlying semantics. Dimensionality reduction is an important property of the word space model, not least in random indexing, where the dimensionality is a predefined model parameter. In this paper, we demonstrate the importance of dimensionality optimization and discuss correlations between dimensionality and the size of the vocabulary. This is of particular importance in the clinical domain, where the level of noise in the text leads to a large vocabulary; it may also mitigate the effect of exploding vocabulary sizes when modeling multiword terms as single tokens. A system that automatically assigns diagnosis codes to patient record entries is shown to improve by up to 18 percentage points by manually optimizing the dimensionality.
Place, publisher, year, edition, pages
NICTA , 2013.
distributional semantics, random indexing, semantic space, dimensionality reduction, multiword terms, diagnosis codes
Research subject Computer and Systems Sciences
IdentifiersURN: urn:nbn:se:su:diva-97228OAI: oai:DiVA.org:su-97228DiVA: diva2:676272
4th International Louhi Workshop on Health Document Text Mining and Information Analysis Sydney, NSW, Australia, 11-12 February 2013