Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Automatic Segmentation of Swedish Medical Words with Greek and Latin Morphemes: A Computational Morphological Analysis
Stockholm University, Faculty of Humanities, Department of Linguistics, Computational Linguistics.
2015 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Raw text data online has increased the need for designing artificial systems capable of processing raw data efficiently and at a low cost in the field of natural language processing (NLP). A well-developed morphological analysis is an important cornerstone of NLP, in particular when word look-up is an important stage of processing. Morphological analysis has many advantages, including reducing the number of word forms to be stored computationally, as well as being cost-efficient and time-efficient. NLP is relevant in the field of medicine, especially in automatic text analysis, which is a relatively young field in Swedish medical texts. Much of the stored information is highly unstructured and disorganized.

Using raw corpora, this paper aims to contribute to automatic morphological segmentation by experimenting with state-of-art-tools for unsupervised and semi-supervised word segmentation of Swedish words in medical texts. The results show that a reasonable segmentation is more dependent on a high number of word types, rather than a special type of corpora. The results also show that semi-supervised word segmentation in the form of annotated training data greatly increases the performance.

Abstract [sv]

Rå textdata online har ökat behovet för artificiella system som klarar av att processa rå data effektivt och till en låg kostnad inom språkteknologi (NLP). En välutvecklad morfologisk analys är en viktig hörnsten inom NLP, speciellt när ordprocessning är ett viktigt steg. Morfologisk analys har många fördelar, bland annat reducerar den antalet ordformer som ska lagras teknologiskt, samt så är det kostnadseffektivt och tidseffektivt. NLP är av relevans för det medicinska ämnet, speciellt inom textanalys som är ett relativt ungt område inom svenska medicinska texter. Mycket av den lagrade informationen är väldigt ostrukturerat och oorganiserat.

Genom att använda råa korpusar ämnar denna uppsats att bidra till automatisk morfologisk segmentering genom att experimentera med de för närvarande bästa verktygen för oövervakad och semi-övervakad ordsegmentering av svenska ord i medicinska texter. Resultaten visar att en acceptabel segmentering beror mer på ett högt antal ordtyper, och inte en speciell sorts korpus. Resultaten visar också att semi-övervakad ordsegmentering, dvs. annoterad träningsdata, ökar prestandan markant.

Place, publisher, year, edition, pages
2015. , 43 p.
Keyword [en]
automatic word segmentation, Swedish medical word segmentation, morpheme segmentation, morphology induction, morphological analysis, unsupervised learning, natural language processing
Keyword [sv]
automatisk ordsegmentering, svensk medicinsk ordsegmentering, morfemsegmentering, morfeminduktion, morfologisk analys, oövervakad inlärning, språkteknologi
National Category
General Language Studies and Linguistics
Identifiers
URN: urn:nbn:se:su:diva-121650OAI: oai:DiVA.org:su-121650DiVA: diva2:860557
Supervisors
Examiners
Available from: 2015-10-14 Created: 2015-10-12 Last updated: 2015-10-14Bibliographically approved

Open Access in DiVA

Automatic Segmentation of Swedish Medical Words with Greek and Latin Morphemes - A Computational Morphological Analysis(1425 kB)121 downloads
File information
File name FULLTEXT01.pdfFile size 1425 kBChecksum SHA-512
afca8b83f48f9e66cc9383b9cdeec6f03e30898c103c416572f58902f000d02987ff3509a0e44135714912e70695ed8adca3513aa74ec327ae5372029eadb826
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Lindström, Mathias
By organisation
Computational Linguistics
General Language Studies and Linguistics

Search outside of DiVA

GoogleGoogle Scholar
Total: 121 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 213 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf