Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Text Retrieval in Restricted Domains by Pairwise Term Co-occurrence
Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.ORCID-id: 0000-0002-2803-5139
Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.ORCID-id: 0000-0001-9731-1048
Antal upphovsmän: 22024 (Engelska)Ingår i: Complex Systems Informatics and Modeling Quarterly, E-ISSN 2255-9922, Vol. 41, s. 80-111, artikel-id 227Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Text similarity calculation by text embeddings requires fine-tuning of the language model by a large amount of labeled data, which may not be available for small text collections in their specific knowledge domains, in particular, in public organizations. As an alternative to machine learning, this research proposes pairwise term co-occurrence within plain-text matching, i.e., the query and the document share co-occurrences of two terms in a text span. In the entire document, the co-occurrences form the context that affects a term. This is analogous to a contextual word embedding, except our context affects the importance, not the meaning, of the term. Pairwise term co-occurrence has been applied in three text similarity calculation methods: term-pair-based text similarity, BM25 with term weights enhanced by pairwise term co-occurrence, and likewise enhanced cosine similarity. The three methods were evaluated for retrieval of four text types – email messages, web articles, fill-in forms, and brochures from a public organization – by having the first three as queries. Pairwise term co-occurrence performed on par with or better than BERT sentence embeddings without fine-tuning the BERT language model. With some text types, pairwise term co-occurrence outperformed bag-of-words matching by as much as 29.44 (MAP) and 31.71 (P@1) percentage points. Pairwise term co-occurrence can fill a niche by improving text similarity calculation where supervised machine learning is difficult to carry out.

Ort, förlag, år, upplaga, sidor
2024. Vol. 41, s. 80-111, artikel-id 227
Nyckelord [en]
Term Co-occurrence, Text Similarity, Text Matching, Term Weights, Document Retrieval, BM25, Embeddings
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
data- och systemvetenskap
Identifikatorer
URN: urn:nbn:se:su:diva-237544DOI: 10.7250/csimq.2024-41.05Scopus ID: 2-s2.0-85216475133OAI: oai:DiVA.org:su-237544DiVA, id: diva2:1924915
Tillgänglig från: 2025-01-07 Skapad: 2025-01-07 Senast uppdaterad: 2025-02-25Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Övriga länkar

Förlagets fulltextScopus

Person

Sneiders, EriksHenriksson, Aron

Sök vidare i DiVA

Av författaren/redaktören
Sneiders, EriksHenriksson, Aron
Av organisationen
Institutionen för data- och systemvetenskap
Språkbehandling och datorlingvistik

Sök vidare utanför DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 53 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf