Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
TED-MDB Lexicons: Tr-EnConnLex, Pt-EnConnLex
Stockholm University, Faculty of Humanities, Department of Linguistics.ORCID iD: 0000-0002-7020-8275
2020 (English)In: the First Workshop on Computational Approaches to Discourse, 2020Conference paper, Published paper (Refereed)
Abstract [en]

In this work, we present two new bilingual discourse connective lexicons, namely,for Turkish-English and European PortugueseEnglish created automatically using the existing discourse relation-aligned TED-MDB corpus. In their current form, the Pt-En lexiconincludes 95 entries, whereas the Tr-En lexiconcontains 133 entries. The lexicons constitutethe first step of a larger project of developing amultilingual discourse connective lexicon. 

Place, publisher, year, edition, pages
2020.
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:su:diva-194334DOI: 10.18653/v1/2020.codi-1.15OAI: oai:DiVA.org:su-194334DiVA, id: diva2:1569016
Conference
The 2020 Conference on Empirical Methods in Natural Language Processing, November 16-20, 2020
Available from: 2021-06-18 Created: 2021-06-18 Last updated: 2025-02-07Bibliographically approved
In thesis
1. Contributions to Shallow Discourse Parsing: To English and beyond
Open this publication in new window or tab >>Contributions to Shallow Discourse Parsing: To English and beyond
2022 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Discourse is a coherent set of sentences where the sequential reading of the sentences yields a sense of accumulation and readers can easily follow why one sentence follows another. A text that lacks coherence will most certainly fail to communicate its intended message and leave the reader puzzled as to why the sentences are presented together. However, formally accounting for the differences between a coherent and a non-coherent text still remains a challenge. Various theories propose that the semantic links that are inferred between sentences/clauses, known as discourse relations, are the building blocks of the discourse that can be connected to one another in various ways to form the discourse structure. This dissertation focuses on the former problem of discovering such discourse relations without aiming to arrive at any structure, a task known as shallow discourse parsing (SDP). Unfortunately, so far, SDP has been almost exclusively performed on the available gold annotations in English, leading to only limited insight into how the existing models would perform  in a low-resource scenario potentially involving any non-English language. The main objective of the current dissertation is to address these shortcomings and help extend SDP to the non-English territory. This aim is pursued through three different threads: (i) investigation of what kind of supervision is minimally required to perform SDP, (ii) construction of multilingual resources annotated at discourse-level, (iii) extension of well-known means to (SDP-wise) low-resource languages. An additional aim is to explore the feasibility of SDP as a probing task to evaluate discourse-level understanding abilities of modern language models is also explored.

The dissertation is based on six papers grouped in three themes. The first two papers perform different subtasks of SDP through relatively understudied means. Paper I presents a simplified method to perform explicit discourse relation labeling without any feature-engineering whereas Paper II shows how implicit discourse relation recognition benefits from large amounts of unlabeled text through a novel method for distant supervision. The third and fourth papers describe two novel multilingual discourse resources, TED-MDB (Paper III) and three bilingual discourse connective lexicons (Paper IV). Notably, Ted-MDB is the first parallel corpus annotated for PDTB-style discourse relations covering six non-English languages. Finally, the last two studies directly deal with multilingual discourse parsing where Paper V reports the first results in cross-lingual implicit discourse relation recognition and Paper VI proposes a multilingual benchmark including certain discourse-level tasks that have not been explored in this context before. Overall, the dissertation allows for a more detailed understanding of what is required to extend shallow discourse parsing beyond English. The conventional aspects of traditional supervised approaches are replaced in favor of less knowledge-intensive alternatives which, nevertheless, achieve state-of-the-art performance in their respective settings. Moreover, thanks to the introduction of TED-MDB, cross-lingual SDP is explored in a zero-shot setting for the first time. In sum, the proposed methodologies and the constructed resources are among the earliest steps towards building high-performance multilingual, or non-English monolingual, shallow discourse parsers.

Place, publisher, year, edition, pages
Stockholm: Department of Linguistics, Stockholm University, 2022. p. 130
Keywords
discourse, discourse relations, shallow discourse parsing, transfer learning, multilinguality, low-resource nlp
National Category
Natural Language Processing
Research subject
Linguistics
Identifiers
urn:nbn:se:su:diva-201508 (URN)978-91-7911-778-8 (ISBN)978-91-7911-779-5 (ISBN)
Public defence
2022-03-15, online via Zoom, public link is available at the department website, Stockholm, 15:00 (English)
Opponent
Supervisors
Available from: 2022-02-18 Created: 2022-01-28 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records

Kurfali, Murathan

Search in DiVA

By author/editor
Kurfali, Murathan
By organisation
Department of Linguistics
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 38 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf