Change search
Link to record
Permanent link

Direct link
Alternative names
Publications (10 of 23) Show all publications
Hörberg, T., Kurfali, M. & Olofsson, J. K. (2025). Chemosensory vocabulary in wine, perfume and food product reviews: Insights from language modeling. Food Quality and Preference, 124, Article ID 105357.
Open this publication in new window or tab >>Chemosensory vocabulary in wine, perfume and food product reviews: Insights from language modeling
2025 (English)In: Food Quality and Preference, ISSN 0950-3293, E-ISSN 1873-6343, Vol. 124, article id 105357Article in journal (Refereed) Published
Abstract [en]

Chemosensory sensations are often hard to describe and quantify. Language models may facilitate a systematic understanding of sensory descriptions. We accessed consumer and expert reviews of wine, perfume, and food products (English language; about 68 million words in total) and analyzed their sensory descriptions. Using a novel data-driven method based on natural language data, we compared the three chemosensory vocabularies (wine, perfume, food) with respect to their vocabulary overlap and semantic properties, and explored their semantic spaces. The three vocabularies primarily differ with respect to domain specificity, concreteness, descriptor type preference and degree of gustatory vs. olfactory association. Wine vocabulary primarily distinguishes between white wine and red wine flavors and qualities. Food vocabulary separates drinkable and edible food products and ingredients, on the one hand, and savory and non-savory products, on the other. A salient distinction in all three vocabularies is between concrete and abstract/evaluative terms. Valence also plays a role in the semantic spaces of all three vocabularies, but valence is less prominent here than in general olfactory vocabulary. Our method allows a systematic comparison of sensory descriptors in the three product domains and provides a data-driven approach to derive sensory lexicons that can be applied by sensory scientists.

Keywords
Consumer reviews, Cross-domain comparison, Machine learning, Natural language processing, Semantic analysis, Sensory vocabulary
National Category
Comparative Language Studies and Linguistics Food Science
Identifiers
urn:nbn:se:su:diva-241541 (URN)10.1016/j.foodqual.2024.105357 (DOI)001354909000001 ()2-s2.0-85208399146 (Scopus ID)
Available from: 2025-04-01 Created: 2025-04-01 Last updated: 2025-04-01Bibliographically approved
Masciolini, A., Caines, A., De Clercq, O., Kruijsbergen, J., Kurfalı, M., Sánchez, R. M., . . . Zesch, T. (2025). Towards better language representation in Natural Language Processing: A multilingual dataset for text-level Grammatical Error Correction. International Journal of Learner Corpus Research
Open this publication in new window or tab >>Towards better language representation in Natural Language Processing: A multilingual dataset for text-level Grammatical Error Correction
Show others...
2025 (English)In: International Journal of Learner Corpus Research, ISSN 2215-1478, E-ISSN 2215-1486Article in journal (Refereed) Epub ahead of print
Abstract [en]

This paper introduces MultiGEC, a dataset for multilingual Grammatical Error Correction (GEC) in twelve European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. MultiGEC distinguishes itself from previous GEC datasets in that it covers several underrepresented languages, which we argue should be included in resources used to train models for Natural Language Processing tasks which, as GEC itself, have implications for Learner Corpus Research and Second Language Acquisition. Aside from multilingualism, the novelty of the MultiGEC dataset is that it consists of full texts — typically learner essays — rather than individual sentences, making it possible to train systems that take a broader context into account. The dataset was built for MultiGEC-2025, the first shared task in multilingual text-level GEC, but it remains accessible after its competitive phase, serving as a resource to train new error correction systems and perform cross-lingual GEC studies.

Keywords
grammatical error correction, learner corpora, Matthew effect, MultiGEC shared task, multilingual corpora
National Category
Natural Language Processing
Identifiers
urn:nbn:se:su:diva-243066 (URN)10.1075/ijlcr.24033.mas (DOI)2-s2.0-105003035015 (Scopus ID)
Available from: 2025-05-09 Created: 2025-05-09 Last updated: 2025-05-09
Erolcan Er, M., Kurfali, M. & Zeyrek, D. (2024). Lightweight Connective Detection Using Gradient Boosting. In: ISA 2024: 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC-COLING 2024, Workshop Proceedings: (pp. 53-59). European Language Resources Association
Open this publication in new window or tab >>Lightweight Connective Detection Using Gradient Boosting
2024 (English)In: ISA 2024: 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC-COLING 2024, Workshop Proceedings, European Language Resources Association, 2024, p. 53-59Conference paper, Published paper (Refereed)
Abstract [en]

In this work, we introduce a lightweight discourse connective detection system. Employing gradient boosting trained on straightforward, low-complexity features, this proposed approach sidesteps the computational demands of the current approaches that rely on deep neural networks. Considering its simplicity, our approach achieves competitive results while offering significant gains in terms of time even on CPU. Furthermore, the stable performance across two unrelated languages suggests the robustness of our system in the multilingual scenario. The model is designed to support the annotation of discourse relations, particularly in scenarios with limited resources, while minimizing performance loss.

Place, publisher, year, edition, pages
European Language Resources Association, 2024
Series
ISA 2024: 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC-COLING 2024, Workshop Proceedings
Keywords
Discourse Connectives, Gradient Boosting, linguistically-informed features
National Category
Embedded Systems
Identifiers
urn:nbn:se:su:diva-236097 (URN)2-s2.0-85195188126 (Scopus ID)9782493814326 (ISBN)
Available from: 2024-12-02 Created: 2024-12-02 Last updated: 2024-12-02Bibliographically approved
Östling, R. & Kurfali, M. (2023). Language Embeddings Sometimes Contain Typological Generalizations. Computational linguistics - Association for Computational Linguistics (Print), 49(4), 1003-1051
Open this publication in new window or tab >>Language Embeddings Sometimes Contain Typological Generalizations
2023 (English)In: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 49, no 4, p. 1003-1051Article in journal (Refereed) Published
Abstract [en]

To what extent can neural network models learn generalizations about language structure, and how do we find out what they have learned? We explore these questions by training neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1,295 languages. The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features obtained through annotation projection. We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most of our models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations. Careful attention to details in the evaluation turns out to be essential to avoid false positives. Furthermore, to encourage continued work in this field, we release several resources covering most or all of the languages in our data: (1) multiple sets of language representations, (2) multilingual word embeddings, (3) projected and predicted syntactic and morphological features, (4) software to provide linguistically sound evaluations of language representations.

Keywords
computational typology, language models, multilingual neural models, multilingual NLP, linguistic typology
National Category
General Language Studies and Linguistics Natural Language Processing
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:su:diva-226219 (URN)10.1162/coli_a_00491 (DOI)001152974700001 ()2-s2.0-85175520799 (Scopus ID)
Funder
Swedish Research Council, 2019-04129
Available from: 2024-02-02 Created: 2024-02-02 Last updated: 2025-02-01Bibliographically approved
Buchanan, E. M., Jernsäther, T., Koptjevskaja-Tamm, M., Kurfalı, M., Nilsonne, G., Olofsson, J. K. & Primbs, M. A. (2023). The Psychological Science Accelerator’s COVID-19 rapid-response dataset. Scientific Data, 10, Article ID 87.
Open this publication in new window or tab >>The Psychological Science Accelerator’s COVID-19 rapid-response dataset
Show others...
2023 (English)In: Scientific Data, E-ISSN 2052-4463, Vol. 10, article id 87Article in journal (Refereed) Published
Abstract [en]

In response to the COVID-19 pandemic, the Psychological Science Accelerator coordinated three large-scale psychological studies to examine the effects of loss-gain framing, cognitive reappraisals, and autonomy framing manipulations on behavioral intentions and affective measures. The data collected (April to October 2020) included specific measures for each experimental study, a general questionnaire examining health prevention behaviors and COVID-19 experience, geographical and cultural context characterization, and demographic information for each participant. Each participant started the study with the same general questions and then was randomized to complete either one longer experiment or two shorter experiments. Data were provided by 73,223 participants with varying completion rates. Participants completed the survey from 111 geopolitical regions in 44 unique languages/dialects. The anonymized dataset described here is provided in both raw and processed formats to facilitate re-use and further analyses. The dataset offers secondary analytic opportunities to explore coping, framing, and self-determination across a diverse, global sample obtained at the onset of the COVID-19 pandemic, which can be merged with other time-sampled or geographic data. 

Place, publisher, year, edition, pages
Springer Nature, 2023
Keywords
Covid-19, Psychological Science Accelerator, loss-gain framing, cognitive reappraisals, autonomy framing manipulations, affective measures, geopolitical
National Category
Psychology (excluding Applied Psychology)
Research subject
Psychology
Identifiers
urn:nbn:se:su:diva-220588 (URN)10.1038/s41597-022-01811-7 (DOI)000981838600002 ()36774440 (PubMedID)2-s2.0-85147834966 (Scopus ID)
Available from: 2023-08-31 Created: 2023-08-31 Last updated: 2024-01-11Bibliographically approved
Kutlu, F., Zeyrek, D. & Kurfali, M. (2023). Toward a shallow discourse parser for Turkish. Natural Language Engineering
Open this publication in new window or tab >>Toward a shallow discourse parser for Turkish
2023 (English)In: Natural Language Engineering, ISSN 1351-3249, E-ISSN 1469-8110Article in journal (Refereed) Epub ahead of print
Abstract [en]

One of the most interesting aspects of natural language is how texts cohere, which involves the pragmatic or semantic relations that hold between clauses (addition, cause-effect, conditional, similarity), referred to as discourse relations. A focus on the identification and classification of discourse relations appears as an imperative challenge to be resolved to support tasks such as text summarization, dialogue systems, and machine translation that need information above the clause level. Despite the recent interest in discourse relations in well-known languages such as English, data and experiments are still needed for typologically different and less-resourced languages. We report the most comprehensive investigation of shallow discourse parsing in Turkish, focusing on two main sub-tasks: identification of discourse relation realization types and the sense classification of explicit and implicit relations. The work is based on the approach of fine-tuning a pre-trained language model (BERT) as an encoder and classifying the encoded data with neural network-based classifiers. We firstly identify the discourse relation realization type that holds in a given text, if there is any. Then, we move on to the sense classification of the identified explicit and implicit relations. In addition to in-domain experiments on a held-out test set from the Turkish Discourse Bank (TDB 1.2), we also report the out-domain performance of our models in order to evaluate its generalization abilities, using the Turkish part of the TED Multilingual Discourse Bank. Finally, we explore the effect of multilingual data aggregation on the classification of relation realization type through a cross-lingual experiment. The results suggest that our models perform relatively well despite the limited size of the TDB 1.2 and that there are language-specific aspects of detecting the types of discourse relation realization. We believe that the findings are important both in providing insights regarding the performance of the modern language models in a typologically different language and in the low-resource scenario, given that the TDB 1.2 is 1/20th of the Penn Discourse TreeBank in terms of the number of total relations.

Keywords
Discourse relation, Classification, Pre-trained language model, Encoding, Cross-lingual transfer learning
National Category
Computer and Information Sciences Languages and Literature
Identifiers
urn:nbn:se:su:diva-221313 (URN)10.1017/S1351324923000359 (DOI)001046011700001 ()2-s2.0-85171751589 (Scopus ID)
Available from: 2023-09-19 Created: 2023-09-19 Last updated: 2023-10-06
Kurfalı, M. (2022). Contributions to Shallow Discourse Parsing: To English and beyond. (Doctoral dissertation). Stockholm: Department of Linguistics, Stockholm University
Open this publication in new window or tab >>Contributions to Shallow Discourse Parsing: To English and beyond
2022 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Discourse is a coherent set of sentences where the sequential reading of the sentences yields a sense of accumulation and readers can easily follow why one sentence follows another. A text that lacks coherence will most certainly fail to communicate its intended message and leave the reader puzzled as to why the sentences are presented together. However, formally accounting for the differences between a coherent and a non-coherent text still remains a challenge. Various theories propose that the semantic links that are inferred between sentences/clauses, known as discourse relations, are the building blocks of the discourse that can be connected to one another in various ways to form the discourse structure. This dissertation focuses on the former problem of discovering such discourse relations without aiming to arrive at any structure, a task known as shallow discourse parsing (SDP). Unfortunately, so far, SDP has been almost exclusively performed on the available gold annotations in English, leading to only limited insight into how the existing models would perform  in a low-resource scenario potentially involving any non-English language. The main objective of the current dissertation is to address these shortcomings and help extend SDP to the non-English territory. This aim is pursued through three different threads: (i) investigation of what kind of supervision is minimally required to perform SDP, (ii) construction of multilingual resources annotated at discourse-level, (iii) extension of well-known means to (SDP-wise) low-resource languages. An additional aim is to explore the feasibility of SDP as a probing task to evaluate discourse-level understanding abilities of modern language models is also explored.

The dissertation is based on six papers grouped in three themes. The first two papers perform different subtasks of SDP through relatively understudied means. Paper I presents a simplified method to perform explicit discourse relation labeling without any feature-engineering whereas Paper II shows how implicit discourse relation recognition benefits from large amounts of unlabeled text through a novel method for distant supervision. The third and fourth papers describe two novel multilingual discourse resources, TED-MDB (Paper III) and three bilingual discourse connective lexicons (Paper IV). Notably, Ted-MDB is the first parallel corpus annotated for PDTB-style discourse relations covering six non-English languages. Finally, the last two studies directly deal with multilingual discourse parsing where Paper V reports the first results in cross-lingual implicit discourse relation recognition and Paper VI proposes a multilingual benchmark including certain discourse-level tasks that have not been explored in this context before. Overall, the dissertation allows for a more detailed understanding of what is required to extend shallow discourse parsing beyond English. The conventional aspects of traditional supervised approaches are replaced in favor of less knowledge-intensive alternatives which, nevertheless, achieve state-of-the-art performance in their respective settings. Moreover, thanks to the introduction of TED-MDB, cross-lingual SDP is explored in a zero-shot setting for the first time. In sum, the proposed methodologies and the constructed resources are among the earliest steps towards building high-performance multilingual, or non-English monolingual, shallow discourse parsers.

Place, publisher, year, edition, pages
Stockholm: Department of Linguistics, Stockholm University, 2022. p. 130
Keywords
discourse, discourse relations, shallow discourse parsing, transfer learning, multilinguality, low-resource nlp
National Category
Natural Language Processing
Research subject
Linguistics
Identifiers
urn:nbn:se:su:diva-201508 (URN)978-91-7911-778-8 (ISBN)978-91-7911-779-5 (ISBN)
Public defence
2022-03-15, online via Zoom, public link is available at the department website, Stockholm, 15:00 (English)
Opponent
Supervisors
Available from: 2022-02-18 Created: 2022-01-28 Last updated: 2025-02-07Bibliographically approved
Özer, S., Kurfalı, M., Zeyrek, D., Mendes, A. & Valūnaitė Oleškevičienė, G. (2022). Linking discourse-level information and the induction of bilingual discourse connective lexicons. Semantic Web, 13(6), 1081-1102
Open this publication in new window or tab >>Linking discourse-level information and the induction of bilingual discourse connective lexicons
Show others...
2022 (English)In: Semantic Web, ISSN 1570-0844, E-ISSN 2210-4968, Vol. 13, no 6, p. 1081-1102Article in journal (Refereed) Published
Abstract [en]

The single biggest obstacle in performing comprehensive cross-lingual discourse analysis is the scarcity of multilingual resources. The existing resources are overwhelmingly monolingual, compelling researchers to infer the discourse-level information in the target languages through error-prone automatic means. The current paper aims to provide a more direct insight into the cross-lingual variations in discourse structures by linking the annotated relations of the TED-Multilingual Discourse Bank, which consists of independently annotated six TED talks in seven different languages. It is shown that the linguistic labels over the relations annotated in the texts of these languages can be automatically linked with English with high accuracy, as verified against the relations of three diverse languages semi-automatically linked with relations over English texts. The resulting corpus has a great potential to reveal the divergences in local discourse relations, as well as leading to new resources, as exemplified by the induction of bilingual discourse connective lexicons.

Keywords
Discourse relations, discourse connectives, discourse connective lexicons, linking discourse relations, parallel corpus
National Category
Languages and Literature
Identifiers
urn:nbn:se:su:diva-210633 (URN)10.3233/SW-223011 (DOI)000862910800007 ()
Available from: 2022-10-26 Created: 2022-10-26 Last updated: 2022-10-26Bibliographically approved
Wang, K., Miller, J. K., Grzech, K., Nilsonne, G., Kurfalı, M., Koptjevskaja-Tamm, M., . . . Moshontz, H. (2021). A multi-country test of brief reappraisal interventions on emotions during the COVID-19 pandemic. Nature Human Behaviour, 5(8), 1089-1110
Open this publication in new window or tab >>A multi-country test of brief reappraisal interventions on emotions during the COVID-19 pandemic
Show others...
2021 (English)In: Nature Human Behaviour, E-ISSN 2397-3374, Vol. 5, no 8, p. 1089-1110Article in journal (Refereed) Published
Abstract [en]

The COVID-19 pandemic has increased negative emotions and decreased positive emotions globally. Left unchecked, these emotional changes might have a wide array of adverse impacts. To reduce negative emotions and increase positive emotions, we tested the effectiveness of reappraisal, an emotion-regulation strategy that modifies how one thinks about a situation. Participants from 87 countries and regions (n = 21,644) were randomly assigned to one of two brief reappraisal interventions (reconstrual or repurposing) or one of two control conditions (active or passive). Results revealed that both reappraisal interventions (vesus both control conditions) consistently reduced negative emotions and increased positive emotions across different measures. Reconstrual and repurposing interventions had similar effects. Importantly, planned exploratory analyses indicated that reappraisal interventions did not reduce intentions to practice preventive health behaviours. The findings demonstrate the viability of creating scalable, low-cost interventions for use around the world.

National Category
Public Health, Global Health and Social Medicine
Research subject
Psychology
Identifiers
urn:nbn:se:su:diva-196884 (URN)10.1038/s41562-021-01173-x (DOI)000680374200002 ()34341554 (PubMedID)2-s2.0-85111795195 (Scopus ID)
Available from: 2021-09-21 Created: 2021-09-21 Last updated: 2025-02-20Bibliographically approved
Kurfali, M. & Östling, R. (2021). Let’s be explicit about that: Distant supervision for implicit discourse relation classification via connective prediction. In: : . Paper presented at The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Bangkok, Thailand, August 1-6, 2021.
Open this publication in new window or tab >>Let’s be explicit about that: Distant supervision for implicit discourse relation classification via connective prediction
2021 (English)Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

In implicit discourse relation classification, we want to predict the relation between adjacent sentences in the absence of any overt discourse connectives. This is challenging even for humans, leading to shortage of annotated data, a fact that makes the task even more difficult for supervised machine learning approaches. In the current study, we perform implicit discourse relation classification without relying on any labeled implicit relation. We sidestep the lack of data through explicitation of implicit relations to reduce the task to two sub-problems: language modeling and explicit discourse relation classification, a much easier problem. Our experimental results show that this method can even marginally outperform the state-of-the-art, in spite of being much simpler than alternative models of comparable performance. Moreover, we show that the achieved performance is robust across domains as suggested by the zero-shot experiments on a completely different domain. This indicates that recent advances in language modeling have made language models sufficiently good at capturing inter-sentence relations without the help of explicit discourse markers.

National Category
Natural Language Processing
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:su:diva-201395 (URN)
Conference
The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Bangkok, Thailand, August 1-6, 2021
Available from: 2022-01-25 Created: 2022-01-25 Last updated: 2025-02-07Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-7020-8275

Search in DiVA

Show all publications