Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Evaluating the Reliability of Self-Explanations in Large Language Models
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.ORCID iD: 0000-0002-7938-2747
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.ORCID iD: 0000-0001-9188-7425
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.ORCID iD: 0000-0001-9731-1048
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.ORCID iD: 0000-0001-7713-1381
Number of Authors: 42025 (English)In: Discovery Science: 27th International Conference, DS 2024, Pisa, Italy, October 14–16, 2024, Proceedings, Part I / [ed] Dino Pedreschi; Anna Monreale; Riccardo Guidotti; Roberto Pellungrini; Francesca Naretto, Springer Publishing Company , 2025, p. 36-51Conference paper, Published paper (Refereed)
Abstract [en]

This paper investigates the reliability of explanations generated by large language models~(LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations -- extractive and counterfactual -- using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective).

Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning.

We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g. SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.

Place, publisher, year, edition, pages
Springer Publishing Company , 2025. p. 36-51
Series
Lecture Notes in Computer Science (LNCS), ISSN 0302-9743, E-ISSN 1611-3349 ; 15243
Keywords [en]
Large Language Models, Self-Explanations, Counterfactuals
National Category
Computer Sciences
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-239126DOI: 10.1007/978-3-031-78977-9_3Scopus ID: 2-s2.0-85218499264ISBN: 978-3-031-78976-2 (print)ISBN: 978-3-031-78977-9 (electronic)OAI: oai:DiVA.org:su-239126DiVA, id: diva2:1935151
Conference
Discovery Science, 27th International Conference, DS 2024, 14-16 October 2024, Pisa, Italy.
Available from: 2025-02-06 Created: 2025-02-06 Last updated: 2025-04-09Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Randl, Korbinian RobertPavlopoulos, IoannisHenriksson, AronLindgren, Tony

Search in DiVA

By author/editor
Randl, Korbinian RobertPavlopoulos, IoannisHenriksson, AronLindgren, Tony
By organisation
Department of Computer and Systems Sciences
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 139 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf