Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Mind the gap: from plausible to valid self-explanations in large language models
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.ORCID iD: 0000-0002-7938-2747
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.ORCID iD: 0000-0001-9731-1048
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.ORCID iD: 0000-0001-7713-1381
Number of Authors: 42025 (English)In: Machine Learning, ISSN 0885-6125, E-ISSN 1573-0565, Vol. 114, no 10, article id 220Article in journal (Refereed) Published
Abstract [en]

This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations (SE)—extractive and counterfactual—using state-of-the-art LLMs (1B to 70B parameters) on three different classification tasks (both objective and subjective). In line with Agarwal et al. (Faithfulness versus plausibility: On the (Un)reliability of explanations from large language models. 2024. https://doi.org/10.48550/arXiv.2402.04614), our findings indicate a gap between perceived and actual model reasoning: while SE largely correlate with human judgment (i.e. are plausible), they do not fully and accurately follow the model’s decision process (i.e. are not faithful). Additionally, we show that counterfactual SE are not even necessarily valid in the sense of actually changing the LLM’s prediction. Our results suggest that extractive SE provide the LLM’s “guess” at an explanation based on training data. Conversely, counterfactual SE can help understand the LLM’s reasoning: We show that the issue of validity can be resolved by sampling counterfactual candidates at high temperature—followed by a validity check—and introducing a formula to estimate the number of tries needed to generate valid explanations. This simple method produces plausible and valid explanations that offer a 16 times faster alternative to SHAP on average in our experiments.

Place, publisher, year, edition, pages
2025. Vol. 114, no 10, article id 220
Keywords [en]
Attention-based explainability, Counterfactuals, Gradient-based explainability, Interpretability, Large language models (LLMs), Self-explanations
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:su:diva-246656DOI: 10.1007/s10994-025-06838-6ISI: 001563123000001Scopus ID: 2-s2.0-105014633582OAI: oai:DiVA.org:su-246656DiVA, id: diva2:1996328
Available from: 2025-09-09 Created: 2025-09-09 Last updated: 2025-10-06Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Randl, Korbinian RobertHenriksson, AronLindgren, Tony

Search in DiVA

By author/editor
Randl, Korbinian RobertHenriksson, AronLindgren, Tony
By organisation
Department of Computer and Systems Sciences
In the same journal
Machine Learning
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 69 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf