Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Policy Control with Delayed, Aggregate, and Anonymous Feedback
Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.ORCID-id: 0000-0002-6617-8683
Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.ORCID-id: 0000-0002-1912-712x
Rekke forfattare: 32024 (engelsk)Inngår i: Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2024, Vilnius, Lithuania, September 9–13, 2024, Proceedings, Part VI / [ed] Albert Bifet, Jesse Davis, Tomas Krilavičius, Meelis Kull, Eirini Ntoutsi, Indrė Žliobaitė, Springer Nature , 2024, s. 389-406Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [sv]

Reinforcement learning algorithms have a dependency on observing rewards for actions taken. The relaxed setting of having fully observable rewards, however, can be infeasible in certain scenarios, due to either cost or the nature of the problem. Of specific interest here is the challenge of learning a policy when rewards are delayed, aggregated, and anonymous (DAAF). A problem which has been addressed in bandits literature and, to the best of our knowledge, to a lesser extent in the more general reinforcement learning (RL) setting. We introduce a novel formulation that mirrors scenarios encountered in real-world applications, characterized by intermittent and aggregated reward observations. To address these constraints, we develop four new algorithms: one employs least squares for true reward estimation; two and three adapt Q-learning and SARSA, to deal with our unique setting; and the fourth leverages a policy with options framework. Through a thorough and methodical experimental analysis, we compare these methodologies, demonstrating that three of them can approximate policies nearly as effectively as those derived from complete information scenarios, albeit with minimal performance degradation due to informational constraints. Our findings pave the way for more robust RL applications in environments with limited reward feedback.

sted, utgiver, år, opplag, sider
Springer Nature , 2024. s. 389-406
Serie
Lecture Notes in Computer Science (LNCS), ISSN 0302-9743, E-ISSN 1611-3349
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
URN: urn:nbn:se:su:diva-237094DOI: 10.1007/978-3-031-70365-2_23ISI: 001330395900023Scopus ID: 2-s2.0-85203879812ISBN: 978-3-031-70364-5 (tryckt)ISBN: 978-3-031-70365-2 (digital)OAI: oai:DiVA.org:su-237094DiVA, id: diva2:1920194
Konferanse
Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2024, Vilnius, Lithuania, September 9–13, 2024.
Tilgjengelig fra: 2024-12-10 Laget: 2024-12-10 Sist oppdatert: 2025-02-06bibliografisk kontrollert

Open Access i DiVA

Fulltekst mangler i DiVA

Andre lenker

Forlagets fulltekstScopus

Person

Chaliane Junior, Guilherme DinisMagnússon, SindriHollmén, Jaakko

Søk i DiVA

Av forfatter/redaktør
Chaliane Junior, Guilherme DinisMagnússon, SindriHollmén, Jaakko
Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric

doi
isbn
urn-nbn
Totalt: 36 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf