Matches in SemOpenAlex for { <https://semopenalex.org/work/W4301369075> ?p ?o ?g. }
Showing items 1 to 71 of
71
with 100 items per page.
- W4301369075 abstract "Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. Yet, so far, such methods have been mostly analyzed from an optimization perspective, without addressing the problem of exploration, or by making strong assumptions on the interaction with the environment. In this paper we consider model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback. For this setting, we propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrt{S^2 A H^4 K})$ regret for stochastic rewards. Furthermore, we prove $tilde O( sqrt{ S^2 A H^4 } K^{2/3} ) $ regret for adversarial rewards. Interestingly, this result matches previous bounds derived for the bandit feedback case, yet with known transitions. To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback." @default.
- W4301369075 created "2022-10-05" @default.
- W4301369075 creator A5036260775 @default.
- W4301369075 creator A5049062714 @default.
- W4301369075 creator A5090839043 @default.
- W4301369075 creator A5090891199 @default.
- W4301369075 date "2020-02-19" @default.
- W4301369075 modified "2023-09-27" @default.
- W4301369075 title "Optimistic Policy Optimization with Bandit Feedback" @default.
- W4301369075 doi "https://doi.org/10.48550/arxiv.2002.08243" @default.
- W4301369075 hasPublicationYear "2020" @default.
- W4301369075 type Work @default.
- W4301369075 citedByCount "0" @default.
- W4301369075 crossrefType "posted-content" @default.
- W4301369075 hasAuthorship W4301369075A5036260775 @default.
- W4301369075 hasAuthorship W4301369075A5049062714 @default.
- W4301369075 hasAuthorship W4301369075A5090839043 @default.
- W4301369075 hasAuthorship W4301369075A5090891199 @default.
- W4301369075 hasBestOaLocation W43013690751 @default.
- W4301369075 hasConcept C114614502 @default.
- W4301369075 hasConcept C119857082 @default.
- W4301369075 hasConcept C126255220 @default.
- W4301369075 hasConcept C12713177 @default.
- W4301369075 hasConcept C137836250 @default.
- W4301369075 hasConcept C154945302 @default.
- W4301369075 hasConcept C159176650 @default.
- W4301369075 hasConcept C178635117 @default.
- W4301369075 hasConcept C2524010 @default.
- W4301369075 hasConcept C28761237 @default.
- W4301369075 hasConcept C33923547 @default.
- W4301369075 hasConcept C36686422 @default.
- W4301369075 hasConcept C37736160 @default.
- W4301369075 hasConcept C38652104 @default.
- W4301369075 hasConcept C41008148 @default.
- W4301369075 hasConcept C50817715 @default.
- W4301369075 hasConcept C89109886 @default.
- W4301369075 hasConcept C97541855 @default.
- W4301369075 hasConceptScore W4301369075C114614502 @default.
- W4301369075 hasConceptScore W4301369075C119857082 @default.
- W4301369075 hasConceptScore W4301369075C126255220 @default.
- W4301369075 hasConceptScore W4301369075C12713177 @default.
- W4301369075 hasConceptScore W4301369075C137836250 @default.
- W4301369075 hasConceptScore W4301369075C154945302 @default.
- W4301369075 hasConceptScore W4301369075C159176650 @default.
- W4301369075 hasConceptScore W4301369075C178635117 @default.
- W4301369075 hasConceptScore W4301369075C2524010 @default.
- W4301369075 hasConceptScore W4301369075C28761237 @default.
- W4301369075 hasConceptScore W4301369075C33923547 @default.
- W4301369075 hasConceptScore W4301369075C36686422 @default.
- W4301369075 hasConceptScore W4301369075C37736160 @default.
- W4301369075 hasConceptScore W4301369075C38652104 @default.
- W4301369075 hasConceptScore W4301369075C41008148 @default.
- W4301369075 hasConceptScore W4301369075C50817715 @default.
- W4301369075 hasConceptScore W4301369075C89109886 @default.
- W4301369075 hasConceptScore W4301369075C97541855 @default.
- W4301369075 hasLocation W43013690751 @default.
- W4301369075 hasOpenAccess W4301369075 @default.
- W4301369075 hasPrimaryLocation W43013690751 @default.
- W4301369075 hasRelatedWork W2963139197 @default.
- W4301369075 hasRelatedWork W2985982678 @default.
- W4301369075 hasRelatedWork W2995459009 @default.
- W4301369075 hasRelatedWork W3007034372 @default.
- W4301369075 hasRelatedWork W3013223143 @default.
- W4301369075 hasRelatedWork W3034871777 @default.
- W4301369075 hasRelatedWork W3035388736 @default.
- W4301369075 hasRelatedWork W3188206594 @default.
- W4301369075 hasRelatedWork W4301369075 @default.
- W4301369075 hasRelatedWork W4376653367 @default.
- W4301369075 isParatext "false" @default.
- W4301369075 isRetracted "false" @default.
- W4301369075 workType "article" @default.