Matches in SemOpenAlex for { <https://semopenalex.org/work/W134602879> ?p ?o ?g. }
- W134602879 endingPage "514" @default.
- W134602879 startingPage "509" @default.
- W134602879 abstract "This paper provides new techniques for abstracting the state space of a Markov Decision Process (MDP). These techniques extend one of the recent minimization models, known as -reduction, to construct a partition space that has a smaller number of states than the original MDP. As a result, learning policies on the partition space should be faster than on the original state space. The technique presented here extends reduction to SMDPs by executing a policy instead of a single action, and grouping all states which have a small difference in transition probabilities and reward function under a given policy. When the reward structure is not known, a two-phase method for state aggregation is introduced and a theorem in this paper shows the solvability of tasks using the two-phase method partitions. These partitions can be further refined when the complete structure of reward is available. Simulations of different state spaces show that the policies in both MDP and this representation achieve similar results and the total learning time in partition space in presented approach is much smaller than the total amount of time spent on learning on the original state space. Introduction Markov decision processes (MDPs) are useful ways to model stochastic environments, as there are well established algorithms to solve these models. Even though these algorithms find an optimal solution for the model, they suffer from the high time complexity when the number of decision points is large(Parr 1998; Dietterich 2000). To address increasingly complex problems a number of approaches have been used to design state space representations in order to increase the efficiency of learning (Dean Thomas; Kaelbling & Nicholson 1995; Dean & Robert 1997). Here particular features are hand-designed based on the task domain and the capabilities of the learning agent. In autonomous systems, however, this is generally a difficult task since it is hard to anticipate which parts of the underlying physical state are important for the given decision making problem. Moreover, in hierarchical learning approaches the required information might change over time as increasingly competent actions become available. The same can be observed in biological systems where information about all muscle Copyright c © 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. fibers is initially instrumental to generate strategies for coordinated movement. However, as such strategies become established and ready to be used, this low-level information does no longer have to be consciously taken into account. The methods presented here build on the -reduction technique developed by Dean et al.(Givan & Thomas 1995) to derive representations in the form of state space partitions that ensure that the utility of a policy learned in the reduced state space is within a fixed bound of the optimal policy. The presented methods here extend the -reduction technique by including policies as actions and thus using it to find approximate SMDP reductions. Furthermore it derives partitions for individual actions and composes them into representations for any given subset of the action space. This is further extended by permitting the definition of two-phase partitioning that is initially reward independent and can later be refined once the reward function is known. In particular the techniques described in the following subsections are to extend -reduction(Thomas Dean & Leach 1997) by introducing the following methods: • Temporal abstraction • Action dependent decomposition • Two-phase decomposition Formalism A Markov decision processes (MDP ) is a 4-tuple (S,A, P,R) where S is the set of states, A is a set of actions available in each state, P is a transition probability function that assigns a value 0 ≤ p ≤ 1 to each state-action pair, and R is the reward function. A transition function is a map P : S × A × S → [0, 1] and usually is denoted by P (s|s, a), which is the probability that executing action a in state s will lead to state s . Similarly, a reward function is a map R : S × A → and R(s, a) denotes the reward gained by executing action a in state s. Any policy defines a value function and the Bellman equation (Bellman 1957; Puterman 1994) creates a connection between the value of each state and the value of other states by: V π(s) = R(s, π(s)) + γ ∑ s ′ P (s′|s, π(s))V π(s′) Previous Work State space reduction methods use the basic concepts of a MDP such as transition probabilities and reward function to represent a large class of states with a single state of the abstract space.The most important issues that show the generated abstraction is a valid approximate MDP are: 1. The difference between the transition function and reward function in both models has to be a small value. 2. For each policy on the original state space there must exist a policy in the abstract model. And if a state s is not reachable from state s in the abstract model, then there should not exist a policy that leads from s to s in the original state space. SMDPs One of the approaches in treating temporal abstraction is to use the theory of semi Markov decision processes (SMDPs). The actions in SMDPs take a variable amount of time and are intended to model temporally extended actions, represented as a sequence of primary actions. Policies: A policy (option) in SMDPs is a triple oi = (Ii, πi, βi)(Boutillier & Hanks 1995), where Ii is an initiation set, πi : S × A −→ [0, 1] is a primary policy and βi : S −→ [0, 1] is a termination condition. When a policy oi is executed, actions are chosen according to πi until the policy terminates stochastically according to βi. The initiation set and termination condition of a policy limit the range over which the policy needs to be defined and determine its termination. Given any set of multi-step actions, we consider the policy over those actions. In this case we need to generalize the definition of value function. The value of a state s under an SMDP policy π is defined as(Boutillier & Goldszmidt 1994):" @default.
- W134602879 created "2016-06-24" @default.
- W134602879 creator A5044329397 @default.
- W134602879 creator A5047174917 @default.
- W134602879 date "2004-01-01" @default.
- W134602879 modified "2023-09-23" @default.
- W134602879 title "State space reduction for hierarchical reinforcement learning" @default.
- W134602879 cites W1566007944 @default.
- W134602879 cites W1574877594 @default.
- W134602879 cites W1585385982 @default.
- W134602879 cites W1586162706 @default.
- W134602879 cites W1650504995 @default.
- W134602879 cites W1688218840 @default.
- W134602879 cites W1896074376 @default.
- W134602879 cites W1993711637 @default.
- W134602879 cites W2034448800 @default.
- W134602879 cites W2048679005 @default.
- W134602879 cites W2097089247 @default.
- W134602879 cites W2100969003 @default.
- W134602879 cites W2103369961 @default.
- W134602879 cites W2107008379 @default.
- W134602879 cites W2110268278 @default.
- W134602879 cites W2111471791 @default.
- W134602879 cites W2122520221 @default.
- W134602879 cites W2125838338 @default.
- W134602879 cites W2126954844 @default.
- W134602879 cites W2129660483 @default.
- W134602879 cites W2131775048 @default.
- W134602879 cites W2132875213 @default.
- W134602879 cites W2149098830 @default.
- W134602879 cites W2151831732 @default.
- W134602879 cites W2160135234 @default.
- W134602879 cites W2168171912 @default.
- W134602879 cites W2334782222 @default.
- W134602879 cites W2341171179 @default.
- W134602879 cites W199720190 @default.
- W134602879 hasPublicationYear "2004" @default.
- W134602879 type Work @default.
- W134602879 sameAs 134602879 @default.
- W134602879 citedByCount "10" @default.
- W134602879 countsByYear W1346028792012 @default.
- W134602879 countsByYear W1346028792020 @default.
- W134602879 crossrefType "proceedings-article" @default.
- W134602879 hasAuthorship W134602879A5044329397 @default.
- W134602879 hasAuthorship W134602879A5047174917 @default.
- W134602879 hasConcept C105795698 @default.
- W134602879 hasConcept C111335779 @default.
- W134602879 hasConcept C111919701 @default.
- W134602879 hasConcept C11413529 @default.
- W134602879 hasConcept C127413603 @default.
- W134602879 hasConcept C154945302 @default.
- W134602879 hasConcept C2524010 @default.
- W134602879 hasConcept C2778572836 @default.
- W134602879 hasConcept C33923547 @default.
- W134602879 hasConcept C41008148 @default.
- W134602879 hasConcept C48103436 @default.
- W134602879 hasConcept C66938386 @default.
- W134602879 hasConcept C67203356 @default.
- W134602879 hasConcept C72434380 @default.
- W134602879 hasConcept C97541855 @default.
- W134602879 hasConceptScore W134602879C105795698 @default.
- W134602879 hasConceptScore W134602879C111335779 @default.
- W134602879 hasConceptScore W134602879C111919701 @default.
- W134602879 hasConceptScore W134602879C11413529 @default.
- W134602879 hasConceptScore W134602879C127413603 @default.
- W134602879 hasConceptScore W134602879C154945302 @default.
- W134602879 hasConceptScore W134602879C2524010 @default.
- W134602879 hasConceptScore W134602879C2778572836 @default.
- W134602879 hasConceptScore W134602879C33923547 @default.
- W134602879 hasConceptScore W134602879C41008148 @default.
- W134602879 hasConceptScore W134602879C48103436 @default.
- W134602879 hasConceptScore W134602879C66938386 @default.
- W134602879 hasConceptScore W134602879C67203356 @default.
- W134602879 hasConceptScore W134602879C72434380 @default.
- W134602879 hasConceptScore W134602879C97541855 @default.
- W134602879 hasLocation W1346028791 @default.
- W134602879 hasOpenAccess W134602879 @default.
- W134602879 hasPrimaryLocation W1346028791 @default.
- W134602879 hasRelatedWork W1553182805 @default.
- W134602879 hasRelatedWork W1557517019 @default.
- W134602879 hasRelatedWork W1561485809 @default.
- W134602879 hasRelatedWork W1586162706 @default.
- W134602879 hasRelatedWork W1592847719 @default.
- W134602879 hasRelatedWork W2109910161 @default.
- W134602879 hasRelatedWork W2121517924 @default.
- W134602879 hasRelatedWork W2121863487 @default.
- W134602879 hasRelatedWork W2612281180 @default.
- W134602879 hasRelatedWork W2612329773 @default.
- W134602879 hasRelatedWork W2912960479 @default.
- W134602879 hasRelatedWork W2950912239 @default.
- W134602879 hasRelatedWork W2982138249 @default.
- W134602879 hasRelatedWork W3022519106 @default.
- W134602879 hasRelatedWork W3033762800 @default.
- W134602879 hasRelatedWork W3035599863 @default.
- W134602879 hasRelatedWork W3202097587 @default.
- W134602879 hasRelatedWork W3211380950 @default.
- W134602879 hasRelatedWork W3212134242 @default.
- W134602879 hasRelatedWork W52822972 @default.