Thanks to artificial intelligence, the BionicSoftHand learns independently and quickly to move the desired side of the cube to face upwards, using the method of reinforcement learning. Instead of a concrete action that the hand must imitate, it is given the objective. It tries to achieve this through trial and error. Based on the feedback received, the hand gradually optimises its actions until it solves the task successfully.
The teaching takes place in a virtual environment using a simulation — the so-called digital twin. With the help of the data from a depth camera and the algorithms of artificial intelligence, the control is trained in the simulation to the desired motion strategy and then transferred to the real SoftHand.
In this way, the learned knowledge elements can also be shared with other robot hands around the world. The fingers of the robot hand are comprised of flexible bellows structures with air chambers, which are enclosed in a special 3D textile cover which is knitted from both elastic and high-strength fibres. This means that the textile can be used to exactly determine at which points the structure expands, thereby generating force, and where it is prevented from expanding. This makes the hand light, flexible, adaptable and sensitive, yet capable of exerting strong forces.
In addition, position sensors are installed as well as tactile force sensors that register pressure. In order to keep the effort for tubing the BionicSoftHand as low as possible, the developers have specially designed a small, digitally controlled valve terminal, which is mounted directly below the hand. This means that the tubes for controlling the gripper fingers do not have to be pulled through the entire robot arm. In this way, the BionicSoftHand can be quickly and easily connected and operated with only one tube each for supply air and exhaust air.
With the proportional piezo valves used, the movements of the gripper fingers can be precisely controlled. The BionicSoftHand can be mounted on the BionicSoftArm — a lightweight pneumatic robot — for working together with humans. Adapting bias by gradient descent: An incremental version of delta-bar-delta. Gain adaptation beats least squares? Machines that Learn and Mimic the Brain. Reprinted in Stethoscope Quarterly, Spring. Reinforcement learning architectures.
Introduction: The challenge of reinforcement learning. Sanger, T. Iterative construction of sparse polynomial approximations. Advances in Neural Information Processing Systems 4, pp. Gluck, M. Adaptation of cue-specific learning rates in network models of human category learning , Proceedings of the Fouteenth Annual Conference of the Cognitive Science Society, pp. Planning by incremental dynamic programming.
Dyna, an integrated architecture for learning, planning and reacting. Integrated modeling and control based on reinforcement learning and dynamic programming. Touretzky ed. The second part of the paper presents Dyna, a class of architectures based on reinforcement learning but which go beyond trial-and-error learning. Dyna architectures include a learned internal model of the world.
By intermixing conventional trial and error with hypothetical trial and error using the world model, Dyna systems can plan and learn optimal behavior very rapidly. Results are shown for simple Dyna systems that learn from trial and error while they simultaneously learn a world model and use it to plan optimal action sequences. We also show that Dyna architectures are easy to adapt for use in changing environments.
Reinforcement learning is direct adaptive optimal control , Proceedings of the American Control Conference, pages First results with Dyna, an integrated architecture for learning, planning, and reacting. Time-derivative models of pavlovian reinforcement. Gabriel and J. Moore, Eds. Barto, A. Sequential decision problems and neural networks. Whitehead, S. Franklin, J.
Rodriguez, Ed. SPIE , pp. Anderson, C. Artificial intelligence as a control problem: Comments on the relationship between machine learning and intelligent control. Appeared in Machine Learning in a dynamic world. Implementation details of the TD lambda procedure for the case of vector predictions and backpropagation. Learning to predict by the methods of temporal differences. Machine Learning 3 : , erratum p. Scan of paper as published, with erratum added. Digitally remastered with missing figure in place. Convergence theory for a new kind of prediction learning , Proceedings of the Workshop on Computational Learning Theory, pp.
Selfridge, O. Lee Ed. A temporal-difference model of classical conditioning , Proceedings of the Ninth Annual Conference of the Cognitive Science Society, pp. Two problems with backpropagation and other steepest-descent learning procedures for networks , Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pp.
Mehra and B. Learning distributed, searchable, internal models , Proceedings of the Distributed Artificial Intelligence Workshop, pp. The learning of world models by connectionist networks , Proceedings of the Seventh Annual Conference of the Cognitive Science Society, pp. Neural problem solving. Anderson Eds. Lawrence Erlbaum. The results here were obtained using a computer simulation of the pole-balancing problem. A movie will be shown of the performance of the system under the various requirements and tasks.
Moore, J. Connectionist learning in real time: Sutton-Barto adaptive element and classical conditioning of the nictitating membrane response , Proceedings of the Seventh Annual Conference of the Cognitive Science Society, pp. Temporal credit assignment in reinforcement learning Mbytes.
The algorithms considered include some from learning automata theory, mathematical learning theory, early "cybernetic" approaches to learning, Samuel's checker-playing program, Michie and Chambers's "Boxes" system, and a number of new algorithms. The tasks were selected to involve, first in isolation and then in combination, the issues of misleading generalizations, delayed reinforcement, unbalanced reinforcement, and secondary reinforcement. The tasks range from simple, abstract "two-armed bandit" tasks to a physically realistic pole-balancing task.
The results indicate several areas where the algorithms presented here perform substantially better than those previously studied. An unbalanced distribution of reinforcement, misleading generalizations, and delayed reinforcement can greatly retard learning and in some cases even make it counterproductive. Performance can be substantially improved in the presence of these common problems through the use of mechanisms of reinforcement comparison and secondary reinforcement. We present a new algorithm similar to the "learning-by-generalization" algorithm used for altering the static evaluation function in Samuel's checker-playing program.
Simulation experiments indicate that the new algorithm performs better than a version of Samuel's algorithm suitably modified for reinforcement learning tasks. Theoretical analysis in terms of an "ideal reinforcement signal" sheds light on the relationship between these two algorithms and other temporal credit-assignment algorithms.
A theory of salience change dependent on the relationship between discrepancies on successive trials on which the stimulus is present. Synthesis of nonlinear control surfaces by a layered associative network , Biological Cybernetics 43 Adaptation of learning rate parameters. Barto and R. Toward a modern theory of adaptive networks: Expectation and prediction , Psychological Review 88 Translated into Spanish by G.
Ruiz to appear in the journal Estudios de Psicologia. An adaptive network that constructs and uses an internal model of its world , Cognition and Brain Theory 4 Goal seeking components for adaptive intelligence: An initial assessment. Appendix C is available separately. Associative search network: A reinforcement learning associative memory , Biological Cybernetics 40 Landmark learning: An illustration of associative search , Biological Cybernetics 42 A unified theory of expectation in classical and instrumental conditioning.
Bachelors thesis, Stanford University. Learning theory support for a single channel theory of the brain. ABSTRACT: To estimate the value functions of policies from exploratory data, most model-free off-policy algorithms rely on importance sampling, where the use of importance sampling ratios often leads to estimates with severe variance. It is thus desirable to learn off-policy without using the ratios.
However, such an algorithm does not exist for multi-step learning with function approximation. In this paper, we introduce the first such algorithm based on temporal-difference TD learning updates. We show that an explicit use of importance sampling ratios can be eliminated by varying the amount of bootstrapping in TD updates in an action-dependent manner. Our new algorithm achieves stability using a two-timescale gradient-based TD update.
A prior algorithm based on lookup table representation called Tree Backup can also be retrieved using action-dependent bootstrapping, becoming a special case of our algorithm. In two challenging off-policy tasks, we demonstrate that our algorithm is stable, effectively avoids the large variance issue, and can perform substantially better than its state-of-the-art counterpart. ABSTRACT: Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. Currently, there are a multitude of algorithms that can be used to perform TD control, including Sarsa, Q-learning, and Expected Sarsa.
These methods are often studied in the one-step case, but they can be extended across multiple time steps to achieve better performance. Each of these algorithms is seemingly distinct, and no one dominates the others for all problems. The mixture can also be varied dynamically which can result in even greater performance. ABSTRACT: We consider off-policy temporal-difference TD learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy.
These results not only lead immediately to a characterization of the convergence behavior of least-squares based implementation of our scheme, but also prepare the ground for further analysis of gradient-based implementations. In such continuous domain, we also propose four off-policy IPI methods—two are the ideal PI forms that use advantage and Q-functions, respectively, and the other two are natural extensions of the existing off-policy IPI schemes to our general RL framework.
Compared to the IPI methods in optimal control, the proposed IPI schemes can be applied to more general situations and do not require an initial stabilizing policy to run; they are also strongly relevant to the RL algorithms in CTS such as advantage updating, Q-learning, and value-gradient based VGB greedy policy improvement. Our on-policy IPI is basically model-based but can be made partially model-free; each off-policy method is also either partially or completely model-free.
The mathematical properties of the IPI methods—admissibility, monotone improvement, and convergence towards the optimal solution—are all rigorously proven, together with the equivalence of on- and off-policy IPI. Finally, the IPI methods are simulated with an inverted-pendulum model to support the theory and verify the performance.
These examples were chosen to illustrate a diversity of application types, the engineering needed to build applications, and most importantly, the impressive results that these methods are able to achieve. We suggest one advantage of this particular type of memory is the ability to easily assign credit to a specific state when remembered information is found to be useful.
Inspired by this idea, and the increasing popularity of external memory mechanisms to handle long-term dependencies in deep learning systems, we propose a novel algorithm which uses a reservoir sampling procedure to maintain an external memory consisting of a fixed number of past states. The algorithm allows a deep reinforcement learning agent to learn online to preferentially remember those states which are found to be useful to recall later on. Critically this method allows for efficient online computation of gradient estimates with respect to the write process of the external memory.
Thus unlike most prior mechanisms for external memory it is feasible to use in an online reinforcement learning setting.
ABSTRACT: This work presents an overarching perspective on the role that machine intelligence can play in enhancing human abilities, especially those that have been diminished due to injury or illness. As a primary contribution, we develop the hypothesis that assistive devices, and specifically artificial arms and hands, can and should be viewed as agents in order for us to most effectively improve their collaboration with their human users. We believe that increased agency will enable more powerful interactions between human users and next generation prosthetic devices, especially when the sensorimotor space of the prosthetic technology greatly exceeds the conventional control and communication channels available to a prosthetic user.
We then introduce the idea of communicative capital as a way of thinking about the communication resources developed by a human and a machine during their ongoing interaction. Using this schema of agency and capacity, we examine the benefits and disadvantages of increasing the agency of a prosthetic limb. To do so, we present an analysis of examples from the literature where building communicative capital has enabled a progression of fruitful, task-directed interactions between prostheses and their human users.
We then describe further work that is needed to concretely evaluate the hypothesis that prostheses are best thought of as agents. The agent-based viewpoint developed in this article significantly extends current thinking on how best to support the natural, functional use of increasingly complex prosthetic enhancements, and opens the door for more powerful interactions between humans and their assistive technologies. It introduces a new hyper-parameter, the memory buffer size, which needs carefully tuning.
However unfortunately the importance of this new hyper-parameter has been underestimated in the community for a long time. In this paper we did a systematic empirical study of experience replay under various function representations. We showcase that a large replay buffer can significantly hurt the performance.
Moreover, we propose a simple O 1 method to remedy the negative influence of a large replay buffer. We showcase its utility in both simple grid world and challenging domains like Atari games. Frequently used models of the interaction between an agent and its environment, such as Markov Decision Processes MDP or Semi-Markov Decision Processes SMDP , do not capture the fact that, in an asynchronous environment, the state of the environment may change during computation performed by the agent.
In an asynchronous environment, minimizing reaction time—the time it takes for an agent to react to an observation—also minimizes the time in which the state of the environment may change following observation. In many environments, the reaction time of an agent directly impacts task performance by permitting the environment to transition into either an undesirable terminal state or a state where performing the chosen action is inappropriate.
We propose a class of reactive reinforcement learning algorithms that address this problem of asynchronous environments by immediately acting after observing new state information. We compare a reactive SARSA learning algorithm with the conventional SARSA learning algorithm on two asynchronous robotic tasks emergency stopping and impact prevention , and show that the reactive RL algorithm reduces the reaction time of the agent by approximately the duration of the algorithm's learning update.
This will be a pre-requisite for machines to learn from experience and respond to circumstances effectively. What is AI AI is the theory and development of computer systems able to perform tasks that normally require human intelligence. Gabriel Hallevy has proposed that AI entities can fulfill the two requirements of criminal liability under three possible models of criminal liability i the Perpetration-by-Another liability model; ii the Natural-Probable Consequence liability model; and iii he Direct liability model. Consumer expectations themselves are subject to a reasonableness test. In this paper we show that the performance of existing step-size adaptation methods are strongly dependent on the choice of their meta-step-size parameter and that their meta-step-size parameter cannot be set reliably in a problem-independent way. Praetor's Edict provided for cases in which a claim on obligations arising under transactions of a slave who was directly involved in commercial activities could have been made against the slaveholder.
This new class of reactive algorithms may facilitate safer control and faster decision making without any change to standard learning guarantees. The performance of TD methods often depends on well chosen step-sizes, yet few algorithms have been developed for setting the step-size automatically for TD learning. An important limitation of current methods is that they adapt a single step-size shared by all the weights of the learning system.
A vector step-size enables greater optimization by specifying parameters on a per-feature basis. Furthermore, adapting parameters at different rates has the added benefit of being a simple form of representation learning. We demonstrate that TIDBD is able to find appropriate step-sizes in both stationary and non-stationary prediction tasks, outperforming ordinary TD methods and TD methods with scalar step-size adaptation; we demonstrate that it can differentiate between features which are relevant and irrelevant for a given task, performing representation learning; and we show on a real-world robot prediction task that TIDBD is able to outperform ordinary TD methods and TD methods augmented with AlphaBound and RMSprop.
Our goal is to train these networks in an incremental manner, without the computationally expensive experience replay. We propose reducing such interferences with two efficient input transformation methods that are geometric in nature and match well the geometric property of ReLU gates. The first one is tile coding, a classic binary encoding scheme originally designed for local generalization based on the topological structure of the input space. The second one EmECS is a new method we introduce; it is based on geometric properties of convex sets and topological embedding of the input space into the boundary of a convex set.
We discuss the behavior of the network when it operates on the transformed inputs. We also compare it experimentally with some neural nets that do not use the same input transformations, and with the classic algorithm of tile coding plus a linear function approximator, and on several online reinforcement learning tasks, we show that the neural net with tile coding or EmECS can achieve not only faster learning but also more accurate approximations. Our results strongly suggest that geometric input transformation of this type can be effective for interference reduction and takes us a step closer to fully incremental reinforcement learning with neural nets.
They address a bias-variance trade off between reliance on current estimates, which could be poor, and incorporating longer sampled reward sequences into the updates. Especially in the off-policy setting, where the agent aims to learn about a policy different from the one generating its behaviour, the variance in the updates can cause learning to diverge as the number of sampled rewards used in the estimates increases.
In this paper, we introduce per-decision control variates for multi-step TD algorithms, and compare them to existing methods. Our results show that including the control variates can greatly improve performance on both on and off-policy multi-step temporal difference learning tasks. The third case, that of an expectation model , is particularly appealing because the expectation is compact and deterministic; this is the case most commonly used, but often in a way that is not sound for non-linear models such as those obtained with deep learning.
In this paper we introduce the first MBRL algorithm that is sound for non-linear expectation models and stochastic environments. Key to our algorithm, based on the Dyna architecture, is that the model is never iterated to produce a trajectory, but only used to generate single expected transitions to which a Bellman backup with a linear approximate value function is applied. In our results, we also consider the extension of the Dyna architecture to partial observability. We show the effectiveness of our algorithm by comparing it with model-free methods on partially-observable navigation tasks.
ABSTRACT: Temporal difference TD learning is an important approach in reinforcement learning, as it combines ideas from dynamic programming and Monte Carlo methods in a way that allows for online and incremental model-free learning. A key idea of TD learning is that it is learning predictive knowledge about the environment in the form of value functions, from which it can derive its behavior to address long-term sequential decision making problems.
In this paper, we introduce an alternative view on the discount rate, with insight from digital signal processing, to include complex-valued discounting. Our results show that setting the discount rate to appropriately chosen complex numbers allows for online and incremental estimation of the Discrete Fourier Transform DFT of a signal of interest with TD learning. We thereby extend the types of knowledge representable by value functions, which we show are particularly useful for identifying periodic effects in the reward sequence.
ABSTRACT: Temporal-difference TD learning methods are widely used in reinforcement learning to estimate the expected return for each state, without a model, because of their significant advantages in computational and data efficiency. For many applications involving risk mitigation, it would also be useful to estimate the variance of the return by TD methods. In this paper, we describe a way of doing this that is substantially simpler than those proposed by Tamar, Di Castro, and Mannor in , or those proposed by White and White in We show that two TD learners operating in series can learn expectation and variance estimates.
With these two modifications, the variance learning problem becomes a conventional TD learning problem to which standard theoretical results can be applied. Our formal results are limited to the table lookup case, for which our method is still novel, but the extension to function approximation is immediate, and we provide some empirical results for the linear function approximation case. Our experimental results show that our direct method behaves just as well as a comparable indirect method, but is generally more robust. Eligibility traces, the usual way of handling them, works well with linear function approximators.
However, this was limited to action-value methods. In this paper, we extend this approach to handle n-step returns, generalize this approach to policy gradient methods and empirically study the effect of such delayed updates in control tasks. Specifically, we introduce two novel forward actor-critic methods and empirically investigate our proposed methods with the conventional actor-critic method on mountain car and pole-balancing tasks. From our experiments, we observe that forward actor-critic dramatically outperforms the conventional actor-critic in these standard control tasks.
Notably, this forward actor-critic method has produced a new class of multi-step RL algorithms without eligibility traces. ABSTRACT: We consider off-policy temporal-difference TD learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process gen- erated without executing the policy.
The performance of a learning system depends on the type of representation used for representing the data. Typically, these representations are hand-engineered using domain knowledge. More recently, the trend is to learn these representations through stochastic gradient descent in multi-layer neural networks, which is called backprop. Learning the representations directly from the incoming data stream reduces the human labour involved in designing a learning system. More importantly, this allows in scaling of a learning system for difficult tasks.
In this paper, we introduce a new incremental learning algorithm called crossprop, which learns incoming weights of hidden units based on the meta-gradient descent approach, that was previously introduced by Sutton and Schraudolph for learning step-sizes. The final update equation introduces an additional memory parameter for each of these weights and generalizes the backprop update equation.
From our experiments, we show that crossprop learns and reuses its feature representation while tackling new and unseen tasks whereas backprop re- learns a new feature representation. In addition to learning from the current trial, the new model supposes that animals store and replay previous trials, learning from the replayed trials using the same learning rule.
This simple idea provides a unified explanation for diverse phenomena that have proved challenging to earlier associative models, including spontaneous recovery, latent inhibition, retrospective revaluation, and trial spacing effects. For example, spontaneous recovery is explained by supposing that the animal replays its previous trials during the interval between extinction and test. These include earlier acquisition trials as well as recent extinction trials, and thus there is a gradual re-acquisition of the conditioned response.
We present simulation results for the simplest version of this replay idea, where the trial memory is assumed empty at the beginning of an experiment, all experienced trials are stored and none removed, and sampling from the memory is performed at random. Even this minimal replay model is able to explain the challenging phenomena, illustrating the explanatory power of an associative model enhanced by learning from remembered as well as real experiences. To accomplish this, machines need to learn about their human users' intentions and adapt to their preferences.
In most current research, a user has conveyed preferences to a machine using explicit corrective or instructive feedback; explicit feedback imposes a cognitive load on the user and is expensive in terms of human effort. The primary objective of the current work is to demonstrate that a learning agent can reduce the amount of explicit feedback required for adapting to the user's preferences pertaining to a task by learning to perceive a value of its behavior from the human user, particularly from the user's facial expressionswe call this face valuing.
We empirically evaluate face valuing on a grip selection task. Our preliminary results suggest that an agent can quickly adapt to a user's changing preferences with minimal explicit feedback by learning a value function that maps facial features extracted from a camera image to expected future reward. We believe that an agent learning to perceive a value from the body language of its human user is complementary to existing interactive machine learning approaches and will help in creating successful human-machine interactive applications. ABSTRACT: In this paper we introduce the idea of improving the performance of parametric temporal-difference TD learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps.
Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states. Machine learning, and in particular learned predictions about user intent, could help to reduce the time and cognitive load required by amputees while operating their prosthetic device. Objectives: The goal of this study was to compare two switching-based methods of controlling a myoelectric arm: non-adaptive or conventional control and adaptive control involving real-time prediction learning. Study Design: Case series study.
Methods: We compared non-adaptive and adaptive control in two different experiments. In the first, one amputee and one non-amputee subject controlled a robotic arm to perform a simple task; in the second, three able-bodied subjects controlled a robotic arm to perform a more complex task.
For both tasks, we calculated the mean time and total number of switches between robotic arm functions over three trials. Results: Adaptive control significantly decreased the number of switches and total switching time for both tasks compared with the conventional control method. Conclusion: Real-time prediction learning was successfully used to improve the control interface of a myoelectric robotic arm during uninterrupted use by an amputee subject and able-bodied subjects.
Clinical Relevance: Adaptive control using real-time prediction learning has the potential to help decrease both the time and the cognitive load required by amputees in real-world functional situations when using myoelectric prostheses. Conventional algorithms wait until observing actual outcomes before performing the computations to update their predictions. If predictions are made at a high rate or span over a large amount of time, substantial computation can be required to store all relevant observations and to update all predictions when the outcome is finally observed.
We show that the exact same predictions can be learned in a much more computationally congenial way, with uniform per-step computation that does not depend on the span of the predictions. We apply this idea to various settings of increasing generality, repeatedly adding desired properties and each time deriving an equivalent span-independent algorithm for the conventional algorithm that satisfies these desiderata. Interestingly, along the way several known algorithmic constructs emerge spontaneously from our derivations, including dutch eligibility traces, temporal difference errors, and averaging.
Each step, we make sure that the derived algorithm subsumes the previous algorithms, thereby retaining their properties.
Ultimately we arrive at a single general temporal-difference algorithm that is applicable to the full setting of reinforcement learning. ABSTRACT: In this article we develop the perspective that assistive devices, and specifically artificial arms and hands, may be beneficially viewed as goal-seeking agents. We further suggest that taking this perspective enables more powerful interactions between human users and next generation prosthetic devices, especially when the sensorimotor space of the prosthetic technology greatly exceeds the conventional myoelectric control and communication channels available to a prosthetic user.
Using this schema, we present a brief analysis of three examples from the literature where agency or goal-seeking behaviour by a prosthesis has enabled a progression of fruitful, task-directed interactions between a prosthetic assistant and a human director. While preliminary, the agent-based viewpoint developed in this article extends current thinking on how best to support the natural, functional use of increasingly complex prosthetic enhancements.
Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Algorithmically, these true online methods only make two small changes to the update rules of the regular methods, and the extra computational cost is negligible in most cases. However, they follow the ideas underlying the forward view much more closely. In particular, they maintain an exact equivalence with the forward view at all times, whereas the traditional versions only approximate it for small step-sizes. We hypothesize that these true online methods not only have better theoretical properties, but also dominate the regular methods empirically.
In this article, we put this hypothesis to the test by performing an extensive empirical comparison. We use linear function approximation with tabular, binary, and non-binary features. Our results suggest that the true online methods indeed dominate the regular methods. An additional advantage is that no choice between traces has to be made for the true online methods.
Besides the empirical results, we provide an in-dept analysis of the theory behind true online temporal-difference learning. In addition, we show that new true online temporal- difference methods can be derived by making changes to the online forward view and then rewriting the update equations.
Our domains include a challenging one-state and two-state example, random Markov reward processes, and a real-world myoelectric prosthetic arm. We assess the algorithms along three dimensions: computational cost, learning speed, and ease of use.
ABSTRACT: Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps. Recent works by Sutton, Mahmood and White and Yu show that by varying the emphasis in a particular way these algorithms become stable and convergent under off-policy training with linear function approximation. This paper serves as a unified summary of the available results from both works. Additionally, we empirically demonstrate the benefits of emphatic algorithms, due to the flexible learning using state-dependent discounting, bootstrapping and a user-specified allocation of function approximation resources.
Weighted importance sampling WIS is generally considered superior to ordinary importance sampling but, when combined with function approximation, it has hitherto required computational complexity that is O n 2 or more in the number of features. In this paper we introduce new off-policy learning algorithms that obtain the benefits of WIS with O n computational complexity. Our algorithms maintain for each component of the parameter vector a measure of the extent to which that component has been used in previous examples.
This measure is used to determine component-wise step sizes, merging the ideas of stochastic gradient descent and sample averages. We present our main WIS-based algorithm first in an intuitive acausal form the forward view and then derive a causal algorithm using eligibility traces that is equivalent but more efficient the backward view. In three small experiments, our algorithms performed significantly better than prior O n algorithms for off-policy policy evaluation. ABSTRACT: In reinforcement learning, the notions of experience replay, and of planning as learning from replayed experience, have long been used to find good policies with minimal training data.
Replay can be seen either as model-based reinforcement learning, where the store of past experiences serves as the model, or as a way to avoid a conventional model of the environment altogether. In this paper, we look more deeply at how replay blurs the line between model-based and model-free methods. First, we show for the first time an exact equivalence between the sequence of value functions found by a model-based policy-evaluation method and by a model-free method with replay.
Second, we present a general replay method that can mimic a spectrum of methods ranging from the explicitly model-free TD 0 to the explicitly model-based linear Dyna. Finally, we use insights gained from these relationships to design a new model-based reinforcement learning algorithm for linear function approximation. ABSTRACT: The present experiment tested whether or not the time course of a conditioned eyeblink response, particularly its duration, would expand and contract, as the magnitude of the conditioned response CR changed massively during acquisition, extinction, and reacquisition.
The CR duration remained largely constant throughout the experiment, while CR onset and peak time occurred slightly later during extinction. The results suggest that computational models can account for these results by using two layers of plasticity conforming to the sequence of synapses in the cerebellar pathways that mediate eyeblink conditioning. However, its most effective variant, weighted importance sampling, does not carry over easily to function approximation and, because of this, it is not utilized in existing off-policy learning algorithms.
In this paper, we take two steps toward bridging this gap. First, we show that weighted importance sampling can be viewed as a special case of weighting the error of individual training samples, and that this weighting has theoretical and empirical benefits similar to those of weighted importance sampling. ABSTRACT: We consider the problem of learning models of options for real-time abstract plan ning, in the setting where reward functions can be specified at any time and their expected returns must be efficiently computed.
We introduce a new model for an option that is independent of any reward function, called the universal option model UOM. We prove that the UOM of an option can construct a traditional option model given a reward function, and also supports efficient computation of the option-conditional return.
We extend the UOM to linear function approximation, and we show it gives the TD solution of option returns and value functions of policies over options. We provide a stochastic approximation algorithm for incrementally learning UOMs from data and prove its consistency. We demonstrate our method in two domains. The first domain is a real-time strategy game, where the controller must select the best game unit to accomplish dynamically-specified tasks. Our experiments show that UOMs are substantially more efficient than previously known methods in evaluating option returns and policies over options.
However, their algorithm is restricted to on-policy learning. In the more general case of off-policy learning, in which the policy whose outcome is predicted and the policy used to generate data may be different, their algorithm cannot be applied. One reason for this is that the algorithm bootstraps and thus is subject to instability problems when function approximation is used. To address these limitations, we generalize their equivalence result and use this generalization to construct the first online algorithm to be exactly equivalent to an off-policy forward view. In the general theorem that allows us to derive this new algorithm, we encounter a new general eligibility-trace update.
ABSTRACT: Q-learning, the most popular of reinforcement learning algorithms, has always included an extension to eligibility traces to enable more rapid learning and improved asymptotic performance on non-Markov problems. Its appeal comes from its equivalence to a clear and conceptually simple forward view, and the fact that it can be implemented online in an inexpensive manner. In a sense this is unavoidable for the conventional forward view, as it itself presumes that the estimates are unchanging during an episode.
Our algorithm uses a new form of eligibility trace similar to but different from conventional accumulating and replacing traces. In this paper we present results with a robot that learns to next in real time, making thousands of predictions about sensory input signals at timescales from 0. Our predictions are formulated as a generalization of the value functions commonly used in reinforcement learning, where now an arbitrary function of the sensory input signals is used as a pseudo reward, and the discount rate determines the timescale.
This approach is sufficiently computationally efficient to be used for real-time learning on the robot and sufficiently data efficient to achieve substantial accuracy within 30 minutes. Moreover, a single tile-coded feature representation suffices to accurately predict many different signals over a significant range of timescales.
We also extend nexting beyond simple timescales by letting the discount rate be a function of the state and show that nexting predictions of this more general form can also be learned with substantial accuracy. General nexting provides a simple yet powerful mechanism for a robot to acquire predictive knowledge of the dynamics of its environment. ABSTRACT: We introduce a new method for robot control that combines prediction learning with a fixed, crafted response—the robot learns to make a temporally-extended prediction during its normal operation, and the prediction is used to select actions as part of a fixed behavioral response.
Our method for robot control combines a fixed response with online prediction learning, thereby producing an adaptive behavior. This method is different from standard non-adaptive control methods and also from adaptive reward-maximizing control methods.