Publications of Ronald J. Williams Available for Downloading


To see abstracts click here. These papers are also available via direct ftp.

Journal Articles

Peng, J. and Williams, R. J. (1996). Incremental multi-step Q-learning. Machine Learning, 22, 283-290.

Peng, J. and Williams, R. J. (1993). Efficient learning and planning within the Dyna framework. Adaptive Behavior, 2, 437-454.

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229-256.

Williams, R. J. and Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3, 241-268. [Figures not included]

Williams, R. J. and Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2, 490-501.

Williams, R. J. and Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270-280.

Conference and Workshop Presentations

Al-Ansari, M. A. and Williams, R. J. (1999). Efficient, globally-optimized reinforcement learning with the parti-game algorithm. Advances in Neural Information Processing Systems 11

Al-Ansari, M. A. and Williams, R. J. (1998). Modifying the parti-game algorithm for increased robustness, higher efficiency, and better policies. Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, June, New Haven, CT, 204-209.

Peng, J. and Williams, R. J. (1994). Incremental multi-step Q-learning. Proceedings of the Eleventh International Conference on Machine Learning, July, New Brunswick, NJ, 226-232.

Williams, R. J. and Baird, L. C., III (1994). Tight performance bounds on greedy policies based on imperfect value functions. Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, June, New Haven, CT, 108-113.

Williams, R. J. (1992). Training recurrent networks using the extended Kalman filter. Proceedings of the International Joint Conference on Neural Networks, June, Baltimore, MD, Vol. IV, 241-246.

Williams, R. J. and Baird, L. C., III (1990). A mathematical analysis of actor-critic architectures for learning optimal controls through incremental dynamic programming. Proceedings of the Sixth Yale Workshop on Adaptive and Learning Systems, August, New Haven, CT, 96-101.

Williams, R. J. and Peng, J. (1989). Reinforcement learning algorithms as function optimizers. Proceedings of the International Joint Conference on Neural Networks, Washington, DC, Vol. II, 89-95.

Book Chapters

Williams, R. J. and Zipser, D. (1995). Gradient-based learning algorithms for recurrent networks and their computational complexity. In: Y. Chauvin and D. E. Rumelhart (Eds.) Back-propagation: Theory, Architectures and Applications, Hillsdale, NJ: Erlbaum. [Figures not included]

Miller, S. and Williams, R. J. (1995). Temporal difference learning: A chemical process control application. In: A. F. Murray (Ed.) Applications of Artificial Neural Networks, Norwell, MA: Kluwer.

Williams, R. J. (1990). Adaptive state representation and estimation using recurrent connectionist networks. In: W. T. Miller, R. S. Sutton, and P. J. Werbos (Eds.) Neural Networks for Control, Cambridge: MIT Press/Bradford Books.

Technical Reports

Al-Ansari, M. A. and Williams, R. J. (1998). Modifying the parti-game algorithm for increased robustness, higher efficiency, and better policies. Technical Report NU-CCS-98-13. Boston: Northeastern University, College of Computer Science.

Williams, R. J. and Baird, L. C., III (1993). Tight performance bounds on greedy policies based on imperfect value functions. Technical Report NU-CCS-93-14. Boston: Northeastern University, College of Computer Science.

Williams, R. J. and Baird, L. C., III (1993). Analysis of some incremental variants of policy iteration: First steps toward understanding actor-critic learning systems. Technical Report NU-CCS-93-11. Boston: Northeastern University, College of Computer Science.

Williams, R. J. (1992). Some observations on the use of the extended Kalman filter as a recurrent network learning algorithm. Technical Report NU-CCS-92-1. Boston: Northeastern University, College of Computer Science.

Williams, R. J. and Zipser, D. (1990). Gradient-based learning algorithms for recurrent connectionist networks. (Technical Report NU-CCS-90-9). Boston: Northeastern University, College of Computer Science. [Figures not available]

Greene, R. L. and Williams, R. J. (1989). An approach to using rule-like training data in connectionist networks. Technical Report NU-CCS-89-30. Boston: Northeastern University, College of Computer Science. [Figures not available]


Abstracts of Papers Available For Downloading


To see just the titles without abstracts click here. These papers are also available by direct ftp.

Journal Articles

Peng, J. and Williams, R. J. (1996). Incremental multi-step Q-learning. Machine Learning, 22, 283-290.

ABSTRACT: This paper presents a novel incremental algorithm that combines Q-learning, a well-known dynamic programming-based reinforcement learning method with the TD($\lambda$) return estimation process, which is typically used in actor-critic learning, another dynamic programming-based reinforcement learning method. The parameter $\lambda$ is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the non-Markovian effect of coarse state-space quantization. The resulting algorithm, Q($\lambda$)-learning, thus combines some of the best features of the Q-learning and actor-critic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations.

Peng, J. and Williams, R. J. (1993). Efficient learning and planning within the Dyna framework. Adaptive Behavior, 2, 437-454.

ABSTRACT: Sutton's Dyna framework provides a novel and computationally appealing way to integrate learning, planning, and reacting in autonomous agents. Examined here is a class of strategies designed to enhance the learning and planning power of Dyna systems by increasing their computational efficiency. The benefit of using these strategies is demonstrated on some simple abstract learning tasks.

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229-256.

ABSTRACT: This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.

Williams, R. J. and Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3, 241-268. [Figures not included]

ABSTRACT: Any nonassociative reinforcement learning algorithm can be viewed as a method for performing function optimization through (possibly noise-corrupted) sampling of function values. We describe the results of simulations in which the optima of several deterministic functions studied by Ackley (1987) were sought using variants of REINFORCE algorithms (Williams, 1987; 1988). Some of the algorithms used here incorporated additional heuristic features resembling certain aspects of some of the algorithms used in Ackley's studies. Differing levels of performance were achieved by the various algorithms investigated, but a number of them performed at a level comparable to the best found in Ackley's studies on a number of the tasks, in spite of their simplicity. One of these variants, called REINFORCE/MENT, represents a novel but principled approach to reinforcement learning in nontrivial networks which incorporates an entropy maximization strategy. This was found to perform especially well on more hierarchically organized tasks.

Williams, R. J. and Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2, 490-501.

ABSTRACT: A novel variant of the familiar backpropagation-through-time approach to training recurrent networks is described. This algorithm is intended to be used on arbitrary recurrent networks that run continually without ever being reset to an initial state, and it is specifically designed for computationally efficient computer implementation. This algorithm can be viewed as a cross between epochwise backpropagation through time, which is not appropriate for continually running networks, and the widely used on-line gradient approximation technique of truncated backpropagation through time.

Williams, R. J. and Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270-280.

ABSTRACT: The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks. These algorithms have: (1)~the advantage that they do not require a precisely defined training interval, operating while the network runs; and (2)~the disadvantage that they require nonlocal communication in the network being trained and are computationally expensive. These algorithms are shown to allow networks having recurrent connections to learn complex tasks requiring the retention of information over time periods having either fixed or indefinite length.

Conference and Workshop Presentations

Al-Ansari, M. A. and Williams, R. J. (1999). Efficient, globally-optimized reinforcement learning with the parti-game algorithm. Advances in Neural Information Processing Systems 11

ABSTRACT: Parti-game is a reinforcement learning (RL) algorithm that has a lot of promise in overcoming the curse of dimensionality that can plague RL algorithms when applied to high-dimensional problems. In this paper we introduce modifications to the algorithm that further improve its performance and robustness. In addition, while parti-game solutions can be improved locally by standard local path-improvement techniques, we introduce an add-on algorithm in the same spirit as parti-game that instead tries to improve solutions in a non-local manner.

Al-Ansari, M. A. and Williams, R. J. (1998). Modifying the parti-game algorithm for increased robustness, higher efficiency, and better policies. Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, June, New Haven, CT, 204-209.

ABSTRACT: Parti-game is a reinforcement learning (RL) algorithm that has a lot of promise in overcoming the curse of dimensionality that can plague RL algorithms when applied to high-dimensional problems. In this paper we introduce modifications to the algorithm that further improve its performance and robustness. In addition, while parti-game solutions can be improved locally by standard local path-improvement techniques, we introduce an add-on algorithm in the same spirit as parti-game that instead tries to improve solutions in a non-local manner.

Peng, J. and Williams, R. J. (1994). Incremental multi-step Q-learning. Proceedings of the Eleventh International Conference on Machine Learning, July, New Brunswick, NJ, 226-232.

ABSTRACT: This paper presents a novel incremental algorithm that combines Q-learning, a well-known dynamic programming-based reinforcement learning method with the TD($\lambda$) return estimation process, which is typically used in actor-critic learning, another dynamic programming-based reinforcement learning method. The parameter $\lambda$ is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the non-Markovian effect of coarse state-space quantization. The resulting algorithm, Q($\lambda$)-learning, thus combines some of the best features of the Q-learning and actor-critic learning paradigms. The behavior of this algorithm is demonstrated through computer simulations of the standard benchmark control problem of learning to bablance a pole on a cart.

Williams, R. J. and Baird, L. C., III (1994). Tight performance bounds on greedy policies based on imperfect value functions. Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, June, New Haven, CT, 108-113.

ABSTRACT: Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function specifies at that state and what is obtained by a one-step lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions defined on state-action pairs, as are used in Q-learning. One significant application of these results is to problems where a function approximator is used to learn a value function, with training of the approximator based on trying to minimize the Bellman residual across states or state-action pairs. When control is based on the use of the resulting value function, this result provides a link between how well the objectives of function approximator training are met and the quality of the resulting control.

Williams, R. J. (1992). Training recurrent networks using the extended Kalman filter. Proceedings of the International Joint Conference on Neural Networks, June, Baltimore, MD, Vol. IV, 241-246.

ABSTRACT: The extended Kalman filter (EKF) can be used as an on-line algorithm to determine the weights in a recurrent network given target outputs as it runs. This paper notes some relationships between the EKF as applied to recurrent net learning and some simpler techniques that are more widely used. In particular, making certain simplifications to the EKF gives rise to an algorithm essentially identical to the real-time recurrent learning (RTRL) algorithm. Since the EKF involves adjusting unit activity in the network, it also provides a principled generalization of the teacher forcing technique. Preliminary simulation experiments on simple finite-state Boolean tasks indicate that the EKF can provide substantial speed-up in number of time steps required for training on such problems when compared with simpler on-line gradient algorithms. The computational requirements of the EKF are steep, but turn out to scale with network size at the same rate as RTRL. These observations are intended to provide insights that may lead to recurrent net training techniques that allow better control over the tradeoff between computational cost and convergence time.

Williams, R. J. and Baird, L. C., III (1990). A mathematical analysis of actor-critic architectures for learning optimal controls through incremental dynamic programming. Proceedings of the Sixth Yale Workshop on Adaptive and Learning Systems, August, New Haven, CT, 96-101.

ABSTRACT: Combining elements of the theory of dynamic programming with features appropriate for on-line learning has led to an approach Watkins has called incremental dynamic programming. Here we adopt this incremental dynamic programming point of view and obtain some preliminary mathematical results relevant to understanding the capabilities and limitations of actor-critic learning systems. Examples of such systems are Samuel's learning checker player, Holland's bucket brigade algorithm, Witten's adaptive controller, and the adaptive heuristic critic algorithm of Barto, Sutton, and Anderson. Particular emphasis here is on the effect of complete asynchrony in the updating of the actor and the critic across individual states or state-action pairs. The main results are that, while convergence to optimal performance is not guaranteed in general, there are a number of situations in which such convergence is assured.

Williams, R. J. and Peng, J. (1989). Reinforcement learning algorithms as function optimizers. Proceedings of the International Joint Conference on Neural Networks, Washington, DC, Vol. II, 89-95.

ABSTRACT: Any nonassociative reinforcement learning algorithm can be viewed as a method for performing function optimization through (possibly noise-corrupted) sampling of function values. We describe the results of simulations in which the optima of several deterministic functions studied by Ackley were sought using variants of REINFORCE algorithms. Results obtained for certain of these algorithms compare favorably to the best results found by Ackley.

Book Chapters

Williams, R. J. and Zipser, D. (1995). Gradient-based learning algorithms for recurrent networks and their computational complexity. In: Y. Chauvin and D. E. Rumelhart (Eds.) Back-propagation: Theory, Architectures and Applications, Hillsdale, NJ: Erlbaum. Also appeared as: Gradient-based learning algorithms for recurrent connectionist networks, (Technical Report NU-CCS-90-9). Boston: Northeastern University, College of Computer Science. [Figures not included]

ABSTRACT: None

Miller, S. and Williams, R. J. (1995). Temporal difference learning: A chemical process control application. In: A. F. Murray (Ed.) Applications of Artificial Neural Networks, Norwell, MA: Kluwer.

ABSTRACT: None

Williams, R. J. (1990). Adaptive state representation and estimation using recurrent connectionist networks. In: W. T. Miller, R. S. Sutton, and P. J. Werbos (Eds.) Neural Networks for Control, Cambridge: MIT Press/Bradford Books.

ABSTRACT: None

Technical Reports

Al-Ansari, M. A. and Williams, R. J. (1998). Modifying the parti-game algorithm for increased robustness, higher efficiency, and better policies. Technical Report NU-CCS-98-13. Boston: Northeastern University, College of Computer Science.

ABSTRACT: Parti-game is a reinforcement learning (RL) algorithm that has a lot of promise in overcoming the curse of dimensionality that can plague RL algorithms when applied to high-dimensional problems. In this paper we introduce modifications to the algorithm that further improve its performance and robustness. In addition, while parti-game solutions can be improved locally by standard local path-improvement techniques, we introduce an add-on algorithm in the same spirit as parti-game that instead tries to improve solutions in a non-local manner.

Williams, R. J. and Baird, L. C., III (1993). Tight performance bounds on greedy policies based on imperfect value functions. Technical Report NU-CCS-93-14. Boston: Northeastern University, College of Computer Science.

ABSTRACT: Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function specifies at that state and what is obtained by a one-step lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions defined on state-action pairs, as are used in Q-learning. One significant application of these results is to problems where a function approximator is used to learn a value function, with training of the approximator based on trying to minimize the Bellman residual across states or state-action pairs. When control is based on the use of the resulting value function, this result provides a link between how well the objectives of function approximator training are met and the quality of the resulting control.

Williams, R. J. and Baird, L. C., III (1993). Analysis of some incremental variants of policy iteration: First steps toward understanding actor-critic learning systems. Technical Report NU-CCS-93-11. Boston: Northeastern University, College of Computer Science.

ABSTRACT: This paper studies algorithms based on an incremental dynamic programming abstraction of one of the key issues in understanding the behavior of actor-critic learning systems. The prime example of such a learning system is the ASE/ACE architecture introduced by Barto, Sutton, and Anderson (1983). Also related are Witten's adaptive controller (1977) and Holland's bucket brigade algorithm (1986). The key feature of such a system is the presence of separate adaptive components for action selection and state evaluation, and the key issue focused on here is the extent to which their joint adaptation is guaranteed to lead to optimal behavior in the limit. In the incremental dynamic programming point of view taken here, these questions are formulated in terms of the use of separate data structures for the current best choice of policy and current best estimate of state values, with separate operations used to update each at individual states. Particular emphasis here is on the effect of complete asynchrony in the updating of these data structures across states. The main results are that, while convergence to optimal performance is not guaranteed in general, there are a number of situations in which such convergence is assured. Since the algorithms investigated represent a certain idealized abstraction of actor-critic learning systems, these results are not directly applicable to current versions of such learning systems but may be viewed instead as providing a useful first step toward more complete understanding of such systems. Another useful perspective on the algorithms analyzed here is that they represent a broad class of asynchronous dynamic programming procedures based on policy iteration.

Williams, R. J. (1992). Some observations on the use of the extended Kalman filter as a recurrent network learning algorithm. Technical Report NU-CCS-92-1. Boston: Northeastern University, College of Computer Science.

ABSTRACT: The extended Kalman filter (EKF) can be used as an on-line algorithm to determine the weights in a recurrent network given target outputs as it runs. This involves forming an augmented network state vector consisting of all unit activities and weights. This report notes some relationships between the EKF as applied to recurrent net learning and some simpler techniques that are more widely used. In particular, it is shown that making certain simplifications to the EKF gives rise to an algorithm essentially identical to the real-time recurrent learning (RTRL) algorithm. That is, the resulting algorithm both maintains the RTRL data structure and prescribes identical weight changes. In addition, because the EKF also involves adjusting unit activity in the network, it provides a principled generalization of the useful ``teacher forcing'' technique. Very preliminary experiments on simple finite-state Boolean tasks indicate that the EKF works well for these, generally giving substantial speed-up in number of time steps required for convergence to a solution when compared with the behavior of simpler on-line gradient algorithms like RTRL or truncated backpropagation through time (BPTT). The computational requirements of the EKF for this application are also compared with those of RTRL and BPTT. Its storage requirements scale as the 4th power of the number of units in the network, which exceeds the 3rd power scaling of the storage requirments of RTRL. Interestingly, however, the computation can be organized so that the number of arithmetic operations per time step scales as the 4th power of the number of units, which is no worse (except by a constant factor) than RTRL. Finally, some speculation is offered that the EKF and related algorithms may help provide a basis for both gaining a deeper understanding of existing recurrent network learning techniques and creating more computationally attractive algorithms that nevertheless retain the advantages of the EKF, most notably its incrementality and superior convergence speed.

Williams, R. J. and Zipser, D. (1990). Gradient-based learning algorithms for recurrent connectionist networks. (Technical Report NU-CCS-90-9). Boston: Northeastern University, College of Computer Science. [Figures not available]

ABSTRACT: Recurrent connectionist networks are important because they can perform temporally extended tasks, giving them considerable power beyond the static mappings performed by the now-familiar multilayer feedforward networks. This ability to perform highly nonlinear dynamic mappings makes these networks particularly interesting to study and potentially quite useful in tasks which have an important temporal component not easily handled through the use of simple tapped delay lines. Some examples are tasks involving recognition or generation of sequential patterns and sensorimotor control. This report examines a number of learning procedures for adjusting the weights in recurrent networks in order to train such networks to produce desired temporal behaviors from input-output stream examples. The procedures are all based on the computation of the gradient of performance error with respect to network weights, and a number of strategies for computing the necessary gradient information are described. Included here are approaches which are familiar and have been first described elsewhere, along with several novel approaches. One particular purpose of this report is to provide uniform and detailed descriptions and derivations of the various techniques in order to emphasize how they relate to one another. Another important contribution of this report is a detailed analysis of the computational requirements of the various approaches discussed.

Greene, R. L. and Williams, R. J. (1989). An approach to using rule-like training data in connectionist networks. Technical Report NU-CCS-89-30. Boston: Northeastern University, College of Computer Science. [Figures not available]

ABSTRACT: This report proposes a technique for training multilayer networks using data which is more rule-like than standard approaches allow. In particular, this technique is designed to handle training patterns in which some components of the input vector are unspecifed. These training patterns are interpreted as representing entire sets of individual training instances, each of which is to be mapped to the desired output for that pattern. Training a network using such a pattern corresponds to adding a rule to a rule-based system. Results of some experiments using this technique are also discussed, and an interesting further generalization is described. Overall, the approach can be viewed as a means of creating networks that integrates some intensional aspects into the purely extensional mode of specification characteristic of standard supervised learning algorithms like backpropagation.