This appendix shows how to apply the work of A. A. J. Marley on commutative learning operators to reinforcement learning. Marley works with a slightly more general set of axioms than the one I use here, which is tailored toward the basic model of reinforcement learning.
Let's consider the sequences of propensities one for each act. They arise from sequences of choice probabilities that at every stage satisfy Luce's choice axiom. Let be the range of values the random variables, can take on, and let be In many applications, will just be the set of nonnegative real numbers. Let be the set of outcomes, which can often be identified with a subset of the reals.
A learning operator L maps pairs of propensities and outcomes in to. If x is an alternative's present propensity, then L(x, a) is its new propensity if choosing that alternative has led to outcome a. This gives rise to a family of learning operators; for each, can be viewed as an operator from to. We assume that there is a unit element with
The triple is called an abstract family. An abstract family is quasi-additive if there exists a function and a function such that for each x, y in
We say that the process given by the sequences of propensities and the sequence of choices of acts is a reinforcement learning process if there exists an abstract family such that for all n whenever
where e is the unit element of. If such a family is quasi-abstract, then
Some of Marley's principles tell us when an abstract family fits the learning process we are interested in. Let's start by introducing the relevant concepts.
An abstract family is strictly monotonic if for all x, y in and each a in
Strict monotonicity says that learning is stable; an outcome has the same effect on how propensities are ordered across all propensity levels.