5 The Stochastic Approximation Algorithm: 5.1 Stochastic Processes - Some Basic Concepts
5 The Stochastic Approximation Algorithm: 5.1 Stochastic Processes - Some Basic Concepts
1
5.1.2 Convergence of Random Variables
A random sequence may converge to a random variable, say to X. There are several useful
notions of convergence:
2. Convergence in probability:
p
Xn X if lim P (|Xn X| > ) = 0 , > 0 .
n
4. Convergence in Distribution:
Dist
Xn X (or Xn X) if Ef (Xn ) Ef (X)
2
5.1.3 Sigma-algebras and information
Sigma algebras (or -algebras) are part of the mathematical structure of probability theory.
They also have a convenient interpretation as information sets, which we shall find useful.
E(Z|X) E(Z|FX ) .
E(Z|X1 , . . . , Xn ) E(Z|Fn ) .
Note that Fn+1 Fn : more RVs carry more information, leading Fn+1 to be finer,
or more detailed
5.1.4 Martingales
b. Each RV Xk is Fk -measurable.
Note that
(a) Property is roughly equivalent to:
Fk represents (the information in) some RVs (Y0 , . . . , Yk ),
and (b) then means: Xk is a function of (Y0 , . . . , Yk ).
3
A particular case is Fn = {X1 , . . . , Xn } (a self-martingale).
The central property is (c), which says that the conditional mean of Xk+1 equals Xk .
This is obviously stronger than E(Xk+1 ) = E(Xk ).
The definition sometimes requires also that E|Xn | < , we shall assume that below.
Examples:
a. The simplest example of a martingale is
k
X
Xk = ` ,
`=0
Martingale Inequalities
4
2. Positive Martingale Convergence: If (Xk , Fk ) is a positive martingale (namely Xn
0), then Xk converges (a.s.) to some RV X .
For example, the conclusion holds if the sequence (k ) is bounded, namely |k | C for
some C > 0 (independent of k).
Note:
More generally, for any sequence (Yk ) and filtration (Fk ), where Yk is measurable on
Fk , the following is a martingale difference:
k , Yk E(Yk |Fk1 ) .
5
5.2 The Basic SA Algorithm
n+1 = n + n Yn
= n + n [h(n ) + n ], n 0.
Obviously, with zero noise (n 0) the stationary points of the algorithm coincide with
the solutions of h() = 0. Under appropriate conditions (on n , h and n ) the algorithm
indeed can be shown to converge to a solution of h() = 0.
References:
6
Some examples of the SA algorithm:
a. Average of an i.i.d. sequence: Let (Zn )0 be an i.i.d. sequence with mean = E(Z0 )
and finite variance. We wish to estimate the mean.
The iterative algorithm
1
n+1 = n + [Zn n ]
n+1
gives
n1
1 1X
n = 0 + Zk (w.p. 1), by the SLLN.
n n k=0
1
This is a SA iteration, with n = n+1 , and Yn = Zn n . Writing Zn = + n (Zn
is considered a noisy measurement of , with zero-mean noise n ), we can identify
h() = .
n+1 = n + n [f () + n ].
f () f ( + ei i ) f ( ei i )
.
i 2i
where ei is the i-th unit vector. This scheme is known as the Kiefer-Wolfowitz
Procedure.
7
Some variants of the SA algorithm
This is the form used in the Bertsekas & Tsitsiklis (1996) monograph.
Note that in the average estimation problem (example a. above) we get H() = ,
hence Zn = H(n ) + n .
where each component of is updated with a different gain sequence {n (i)}. These
gain sequences are typically required to be of comparable magnitude.
Moreover, the gain sequences may be allowed to be stochastic, namely depend on the
entire history of the process up to the time of update. For example, in the TD(0)
algorithm corresponds to the estimated value function V = (V (s), s S), and we
can define n (s) = 1/Nn (s), where Nn (s) is the number of visits to state s up to time
n.
Projections: If is often known that the required parameter lies in some set B Rd .
In that case we could use the projected iterates:
n+1 = ProjB [n + n Yn ]
8
5.3 Assumptions
Gain assumptions
To obtain convergence, the gain sequence needs to decrease to zero. The following assump-
tion is standard.
1 1
A common example is n = , with 2
< a 1.
na
Noise Assumptions
Let
Fn1 = {0 , 0 , 0 , , n1 ; n , n }
denote the ( -algebra generated by) the history sequence up to step n. Note that n is
measurable on Fn by definition of the latter.
Assumption N1
(a) The noise sequence {n } is a martingale difference sequence relative to the filtration
{Fn }, namely
E(n |Fn1 ) = 0 (a.s.).
9
Example: Let n N (0, n ), where n may depend on n , namely n = f (n ). Formally,
E(n |Fn ) = 0
E(n2 |Fn ) = f (n )2 ,
for some C < . It then follows by the martingale difference convergence theorem that
n
1X
lim k = 0 (a.s.).
n n
k=1
Markov Noise: The SA algorithm may converge under more general noise assumptions,
which are sometimes useful. For example, for each fixed , n may be a Markov chain
such that its long-term average is zero (but E(n |Fn1 ) 6= 0). We shall not go into that
generality here.
10
5.4 The ODE Method
The asymptotic behavior of the SA algorithm is closely related to the solutions of a certain
ODE (Ordinary Differential Equation), namely
d
(t) = h((t))
dt
or = h().
Define
(tn ) = n ,
and use linear interpolation in-between the tn s.
0 2
1
3
n
0 1 2 3
(t)
t
t0 t1 t2 t3
0 1 2
11
where K(t, t) = {k : t tk < t + t}.
Now:
X
(t + t) = (t) + k [h(n ) + n ] .
kk(t,t)
For t large, k becomes small and the summation is over many terms; thus the noise
P
term is approximately averaged out: k k 0.
We thus obtain:
(t + t) ' (t) + t h((t)) .
For t 0, this reduces to the ODE:
= h((t)) .
(t)
To conclude:
12
5.5 Some Convergence Results
Remarks:
1. More generally, even if the ODE is not globally stable, n can be shown to converge
to an invariant set of the ODE (e.g., a limit cycle).
2. Corresponding results exist for the asynchronous versions, under suitable assumptions
on the relative gains.
3. A major assumption in the last result in the boundedness of (n ). In general this
assumption has to be verified independently. However, there exist several results
that rely on further properties of h to deduce boundedness, and hence convergence.
The following convergence result from B. &T. (1996) relies on on contraction properties
of H, and applies to the asynchronous case. It will directly apply to some of our learning
algorithms. We start with a few definitions.
kH(1 ) H(2 )k k1 2 k
13
Recall that the max-norm is given by kk = maxi |(i)|. The weighted max-norm,
with a weight vector w, w(i) > 0, is given by
|(i)|
kkw = max{ }.
i w(i)
Here we can no longer expect w.p. 1 convergence results. What can be expected is a
statement of the form:
For
small enough, we have for all > 0
14