5 The Stochastic Approximation Algorithm: 5.1 Stochastic Processes - Some Basic Concepts

Learning in Complex Systems Spring 2011
Lecture Notes Nahum Shimkin
5 The Stochastic Approximation Algorithm
5.1 Stochastic Processes Some Basic Concepts
5.1.1 Random Variables and Random Sequences
Let (, F, P ) be a probability space, namely:

is the sample space.
F is the event space. Its elements are subsets of , and it is required to be a -

algebra (includes and ; includes all countable union of its members; includes all
complements of its members).
P is the probability measure (assigns a probability in [0,1] to each element of F, with

the usual properties: P () = 1, countably additive).
A random variable (RV) X on (, F) is a function X : R, with values X(). It is

required to be measurable on F, namely, all sets of the form { : X() a} are events in
F.
A vector-valued RV is a vector of RVs. Equivalently, it is a function X : Rd , with

similar measurability requirement.
A random sequence, or a discrete-time stochastic process, is a sequence (Xn )n0 of Rd -valued

RVs, which are all defined on the same probability space.
1
5.1.2 Convergence of Random Variables
A random sequence may converge to a random variable, say to X. There are several useful
notions of convergence:
1. Almost sure convergence (or: convergence with probability 1):

a.s.
Xn X if P { lim Xn = X} = 1 .
n
2. Convergence in probability:
p
Xn X if lim P (|Xn X| > ) = 0 , > 0 .
n
3. Mean-squares convergence (convergence in L2 ):

L2
Xn X if E|Xn X |2 0 .
4. Convergence in Distribution:
Dist
Xn X (or Xn X) if Ef (Xn ) Ef (X)
for every bounded and continuous function f .
The following relations hold:
a. Basic implications: (a.s. or L2 ) = p = Dist
b. Almost sure convergence is equivalent to
lim P {sup |Xk X| > ) = 0 , > 0 .

n kn
c. A useful sufficient condition for a.s. convergence:

X
P (|Xn X| > ) < .
n=0
2
5.1.3 Sigma-algebras and information
Sigma algebras (or -algebras) are part of the mathematical structure of probability theory.
They also have a convenient interpretation as information sets, which we shall find useful.
Define FX , {X}, the -algebra generated by the RV X. This is the smallest

-algebra that contains all sets of the form {X a} { : X() a}.
We can interpret {X} as carrying all the information in X. Accordingly, we identify
E(Z|X) E(Z|FX ) .
Also, Z is measurable on {X} is equivalent to: Z = f (X) (with the additional

technical requirement that f is a Borel measurable function).
We can similarly define Fn = {X1 , . . . , Xn }, etc. Thus,
E(Z|X1 , . . . , Xn ) E(Z|Fn ) .
Note that Fn+1 Fn : more RVs carry more information, leading Fn+1 to be finer,
or more detailed
5.1.4 Martingales
A sequence (Xk , Fk )k0 on a given probability space (, F, P ) is a martingale if

a. (Fk ) is a filtration an increasing sequence of -algebras in F.
b. Each RV Xk is Fk -measurable.
c. E(Xk+1 |Fk ) = Xk (Pa.s.).
Note that
(a) Property is roughly equivalent to:
Fk represents (the information in) some RVs (Y0 , . . . , Yk ),
and (b) then means: Xk is a function of (Y0 , . . . , Yk ).
3
A particular case is Fn = {X1 , . . . , Xn } (a self-martingale).
The central property is (c), which says that the conditional mean of Xk+1 equals Xk .
This is obviously stronger than E(Xk+1 ) = E(Xk ).
The definition sometimes requires also that E|Xn | < , we shall assume that below.
Replacing (c) by E(Xk+1 |Fk ) Xk gives a submartingale, while E(Xk+1 |Fk ) Xk

corresponds to a supermartingale.
Examples:
a. The simplest example of a martingale is
k
X
Xk = ` ,
`=0
with {k } a sequence of 0-mean independent RVs, and Fk = (0 , . . . , k ).
b. Xk = E(X|Fk ), where (Fk ) is a given filtration and X a fixed RV.
Martingales play an important role in the convergence analysis of stochastic processes. We

quote a few basic theorems (see, for example: A.N. Shiryaev, Probability, Springer, 1996).
Martingale Inequalities
Let(Xk , Fk )k0 be a martingale. Then for every > 0 and p 1

E|Xn |p
P max |Xk |
kn p
and for p > 1
p p
E[(max |Xk |)p ] ( p1 ) E(|Xn |p ) .
kn
Martingale Convergence Theorems
1. Convergence with Bounded-moments: Consider a martingale (Xk , Fk )k0 . Assume

that:
E|Xk |q C for some C < , q 1 and all k.
Then {Xk } converges (a.s.) to a RV X (which is finite w.p. 1).
4
2. Positive Martingale Convergence: If (Xk , Fk ) is a positive martingale (namely Xn
0), then Xk converges (a.s.) to some RV X .
Martingale Difference Convergence
The sequence (k , Fk ) is a martingale difference sequence if property (c) is replaced by

E(k+1 |Fk ) = 0. In this case we have:
P
3. Suppose that for some 0 < q 2, k=1
1
kq
E(|k |q |Fk1 ) < (a.s.).
P
Then limn n1 nk=1 k = 0 (a.s.).
For example, the conclusion holds if the sequence (k ) is bounded, namely |k | C for
some C > 0 (independent of k).
Note:
It is trivially seen that (n , Xn Xn1 ) is a martingale difference if (Xn ) is a

martingale.
More generally, for any sequence (Yk ) and filtration (Fk ), where Yk is measurable on
Fk , the following is a martingale difference:
k , Yk E(Yk |Fk1 ) .
The conditions of the last theorem hold for this k if either:

(i) |Yk | M k for some constant M < ,
(ii) or, more generally, E(|Yk |q |Fk1 ) M (a.s.) for some q > 1 and a finite RV M .
In that case we have
n n
1X 1X
k (Yk E(Yk |Fk1 )) 0 (a.s.)
n k=1 n k=1
5
5.2 The Basic SA Algorithm
The stochastic approximations (SA) algorithm essentially solves a system of (nonlinear)

equations of the form
h() = 0
based on noisy measurements of h().
More specifically, we consider a (continuous) function h : Rd Rd , with d 1, which

depends on a set of parameters Rd . Suppose that h is unknown. However, for each
we can measure Y = h() + , where is some 0-mean noise. The classical SA algorithm
(Robbins-Monro, 1951) is of the form
n+1 = n + n Yn
= n + n [h(n ) + n ], n 0.
Here n is the algorithm the step-size, or gain.
Obviously, with zero noise (n 0) the stationary points of the algorithm coincide with
the solutions of h() = 0. Under appropriate conditions (on n , h and n ) the algorithm
indeed can be shown to converge to a solution of h() = 0.
References:
H. Kushner and G. Yin, Stochastic Approximation Algorithms and Applications, Springer,

1997.
V. Borkar, Stochastic Approximation: A Dynamic System Viewpoint, Hindustan, 2008.
J. Spall, Introduction to Stochastic Search and Optimization: Estimation, Simulation and

Control, Wiley, 2003.
6
Some examples of the SA algorithm:
a. Average of an i.i.d. sequence: Let (Zn )0 be an i.i.d. sequence with mean = E(Z0 )
and finite variance. We wish to estimate the mean.
The iterative algorithm
1
n+1 = n + [Zn n ]
n+1
gives
n1
1 1X
n = 0 + Zk (w.p. 1), by the SLLN.
n n k=0
1
This is a SA iteration, with n = n+1 , and Yn = Zn n . Writing Zn = + n (Zn
is considered a noisy measurement of , with zero-mean noise n ), we can identify
h() = .
b. Function minimization: Suppose we wish to minimize a (convex) function f (). De-

noting h() = f () f

, we need to solve h() = 0.
The basic iteration here is
n+1 = n + n [f () + n ].
This is a noisy gradient descent algorithm.

When f is not computable, it may be approximated by finite differences of the form
f () f ( + ei i ) f ( ei i )
.
i 2i
where ei is the i-th unit vector. This scheme is known as the Kiefer-Wolfowitz
Procedure.
7
Some variants of the SA algorithm
A fixed-point formulation: Let h() = H() . Then h() = 0 is equivalent to the

fixed-point equation H() = , and the algorithm is
n+1 = n + n [H(n ) n + n ] = (1 n )n + n [H(n ) + n ] .
This is the form used in the Bertsekas & Tsitsiklis (1996) monograph.
Note that in the average estimation problem (example a. above) we get H() = ,
hence Zn = H(n ) + n .
Asynchronous updates: Different components of may be updated at different times

and rates. A general form of the algorithm is:
n+1 (i) = n (i) + n (i)Yn (i), i = 1, , d
where each component of is updated with a different gain sequence {n (i)}. These
gain sequences are typically required to be of comparable magnitude.
Moreover, the gain sequences may be allowed to be stochastic, namely depend on the
entire history of the process up to the time of update. For example, in the TD(0)
algorithm corresponds to the estimated value function V = (V (s), s S), and we
can define n (s) = 1/Nn (s), where Nn (s) is the number of visits to state s up to time
n.
Projections: If is often known that the required parameter lies in some set B Rd .
In that case we could use the projected iterates:
n+1 = ProjB [n + n Yn ]
where ProjB is some projection onto B.

The simplest case is of course when B is a box, so that the components of are
simply truncated at their minimal and maximal values.
If B is a bounded set then the estimated sequence {n } is guaranteed to be bounded
in this algorithm. This is very helpful for convergence analysis.
8
5.3 Assumptions
Gain assumptions
To obtain convergence, the gain sequence needs to decrease to zero. The following assump-
tion is standard.
Assumption G1: n 0, and

X
(i) n =
n=1
X
(ii) n2 < .
n=1
1 1
A common example is n = , with 2
< a 1.
na
Noise Assumptions
In general the noise sequence {n } is required to be zero-mean, so that it will average

out.
Since we want to allow dependence of n on n , the sequence {n } cannot be assumed

independent. The assumption below allows {n } to be a martingale difference sequence.
Let
Fn1 = {0 , 0 , 0 , , n1 ; n , n }
denote the ( -algebra generated by) the history sequence up to step n. Note that n is
measurable on Fn by definition of the latter.
Assumption N1
(a) The noise sequence {n } is a martingale difference sequence relative to the filtration
{Fn }, namely
E(n |Fn1 ) = 0 (a.s.).
(b) For some finite constants A, B and some norm k k on Rd ,
E(kn k2 |Fn1 ) A + Bkn k2 (a.s.), n 1 .
9
Example: Let n N (0, n ), where n may depend on n , namely n = f (n ). Formally,
E(n |Fn ) = 0
E(n2 |Fn ) = f (n )2 ,
and we require that f ()2 A + B2 .
Note: When {n } is known to be bounded, then (b) reduces to
E(kn k2 |Fn1 ) C (a.s.) n
for some C < . It then follows by the martingale difference convergence theorem that
n
1X
lim k = 0 (a.s.).
n n
k=1
However, it is often the case that is not known to be bounded a-priori.
Markov Noise: The SA algorithm may converge under more general noise assumptions,
which are sometimes useful. For example, for each fixed , n may be a Markov chain
such that its long-term average is zero (but E(n |Fn1 ) 6= 0). We shall not go into that
generality here.
10
5.4 The ODE Method
The asymptotic behavior of the SA algorithm is closely related to the solutions of a certain
ODE (Ordinary Differential Equation), namely
d
(t) = h((t))
dt
or = h().
Given {n , n }, we define a continuous-time process (t) as follows. Let

n1
X
tn = k .
k=0
Define
(tn ) = n ,
and use linear interpolation in-between the tn s.
Thus, the time-axis t is rescaled according to the gains {n }.

n
0 2
1
3
n
0 1 2 3
(t)
t
t0 t1 t2 t3
0 1 2
Note that over a fixed t, the total gain is approximately constant:

X
k ' t ,
kK(t,t)
11
where K(t, t) = {k : t tk < t + t}.
Now:
X
(t + t) = (t) + k [h(n ) + n ] .
kk(t,t)
For t large, k becomes small and the summation is over many terms; thus the noise
P
term is approximately averaged out: k k 0.
For t small, k is approximately constant over K(t, t) : h(k ) ' h((t)).
We thus obtain:
(t + t) ' (t) + t h((t)) .
For t 0, this reduces to the ODE:
= h((t)) .
(t)
To conclude:
As n , we expect that the estimates {n } will follow a trajectory of the ODE

= h() (under the above time normalization).
Note that the stationary point(s) of the ODE are given by : h( ) = 0.
An obvious requirement for n is (t) (for any (0)). That is: is a

globally asymptotically stable equilibrium of the ODE.
This may this be viewed as a necessary condition for convergence of n . It is also suf-
ficient under additional assumptions on h (continuity, smoothness), and boundedness
of {n }.
12
5.5 Some Convergence Results
A typical convergence result for the (synchronous) SA algorithm is the following:
Theorem 1 Assume G1, N1, and furthermore:

(i) h is Lipschitz continuous.
(ii) The ODE = h() has a unique equilibrium point , which is globally asymptotically
stable.
(iii) The sequence (n ) is bounded (with probability 1).
Then n (w.p. 1), for any initial conditions 0 .
Remarks:
1. More generally, even if the ODE is not globally stable, n can be shown to converge
to an invariant set of the ODE (e.g., a limit cycle).
2. Corresponding results exist for the asynchronous versions, under suitable assumptions
on the relative gains.
3. A major assumption in the last result in the boundedness of (n ). In general this
assumption has to be verified independently. However, there exist several results
that rely on further properties of h to deduce boundedness, and hence convergence.
The following convergence result from B. &T. (1996) relies on on contraction properties
of H, and applies to the asynchronous case. It will directly apply to some of our learning
algorithms. We start with a few definitions.
Let H() = h() + , so that h() = H() .
Recall that H() is a contraction operator w.r.t. a norm k k if
kH(1 ) H(2 )k k1 2 k
for some < 1 and all 1 , 2 .
H() is a pseudo-contraction if the same holds for a fixed 2 = . It easily follows

then that is a unique fixed point of H.
13
Recall that the max-norm is given by kk = maxi |(i)|. The weighted max-norm,
with a weight vector w, w(i) > 0, is given by
|(i)|
kkw = max{ }.
i w(i)
Theorem 2 (Prop. 4.4. in B.&T). Let
n+1 (i) = n (i) + n (i)[H(n ) n + n ]i , i = 1, , d .
Assume N1, and:

(a) Gain assumption: n (i) 0, measurable on the past, and satisfy
X X
n (i) = , n (i)2 < (w.p. 1) .
n n
(b) H is a pseudo-contraction w.r.t. some weighted max-norm.
Then n (w.p. 1), where is the unique fixed point of H.
Remark on Constant Gain Algorithms
As noted before, in practice it is often desirable to keep a non-diminishing gain. A typical

case is n (i) [,
].
Here we can no longer expect w.p. 1 convergence results. What can be expected is a
statement of the form:
For
small enough, we have for all > 0
lim sup P (kn k > ) b()

,
n
with b() < .
This is related to convergence in probability, or weak convergence. We shall not give

a detailed account here.
14

5 The Stochastic Approximation Algorithm: 5.1 Stochastic Processes - Some Basic Concepts

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 The Stochastic Approximation Algorithm: 5.1 Stochastic Processes - Some Basic Concepts

Uploaded by

Copyright:

Available Formats

Learning in Complex Systems Spring 2011

Lecture Notes Nahum Shimkin

5 The Stochastic Approximation Algorithm

5.1 Stochastic Processes Some Basic Concepts

5.1.1 Random Variables and Random Sequences

Let (, F, P ) be a probability space, namely:

F is the event space. Its elements are subsets of , and it is required to be a -

P is the probability measure (assigns a probability in [0,1] to each element of F, with

A random variable (RV) X on (, F) is a function X : R, with values X(). It is

A vector-valued RV is a vector of RVs. Equivalently, it is a function X : Rd , with

A random sequence, or a discrete-time stochastic process, is a sequence (Xn )n0 of Rd -valued

1. Almost sure convergence (or: convergence with probability 1):

3. Mean-squares convergence (convergence in L2 ):

for every bounded and continuous function f .

The following relations hold:

a. Basic implications: (a.s. or L2 ) = p = Dist

b. Almost sure convergence is equivalent to

lim P {sup |Xk X| > ) = 0 , > 0 .

c. A useful sufficient condition for a.s. convergence:

Define FX , {X}, the -algebra generated by the RV X. This is the smallest

We can interpret {X} as carrying all the information in X. Accordingly, we identify

Also, Z is measurable on {X} is equivalent to: Z = f (X) (with the additional

We can similarly define Fn = {X1 , . . . , Xn }, etc. Thus,

A sequence (Xk , Fk )k0 on a given probability space (, F, P ) is a martingale if

c. E(Xk+1 |Fk ) = Xk (Pa.s.).

Replacing (c) by E(Xk+1 |Fk ) Xk gives a submartingale, while E(Xk+1 |Fk ) Xk

with {k } a sequence of 0-mean independent RVs, and Fk = (0 , . . . , k ).

b. Xk = E(X|Fk ), where (Fk ) is a given filtration and X a fixed RV.

Martingales play an important role in the convergence analysis of stochastic processes. We

Let(Xk , Fk )k0 be a martingale. Then for every > 0 and p 1

Martingale Convergence Theorems

1. Convergence with Bounded-moments: Consider a martingale (Xk , Fk )k0 . Assume

Martingale Difference Convergence

The sequence (k , Fk ) is a martingale difference sequence if property (c) is replaced by

It is trivially seen that (n , Xn Xn1 ) is a martingale difference if (Xn ) is a

The conditions of the last theorem hold for this k if either:

The stochastic approximations (SA) algorithm essentially solves a system of (nonlinear)

More specifically, we consider a (continuous) function h : Rd Rd , with d 1, which

Here n is the algorithm the step-size, or gain.

H. Kushner and G. Yin, Stochastic Approximation Algorithms and Applications, Springer,

V. Borkar, Stochastic Approximation: A Dynamic System Viewpoint, Hindustan, 2008.

J. Spall, Introduction to Stochastic Search and Optimization: Estimation, Simulation and

b. Function minimization: Suppose we wish to minimize a (convex) function f (). De-

This is a noisy gradient descent algorithm.

A fixed-point formulation: Let h() = H() . Then h() = 0 is equivalent to the

n+1 = n + n [H(n ) n + n ] = (1 n )n + n [H(n ) + n ] .

Asynchronous updates: Different components of may be updated at different times

n+1 (i) = n (i) + n (i)Yn (i), i = 1, , d

where ProjB is some projection onto B.

Assumption G1: n 0, and

In general the noise sequence {n } is required to be zero-mean, so that it will average

Since we want to allow dependence of n on n , the sequence {n } cannot be assumed

(b) For some finite constants A, B and some norm k k on Rd ,

E(kn k2 |Fn1 ) A + Bkn k2 (a.s.), n 1 .

and we require that f ()2 A + B2 .

Note: When {n } is known to be bounded, then (b) reduces to

E(kn k2 |Fn1 ) C (a.s.) n

However, it is often the case that is not known to be bounded a-priori.

Given {n , n }, we define a continuous-time process (t) as follows. Let

Thus, the time-axis t is rescaled according to the gains {n }.

Note that over a fixed t, the total gain is approximately constant:

For t small, k is approximately constant over K(t, t) : h(k ) ' h((t)).