Gluon Tutorials: Deep Learning - The Straight Dope

Deep Learning - The Straight Dope
Release 0.1
MXNet Community
Oct 12, 2018

CRASH COURSE
i
ii
Deep Learning - The Straight Dope, Release 0.1
This repo contains an incremental sequence of notebooks designed to teach deep learning, Apache MXNet
(incubating), and the gluon interface. Our goal is to leverage the strengths of Jupyter notebooks to present
prose, graphics, equations, and code together in one place. If we’re successful, the result will be a resource
that could be simultaneously a book, course material, a prop for live tutorials, and a resource for plagiarising
(with our blessing) useful code. To our knowledge there’s no source out there that teaches either (1) the full
breadth of concepts in modern deep learning or (2) interleaves an engaging textbook with runnable code.
We’ll find out by the end of this venture whether or not that void exists for a good reason.
Another unique aspect of this book is its authorship process. We are developing this resource fully in the
public view and are making it available for free in its entirety. While the book has a few primary authors
to set the tone and shape the content, we welcome contributions from the community and hope to coauthor
chapters and entire sections with experts and community members. Already we’ve received contributions
spanning typo corrections through full working examples.
CRASH COURSE 1
2 CRASH COURSE
CHAPTER
ONE
HOW TO CONTRIBUTE
To clone or contribute, visit Deep Learning - The Straight Dope on Github.
3
4 Chapter 1. How to contribute

CHAPTER
TWO
DEPENDENCIES
To run these notebooks, a recent version of MXNet is required. The easiest way is to install the nightly build
MXNet through pip. E.g.:
$ pip install mxnet --pre --user
More detailed instructions are available here
5
6 Chapter 2. Dependencies
CHAPTER
THREE
PART 1: DEEP LEARNING FUNDAMENTALS
3.1 Preface
If you’re a reasonable person, you might ask, “what is mxnet-the-straight-dope?” You might also ask, “why
does it have such an ostentatious name?” Speaking to the former question, mxnet-the-straight-dope is an
attempt to create a new kind of educational resource for deep learning. Our goal is to leverage the strengths
of Jupyter notebooks to present prose, graphics, equations, and (importantly) code together in one place. If
we’re successful, the result will be a resource that could be simultaneously a book, course material, a prop
for live tutorials, and a resource for plagiarising (with our blessing) useful code. To our knowledge, few
available resources aim to teach either (1) the full breadth of concepts in modern machine learning or (2)
interleave an engaging textbook with runnable code. We’ll find out by the end of this venture whether or not
that void exists for a good reason.
Regarding the name, we are cognizant that the machine learning community and the ecosystem in which we
operate have lurched into an absurd place. In the early 2000s, comparatively few tasks in machine learning
had been conquered, but we felt that we understood how and why those models worked (with some caveats).
By contrast, today’s machine learning systems are extremely powerful and actually work for a growing list
of tasks, but huge open questions remain as to precisely why they are so effective.
This new world offers enormous opportunity, but has also given rise to considerable buffoonery. Research
preprints like the arXiv are flooded by clickbait, AI startups have sometimes received overly optimistic
valuations, and the blogosphere is flooded with thought leadership pieces written by marketers bereft of any
technical knowledge. Amid the chaos, easy money, and lax standards, we believe it’s important not to take
our models or the environment in which they are worshipped too seriously. Also, in order to both explain,
visualize, and code the full breadth of models that we aim to address, it’s important that the authors do not
get bored while writing.
3.1.1 Organization
At present, we’re aiming for the following format: aside from a few (optional) notebooks providing a crash
course in the basic mathematical background, each subsequent notebook will both:
1. Introduce a reasonable number (perhaps one) of new concepts
2. Provide a single self-contained working example, using a real dataset
This presents an organizational challenge. Some models might logically be grouped together in a single
notebook. And some ideas might be best taught by executing several models in succession. On the other
hand, there’s a big advantage to adhering to a policy of 1 working example, 1 notebook: This makes it as
7
easy as possible for you to start your own research projects by plagiarising our code. Just copy a single
notebook and start modifying it.
We will interleave the runnable code with background material as needed. In general, we will often err on
the side of making tools available before explaining them fully (and we will follow up by explaining the
background later). For instance, we might use stochastic gradient descent before fully explaining why it is
useful or why it works. This helps to give practitioners the necessary ammunition to solve problems quickly,
at the expense of requiring the reader to trust us with some decisions, at least in the short term. Throughout,
we’ll be working with the MXNet library, which has the rare property of being flexible enough for research
while being fast enough for production. Our more advanced chapters will mostly rely on MXNet’s new high-
level imperative interface gluon. Note that this is not the same as mxnet.module, an older, symbolic
interface supported by MXNet.
This book will teach deep learning concepts from scratch. Sometimes, we’ll want to delve into fine details
about the models that are hidden from the user by gluon’s advanced features. This comes up especially
in the basic tutorials, where we’ll want you to understand everything that happens in a given layer. In
these cases, we’ll generally present two versions of the example: one where we implement everything from
scratch, relying only on NDArray and automatic differentiation, and another where we show how to do
things succinctly with gluon. Once we’ve taught you how a layer works, we can just use the gluon
version in subsequent tutorials.
3.1.2 Learning by doing

Many textbooks teach a series of topics, each in exhaustive detail. For example, Chris Bishop’s excellent
textbook, Pattern Recognition and Machine Learning, teaches each topic so thoroughly, that getting to the
chapter on linear regression requires a non-trivial amount of work. When I (Zack) was first learning machine
learning, this actually limited the book’s usefulness as an introductory text. When I rediscovered it a couple
years later, I loved it precisely for its thoroughness, and I hope you check it out after working through this
material! But perhaps the traditional textbook aproach is not the easiest way to get started in the first place.
Instead, in this book, we’ll teach most concepts just in time. For the fundamental preliminaries like linear
algebra and probability, we’ll provide a brief crash course from the outset, but we want you to taste the
satisfaction of training your first model before worrying about exotic probability distributions.
3.1.3 Next steps

If you’re ready to get started, head over to the introduction or go straight to our basic primer on NDArray,
MXNet’s workhorse data structure.
For whinges or inquiries, open an issue on GitHub.
3.2 Introduction
Before we could begin writing, the authors of this book, like much of the work force, had to become
caffeinated. We hopped in the car and started driving. Having an Android, Alex called out “Okay Google”,
awakening the phone’s voice recognition system. Then Mu commanded “directions to Blue Bottle coffee
shop”. The phone quickly displayed the transcription of his command. It also recognized that we were
asking for directions and launched the Maps application to fulfill our request. Once launched, the Maps app
identified a number of routes. Next to each route, the phone displayed a predicted transit time. While we
8 Chapter 3. Part 1: Deep Learning Fundamentals

fabricated this story for pedagogical convenience, it demonstrates that in the span of just a few seconds, our
everyday interactions with a smartphone can engage several machine learning models.
If you’ve never worked with machine learning before, you might be wondering what the hell we’re talking
about. You might ask, “isn’t that just programming?” or “what does machine learning even mean?” First, to
be clear, we implement all machine learning algorithms by writing computer programs. Indeed, we use the
same languages and hardware as other fields of computer science, but not all computer programs involve
machine learning. In response to the second question, precisely defining a field of study as vast as machine
learning is hard. It’s a bit like answering, “what is math?”. But we’ll try to give you enough intuition to get
started.
3.2.1 A motivating example

Most of the computer programs we interact with every day can be coded up from first principles. When you
add an item to your shopping cart, you trigger an e-commerce application to store an entry in a shopping
cart database table, associating your user ID with the product’s ID. We can write such a program from first
principles, launch without ever having seen a real customer. When it’s this easy to write an application you
should not be using machine learning.
Fortunately (for the community of ML scientists), however, for many problems, solutions aren’t so easy.
Returning to our fake story about going to get coffee, imagine just writing a program to respond to a wake
word like “Alexa”, “Okay, Google” or “Siri”. Try coding it up in a room by yourself with nothing but
a computer and a code editor. How would you write such a program from first principles? Think about
it. . . the problem is hard. Every second, the microphone will collect roughly 44,000 samples. What rule
could map reliably from a snippet of raw audio to confident predictions {yes, no} on whether the snippet
contains the wake word? If you’re stuck, don’t worry. We don’t know how to write such a program from
scratch either. That’s why we use machine learning.
Here’s the trick. Often, even when we don’t know how to tell a computer explicitly how to map from inputs
to outputs, we are nonetheless capable of performing the cognitive feat ourselves. In other words, even
if you don’t know how to program a computer to recognize the word “Alexa”, you yourself are able to
recognize the word “Alexa”. Armed with this ability, we can collect a huge data set containing examples of
audio and label those that do and that do not contain the wake word. In the machine learning approach, we
do not design a system explicitly to recognize wake words right away. Instead, we define a flexible program
with a number of parameters. These are knobs that we can tune to change the behavior of the program. We
call this program a model. Generally, our model is just a machine that transforms its input into some output.
In this case, the model receives as input a snippet of audio, and it generates as output an answer {yes,
3.2. Introduction 9
no}, which we hope reflects whether (or not) the snippet contains the wake word.
If we choose the right kind of model, then there should exist one setting of the knobs such that the model
fires yes every time it hears the word “Alexa”. There should also be another setting of the knobs that might
fire yes on the word “Apricot”. We expect that the same model should apply to “Alexa” recognition and
“Apricot” recognition because these are similar tasks. However, we might need a different model to deal
with fundamentally different inputs or outputs. For example, we might choose a different sort of machine to
map from images to captions, or from English sentences to Chinese sentences.
As you might guess, if we just set the knobs randomly, the model will probably recognize neither “Alexa”,
“Apricot”, nor any other English word. Generally, in deep learning, the learning refers precisely to updating
the model’s behavior (by twisting the knobs) over the course of a training period.
The training process usually looks like this:
1. Start off with a randomly initialized model that can’t do anything useful.
2. Grab some of your labeled data (e.g. audio snippets and corresponding {yes,no} labels)
3. Tweak the knobs so the model sucks less with respect to those examples
4. Repeat until the model is awesome.
To summarize, rather than code up a wake word recognizer, we code up a program that can learn to recognize
wake words, if we present it with a large labeled dataset. You can think of this act of determining a program’s
behavior by presenting it with a dataset as programming with data.
We can ‘program’ a cat detector by providing our machine learning system with many examples of cats and
dogs, such as the images below:

cat cat dog dog
This way the detector will eventually learn to emit a very large positive number if it’s a cat, a very large
negative number if it’s a dog, and something closer to zero if it isn’t sure, but this is just barely scratching
the surface of what machine learning can do.
3.2.2 The dizzying versatility of machine learning

This is the core idea behind machine learning: Rather than code programs with fixed behavior, we design
programs with the ability to improve as they acquire more experience. This basic idea can take many forms.
Machine learning can address many different application domains, involve many different types of models,
and update them according to many different learning algorithms. In this particular case, we described an
instance of supervised learning applied to a problem in automated speech recognition.
Machine Learning is a versatile set of tools that lets you work with data in many different situations where
simple rule-based systems would fail or might be very difficult to build. Due to its versatility, machine
learning can be quite confusing to newcomers. For example, machine learning techniques are already widely
used in applications as diverse as search engines, self driving cars, machine translation, medical diagnosis,
spam filtering, game playing (chess, go), face recognition, data matching, calculating insurance premiums,
and adding filters to photos.
Despite the superficial differences between these problems many of them share a common structure and are
addressable with deep learning tools. They’re mostly similar because they are problems where we wouldn’t
be able to program their behavior directly in code, but we can program them with data. Often times the most
direct language for communicating these kinds of programs is math. In this book, we’ll introduce a minimal
amount of mathematical notation, but unlike other books on machine learning and neural networks, we’ll
always keep the conversation grounded in real examples and real code.
3.2.3 Basics of machine learning

When we considered the task of recognizing wake-words, we put together a dataset consisting of snippets
and labels. We then described (albeit abstractly) how you might train a machine learning model to predict
the label given a snippet. This set-up, predicting labels from examples, is just one flavor of ML and it’s
called supervised learning. Even within deep learning, there are many other approaches, and we’ll discuss
each in subsequent sections. To get going with machine learning, we need four things:
1. Data
2. A model of how to transform the data
3. A loss function to measure how well we’re doing
4. An algorithm to tweak the model parameters such that the loss function is minimized
3.2. Introduction 11
Data
Generally, the more data we have, the easier our job becomes. When we have more data, we can train more
powerful models. Data is at the heart of the resurgence of deep learning and many of most exciting models
in deep learning don’t work without large data sets. Here are some examples of the kinds of data machine
learning practitioners often engage with:
• Images: Pictures taken by smartphones or harvested from the web, satellite images, photographs of
medical conditions, ultrasounds, and radiologic images like CT scans and MRIs, etc.
• Text: Emails, high school essays, tweets, news articles, doctor’s notes, books, and corpora of trans-
lated sentences, etc.
• Audio: Voice commands sent to smart devices like Amazon Echo, or iPhone or Android phones,
audio books, phone calls, music recordings, etc.
• Video: Television programs and movies, YouTube videos, cell phone footage, home surveillance,
multi-camera tracking, etc.
• Structured data: Webpages, electronic medical records, car rental records, electricity bills, etc.
Models
Usually the data looks quite different from what we want to accomplish with it. For example, we might have
photos of people and want to know whether they appear to be happy. We might desire a model capable of
ingesting a high-resolution image and outputting a happiness score. While some simple problems might be
addressable with simple models, we’re asking a lot in this case. To do its job, our happiness detector needs
to transform hundreds of thousands of low-level features (pixel values) into something quite abstract on the
other end (happiness scores). Choosing the right model is hard, and different models are better suited to
different datasets. In this book, we’ll be focusing mostly on deep neural networks. These models consist
of many successive transformations of the data that are chained together top to bottom, thus the name deep
learning. On our way to discussing deep nets, we’ll also discuss some simpler, shallower models.
Loss functions
To assess how well we’re doing we need to compare the output from the model with the truth. Loss functions
give us a way of measuring how bad our output is. For example, say we trained a model to infer a patient’s
heart rate from images. If the model predicted that a patient’s heart rate was 100bpm, when the ground truth
was actually 60bpm, we need a way to communicate to the model that it’s doing a lousy job.
Similarly if the model was assigning scores to emails indicating the probability that they are spam, we’d
need a way of telling the model when its predictions are bad. Typically the learning part of machine learning
consists of minimizing this loss function. Usually, models have many parameters. The best values of these
parameters is what we need to ‘learn’, typically by minimizing the loss incurred on a training data of
observed data. Unfortunately, doing well on the training data doesn’t guarantee that we will do well on
(unseen) test data, so we’ll want to keep track of two quantities.
• Training Error: This is the error on the dataset used to train our model by minimizing the loss on
the training set. This is equivalent to doing well on all the practice exams that a student might use
to prepare for the real exam. The results are encouraging, but by no means guarantee success on the
final exam.

• Test Error: This is the error incurred on an unseen test set. This can deviate quite a bit from the
training error. This condition, when a model fails to generalize to unseen data, is called overfitting. In
real-life terms, this is the equivalent of screwing up the real exam despite doing well on the practice
exams.
Optimization algorithms
Finally, to minimize the loss, we’ll need some way of taking the model and its loss functions, and searching
for a set of parameters that minimizes the loss. The most popular optimization algorithms for work on neural
networks follow an approach called gradient descent. In short, they look to see, for each parameter which
way the training set loss would move if you jiggled the parameter a little bit. They then update the parameter
in the direction that reduces the loss.
In the following sections, we will discuss a few types of machine learning in some more detail. We begin
with a list of objectives, i.e. a list of things that machine learning can do. Note that the objectives are
complemented with a set of techniques of how to accomplish them, i.e. training, types of data, etc. The list
below is really only sufficient to whet the readers’ appetite and to give us a common language when we talk
about problems. We will introduce a larger number of such problems as we go along.
3.2.4 Supervised learning

Supervised learning addresses the task of predicting targets given input data. The targets, also commonly
called labels are generally denoted y. The input data points, also commonly called examples or instances,
are typically denoted 𝑥. The goal is to produce a model 𝑓𝜃 that maps an input 𝑥 to a prediction 𝑓𝜃 (𝑥)
To ground this description in a concrete example, if we were working in healthcare, then we might want to
predict whether or not a patient would have a heart attack. This observation, heart attack or no heart attack,
would be our label 𝑦. The input data 𝑥 might be vital signs such as heart rate, diastolic and systolic blood
pressure, etc.
The supervision comes into play because for choosing the parameters 𝜃, we (the supervisors) provide the
model with a collection of labeled examples (𝑥𝑖 , 𝑦𝑖 ), where each example 𝑥𝑖 is matched up against it’s
correct label.
In probabilistic terms, we typically are interested estimating the conditional probability 𝑃 (𝑦|𝑥). While
it’s just one among several approaches to machine learning, supervised learning accounts for the majority
of machine learning in practice. Partly, that’s because many important tasks can be described crisply as
estimating the probability of some unknown given some available evidence:
• Predict cancer vs not cancer, given a CT image.
• Predict the correct translation in French, given a sentence in English.
• Predict the price of a stock next month based on this month’s financial reporting data.
Even with the simple description “predict targets from inputs” supervised learning can take a great many
forms and require a great many modeling decisions, depending on the type, size, and the number of inputs
and outputs. For example, we use different models to process sequences (like strings of text or time series
data) and for processing fixed-length vector representations. We’ll visit many of these problems in depth
throughout the first 9 parts of this book.
Put plainly, the learning process looks something like this. Grab a big pile of example inputs, selecting them
randomly. Acquire the ground truth labels for each. Together, these inputs and corresponding labels (the
desired outputs) comprise the training set. We feed the training dataset into a supervised learning algorithm.
So here the supervised learning algorithm is a function that takes as input a dataset, and outputs another
function, the learned model. Then, given a learned model, we can take a new previously unseen input, and
predict the corresponding label.
Regression
Perhaps the simplest supervised learning task to wrap your head around is Regression. Consider, for ex-
ample a set of data harvested from a database of home sales. We might construct a table, where each row
corresponds to a different house, and each column corresponds to some relevant attribute, such as the square
footage of a house, the number of bedrooms, the number of bathrooms, and the number of minutes (walking)
to the center of town. Formally, we call one row in this dataset a feature vector, and the object (e.g. a house)
it’s associated with an example.
If you live in New York or San Francisco, and you are not the CEO of Amazon, Google, Microsoft, or
Facebook, the (sq. footage, no. of bedrooms, no. of bathrooms, walking distance) feature vector for your
home might look something like: [100, 0, .5, 60]. However, if you live in Pittsburgh, it might look more like
[3000, 4, 3, 10]. Feature vectors like this are essential for all the classic machine learning problems. We’ll
typically denote the feature vector for any one example xi and the set of feature vectors for all our examples
𝑋.
What makes a problem regression is actually the outputs. Say that you’re in the market for a new home,
you might want to estimate the fair market value of a house, given some features like these. The target
value, the price of sale, is a real number. We denote any individual target 𝑦𝑖 (corresponding to example xi )
and the set of all targets y (corresponding to all examples X). When our targets take on arbitrary real values
in some range, we call this a regression problem. The goal of our model is to produce predictions (guesses
of the price, in our example) that closely approximate the actual target values.

We denote these predictions 𝑦ˆ𝑖 and if the notation seems unfamiliar, then just ignore it for now. We’ll
unpack it more thoroughly in the subsequent chapters.
Lots of practical problems are well-described regression problems. Predicting the rating that a user will
assign to a movie is a regression problem, and if you designed a great algorithm to accomplish this feat
in 2009, you might have won the $1 million Netflix prize. Predicting the length of stay for patients in the
hospital is also a regression problem. A good rule of thumb is that any How much? or How many? problem
should suggest regression. * “How many hours will this surgery take?”. . . regression * “How many dogs
are in this photo?” . . . regression. However, if you can easily pose your problem as “Is this a ___?”, then
it’s likely, classification, a different fundamental problem type that we’ll cover next.
Even if you’ve never worked with machine learning before, you’ve probably worked through a regression
problem informally. Imagine, for example, that you had your drains repaired and that your contractor, spent
𝑥1 = 3 hours removing gunk from your sewage pipes. Then she sent you a bill of 𝑦1 = $350. Now imagine
that your friend hired the same contractor for 𝑥2 = 2 hours and that she received a bill of 𝑦2 = $250. If
someone then asked you how much to expect on their upcoming gunk-removal invoice you might make
some reasonable assumptions, such as more hours worked costs more dollars. You might also assume that
there’s some base charge and that the contractor then charges per hour. If these assumptions held, then given
these two data points, you could already identify the contractor’s pricing structure: $100 per hour plus $50
to show up at your house. If you followed that much then you already understand the high-level idea behind
linear regression.
In this case, we could produce the parameters that exactly matched the contractor’s prices. Sometimes that’s
not possible, e.g., if some of the variance owes to some factors besides your two features. In these cases,
we’ll try to learn models that minimize the distance between our predictions and the observed∑︀ values. In most
of our chapters, we’ll focus on∑︀one of two very common losses, the L1 loss where 𝑙(𝑦, 𝑦 ′ ) = 𝑖 |𝑦𝑖 −𝑦𝑖′ | and
the L2 loss where 𝑙(𝑦, 𝑦 ′ ) = 𝑖 (𝑦𝑖 − 𝑦𝑖′ )2 . As we will see later, the 𝐿2 loss corresponds to the assumption
that our data was corrupted by Gaussian noise, whereas the 𝐿1 loss corresponds to an assumption of noise
from a Laplace distribution.
Classification
While regression models are great for addressing how many? questions, lots of problems don’t bend com-
fortably to this template. For example, a bank wants to add check scanning to their mobile app. This would
involve the customer snapping a photo of a check with their smartphone’s camera and the machine learning
model would need to be able to automatically understand text seen in the image. It would also need to
understand hand-written text to be even more robust. This kind of system is referred to as optical character
recognition (OCR), and the kind of problem it solves is called a classification. It’s treated with a distinct set
of algorithms than those that are used for regression.
In classification, we want to look at a feature vector, like the pixel values in an image, and then predict which
category (formally called classes), among some set of options, an example belongs. For hand-written digits,
we might have 10 classes, corresponding to the digits 0 through 9. The simplest form of classification is
when there are only two classes, a problem which we call binary classification. For example, our dataset 𝑋
could consist of images of animals and our labels 𝑌 might be the classes {cat, dog}. While in regression,
we sought a regressor to output a real value 𝑦ˆ, in classification, we seek a classifier, whose output 𝑦ˆ is the
predicted class assignment.
For reasons that we’ll get into as the book gets more technical, it’s pretty hard to optimize a model that can
only output a hard categorical assignment, e.g. either cat or dog. It’s a lot easier instead to express the model
in the language of probabilities. Given an example 𝑥, the model assigns a probability 𝑦ˆ𝑘 to each label 𝑘.
Because these are probabilities, they need to be positive numbers and add up to 1. This means that we only
need 𝐾 − 1 numbers to give the probabilities of 𝐾 categories. This is easy to see for binary classification. If
there’s a 0.6 (60%) probability that an unfair coin comes up heads, then there’s a 0.4 (40%) probability that
it comes up tails. Returning to our animal classification example, a classifier might see an image and output
the probability that the image is a cat Pr(𝑦 = cat | 𝑥) = 0.9. We can interpret this number by saying that
the classifier is 90% sure that the image depicts a cat. The magnitude of the probability for the predicted
class is one notion of confidence. It’s not the only notion of confidence and we’ll discuss different notions
of uncertainty in more advanced chapters.
When we have more than two possible classes, we call the problem multiclass classification. Common ex-
amples include hand-written character recognition [0, 1, 2, 3 ... 9, a, b, c, ...]. While
we attacked regression problems by trying to minimize the L1 or L2 loss functions, the common loss func-
tion for classification problems is called cross-entropy. In MXNet Gluon, the corresponding loss function
can be found here.
Note that the most likely class is not necessarily the one that you’re going to use for your decision. Assume
that you find this beautiful mushroom in your backyard:
Death cap - do not eat!
Now, assume that you built a classifier and trained it to predict if a mushroom is poisonous based on a
photograph. Say our poison-detection classifier outputs Pr(𝑦 = deathcap | image) = 0.2. In other words,
the classifier is 80% confident that our mushroom is not a death cap. Still, you’d have to be a fool to eat it.
That’s because the certain benefit of a delicious dinner isn’t worth a 20% chance of dying from it. In other
words, the effect of the uncertain risk by far outweighs the benefit. Let’s look at this in math. Basically,
we need to compute the expected risk that we incur, i.e. we need to multiply the probability of the outcome
with the benefit (or harm) associated with it:
𝐿(action | 𝑥) = E𝑦∼𝑝(𝑦|𝑥) [loss(action, 𝑦)]
Hence, the loss 𝐿 incurred by eating the mushroom is 𝐿(𝑎 = eat | 𝑥) = 0.2 * ∞ + 0.8 * 0 = ∞, whereas
the cost of discarding it is 𝐿(𝑎 = discard | 𝑥) = 0.2 * 0 + 0.8 * 1 = 0.8.
We got lucky: as any mycologist would tell us, the above actually is a death cap. Classification can get much
more complicated than just binary, multiclass, of even multi-label classification. For instance, there are some

variants of classification for addressing hierarchies. Hierarchies assume that there exist some relationships
among the many classes. So not all errors are equal - we prefer to misclassify to a related class than to a
distant class. Usually, this is referred to as hierarchical classification. One early example is due to Linnaeus,
who organized the animals in a hierarchy..
In the case of animal classification, it might not be so bad to mistake a poodle for a schnauzer, but our
model would pay a huge penalty if it confused a poodle for a dinosaur. What hierarchy is relevant might
depend on how you plan to use the model. For example, rattle snakes and garter snakes might be close on
the phylogenetic tree, but mistaking a rattler for a garter could be deadly.
Tagging
Some classification problems don’t fit neatly into the binary or multiclass classification setups. For example,
we could train a normal binary classifier to distinguish cats from dogs. Given the current state of computer
vision, we can do this easily, with off-the-shelf tools. Nonetheless, no matter how accurate our model gets,
we might find ourselves in trouble when the classifier encounters an image like this:
As you can see, there’s a cat in the picture. There is also a dog, a tire, some grass, a door, concrete, rust,
individual grass leaves, etc. Depending on what we want to do with our model ultimately, treating this as a
binary classification problem might not make a lot of sense. Instead, we might want to give the model the
option of saying the image depicts a cat and a dog, or neither a cat nor a dog.
The problem of learning to predict classes that are not mutually exclusive is called multi-label classifica-
tion. Auto-tagging problems are typically best described as multi-label classification problems. Think of

the tags people might apply to posts on a tech blog, e.g., “machine learning”, “technology”, “gadgets”, “pro-
gramming languages”, “linux”, “cloud computing”, “AWS”. A typical article might have 5-10 tags applied
because these concepts are correlated. Posts about “cloud computing” are likely to mention “AWS” and
posts about “machine learning” could also deal with “programming languages”.
We also have to deal with this kind of problem when dealing with the biomedical literature, where correctly
tagging articles is important because it allows researchers to do exhaustive reviews of the literature. At the
National Library of Medicine, a number of professional annotators go over each article that gets indexed
in PubMed to associate each with the relevant terms from MeSH, a collection of roughly 28k tags. This is
a time-consuming process and the annotators typically have a one year lag between archiving and tagging.
Machine learning can be used here to provide provisional tags until each article can have a proper manual
review. Indeed, for several years, the BioASQ organization has hosted a competition to do precisely this.
Search and ranking

Sometimes we don’t just want to assign each example to a bucket or to a real value. In the field of information
retrieval, we want to impose a ranking on a set of items. Take web search for example, the goal is less to
determine whether a particular page is relevant for a query, but rather, which one of the plethora of search
results should be displayed for the user. We really care about the ordering of the relevant search results and
our learning algorithm needs to produce ordered subsets of elements from a larger set. In other words, if we
are asked to produce the first 5 letters from the alphabet, there is a difference between returning A B C D
E and C A B E D. Even if the result set is the same, the ordering within the set matters nonetheless.
One possible solution to this problem is to score every element in the set of possible sets along with a
corresponding relevance score and then to retrieve the top-rated elements. PageRank is an early example of
such a relevance score. One of the peculiarities is that it didn’t depend on the actual query. Instead, it simply
helped to order the results that contained the query terms. Nowadays search engines use machine learning
and behavioral models to obtain query-dependent relevance scores. There are entire conferences devoted to
this subject.
Recommender systems
Recommender systems are another problem setting that is related to search and ranking. The problems
are similar insofar as the goal is to display a set of relevant items to the user. The main difference is the
emphasis on personalization to specific users in the context of recommender systems. For instance, for
movie recommendations, the results page for a SciFi fan and the results page for a connoisseur of Woody
Allen comedies might differ significantly.
Such problems occur, e.g. for movie, product or music recommendation. In some cases, customers will
provide explicit details about how much they liked the product (e.g. Amazon product reviews). In some
other cases, they might simply provide feedback if they are dissatisfied with the result (skipping titles on a
playlist). Generally, such systems strive to estimate some score 𝑦𝑖𝑗 , such as an estimated rating or probability
of purchase, given a user 𝑢𝑖 and product 𝑝𝑗 .
Given such a model, then for any given user, we could retrieve the set of objects with the largest scores 𝑦𝑖𝑗
are then used as a recommendation. Production systems are considerably more advanced and take detailed
user activity and item characteristics into account when computing such scores. The following image is an
example of deep learning books recommended by Amazon based on personalization algorithms tuned to the
author’s preferences.
Sequence Learning
So far we’ve looked at problems where we have some fixed number of inputs and produce a fixed number of
outputs. Before we considered predicting home prices from a fixed set of features: square footage, number
of bedrooms, number of bathrooms, walking time to downtown. We also discussed mapping from an image
(of fixed dimension), to the predicted probabilities that it belongs to each of a fixed number of classes, or
taking a user ID and a product ID, and predicting a star rating. In these cases, once we feed our fixed-length
input into the model to generate an output, the model immediately forgets what it just saw.
This might be fine if our inputs truly all have the same dimensions and if successive inputs truly have nothing
to do with each other. But how would we deal with video snippets? In this case, each snippet might consist
of a different number of frames. And our guess of what’s going on in each frame might be much stronger if
we take into account the previous or succeeding frames. Same goes for language. One popular deep learning
problem is machine translation: the task of ingesting sentences in some source language and predicting their
translation in another language.
These problems also occur in medicine. We might want a model to monitor patients in the intensive care
unit and to fire off alerts if their risk of death in the next 24 hours exceeds some threshold. We definitely
wouldn’t want this model to throw away everything it knows about the patient history each hour, and just
make its predictions based on the most recent measurements.
These problems are among the more exciting applications of machine learning and they are instances of
sequence learning. They require a model to either ingest sequences of inputs or to emit sequences of
outputs (or both!). These latter problems are sometimes referred to as seq2seq problems. Language
translation is a seq2seq problem. Transcribing text from spoken speech is also a seq2seq problem.
While it is impossible to consider all types of sequence transformations, a number of special cases are worth

mentioning:
Tagging and Parsing
This involves annotating a text sequence with attributes. In other words, the number of inputs and outputs is
essentially the same. For instance, we might want to know where the verbs and subjects are. Alternatively,
we might want to know which words are the named entities. In general, the goal is to decompose and
annotate text based on structural and grammatical assumptions to get some annotation. This sounds more
complex than it actually is. Below is a very simple example of annotating a sentence with tags indicating
which words refer to named entities.
Tom
Ent
Automatic Speech Recognition
With speech recognition, the input sequence 𝑥 is the sound of a speaker, and the output 𝑦 is the textual
transcript of what the speaker said. The challenge is that there are many more audio frames (sound is
typically sampled at 8kHz or 16kHz) than text, i.e. there is no 1:1 correspondence between audio and text,
since thousands of samples correspond to a single spoken word. These are seq2seq problems where the
output is much shorter than the input.
----D----e----e-----p------- L----ea------r------ni-----ng---
Text to Speech
Text to Speech (TTS) is the inverse of speech recognition. In other words, the input 𝑥 is text and the output
𝑦 is an audio file. In this case, the output is much longer than the input. While it is easy for humans to
recognize a bad audio file, this isn’t quite so trivial for computers.
Machine Translation
Unlike the case of speech of recognition, where corresponding inputs and outputs occur in the same order
(after alignment), in machine translation, order inversion can be vital. In other words, while we are still
converting one sequence into another, neither the number of inputs and outputs nor the order of correspond-
ing data points are assumed to be the same. Consider the following illustrative example of the obnoxious
tendency of Germans (Alex writing here) to place the verbs at the end of sentences.
German Haben Sie sich schon dieses grossartige Lehrwerk angeschaut?

English Did you already check out this excellent tutorial?
Wrong alignmen t Did you yourself already this excellent tutorial looked-at?
A number of related problems exist. For instance, determining the order in which a user reads a webpage
is a two-dimensional layout analysis problem. Likewise, for dialogue problems, we need to take world-
knowledge and prior state into account. This is an active area of research.
3.2.5 Unsupervised learning

All the examples so far were related to Supervised Learning, i.e. situations where we feed the model a bunch
of examples and a bunch of corresponding target values. You could think of supervised learning as having
an extremely specialized job and an extremely anal boss. The boss stands over your shoulder and tells you
exactly what to do in every situation until you learn to map from situations to actions. Working for such a
boss sounds pretty lame. On the other hand, it’s easy to please this boss. You just recognize the pattern as
quickly as possible and imitate their actions.
In a completely opposite way, it could be frustrating to work for a boss who has no idea what they want you
to do. However, if you plan to be a data scientist, you had better get used to it. The boss might just hand you
a giant dump of data and tell you to do some data science with it! This sounds vague because it is. We call
this class of problems unsupervised learning, and the type and number of questions we could ask is limited
only by our creativity. We will address a number of unsupervised learning techniques in later chapters. To
whet your appetite for now, we describe a few of the questions you might ask:
• Can we find a small number of prototypes that accurately summarize the data? Given a set of pho-
tos, can we group them into landscape photos, pictures of dogs, babies, cats, mountain peaks, etc.?
Likewise, given a collection of users’ browsing activity, can we group them into users with similar
behavior? This problem is typically known as clustering.
• Can we find a small number of parameters that accurately capture the relevant properties of the data?
The trajectories of a ball are quite well described by velocity, diameter, and mass of the ball. Tailors
have developed a small number of parameters that describe human body shape fairly accurately for
the purpose of fitting clothes. These problems are referred to as subspace estimation problems. If
the dependence is linear, it is called principal component analysis.
• Is there a representation of (arbitrarily structured) objects in Euclidean space (i.e. the space of vectors
in R𝑛 ) such that symbolic properties can be well matched? This is called representation learning
and it is used to describe entities and their relations, such as Rome - Italy + France = Paris.
• Is there a description of the root causes of much of the data that we observe? For instance, if we
have demographic data about house prices, pollution, crime, location, education, salaries, etc., can
we discover how they are related simply based on empirical data? The field of directed graphical
models and causality deals with this.
• An important and exciting recent development is generative adversarial networks. They are basi-
cally a procedural way of synthesizing data. The underlying statistical mechanisms are tests to check
whether real and fake data are the same. We will devote a few notebooks to them.
3.2.6 Interacting with an environment

So far, we haven’t discussed where data actually comes from, or what actually happens when a machine
learning model generates an output. That’s because supervised learning and unsupervised learning do not
address these issues in a very sophisticated way. In either case, we grab a big pile of data up front, then
do our pattern recognition without ever interacting with the environment again. Because all of the learning

takes place after the algorithm is disconnected from the environment, this is called offline learning. For
supervised learning, the process looks like this:
This simplicity of offline learning has its charms. The upside is we can worry about pattern recognition
in isolation without these other problems to deal with, but the downside is that the problem formulation
is quite limiting. If you are more ambitious, or if you grew up reading Asimov’s Robot Series, then you
might imagine artificially intelligent bots capable not only of making predictions, but of taking actions in
the world. We want to think about intelligent agents, not just predictive models. That means we need to
think about choosing actions, not just making predictions. Moreover, unlike predictions, actions actually
impact the environment. If we want to train an intelligent agent, we must account for the way its actions
might impact the future observations of the agent.
Considering the interaction with an environment that opens a whole set of new modeling questions. Does
the environment:
• remember what we did previously?

• want to help us, e.g. a user reading text into a speech recognizer?
• want to beat us, i.e. an adversarial setting like spam filtering (against spammers) or playing a game
(vs an opponent)?
• not care (as in most cases)?
• have shifting dynamics (steady vs shifting over time)?
This last question raises the problem of covariate shift, (when training and test data are different). It’s a
problem that most of us have experienced when taking exams written by a lecturer, while the homeworks
were composed by his TAs. We’ll briefly describe reinforcement learning, and adversarial learning, two
settings that explicitly consider interaction with an environment.
Reinforcement learning
If you’re interested in using machine learning to develop an agent that interacts with an environment and
takes actions, then you’re probably going to wind up focusing on reinforcement learning (RL). This might
include applications to robotics, to dialogue systems, and even to developing AI for video games. Deep re-
inforcement learning (DRL), which applies deep neural networks to RL problems, has surged in popularity.
The breakthrough deep Q-network that beat humans at Atari games using only the visual input , and the
AlphaGo program that dethroned the world champion at the board game Go are two prominent examples.
Reinforcement learning gives a very general statement of a problem, in which an agent interacts with an
environment over a series of time steps. At each time step 𝑡, the agent receives some observation 𝑜𝑡 from
the environment, and must choose an action 𝑎𝑡 which is then transmitted back to the environment. Finally,
the agent receives a reward 𝑟𝑡 from the environment. The agent then receives a subseqeunt observation,
and chooses a subsequent action, and so on. The behavior of an RL agent is governed by a policy. In
short, a policy is just a function that maps from observations (of the environment) to actions. The goal of
reinforcement learning is to produce a good policy.
It’s hard to overstate the generality of the RL framework. For example, we can cast any supervised learning
problem as an RL problem. Say we had a classification problem. We could create an RL agent with one

action corresponding to each class. We could then create an environment which gave a reward that was
exactly equal to the loss function from the original supervised problem.
That being said, RL can also address many problems that supervised learning cannot. For example, in
supervised learning we always expect that the training input comes associated with the correct label. But in
RL, we don’t assume that for each observation, the environment tells us the optimal action. In general, we
just get some reward. Moreover, the environment may not even tell us which actions led to the reward.
Consider for example the game of chess. The only real reward signal comes at the end of the game when we
either win, which we might assign a reward of 1, or when we lose, which we could assign a reward of -1. So
reinforcement learners must deal with the credit assignment problem. The same goes for an employee who
gets a promotion on October 11. That promotion likely reflects a large number of well-chosen actions over
the previous year. Getting more promotions in the future requires figuring out what actions along the way
led to the promotion.
Reinforcement learners may also have to deal with the problem of partial observability. That is, the current
observation might not tell you everything about your current state. Say a cleaning robot found itself trapped
in one of many identical closets in a house. Inferring the precise location (and thus state) of the robot might
require considering its previous observerations before entering the closet.
Finally, at any given point, reinforcement learners might know of one good policy, but there might be many
other better policies that the agent has never tried. The reinforcement learner must constantly choose whether
to exploit the best currently-known strategy as a policy, or to explore the space of strategies, potentially
giving up some short-run reward in exchange for knowledge.
MDPs, bandits, and friends

The general reinforcement learning problem is a very general setting. Actions affect subsequent observa-
tions. Rewards are only observed corresponding to the chosen actions. The environment may be either fully
or partially observed. Accounting for all this complexity at once may ask too much of researchers. Moreover
not every practical problem exhibits all this complexity. As a result, researchers have studied a number of
special cases of reinforcement learning problems.
When the environment is fully observed, we call the RL problem a Markov Decision Process (MDP). When
the state does not depend on the previous actions, we call the problem a contextual bandit problem. When
there is no state, just a set of available actions with initially unknown rewards, this problem is the classic
multi-armed bandit problem.
3.2.7 When not to use machine learning

Let’s take a closer look at the idea of programming data by considering an interaction that Joel Grus reported
experiencing in a job interview. The interviewer asked him to code up Fizz Buzz. This is a children’s game
where the players count from 1 to 100 and will say ‘fizz’ whenever the number is divisible by 3, ‘buzz’
whenever it is divisible by 5, and ‘fizzbuzz’ whenever it satisfies both criteria. Otherwise, they will just state
the number. It looks like this:
1 2 fizz 4 buzz fizz 7 8 fizz buzz 11 fizz 13 14 fizzbuzz 16 ...
The conventional way to solve such a task is quite simple.

In [1]: res = []
for i in range(1, 101):
if i % 15 == 0:
res.append('fizzbuzz')
elif i % 3 == 0:
res.append('fizz')
elif i % 5 == 0:
res.append('buzz')
else:
res.append(str(i))
print(' '.join(res))
1 2 fizz 4 buzz fizz 7 8 fizz buzz 11 fizz 13 14 fizzbuzz 16 17 fizz 19 buzz fizz 22 23 fiz
This isn’t very exciting if you’re a good programmer. Joel proceeded to ‘implement’ this problem in Machine
Learning instead. For that to succeed, he needed a number of pieces:
• Data X [1, 2, 3, 4, ...] and labels Y ['fizz', 'buzz', 'fizzbuzz',
identity]
• Training data, i.e. examples of what the system is supposed to do. Such as [(2, 2), (6,
fizz), (15, fizzbuzz), (23, 23), (40, buzz)]
• Features that map the data into something that the computer can handle more easily, e.g. x -> [(x
% 3), (x % 5), (x % 15)]. This is optional but helps a lot if you have it.
Armed with this, Joel wrote a classifier in TensorFlow (code). The interviewer was nonplussed . . . and the
classifier didn’t have perfect accuracy.
Quite obviously, this is silly. Why would you go through the trouble of replacing a few lines of Python
with something much more complicated and error prone? However, there are many cases where a simple
Python script simply does not exist, yet a 3-year-old child will solve the problem perfectly. Fortunately, this
is precisely where machine learning comes to the rescue.
3.2.8 Conclusion
Machine Learning is vast. We cannot possibly cover it all. On the other hand, neural networks are simple
and only require elementary mathematics. So let’s get started.
3.2.9 Next
Manipulate data the MXNet way with NDArray
3.3 Manipulate data the MXNet way with ndarray

It’s impossible to get anything done if we can’t manipulate data. Generally, there are two important things
we need to do with: (i) acquire it! and (ii) process it once it’s inside the computer. There’s no point in
trying to acquire data if we don’t even know how to store it, so let’s get our hands dirty first by playing with
synthetic data.
We’ll start by introducing NDArrays, MXNet’s primary tool for storing and transforming data. If you’ve
worked with NumPy before, you’ll notice that NDArrays are, by design, similar to NumPy’s multi-
dimensional array. However, they confer a few key advantages. First, NDArrays support asynchronous

computation on CPU, GPU, and distributed cloud architectures. Second, they provide support for automatic
differentiation. These properties make NDArray an ideal library for machine learning, both for researchers
and engineers launching production systems.
3.3.1 Getting started

In this chapter, we’ll get you going with the basic functionality. Don’t worry if you don’t understand any
of the basic math, like element-wise operations or normal distributions. In the next two chapters we’ll take
another pass at NDArray, teaching you both the math you’ll need and how to realize it in code.
To get started, let’s import mxnet. We’ll also import ndarray from mxnet for convenience. We’ll make
a habit of setting a random seed so that you always get the same results that we do.
In [1]: import mxnet as mx
from mxnet import nd
mx.random.seed(1)
Next, let’s see how to create an NDArray, without any values initialized. Specifically, we’ll create a 2D array
(also called a matrix) with 3 rows and 4 columns.
In [2]: x = nd.empty((3, 4))
print(x)
[[ 0.00000000e+00 0.00000000e+00 2.26995938e-20 4.57734143e-41]

[ 1.38654559e-38 0.00000000e+00 1.07958838e-15 4.57720130e-41]
[ 6.48255647e-37 0.00000000e+00 4.70016266e-18 4.57734143e-41]]
<NDArray 3x4 @cpu(0)>
The empty method just grabs some memory and hands us back a matrix without setting the values of any of
its entries. This means that the entries can have any form of values, including very big ones! But typically,
we’ll want our matrices initialized. Commonly, we want a matrix of all zeros.
In [3]: x = nd.zeros((3, 5))
x
Out[3]:
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]
Similarly, ndarray has a function to create a matrix of all ones.

In [4]: x = nd.ones((3, 4))
x
Out[4]:
[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]
Often, we’ll want to create arrays whose values are sampled randomly. This is especially common when we
intend to use the array as a parameter in a neural network. In this snippet, we initialize with values drawn
from a standard normal distribution with zero mean and unit variance.
3.3. Manipulate data the MXNet way with ndarray 27

In [5]: y = nd.random_normal(0, 1, shape=(3, 4))

y
Out[5]:
[[ 0.11287736 -1.30644417 -0.10713575 -2.63099265]
[-0.05735848 0.31348416 -0.57651091 -1.11059952]
[ 0.57960719 -0.22899596 1.04484284 0.81243682]]
As in NumPy, the dimensions of each NDArray are accessible via the .shape attribute.
In [6]: y.shape
Out[6]: (3, 4)
We can also query its size, which is equal to the product of the components of the shape. Together with the
precision of the stored values, this tells us how much memory the array occupies.
In [7]: y.size
Out[7]: 12
3.3.2 Operations
NDArray supports a large number of standard mathematical operations. Such as element-wise addition:
In [8]: x + y
Out[8]:
[[ 1.11287737 -0.30644417 0.89286423 -1.63099265]
[ 0.9426415 1.31348419 0.42348909 -0.11059952]
[ 1.57960725 0.77100402 2.04484272 1.81243682]]
Multiplication:
In [9]: x * y
Out[9]:
[[ 0.11287736 -1.30644417 -0.10713575 -2.63099265]
[-0.05735848 0.31348416 -0.57651091 -1.11059952]
[ 0.57960719 -0.22899596 1.04484284 0.81243682]]
And exponentiation:
In [10]: nd.exp(y)
Out[10]:
[[ 1.11949468 0.27078119 0.8984037 0.07200695]
[ 0.94425553 1.36818385 0.56185532 0.32936144]
[ 1.78533697 0.79533172 2.84295177 2.25339246]]
We can also grab a matrix’s transpose to compute a proper matrix-matrix product.

In [11]: nd.dot(x, y.T)
Out[11]:
[[-3.93169522 -1.43098474 2.20789099]
[-3.93169522 -1.43098474 2.20789099]

[-3.93169522 -1.43098474 2.20789099]]

We’ll explain these operations and present even more operators in the linear algebra chapter. But for now,
we’ll stick with the mechanics of working with NDArrays.
3.3.3 In-place operations

In the previous example, every time we ran an operation, we allocated new memory to host its results. For
example, if we write y = x + y, we will dereference the matrix that y used to point to and instead point it
at the newly allocated memory. In the following example we demonstrate this with Python’s id() function,
which gives us the exact address of the referenced object in memory. After running y = y + x, we’ll find
that id(y) points to a different location. That’s because Python first evaluates y + x, allocating new
memory for the result and then subsequently redirects y to point at this new location in memory.
In [12]: print('id(y):', id(y))
y = y + x
print('id(y):', id(y))
id(y): 140291459787296
id(y): 140295515324600
This might be undesirable for two reasons. First, we don’t want to run around allocating memory unnec-
essarily all the time. In machine learning, we might have hundreds of megabytes of paramaters and update
all of them multiple times per second. Typically, we’ll want to perform these updates in place. Second, we
might point at the same parameters from multiple variables. If we don’t update in place, this could cause a
memory leak, and could cause us to inadvertently reference stale parameters.
Fortunately, performing in-place operations in MXNet is easy. We can assign the result of an operation to a
previously allocated array with slice notation, e.g., y[:] = <expression>.
In [13]: print('id(y):', id(y))
y[:] = x + y
print('id(y):', id(y))
id(y): 140295515324600
id(y): 140295515324600
While this syntacically nice, x+y here will still allocate a temporary buffer to store the result before copying
it to y[:]. To make even better use of memory, we can directly invoke the underlying ndarray operation,
in this case elemwise_add, avoiding temporary buffers. We do this by specifying the out keyword
argument, which every ndarray operator supports:
In [15]: nd.elemwise_add(x, y, out=y)
Out[15]:
[[ 3.11287737 1.69355583 2.89286423 0.36900735]
[ 2.9426415 3.31348419 2.42348909 1.88940048]
[ 3.57960725 2.77100396 4.04484272 3.81243682]]
If we’re not planning to re-use x, then we can assign the result to x itself. There are two ways to do this in
MXNet. 1. By using slice notation x[:] = x op y 2. By using the op-equals operators like +=
In [16]: print('id(x):', id(x))
x += y

x
print('id(x):', id(x))
id(x): 140291459564992
id(x): 140291459564992
3.3.4 Slicing
MXNet NDArrays support slicing in all the ridiculous ways you might imagine accessing your data. Here’s
an example of reading the second and third rows from x.
In [17]: x[1:3]
Out[17]:
[[ 3.9426415 4.31348419 3.42348909 2.88940048]
[ 4.57960701 3.77100396 5.04484272 4.81243706]]
Now let’s try writing to a specific element.

In [18]: x[1,2] = 9.0
x
Out[18]:
[[ 4.11287737 2.69355583 3.89286423 1.36900735]
[ 3.9426415 4.31348419 9. 2.88940048]
[ 4.57960701 3.77100396 5.04484272 4.81243706]]
Multi-dimensional slicing is also supported.

In [19]: x[1:2,1:3]
Out[19]:
[[ 4.31348419 9. ]]
In [20]: x[1:2,1:3] = 5.0
x
Out[20]:
[[ 4.11287737 2.69355583 3.89286423 1.36900735]
[ 3.9426415 5. 5. 2.88940048]
[ 4.57960701 3.77100396 5.04484272 4.81243706]]
3.3.5 Broadcasting
You might wonder, what happens if you add a vector y to a matrix X? These operations, where we compose
a low dimensional array y with a high-dimensional array X invoke a functionality called broadcasting. Here,
the low-dimensional array is duplicated along any axis with dimension 1 to match the shape of the high
dimensional array. Consider the following example.
In [21]: x = nd.ones(shape=(3,3))
print('x = ', x)
y = nd.arange(3)
print('y = ', y)
print('x + y = ', x + y)

x =
[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
y =
[ 0. 1. 2.]
<NDArray 3 @cpu(0)>
x + y =
[[ 1. 2. 3.]
[ 1. 2. 3.]
[ 1. 2. 3.]]
While y is initially of shape (3), MXNet infers its shape to be (1,3), and then broadcasts along the rows to
form a (3,3) matrix). You might wonder, why did MXNet choose to interpret y as a (1,3) matrix and not
(3,1). That’s because broadcasting prefers to duplicate along the left most axis. We can alter this behavior
by explicitly giving y a 2D shape.
In [22]: y = y.reshape((3,1))
print('y = ', y)
print('x + y = ', x+y)
y =
[[ 0.]
[ 1.]
[ 2.]]
x + y =
[[ 1. 1. 1.]
[ 2. 2. 2.]
[ 3. 3. 3.]]
3.3.6 Converting from MXNet NDArray to NumPy

Converting MXNet NDArrays to and from NumPy is easy. The converted arrays do not share memory.
In [23]: a = x.asnumpy()
type(a)
Out[23]: numpy.ndarray
In [24]: y = nd.array(a)
y
Out[24]:
[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
3.3.7 Managing context

You might have noticed that MXNet NDArray looks almost identical to NumPy. But there are a few cru-
cial differences. One of the key features that differentiates MXNet from NumPy is its support for diverse

hardware devices.
In MXNet, every array has a context. One context could be the CPU. Other contexts might be various GPUs.
Things can get even hairier when we deploy jobs across multiple servers. By assigning arrays to contexts
intelligently, we can minimize the time spent transferring data between devices. For example, when training
neural networks on a server with a GPU, we typically prefer for the model’s parameters to live on the GPU.
To start, let’s try initializing an array on the first GPU.
In [25]: z = nd.ones(shape=(3,3), ctx=mx.gpu(0))
z
Out[25]:
[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
<NDArray 3x3 @gpu(0)>
Given an NDArray on a given context, we can copy it to another context by using the copyto() method.
In [26]: x_gpu = x.copyto(mx.gpu(0))
print(x_gpu)
[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
The result of an operator will have the same context as the inputs.
In [27]: x_gpu + z
Out[27]:
[[ 2. 2. 2.]
[ 2. 2. 2.]
[ 2. 2. 2.]]
If we ever want to check the context of an NDArray programmaticaly, we can just call its .context
attribute.
In [28]: print(x_gpu.context)
print(z.context)
gpu(0)
gpu(0)
In order to perform an operation on two ndarrays x1 and x2, we need them both to live on the same context.
And if they don’t already, we may need to explicitly copy data from one context to another. You might think
that’s annoying. After all, we just demonstrated that MXNet knows where each NDArray lives. So why
can’t MXNet just automatically copy x1 to x2.context and then add them?
In short, people use MXNet to do machine learning because they expect it to be fast. But transferring
variables between different contexts is slow. So we want you to be 100% certain that you want to do
something slow before we let you do it. If MXNet just did the copy automatically without crashing then
you might not realize that you had written some slow code. We don’t want you to spend your entire life on
StackOverflow, so we make some mistakes impossible.

3.3.8 Watch out!

Imagine that your variable z already lives on your second GPU (gpu(0)). What happens if we call z.
copyto(gpu(0))? It will make a copy and allocate new memory, even though that variable already lives
on the desired device!
There are times where depending on the environment our code is running in, two variables may already live
on the same device. So we only want to make a copy if the variables currently lives on different contexts. In
these cases, we can call as_in_context(). If the variable is already the specified context then this is a
no-op.
In [29]: print('id(z):', id(z))
z = z.copyto(mx.gpu(0))
print('id(z):', id(z))
z = z.as_in_context(mx.gpu(0))
print('id(z):', id(z))
print(z)
id(z): 140291459785224
id(z): 140291460485072
id(z): 140291460485072
[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
3.3.9 Next
Linear algebra
3.4 Linear algebra

Now that you can store and manipulate data, let’s briefly review the subset of basic linear algebra that
you’ll need to understand most of the models. We’ll introduce all the basic concepts, the corresponding
3.4. Linear algebra 33

mathematical notation, and their realization in code all in one place. If you’re already confident in your
basic linear algebra, feel free to skim or skip this chapter.
In [2]: from mxnet import nd
3.4.1 Scalars
If you never studied linear algebra or machine learning, you’re probably used to working with one number
at a time. And know how to do basic things like add them together or multiply them. For example, in Palo
Alto, the temperature is 52 degrees Fahrenheit. Formally, we call these values 𝑠𝑐𝑎𝑙𝑎𝑟𝑠. If you wanted to
convert this value to Celsius (using metric system’s more sensible unit of temperature measurement), you’d
evaluate the expression 𝑐 = (𝑓 − 32) * 5/9 setting 𝑓 to 52. In this equation, each of the terms 32, 5, and 9
is a scalar value. The placeholders 𝑐 and 𝑓 that we use are called variables and they stand in for unknown
scalar values.
In mathematical notation, we represent scalars with ordinary lower cased letters (𝑥, 𝑦, 𝑧). We also denote
the space of all scalars as ℛ. For expedience, we’re going to punt a bit on what precisely a space is, but for
now, remember that if you want to say that 𝑥 is a scalar, you can simply say 𝑥 ∈ ℛ. The symbol ∈ can be
pronounced “in” and just denotes membership in a set.
In MXNet, we work with scalars by creating NDArrays with just one element. In this snippet, we instantiate
two scalars and perform some familiar arithmetic operations with them.
In [3]: ##########################
# Instantiate two scalars
##########################
x = nd.array([3.0])
y = nd.array([2.0])
##########################
# Add them
##########################
print('x + y = ', x + y)
##########################
# Multiply them
##########################
print('x * y = ', x * y)
##########################
# Divide x by y
##########################
print('x / y = ', x / y)
##########################
# Raise x to the power y.
##########################
print('x ** y = ', nd.power(x,y))
x + y =
[ 5.]
<NDArray 1 @cpu(0)>
x * y =
[ 6.]

<NDArray 1 @cpu(0)>
x / y =
[ 1.5]
<NDArray 1 @cpu(0)>
x ** y =
[ 9.]
<NDArray 1 @cpu(0)>
We can convert any NDArray to a Python float by calling its asscalar method
In [4]: x.asscalar()
Out[4]: 3.0
3.4.2 Vectors
You can think of a vector as simply a list of numbers, for example [1.0,3.0,4.0,2.0]. Each of the
numbers in the vector consists of a single scalar value. We call these values the entries or components of the
vector. Often, we’re interested in vectors whose values hold some real-world significance. For example, if
we’re studying the risk that loans default, we might associate each applicant with a vector whose components
correspond to their income, length of employment, number of previous defaults, etc. If we were studying
the risk of heart attack in hospital patients, we might represent each patient with a vector whose components
capture their most recent vital signs, cholesterol levels, minutes of exercise per day, etc. In math notation,
we’ll usually denote vectors as bold-faced, lower-cased letters (u, v, w). In MXNet, we work with vectors
via 1D NDArrays with an arbitrary number of components.
In [5]: u = nd.arange(4)
print('u = ', u)
u =
[ 0. 1. 2. 3.]
<NDArray 4 @cpu(0)>
We can refer to any element of a vector by using a subscript. For example, we can refer to the 4th element
of u by 𝑢4 . Note that the element 𝑢4 is a scalar, so we don’t bold-face the font when referring to it. In code,
we access any element 𝑖 by indexing into the NDArray.
In [6]: u[3]
Out[6]:
[ 3.]
<NDArray 1 @cpu(0)>
3.4.3 Length, dimensionality, and, shape

A vector is just an array of numbers. And just as every array has a length, so does every vector. In math
notation, if we want to say that a vector 𝑥 consists of 𝑛 real-valued scalars, we can express this as x ∈ ℛ𝑛 .
The length of a vector is commonly called its 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛. As with an ordinary Python array, we can access
the length of an NDArray by calling Python’s in-built len() function.
In [7]: len(u)
Out[7]: 4

We can also access a vector’s length via its .shape attribute. The shape is a tuple that lists the dimension-
ality of the NDArray along each of its axes. Because a vector can only be indexed along one axis, its shape
has just one element.
In [8]: u.shape
Out[8]: (4,)
Note that the word dimension is overloaded and this tends to confuse people. Some use the dimensionality
of a vector to refer to its length (the number of components). However some use the word dimensionality to
refer to the number of axes that an array has. In this sense, a scalar would have 0 dimensions and a vector
would have 1 dimension. To avoid confusion, when we say *2D* array or *3D* array, we mean an
array with 2 or 3 axes repespectively. But if we say *:math:‘n‘-dimensional* vector, we mean a vector
of length :math:‘n‘.
In [ ]: a = 2
x = nd.array([1,2,3])
y = nd.array([10,20,30])
print(a * x)
print(a * x + y)
3.4.4 Matrices
Just as vectors generalize scalars from order 0 to order 1, matrices generalize vectors from 1𝐷 to 2𝐷.
Matrices, which we’ll denote with capital letters (𝐴, 𝐵, 𝐶), are represented in code as arrays with 2 axes.
Visually, we can draw a matrix as a table, where each entry 𝑎𝑖𝑗 belongs to the 𝑖-th row and 𝑗-th column.
⎛ ⎞
𝑎11 𝑎12 · · · 𝑎1𝑚
⎜ 𝑎21 𝑎22 · · · 𝑎2𝑚 ⎟
𝐴=⎜ .
⎜ ⎟
.. .. .. ⎟
⎝ .. . . . ⎠
𝑎𝑛1 𝑎𝑛2 · · · 𝑎𝑛𝑚
We can create a matrix with 𝑛 rows and 𝑚 columns in MXNet by specifying a shape with two components
(n,m) when calling any of our favorite functions for instantiating an ndarray such as ones, or zeros.
In [10]: A = nd.zeros((5,4))
A
Out[10]:
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
We can also reshape any 1D array into a 2D ndarray by calling ndarray’s reshape method and passing in
the desired shape. Note that the product of shape components n * m must be equal to the length of the
original vector.
In [12]: x = nd.arange(20)
A = x.reshape((5, 4))
A

Out[12]:
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[ 12. 13. 14. 15.]
[ 16. 17. 18. 19.]]
Matrices are useful data structures: they allow us to organize data that has different modalities of variation.
For example, returning to the example of medical data, rows in our matrix might correspond to different
patients, while columns might correspond to different attributes.
We can access the scalar elements 𝑎𝑖𝑗 of a matrix 𝐴 by specifying the indices for the row (𝑖) and column (𝑗)
respectively. Let’s grab the element 𝑎2,3 from the random matrix we initialized above.
In [13]: print('A[2, 3] = ', A[2, 3])
A[2, 3] =
[ 11.]
<NDArray 1 @cpu(0)>
We can also grab the vectors corresponding to an entire row a𝑖,: or a column a:,𝑗 .
In [14]: print('row 2', A[2, :])
print('column 3', A[:, 3])
row 2
[ 8. 9. 10. 11.]
<NDArray 4 @cpu(0)>
column 3
[ 3. 7. 11. 15. 19.]
<NDArray 5 @cpu(0)>
We can transpose the matrix through T. That is, if 𝐵 = 𝐴𝑇 , then 𝑏𝑖𝑗 = 𝑎𝑗𝑖 for any 𝑖 and 𝑗.
In [15]: A.T
Out[15]:
[[ 0. 4. 8. 12. 16.]
[ 1. 5. 9. 13. 17.]
[ 2. 6. 10. 14. 18.]
[ 3. 7. 11. 15. 19.]]
3.4.5 Tensors
Just as vectors generalize scalars, and matrices generalize vectors, we can actually build data structures
with even more axes. Tensors give us a generic way of discussing arrays with an arbitrary number of axes.
Vectors, for example, are first-order tensors, and matrices are second-order tensors.
Using tensors will become more important when we start working with images, which arrive as 3D data
structures, with axes corresponding to the height, width, and the three (RGB) color channels. But in this
chapter, we’re going to skip past and make sure you know the basics.
In [16]: X = nd.arange(24).reshape((2, 3, 4))
print('X.shape =', X.shape)
print('X =', X)

X.shape = (2, 3, 4)
X =
[[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]]
[[ 12. 13. 14. 15.]

[ 16. 17. 18. 19.]
[ 20. 21. 22. 23.]]]
<NDArray 2x3x4 @cpu(0)>
3.4.6 Element-wise operations

Oftentimes, we want to apply functions to arrays. Some of the simplest and most useful functions are
the element-wise functions. These operate by performing a single scalar operation on the corresponding
elements of two arrays. We can create an element-wise function from any function that maps from the
scalars to the scalars. In math notations we would denote such a function as 𝑓 : ℛ → ℛ. Given any two
vectors u and v of the same shape, and the function f, we can produce a vector c = 𝐹 (u, v) by setting
𝑐𝑖 ← 𝑓 (𝑢𝑖 , 𝑣𝑖 ) for all 𝑖. Here, we produced the vector-valued 𝐹 : ℛ𝑑 → ℛ𝑑 by lifting the scalar function
to an element-wise vector operation. In MXNet, the common standard arithmetic operators (+,-,/,*,**) have
all been lifted to element-wise operations for identically-shaped tensors of arbitrary shape.
In [17]: u = nd.array([1, 2, 4, 8])
v = nd.ones_like(u) * 2
print('v =', v)
print('u + v', u + v)
print('u - v', u - v)
print('u * v', u * v)
print('u / v', u / v)
v =
[ 2. 2. 2. 2.]
<NDArray 4 @cpu(0)>
u + v
[ 3. 4. 6. 10.]
<NDArray 4 @cpu(0)>
u - v
[-1. 0. 2. 6.]
<NDArray 4 @cpu(0)>
u * v
[ 2. 4. 8. 16.]
<NDArray 4 @cpu(0)>
u / v
[ 0.5 1. 2. 4. ]
<NDArray 4 @cpu(0)>
We can call element-wise operations on any two tensors of the same shape, including matrices.
In [18]: B = nd.ones_like(A) * 3
print('B =', B)
print('A + B =', A + B)
print('A * B =', A * B)
B =
[[ 3. 3. 3. 3.]

[ 3. 3. 3. 3.]
[ 3. 3. 3. 3.]
[ 3. 3. 3. 3.]
[ 3. 3. 3. 3.]]
A + B =
[[ 3. 4. 5. 6.]
[ 7. 8. 9. 10.]
[ 11. 12. 13. 14.]
[ 15. 16. 17. 18.]
[ 19. 20. 21. 22.]]
A * B =
[[ 0. 3. 6. 9.]
[ 12. 15. 18. 21.]
[ 24. 27. 30. 33.]
[ 36. 39. 42. 45.]
[ 48. 51. 54. 57.]]
3.4.7 Basic properties of tensor arithmetic

Scalars, vectors, matrices, and tensors of any order have some nice properties that we’ll often rely on. For
example, as you might have noticed from the definition of an element-wise operation, given operands with
the same shape, the result of any element-wise operation is a tensor of that same shape. Another convenient
property is that for all tensors, multiplication by a scalar produces a tensor of the same shape. In math, given
two tensors 𝑋 and 𝑌 with the same shape, 𝛼𝑋 + 𝑌 has the same shape. (numerical mathematicians call
this the AXPY operation).
In [19]: a = 2
x = nd.ones(3)
y = nd.zeros(3)
print(x.shape)
print(y.shape)
print((a * x).shape)
print((a * x + y).shape)
(3,)
(3,)
(3,)
(3,)
Shape is not the the only property preserved under addition and multiplication by a scalar. These operations
also preserve membership in a vector space. But we’ll postpone this discussion for the second half of this
chapter because it’s not critical to getting your first models up and running.
3.4.8 Sums and means

The next more sophisticated thing we can do with arbitrary
∑︀ tensors is to calculate the sum of their elements.
In mathematical notation, we express
∑︀ sums using the symbol. To express the sum of the elements in a
vector u of length 𝑑, we can write 𝑑𝑖=1 𝑢𝑖 . In code, we can just call nd.sum().
In [ ]: nd.sum(u)

We can similarly express sums over the elements ∑︀

of tensors
∑︀𝑛 of arbitrary shape. For example, the sum of the
elements of an 𝑚 × 𝑛 matrix 𝐴 could be written 𝑚 𝑖=1 𝑗=1 𝑎𝑖𝑗 .
In [ ]: nd.sum(A)
A related quantity is the mean, which is also called the average. We calculate the mean by dividing the sum
by∑︀
the total number of elements. With mathematical notation, we could write the average over a vector u as
1 𝑑 1 ∑︀𝑚 ∑︀𝑛
𝑑 𝑖=1 𝑢𝑖 and the average over a matrix 𝐴 as 𝑛·𝑚 𝑖=1 𝑗=1 𝑎𝑖𝑗 . In code, we could just call nd.mean()
on tensors of arbitrary shape:
In [ ]: print(nd.mean(A))
print(nd.sum(A) / A.size)
3.4.9 Dot products

One of the most fundamental operations is the dot product. Given two vectors u and v, the dot product u𝑇 v
𝑑
is a sum over the products of the corresponding elements: u𝑇 v = 𝑖=1 𝑢𝑖 · 𝑣𝑖 .
∑︀
In [ ]: nd.dot(u, v)
Note that we can express the dot product of two vectors nd.dot(u, v) equivalently by performing an
element-wise multiplication and then a sum:
In [ ]: nd.sum(u * v)
Dot products are useful in a wide range of contexts. For example, given a set of weights w, the weighted
sum of some ∑︀values 𝑢 could be expressed as the dot product u𝑇 w. When the weights are non-negative and
sum to one ( 𝑑𝑖=1 𝑤𝑖 = 1), the dot product expresses a weighted average. When two vectors each have
length one (we’ll discuss what length means below in the section on norms), dot products can also capture
the cosine of the angle between them.
3.4.10 Matrix-vector products

Now that we know how to calculate dot products we can begin to understand matrix-vector products. Let’s
start off by visualizing a matrix 𝐴 and a column vector x.
⎛ ⎞ ⎛ ⎞
𝑎11 𝑎12 · · · 𝑎1𝑚 𝑥1
⎜ 𝑎21 𝑎22 · · · 𝑎2𝑚 ⎟ ⎜ 𝑥2 ⎟
𝐴=⎜ . .. ⎟ , x = ⎜ .. ⎟
⎜ ⎟ ⎜ ⎟
. .. . .
⎝ . . . . ⎠ ⎝ . ⎠
𝑎𝑛1 𝑎𝑛2 · · · 𝑎𝑛𝑚 𝑥𝑚
We can visualize the matrix in terms of its row vectors

· · · a𝑇1
⎛ ⎞
...
⎜· · · a𝑇 · · ·⎟
2
𝐴=⎜ ⎟,
⎜ ⎟
..
⎝ . ⎠
· · · a𝑇𝑛 ···
where each a𝑇𝑖 ∈ R𝑚 is a row vector representing the 𝑖-th row of the matrix 𝐴.
Then the matrix vector product y = 𝐴x is simply a column vector y ∈ R𝑛 where each entry 𝑦𝑖 is the dot

product a𝑇𝑖 x.
· · · a𝑇1
⎛ ⎞⎛ ⎞ ⎛ 𝑇 ⎞
... 𝑥1 a1 x
⎜· · · a𝑇 · · ·⎟ ⎜ 𝑥2 ⎟ ⎜a𝑇2 x⎟
⎟ ⎜ ⎟ ⎜
2
𝐴x = ⎜ ⎟ ⎜ .. ⎟ = ⎜ .. ⎟
⎜ ⎟
..
⎝ . ⎠⎝ . ⎠ ⎝ . ⎠
· · · a𝑇𝑛 ··· 𝑥𝑚 a𝑇𝑛 x
So you can think of multiplication by a matrix 𝐴 ∈ R𝑚×𝑛 as a transformation that projects vectors from
R𝑚 to R𝑛 .
These transformations turn out to be quite useful. For example, we can represent rotations as multiplications
by a square matrix. As we’ll see in subsequent chapters, we can also use matrix-vector products to describe
the calculations of each layer in a neural network.
Expressing matrix-vector products in code with ndarray, we use the same nd.dot() function as for
dot products. When we call nd.dot(A, x) with a matrix A and a vector x, MXNet knows to perform a
matrix-vector product. Note that the column dimension of A must be the same as the dimension of x.
In [ ]: nd.dot(A, u)
3.4.11 Matrix-matrix multiplication

If you’ve gotten the hang of dot products and matrix-vector multiplication, then matrix-matrix multiplica-
tions should be pretty straightforward.
Say we have two matrices, 𝐴 ∈ R𝑛×𝑘 and 𝐵 ∈ R𝑘×𝑚 :
⎛ ⎞ ⎛ ⎞
𝑎11 𝑎12 · · · 𝑎1𝑘 𝑏11 𝑏12 · · · 𝑏1𝑚
⎜ 𝑎21 𝑎22 · · · 𝑎2𝑘 ⎟ ⎜𝑏21 𝑏22 · · · 𝑏2𝑚 ⎟
𝐴=⎜ . , 𝐵=⎜ .
⎜ ⎟ ⎜ ⎟
. .
. . . .
. ⎟ . .. . . .. ⎟
⎝ . . . . ⎠ ⎝ . . . . ⎠
𝑎𝑛1 𝑎𝑛2 · · · 𝑎𝑛𝑘 𝑏𝑘1 𝑏𝑘2 · · · 𝑏𝑘𝑚
To produce the matrix product 𝐶 = 𝐴𝐵, it’s easiest to think of 𝐴 in terms of its row vectors and 𝐵 in terms
of its column vectors:
· · · a𝑇1 ...
⎛ ⎞
.. .. ..
⎛ ⎞
⎜· · · a𝑇 · · ·⎟
2 ⎜ . . . ⎟
𝐴=⎜ , 𝐵 = b b · · · b 𝑚⎠ .
⎜ ⎟
.. ⎟ ⎝ 1
⎜ 2
⎟
⎝ . ⎠ .. .. ..
· · · a𝑇𝑛 · · · . . .
Note here that each row vector a𝑇𝑖 lies in R𝑘 and that each column vector b𝑗 also lies in R𝑘 .
Then to produce the matrix product 𝐶 ∈ R𝑛×𝑚 we simply compute each entry 𝑐𝑖𝑗 as the dot product a𝑇𝑖 b𝑗 .
· · · a𝑇1 ... ⎛ . ⎞ ⎛a𝑇 b a𝑇 b · · · a𝑇1 b𝑚

⎛ ⎞ ⎞
. .. .. 1 1 1 2
⎜· · · a𝑇 · · ·⎟ ⎜ . . . ⎟ ⎜a𝑇 b a𝑇 b · · · a𝑇2 b𝑚 ⎟
2 ⎜ 2 1 2 2
𝐶 = 𝐴𝐵 = ⎜ ⎟ ⎝b1 b2 · · · b𝑚 ⎟ =⎜ .
⎜ ⎟⎜ ⎟
.. . .. .. .. ⎟
⎝ . ⎠ . . .
⎠ ⎝ . . . . ⎠
𝑇
.. .. .. 𝑇 𝑇 𝑇
··· a 𝑛 ··· a b a b 𝑛 1 𝑛 2 · · · a𝑛 b𝑚
You can think of the matrix-matrix multiplication 𝐴𝐵 as simply performing 𝑚 matrix-vector products and
stitching the results together to form an 𝑛 × 𝑚 matrix. Just as with ordinary dot products and matrix-vector
products, we can compute matrix-matrix products in MXNet by using nd.dot().

In [20]: A = nd.ones(shape=(3, 4))

B = nd.ones(shape=(4, 5))
nd.dot(A, B)
Out[20]:
[[ 4. 4. 4. 4. 4.]
[ 4. 4. 4. 4. 4.]
[ 4. 4. 4. 4. 4.]]
3.4.12 Norms
Before we can start implementing models, there’s one last concept we’re going to introduce. Some of the
most useful operators in linear algebra are norms. Informally, they tell us how big a vector or matrix is. We
represent norms with the notation ‖ · ‖. The · in this expression is just a placeholder. For example, we would
represent the norm of a vector x or matrix 𝐴 as ‖x‖ or ‖𝐴‖, respectively.
All norms must satisfy a handful of properties: 1. ‖𝛼𝐴‖ = |𝛼|‖𝐴‖ 2. ‖𝐴 + 𝐵‖ ≤ ‖𝐴‖ + ‖𝐵‖ 3. ‖𝐴‖ ≥ 0
4. If ∀𝑖, 𝑗, 𝑎𝑖𝑗 = 0, then ‖𝐴‖ = 0
To put it in words, the first rule says that if we scale all the components of a matrix or vector by a constant
factor 𝛼, its norm also scales by the absolute value of the same constant factor. The second rule is the
familiar triangle inequality. The third rule simple says that the norm must be non-negative. That makes
sense, in most contexts the smallest size for anything is 0. The final rule basically says that the smallest
norm is achieved by a matrix or vector consisting of all zeros. It’s possible to define a norm that gives zero
norm to nonzero matrices, but you can’t give nonzero norm to zero matrices. That’s a mouthful, but if you
digest it then you probably have grepped the important concepts here.
If you remember Euclidean distances (think Pythagoras’ theorem) from grade school, then non-negativity
and the triangle inequality might ring a bell. You might notice that norms sound a lot like measures of
distance.
√︀
In fact, the Euclidean distance 𝑥21 + · · · + 𝑥2𝑛 is a norm. Specifically it’s the ℓ2 -norm. An analogous
√︁∑︀
2
computation, performed over the entries of a matrix, e.g. 𝑖,𝑗 𝑎𝑖𝑗 , is called the Frobenius norm. More
often, in machine learning we work with the squared ℓ2 norm (notated ℓ22 ). We also commonly work with
the ℓ1 norm. The ℓ1 norm is simply the sum of the absolute values. It has the convenient property of placing
less emphasis on outliers.
To calculate the ℓ2 norm, we can just call nd.norm().
In [ ]: nd.norm(u)
To calculate the L1-norm we can simply perform the absolute value and then sum over the elements.
In [ ]: nd.sum(nd.abs(u))
3.4.13 Norms and objectives

While we don’t want to get too far ahead of ourselves, we do want you to anticipate why these concepts
are useful. In machine learning we’re often trying to solve optimization problems: Maximize the probability
assigned to observed data. Minimize the distance between predictions and the ground-truth observations.
Assign vector representations to items (like words, products, or news articles) such that the distance between
similar items is minimized, and the distance between dissimilar items is maximized. Oftentimes, these

objectives, perhaps the most important component of a machine learning algorithm (besides the data itself),
are expressed as norms.
3.5 Intermediate linear algebra

If you’ve made it this far, and understand everything that we’ve covered, then honestly, you are ready to
begin modeling. If you’re feeling antsy, this is a perfectly reasonable place to move on. You already know
nearly all of the linear algebra required to implement a number of many practically useful models and you
can always circle back when you want to learn more.
But there’s a lot more to linear algebra, even as concerns machine learning. At some point, if you plan to
make a career of machine learning, you’ll need to know more than we’ve covered so far. In the rest of this
chapter, we introduce some useful, more advanced concepts.
3.5.1 Basic vector properties

Vectors are useful beyond being data structures to carry numbers. In addition to reading and writing values
to the components of a vector, and performing some useful mathematical operations, we can analyze vectors
in some interesting ways.
One important concept is the notion of a vector space. Here are the conditions that make a vector space:
• Additive axioms (we assume that x,y,z are all vectors): 𝑥 + 𝑦 = 𝑦 + 𝑥 and (𝑥 + 𝑦) + 𝑧 = 𝑥 + (𝑦 + 𝑧)
and 0 + 𝑥 = 𝑥 + 0 = 𝑥 and (−𝑥) + 𝑥 = 𝑥 + (−𝑥) = 0.
• Multiplicative axioms (we assume that x is a vector and a, b are scalars): 0 · 𝑥 = 0 and 1 · 𝑥 = 𝑥 and
(𝑎𝑏)𝑥 = 𝑎(𝑏𝑥).
• Distributive axioms (we assume that x and y are vectors and a, b are scalars): 𝑎(𝑥 + 𝑦) = 𝑎𝑥 + 𝑎𝑦
and (𝑎 + 𝑏)𝑥 = 𝑎𝑥 + 𝑏𝑥.
3.5.2 Special matrices

There are a number of special matrices that we will use throughout this tutorial. Let’s look at them in a bit
of detail:
• Symmetric Matrix These are matrices where the entries below and above the diagonal are the same.
In other words, we have that 𝑀 ⊤ = 𝑀 . An example of such matrices are those that describe pairwise
distances, i.e. 𝑀𝑖𝑗 = ‖𝑥𝑖 − 𝑥𝑗 ‖. Likewise, the Facebook friendship graph can be written as a
symmetric matrix where 𝑀𝑖𝑗 = 1 if 𝑖 and 𝑗 are friends and 𝑀𝑖𝑗 = 0 if they are not. Note that the
Twitter graph is asymmetric - 𝑀𝑖𝑗 = 1, i.e. 𝑖 following 𝑗 does not imply that 𝑀𝑗𝑖 = 1, i.e. 𝑗 following
𝑖.
• Antisymmetric Matrix These matrices satisfy 𝑀 ⊤ = −𝑀 . Note that any arbitrary matrix can
always be decomposed into a symmetric and into an antisymmetric matrix by using 𝑀 = 21 (𝑀 +
𝑀 ⊤ ) + 21 (𝑀 − 𝑀 ⊤ ).
• Diagonally Dominant Matrix These are matrices where the off-diagonal
∑︀ elements are ∑︀
small relative
to the main diagonal elements. In particular we have that 𝑀𝑖𝑖 ≥ 𝑗̸=𝑖 𝑀𝑖𝑗 and 𝑀𝑖𝑖 ≥ 𝑗̸=𝑖 𝑀𝑗𝑖 . If
a matrix has this property, we can often approximate 𝑀 by its diagonal. This is often expressed as
diag(𝑀 ).
3.5. Intermediate linear algebra 43

• Positive Definite Matrix These are matrices that have the nice property where 𝑥⊤ 𝑀 𝑥 > 0 whenever
𝑥 ̸= 0. Intuitively, they are a generalization of the squared norm of a vector ‖𝑥‖2 = 𝑥⊤ 𝑥. It is easy
to check that whenever 𝑀 = 𝐴⊤ 𝐴, this holds since there 𝑥⊤ 𝑀 𝑥 = 𝑥⊤ 𝐴⊤ 𝐴𝑥 = ‖𝐴𝑥‖2 . There is
a somewhat more profound theorem which states that all positive definite matrices can be written in
this form.
3.5.3 Conclusions
In just a few pages (or one Jupyter notebook) we’ve taught you all the linear algebra you’ll need to un-
derstand a good chunk of neural networks. Of course there’s a lot more to linear algebra. And a lot of
that math is useful for machine learning. For example, matrices can be decomposed into factors, and these
decompositions can reveal low-dimensional structure in real-world datasets. There are entire subfields of
machine learning that focus on using matrix decompositions and their generalizations to high-order tensors
to discover structure in datasets and solve prediction problems. But this book focuses on deep learning. And
we believe you’ll be much more inclined to learn more mathematics once you’ve gotten your hands dirty
deploying useful machine learning models on real datasets. So while we reserve the right to introduce more
math much later on, we’ll wrap up this chapter here.
If you’re eager to learn more about linear algebra, here are some of our favorite resources on the topic *
For a solid primer on basics, check out Gilbert Strang’s book Introduction to Linear Algebra * Zico Kolter’s
Linear Algebra Reivew and Reference
3.5.4 Next
Probability and statistics
3.6 Probability and statistics

In some form or another, machine learning is all about making predictions. We might want to predict the
probability of a patient suffering a heart attack in the next year, given their clinical history. In anomaly
detection, we might want to assess how likely a set of readings from an airplane’s jet engine would be, were
it operating normally. In reinforcement learning, we want an agent to act intelligently in an environment.
This means we need to think about the probability of getting a high reward under each of the available
action. And when we build recommender systems we also need to think about probability. For example,
if we hypothetically worked for a large online bookseller, we might want to estimate the probability that a
particular user would buy a particular book, if prompted. For this we need to use the language of probability
and statistics. Entire courses, majors, theses, careers, and even departments, are devoted to probability. So
our goal here isn’t to teach the whole subject. Instead we hope to get you off the ground, to teach you just
enough that you know everything necessary to start building your first machine learning models and to have
enough of a flavor for the subject that you can begin to explore it on your own if you wish.
We’ve talked a lot about probabilities so far without articulating what precisely they are or giving a concrete
example. Let’s get more serious by considering the problem of distinguishing cats and dogs based on
photographs. This might sound simpler but it’s actually a formidable challenge. To start with, the difficulty
of the problem may depend on the resolution of the image.

20px 40px 80px 160px 320px
While it’s easy for humans to recognize cats and dogs at 320 pixel resolution, it becomes challenging at 40
pixels and next to impossible at 20 pixels. In other words, our ability to tell cats and dogs apart at a large
distance (and thus low resolution) might approach uninformed guessing. Probability gives us a formal way
of reasoning about our level of certainty. If we are completely sure that the image depicts a cat, we say that
the probability that the corresponding label 𝑙 is cat, denoted 𝑃 (𝑙 = cat) equals 1.0. If we had no evidence
to suggest that 𝑙 = cat or that 𝑙 = dog, then we might say that the two possibilities were equally 𝑙𝑖𝑘𝑒𝑙𝑦
expressing this as 𝑃 (𝑙 = cat) = 0.5. If we were reasonably confident, but not sure that the image depicted
a cat, we might assign a probability .5 < 𝑃 (𝑙 = cat) < 1.0.
Now consider a second case: given some weather monitoring data, we want to predict the probability that
it will rain in Taipei tomorrow. If it’s summertime, the rain might come with probability .5 In both cases,
we have some value of interest. And in both cases we are uncertain about the outcome. But there’s a key
difference between the two cases. In this first case, the image is in fact either a dog or a cat, we just don’t
know which. In the second case, the outcome may actually be a random event, if you believe in such things
(and most physicists do). So probability is a flexible language for reasoning about our level of certainty, and
it can be applied effectively in a broad set of contexts.
3.6.1 Basic probability theory

Say that we cast a die and want to know what the chance is of seeing a 1 rather than another digit. If the die
is fair, all six outcomes 𝒳 = {1, . . . , 6} are equally likely to occur, hence we would see a 1 in 1 out of 6
3.6. Probability and statistics 45

cases. Formally we state that 1 occurs with probability 16 .

For a real die that we receive from a factory, we might not know those proportions and we would need to
check whether it is tainted. The only way to investigate the die is by casting it many times and recording the
outcomes. For each cast of the die, we’ll observe a value {1, 2, . . . , 6}. Given these outcomes, we want to
investigate the probability of observing each outcome.
One natural approach for each value is to take the individual count for that value and to divide it by the total
number of tosses. This gives us an estimate of the probability of a given event. The law of large numbers
tell us that as the number of tosses grows this estimate will draw closer and closer to the true underlying
probability. Before going into the details of what’s going here, let’s try it out.
To start, let’s import the necessary packages:
Next, we’ll want to be able to cast the die. In statistics we call this process of drawing examples from
probability distributions sampling. The distribution which assigns probabilities to a number of discrete
choices is called the multinomial distribution. We’ll give a more formal definition of distribution later, but
at a high level, think of it as just an assignment of probabilities to events. In MXNet, we can sample from
the multinomial distribution via the aptly named nd.sample_multinomial function. The function can
be called in many ways, but we’ll focus on the simplest. To draw a single sample, we simply give pass in a
vector of probabilities.
In [2]: probabilities = nd.ones(6) / 6
nd.sample_multinomial(probabilities)
Out[2]:
[3]
<NDArray 1 @cpu(0)>
If you run this line (nd.sample_multinomial(probabilities)) a bunch of times, you’ll find that
you get out random values each time. As with estimating the fairness of a die, we often want to generate
many samples from the same distribution. It would be really slow to do this with a Python for loop, so
sample_multinomial supports drawing multiple samples at once, returning an array of independent
samples in any shape we might desire.
In [3]: print(nd.sample_multinomial(probabilities, shape=(10)))
print(nd.sample_multinomial(probabilities, shape=(5,10)))
[3 4 5 3 5 3 5 2 3 3]
<NDArray 10 @cpu(0)>
[[2 2 1 5 0 5 1 2 2 4]
[4 3 2 3 2 5 5 0 2 0]
[3 0 2 4 5 4 0 5 5 5]
[2 4 4 2 3 4 4 0 4 3]
[3 0 3 5 4 3 0 2 2 1]]
Now that we know how to sample rolls of a die, we can simulate 1000 rolls.
In [4]: rolls = nd.sample_multinomial(probabilities, shape=(1000))

We can then go through and count, after each of the 1000 rolls, how many times each number was rolled.
In [5]: counts = nd.zeros((6,1000))
totals = nd.zeros(6)
for i, roll in enumerate(rolls):
totals[int(roll.asscalar())] += 1
counts[:, i] = totals
To start, we can inspect the final tally at the end of 1000 rolls.
In [6]: totals / 1000
Out[6]:
[ 0.167 0.168 0.175 0.15899999 0.15800001 0.17299999]
<NDArray 6 @cpu(0)>
As you can see, the lowest estimated probability for any of the numbers is about .15 and the highest estimated
probability is 0.188. Because we generated the data from a fair die, we know that each number actually has
probability of 1/6, roughly .167, so these estimates are pretty good. We can also visualize how these
probabilities converge over time towards reasonable estimates.
To start let’s take a look at the counts array which has shape (6, 1000). For each time step (out of
1000), counts, says how many times each of the numbers has shown up. So we can normalize each 𝑗-th
column of the counts vector by the number of tosses to give the current estimated probabilities at that
time. The counts object looks like this:
In [7]: counts
Out[7]:
[[ 0. 0. 0. ..., 165. 166. 167.]
[ 1. 1. 1. ..., 168. 168. 168.]
[ 0. 0. 0. ..., 175. 175. 175.]
[ 0. 0. 0. ..., 159. 159. 159.]
[ 0. 1. 2. ..., 158. 158. 158.]
[ 0. 0. 0. ..., 173. 173. 173.]]
Normalizing by the number of tosses, we get:

In [8]: x = nd.arange(1000).reshape((1,1000)) + 1
estimates = counts / x
print(estimates[:,0])
[ 0. 1. 0. 0. 0. 0.]
<NDArray 6 @cpu(0)>
[ 0. 0.5 0. 0. 0.5 0. ]
<NDArray 6 @cpu(0)>
[ 0.1980198 0.15841584 0.17821783 0.18811882 0.12871288 0.14851485]

<NDArray 6 @cpu(0)>
As you can see, after the first toss of the die, we get the extreme estimate that one of the numbers will be
rolled with probability 1.0 and that the others have probability 0. After 100 rolls, things already look a bit

more reasonable. We can visualize this convergence by using the plotting package matplotlib. If you
don’t have it installed, now would be a good time to install it.
In [9]: %matplotlib inline
from matplotlib import pyplot as plt

plt.plot(estimates[0, :].asnumpy(), label="Estimated P(die=1)")
plt.axhline(y=0.16666, color='black', linestyle='dashed')
plt.legend()
plt.show()
Each solid curve corresponds to one of the six values of the die and gives our estimated probability that
the die turns up that value as assessed after each of the 1000 turns. The dashed black line gives the true
underlying probability. As we get more data, the solid curves converge towards the true answer.
In our example of casting a die, we introduced the notion of a random variable. A random variable,
which we denote here as 𝑋 can be pretty much any quantity and is not determistic. Random variables
could take one value among a set of possibilites. We denote sets with brackets, e.g., {cat, dog, rabbit}.
The items contained in the set are called elements, and we can say that an element 𝑥 is in the set S, by
writing 𝑥 ∈ 𝑆. The symbol ∈ is read as “in” and denotes membership. For instance, we could truthfully
say dog ∈ {cat, dog, rabbit}. When dealing with the rolls of die, we are concerned with a variable 𝑋 ∈
{1, 2, 3, 4, 5, 6}.
Note that there is a subtle difference between discrete random variables, like the sides of a dice, and con-
tinuous ones, like the weight and the height of a person. There’s little point in asking whether two people
have exactly the same height. If we take precise enough measurements you’ll find that no two people on
the planet have the exact same height. In fact, if we take a fine enough measurement, you will not have

the same height when you wake up and when you go to sleep. So there’s no purpose in asking about the
probability that some one is 2.00139278291028719210196740527486202 meters tall. The probability is 0.
It makes more sense in this case to ask whether someone’s height falls into a given interval, say between
1.99 and 2.01 meters. In these cases we quantify the likelihood that we see a value as a density. The height
of exactly 2.0 meters has no probability, but nonzero density. Between any two different heights we have
nonzero probability.
There are a few important axioms of probability that you’ll want to remember:
• For any event 𝑧, the probability is never negative, i.e. Pr(𝑍 = 𝑧) ≥ 0.
• For any two events 𝑍 = 𝑧 and 𝑋 = 𝑥 the union is no more likely than the sum of the individual
events, i.e. Pr(𝑍 = 𝑧 ∪ 𝑋 = 𝑥) ≤ Pr(𝑍 = 𝑧) + Pr(𝑋 = 𝑥).
• For any random variable, the probabilities of all the values it can take must sum to 1 𝑛𝑖=1 𝑃 (𝑍 =
∑︀
𝑧𝑖 ) = 1.
• For any two mutually exclusive events 𝑍 = 𝑧 and 𝑋 = 𝑥, the probability that either happens is equal
to the sum of their individual probabilities that Pr(𝑍 = 𝑧 ∪ 𝑋 = 𝑥) = Pr(𝑍 = 𝑧) + Pr(𝑋 = 𝑧).
3.6.2 Dealing with multiple random variables

Very often, we’ll want consider more than one random variable at a time. For instance, we may want
to model the relationship between diseases and symptoms. Given a disease and symptom, say ‘flu’ and
‘cough’, either may or may not occur in a patient with some probability. While we hope that the probability
of both would be close to zero, we may want to estimate these probabilities and their relationships to each
other so that we may apply our inferences to effect better medical care.
As a more complicated example, images contain millions of pixels, thus millions of random variables. And
in many cases images will come with a label, identifying objects in the image. We can also think of the
label as a random variable. We can even get crazy and think of all the metadata as random variables such
as location, time, aperture, focal length, ISO, focus distance, camera type, etc. All of these are random
variables that occur jointly. When we deal with multiple random variables, there are several quantities of
interest. The first is called the joint distribution Pr(𝐴, 𝐵). Given any elements 𝑎 and 𝑏, the joint distribution
lets us answer, what is the probability that 𝐴 = 𝑎 and 𝐵 = 𝑏 simulataneously? It might be clear that for any
values 𝑎 and 𝑏, Pr(𝐴, 𝐵) ≤ Pr(𝐴 = 𝑎).
This has to be the case, since for 𝐴 and 𝐵 to happen, 𝐴 has to happen and 𝐵 also has to happen (and vice
versa). Thus 𝐴, 𝐵 cannot be more likely than 𝐴 or 𝐵 individually. This brings us to an interesting ratio:
0 ≤ Pr(𝐴,𝐵)
Pr(𝐴) ≤ 1. We call this a conditional probability and denote it by Pr(𝐵|𝐴), the probability that 𝐵
happens, provided that 𝐴 has happened.
Using the definition of conditional probabilities, we can derive one of the most useful and celebrated
equations in statistics - Bayes’ theorem. It goes as follows: By construction, we have that Pr(𝐴, 𝐵) =
Pr(𝐵|𝐴) Pr(𝐴). By symmetry, this also holds for Pr(𝐴, 𝐵) = Pr(𝐴|𝐵) Pr(𝐵). Solving for one of the
conditional variables we get:
Pr(𝐵|𝐴) Pr(𝐴)
Pr(𝐴|𝐵) =
Pr(𝐵)
This is very useful if we want to infer one thing from another, say cause and effect but we only know the
properties in the reverse direction. One important operation that we need to make this work is marginaliza-
tion, i.e., the operation of determining Pr(𝐴) and Pr(𝐵) from Pr(𝐴, 𝐵). We can see that the probability of

seeing 𝐴 amounts to accounting for all possible choices of 𝐵 and aggregating the joint probabilities over all
of them, i.e.
∑︁ ∑︁
Pr(𝐴) = Pr(𝐴, 𝐵 ′ ) and Pr(𝐵) = Pr(𝐴′ , 𝐵)
𝐵′ 𝐴′
A really useful property to check is for dependence and independence. Independence is when the oc-
currence of one event does not influence the occurrence of the other. In this case Pr(𝐵|𝐴) = Pr(𝐵).
Statisticians typically use 𝐴 ⊥
⊥ 𝐵 to express this. From Bayes Theorem it follows immediately that also
Pr(𝐴|𝐵) = Pr(𝐴). In all other cases we call 𝐴 and 𝐵 dependent. For instance, two successive rolls of a
dice are independent. On the other hand, the position of a light switch and the brightness in the room are not
(they are not perfectly deterministic, though, since we could always have a broken lightbulb, power failure,
or a broken switch).
Let’s put our skills to the test. Assume that a doctor administers an AIDS test to a patient. This test is fairly
accurate and fails only with 1% probability if the patient is healthy by reporting him as diseased, and that it
never fails to detect HIV if the patient actually has it. We use 𝐷 to indicate the diagnosis and 𝐻 to denote
the HIV status. Written as a table the outcome Pr(𝐷|𝐻) looks as follows:
Patient is HIV positive Patient is HIV negative

Test positive 1 0.01
Test negative 0 0.99
Note that the column sums are all one (but the row sums aren’t), since the conditional probability needs to
sum up to 1, just like the probability. Let us work out the probability of the patient having AIDS if the test
comes back positive. Obviously this is going to depend on how common the disease is, since it affects the
number of false alarms. Assume that the population is quite healthy, e.g. Pr(HIV positive) = 0.0015. To
apply Bayes Theorem we need to determine
Pr(Test positive) = Pr(𝐷 = 1|𝐻 = 0) Pr(𝐻 = 0) + Pr(𝐷 = 1|𝐻 = 1) Pr(𝐻 = 1) = 0.01 · 0.9985 + 1 · 0.0015 = 0.011
Hence we get Pr(𝐻 = 1|𝐷 = 1) = Pr(𝐷=1|𝐻=1) Pr(𝐻=1)

Pr(𝐷=1)
1·0.0015
= 0.011485 = 0.131, in other words, there’s only
a 13.1% chance that the patient actually has AIDS, despite using a test that is 99% accurate! As we can see,
statistics can be quite counterintuitive.
3.6.3 Conditional independence

What should a patient do upon receiving such terrifying news? Likely, he/she would ask the physician to
administer another test to get clarity. The second test has different characteristics (it isn’t as good as the first
one).
Patient is HIV positive Patient is HIV negative

Test positive 0.98 0.03
Test negative 0.02 0.97
Unfortunately, the second test comes back positive, too. Let us work out the requisite probabilities to invoke
Bayes’ Theorem.
• Pr(𝐷1 = 1 and 𝐷2 = 1|𝐻 = 0) = 0.01 · 0.03 = 0.0003

• Pr(𝐷1 = 1 and 𝐷2 = 1|𝐻 = 1) = 1 · 0.98 = 0.98

• Pr(𝐷1 = 1 and 𝐷2 = 1) = 0.0001 · 0.9985 + 0.98 · 0.0015 = 0.00176955
0.98·0.0015
• Pr(𝐻 = 1|𝐷1 = 1 and 𝐷2 = 1) = 0.00176955 = 0.831
That is, the second test allowed us to gain much higher confidence that not all is well. Despite the second test
being considerably less accurate than the first one, it still improved our estimate quite a bit. Why couldn’t
we just run the first test a second time? After all, the first test was more accurate. The reason is that we
needed a second test that confirmed independently of the first test that things were dire, indeed. In other
words, we made the tacit assumption that Pr(𝐷1 , 𝐷2 |𝐻) = Pr(𝐷1 |𝐻) Pr(𝐷2 |𝐻). Statisticians call such
random variables conditionally independent. This is expressed as 𝐷1 ⊥⊥ 𝐷2 |𝐻.
3.6.4 Naive Bayes classification

Conditional independence is useful when dealing with data, since it simplifies a lot of equations. A popular
algorithm is the Naive Bayes Classifier. The key assumption in it is that the attributes are all independent of
each other, given the labels. In other words, we have:
∏︁
𝑝(𝑥|𝑦) = 𝑝(𝑥𝑖 |𝑦)
𝑖
∏︀
𝑖 𝑝(𝑥 |𝑦)𝑝(𝑦)
Using Bayes Theorem this leads to the classifier 𝑝(𝑦|𝑥) = 𝑖 𝑝(𝑥) . Unfortunately, this is still in-
∑︀
it, since we know that 𝑦 𝑝(𝑦|𝑥) = 1,
tractable, since we don’t know 𝑝(𝑥). Fortunately, we don’t need∏︀
hence we can always recover the normalization from 𝑝(𝑦|𝑥) ∝ 𝑖 𝑝(𝑥𝑖 |𝑦)𝑝(𝑦). After all that math, it’s
time for some code to show how to use a Naive Bayes classifier for distinguishing digits on the MNIST
classification dataset.
The problem is that we don’t actually know 𝑝(𝑦) and 𝑝(𝑥𝑖 |𝑦). So we need to estimate it given some training
data first. This is what is called training the model. In the case of 10 possible classes we simply compute
𝑛𝑦 , i.e. the number of occurrences of class 𝑦 and then divide it by the total number of occurrences. E.g.
if we have a total of 60,000 pictures of digits and digit 4 occurs 5800 times, we estimate its probability as
5800
60000 . Likewise, to get an idea of 𝑝(𝑥𝑖 |𝑦) we count how many times pixel 𝑖 is set for digit 𝑦 and then divide
it by the number of occurrences of digit 𝑦. This is the probability that that very pixel will be switched on.
In [10]: import numpy as np
# we go over one observation at a time (speed doesn't matter here)

def transform(data, label):
return (nd.floor(data/128)).astype(np.float32), label.astype(np.float32)
mnist_train = mx.gluon.data.vision.MNIST(train=True, transform=transform)
mnist_test = mx.gluon.data.vision.MNIST(train=False, transform=transform)
# Initialize the count statistics for p(y) and p(x_i|y)

# We initialize all numbers with a count of 1 to ensure that we don't get a
# division by zero. Statisticians call this Laplace smoothing.
ycount = nd.ones(shape=(10))
xcount = nd.ones(shape=(784, 10))
# Aggregate count statistics of how frequently a pixel is on (or off) for

# zeros and ones.
for data, label in mnist_train:
x = data.reshape((784,))

y = int(label)
ycount[y] += 1
xcount[:, y] += x
# normalize the probabilities p(x_i|y) (divide per pixel counts by total

# count)
for i in range(10):
xcount[:, i] = xcount[:, i]/ycount[i]
# likewise, compute the probability p(y)

py = ycount / nd.sum(ycount)
Now that we computed per-pixel counts of occurrence for all pixels, it’s time to see how our model behaves.
Time to plot it. We show the estimated probabilities of observing a switched-on pixel. These are some mean
looking digits.
In [11]: import matplotlib.pyplot as plt
fig, figarr = plt.subplots(1, 10, figsize=(15, 15))
for i in range(10):
figarr[i].imshow(xcount[:, i].reshape((28, 28)).asnumpy(), cmap='hot')
figarr[i].axes.get_xaxis().set_visible(False)
figarr[i].axes.get_yaxis().set_visible(False)
plt.show()
print(py)
[ 0.09871688 0.11236461 0.09930012 0.10218297 0.09736711 0.09035161

0.09863356 0.10441593 0.09751708 0.09915014]
Now we can compute the likelihoods of an image, given the model. This is statistican speak for 𝑝(𝑥|𝑦),
i.e. how likely it is to see a particular image under certain conditions (such as the label). Since this is
computationally awkward (we might have to multiply many small numbers if many pixels have a small
probability
∏︀ ∑︀off computing its logarithm instead. That is, instead of 𝑝(𝑥|𝑦) =
of occurring), we are better
𝑖 𝑝(𝑥 𝑖 |𝑦) we compute log 𝑝(𝑥|𝑦) = 𝑖 log 𝑝(𝑥𝑖 |𝑦).
∑︁ ∑︁
𝑙𝑦 := log 𝑝(𝑥𝑖 |𝑦) = 𝑥𝑖 log 𝑝(𝑥𝑖 = 1|𝑦) + (1 − 𝑥𝑖 ) log (1 − 𝑝(𝑥𝑖 = 1|𝑦))
𝑖 𝑖
To avoid recomputing logarithms all the time, we precompute them for all pixels.
In [12]: logxcount = nd.log(xcount)
logxcountneg = nd.log(1-xcount)
logpy = nd.log(py)
fig, figarr = plt.subplots(2, 10, figsize=(15, 3))
# show 10 images
ctr = 0

for data, label in mnist_test:

x = data.reshape((784,))
y = int(label)
# we need to incorporate the prior probability p(y) since p(y|x) is

# proportional to p(x|y) p(y)
logpx = logpy.copy()
for i in range(10):
# compute the log probability for a digit
logpx[i] += nd.dot(logxcount[:, i], x) + nd.dot(logxcountneg[:, i], 1-x)
# normalize to prevent overflow or underflow by subtracting the largest
# value
logpx -= nd.max(logpx)
# and compute the softmax using logpx
px = nd.exp(logpx).asnumpy()
px /= np.sum(px)
# bar chart and image of digit

figarr[1, ctr].bar(range(10), px)
figarr[1, ctr].axes.get_yaxis().set_visible(False)
figarr[0, ctr].imshow(x.reshape((28, 28)).asnumpy(), cmap='hot')
figarr[0, ctr].axes.get_xaxis().set_visible(False)
figarr[0, ctr].axes.get_yaxis().set_visible(False)
ctr += 1
if ctr == 10:
break
plt.show()
As we can see, this classifier is both incompetent and overly confident of its incorrect estimates. That is,
even if it is horribly wrong, it generates probabilities close to 1 or 0. Not a classifier we should use very
much nowadays any longer. While Naive Bayes classifiers used to be popular in the 80s and 90s, e.g. for
spam filtering, their heydays are over. The poor performance is due to the incorrect statistical assumptions
that we made in our model: we assumed that each and every pixel are independently generated, depending
only on the label. This is clearly not how humans write digits, and this wrong assumption led to the downfall
of our overly naive (Bayes) classifier.
3.6.5 Sampling
Random numbers are just one form of random variables, and since computers are particularly good with
numbers, pretty much everything else in code ultimately gets converted to numbers anyway. One of the
basic tools needed to generate random numbers is to sample from a distribution. Let’s start with what
happens when we use a random number generator.

In [13]: import random

for i in range(10):
print(random.random())
0.970844720223
0.11442244666
0.476145849846
0.154138063676
0.925771401913
0.347466944833
0.288795056587
0.855051122608
0.32666729925
0.932922304219
Uniform Distribution
These are some pretty random numbers. As we can see, their range is between 0 and 1, and they are evenly
distributed. That is, there is (actually, should be, since this is not a real random number generator) no
interval in which numbers are more likely than in any other. In other words, the chances of any of these
numbers to fall into the interval, say [0.2, 0.3) are as high as in the interval [.593264, .693264). The way
they are generated internally is to produce a random integer first, and then divide it by its maximum range.
If we want to have integers directly, try the following instead. It generates random numbers between 0 and
100.
In [14]: for i in range(10):
print(random.randint(1, 100))
75
23
34
85
99
66
13
42
19
14
What if we wanted to check that randint is actually really uniform. Intuitively the best strategy would be
to run it, say 1 million times, count how many times it generates each one of the values and to ensure that
the result is uniform.
In [15]: import math
counts = np.zeros(100)
fig, axes = plt.subplots(2, 3, figsize=(15, 8), sharex=True)
axes = axes.reshape(6)
# mangle subplots such that we can index them in a linear fashion rather than
# a 2d grid
for i in range(1, 1000001):

counts[random.randint(0, 99)] += 1
if i in [10, 100, 1000, 10000, 100000, 1000000]:

axes[int(math.log10(i))-1].bar(np.arange(1, 101), counts)

plt.show()
What we can see from the above figures is that the initial number of counts looks very uneven. If we sample
fewer than 100 draws from a distribution over 100 outcomes this is pretty much expected. But even for 1000
samples there is a significant variability between the draws. What we are really aiming for is a situation
where the probability of drawing a number 𝑥 is given by 𝑝(𝑥).
The categorical distribution

Quite obviously, drawing from a uniform distribution over a set of 100 outcomes is quite simple. But what if
we have nonuniform probabilities? Let’s start with a simple case, a biased coin which comes up heads with
probability 0.35 and tails with probability 0.65. A simple way to sample from that is to generate a uniform
random variable over [0, 1] and if the number is less than 0.35, we output heads and otherwise we generate
tails. Let’s try this out.
In [16]: # number of samples
n = 1000000
y = np.random.uniform(0, 1, n)
x = np.arange(1, n+1)
# count number of occurrences and divide by the number of total draws
p0 = np.cumsum(y < 0.35) / x
p1 = np.cumsum(y >= 0.35) / x
plt.figure(figsize=(15, 8))
plt.semilogx(x, p0)
plt.semilogx(x, p1)
plt.show()

As we can see, on average this sampler will generate 35% zeros and 65% ones. Now what if we have
more than two possible outcomes? We can simply generalize this idea as follows. Given any probability
distribution, e.g. 𝑝 = [0.1, 0.2, 0.05, 0.3, 0.25, 0.1] we can compute its cumulative distribution (python’s
cumsum will do this for you) 𝐹 = [0.1, 0.3, 0.35, 0.65, 0.9, 1]. Once we have this we draw a random
variable 𝑥 from the uniform distribution 𝑈 [0, 1] and then find the interval where 𝐹 [𝑖 − 1] ≤ 𝑥 < 𝐹 [𝑖]. We
then return 𝑖 as the sample. By construction, the chances of hitting interval [𝐹 [𝑖 − 1], 𝐹 [𝑖]) has probability
𝑝(𝑖).
Note that there are many more efficient algorithms for sampling than the one above. For instance, binary
search over 𝐹 will run in 𝑂(log 𝑛) time for 𝑛 random variables. There are even more clever algorithms,
such as the Alias Method to sample in constant time, after 𝑂(𝑛) preprocessing.
The Normal distribution

√1 exp − 12 𝑥2 . Let’s plot it
(︀ )︀
The Normal distribution (aka the Gaussian distribution) is given by 𝑝(𝑥) = 2𝜋
to get a feel for it.
In [17]: x = np.arange(-10, 10, 0.01)
p = (1/math.sqrt(2 * math.pi)) * np.exp(-0.5 * x**2)
plt.plot(x, p)
plt.show()

Sampling from this distribution is a lot less trivial. First off, the support is infinite, that is, for any 𝑥 the
density 𝑝(𝑥) is positive. Secondly, the density is nonuniform. There are many tricks for sampling from it -
the key idea in all algorithms is to stratify 𝑝(𝑥) in such a way as to map it to the uniform distribution 𝑈 [0, 1].
One way to do this is with the probability integral transform.
∫︀ 𝑥
Denote by 𝐹 (𝑥) = −∞ 𝑝(𝑧)𝑑𝑧 the cumulative distribution function (CDF) of 𝑝. This is in a way the
continuous version of the cumulative sum that we used previously. In the same way we can now define the
inverse map 𝐹 −1 (𝜉), where 𝜉 is drawn uniformly. Unlike previously where we needed to find the correct
interval for the vector 𝐹 (i.e. for the piecewise constant function), we now invert the function 𝐹 (𝑥).
In practice, this is slightly more tricky since inverting the CDF is hard in the case of a Gaussian. It turns
out that the twodimensional integral is much easier to deal with, thus yielding two normal random variables
than one, albeit at the price of two uniformly distributed ones. For now, suffice it to say that there are built-in
algorithms to address this.
The normal distribution has yet another desirable property. In a way all distributions converge to it, if we
only average over a sufficiently large number of draws from any other distribution. To understand this in a
bit more detail, we need to introduce three important things: expected values, means and variances.
∫︀ expected value E𝑥∼𝑝(𝑥) [𝑓 (𝑥)] of a function 𝑓 under a distribution 𝑝 is given by the integral
• The
𝑥 𝑝(𝑥)𝑓 (𝑥)𝑑𝑥. That is, we average over all possible outcomes, as given by 𝑝.
• A particularly important expected value is that for the function 𝑓 (𝑥) = 𝑥, i.e. 𝜇 := E𝑥∼𝑝(𝑥) [𝑥]. It
provides us with some idea about the typical values of 𝑥.
• Another important quantity is the variance, i.e. the typical deviation from the mean 𝜎 2 :=
E𝑥∼𝑝(𝑥) [(𝑥 − 𝜇)2 ]. Simple math shows (check it as an exercise) that 𝜎 2 = E𝑥∼𝑝(𝑥) [𝑥2 ] − E2𝑥∼𝑝(𝑥) [𝑥].
The above allows us to change both mean and variance of random variables. Quite obviously for some
random variable 𝑥 with mean 𝜇, the random variable 𝑥 + 𝑐 has mean 𝜇 + 𝑐. Moreover, 𝛾𝑥 has the variance
𝛾 2 𝜎 2 . Applying this
(︀ to 1the normal2 )︀distribution we see that one with mean 𝜇 and variance 𝜎 2 has the form
1 1
𝑝(𝑥) = √ 2 exp − 2𝜎2 (𝑥 − 𝜇) . Note the scaling factor 𝜎 - it arises from the fact that if we stretch the
2𝜎 𝜋

distribution by 𝜎, we need to lower it by 𝜎1 to retain the same probability mass (i.e. the weight under the
distribution always needs to integrate out to 1).
Now we are ready to state one of the most fundamental theorems in statistics, the Central Limit Theorem. It
states that for sufficiently well-behaved random variables, in particular random variables with well-defined
mean and variance, the sum tends toward a normal distribution. To get some idea, let’s repeat the experiment
described in the beginning, but now using random variables with integer values of {0, 1, 2}.
In [18]: # generate 10 random sequences of 10,000 random normal variables N(0,1)
tmp = np.random.uniform(size=(10000,10))
x = 1.0 * (tmp > 0.3) + 1.0 * (tmp > 0.8)
mean = 1 * 0.5 + 2 * 0.2
variance = 1 * 0.5 + 4 * 0.2 - mean**2
print('mean {}, variance {}'.format(mean, variance))
# cumulative sum and normalization
y = np.arange(1,10001).reshape(10000,1)
z = np.cumsum(x,axis=0) / y
plt.figure(figsize=(10,5))
for i in range(10):
plt.semilogx(y,z[:,i])
plt.semilogx(y,(variance**0.5) * np.power(y,-0.5) + mean,'r')

plt.semilogx(y,-(variance**0.5) * np.power(y,-0.5) + mean,'r')
plt.show()
mean 0.9, variance 0.49
This looks very similar to the initial example, at least in the limit of averages of large numbers of variables.
This is confirmed by theory. Denote by mean and variance of a random variable the quantities
𝜇[𝑝] := E𝑥∼𝑝(𝑥) [𝑥] and 𝜎 2 [𝑝] := E𝑥∼𝑝(𝑥) [(𝑥 − 𝜇[𝑝])2 ]
√1
∑︀𝑛 𝑥𝑖 −𝜇
Then we have that lim𝑛→∞ 𝑛 𝑖=1 𝜎 → 𝒩 (0, 1). In other words, regardless of what we started out

with, we will always converge to a Gaussian. This is one of the reasons why Gaussians are so popular in
statistics.
More distributions
Many more useful distributions exist. We recommend consulting a statistics book or looking some of them
up on Wikipedia for further detail.
• Binomial Distribution It is used to describe the distribution over multiple draws from the same
distribution, e.g. the number of heads when tossing a biased (︀coin
)︀ (i.e. a coin with probability 𝜋 of
𝑛 𝑥 𝑛−𝑥
returning heads) 10 times. The probability is given by 𝑝(𝑥) = 𝑥 𝜋 (1 − 𝜋) .
• Multinomial Distribution Obviously we can have more than two outcomes, ∏︀𝑘 e.g. when rolling a dice
𝑥𝑖
multiple times. In this case the distribution is given by 𝑝(𝑥) = ∏︀𝑘 𝑛! 𝜋
𝑖=1 𝑖 .
𝑖=1 𝑥𝑖 !
• Poisson Distribution It is used to model the occurrence of point events that happen with a given rate,
e.g. the number of raindrops arriving within a given amount of time in an area (weird fact - the number
of Prussian soldiers being killed by horses kicking them followed that distribution). Given a rate 𝜆,
1 𝑥 −𝜆
the number of occurrences is given by 𝑝(𝑥) = 𝑥! 𝜆 𝑒 .
• Beta, Dirichlet, Gamma, and Wishart Distributions They are what statisticians call conjugate to
the Binomial, Multinomial, Poisson and Gaussian respectively. Without going into detail, these distri-
butions are often used as priors for coefficients of the latter set of distributions, e.g. a Beta distribution
as a prior for modeling the probability for binomial outcomes.
3.6.6 Next
Autograd
3.7 Automatic differentiation with autograd

In machine learning, we train models to get better and better as a function of experience. Usually, getting
better means minimizing a loss function, i.e. a score that answers “how bad is our model?” With neural
networks, we choose loss functions to be differentiable with respect to our parameters. Put simply, this
means that for each of the model’s parameters, we can determine how much increasing or decreasing it
might affect the loss. While the calculations are straightforward, for complex models, working it out by
hand can be a pain.
MXNet’s autograd package expedites this work by automatically calculating derivatives. And while most
other libraries require that we compile a symbolic graph to take automatic derivatives, mxnet.autograd,
like PyTorch, allows you to take derivatives while writing ordinary imperative code. Every time you make
pass through your model, autograd builds a graph on the fly, through which it can immediately backprop-
agate gradients.
Let’s go through it step by step. For this tutorial, we’ll only need to import mxnet.ndarray, and mxnet.
autograd.
from mxnet import nd, autograd
mx.random.seed(1)
3.7. Automatic differentiation with autograd 59

3.7.1 Attaching gradients

As a toy example, Let’s say that we are interested in differentiating a function f = 2 * (x ** 2) with
respect to parameter x. We can start by assigning an initial value of x.
In [2]: x = nd.array([[1, 2], [3, 4]])
Once we compute the gradient of f with respect to x, we’ll need a place to store it. In MXNet, we can tell
an NDArray that we plan to store a gradient by invoking its attach_grad() method.
In [3]: x.attach_grad()
Now we’re going to define the function f and MXNet will generate a computation graph on the fly. It’s as if
MXNet turned on a recording device and captured the exact path by which each variable was generated.
Note that building the computation graph requires a nontrivial amount of computation. So MXNet will only
build the graph when explicitly told to do so. We can instruct MXNet to start recording by placing code
inside a with autograd.record(): block.
In [4]: with autograd.record():
y = x * 2
z = y * x
Let’s backprop by calling z.backward(). When z has more than one entry, z.backward() is equiv-
alent to mx.nd.sum(z).backward().
In [5]: z.backward()
Now, let’s see if this is the expected output. Remember that y = x * 2, and z = x * y, so z should be
equal to 2 * x * x. After, doing backprop with z.backward(), we expect to get back gradient dz/dx
as follows: dy/dx = 2, dz/dx = 4 * x. So, if everything went according to plan, x.grad should consist of
an NDArray with the values [[4, 8],[12, 16]].
In [6]: print(x.grad)
[[ 4. 8.]
[ 12. 16.]]
3.7.2 Head gradients and the chain rule

Caution: This part is tricky, but not necessary to understanding subsequent sections.
Sometimes when we call the backward method on an NDArray, e.g. y.backward(), where y is a function
of x we are just interested in the derivative of y with respect to x. Mathematicians write this as 𝑑𝑦(𝑥) 𝑑𝑥 .
At other times, we may be interested in the gradient of z with respect to x, where z is a function of
𝑑
y, which in turn, is a function of x. That is, we are interested in 𝑑𝑥 𝑧(𝑦(𝑥)). Recall that by the chain rule
𝑑 𝑑𝑧(𝑦) 𝑑𝑦(𝑥) 𝑑𝑧
𝑑𝑥 𝑧(𝑦(𝑥)) = 𝑑𝑦 𝑑𝑥 . So, when y is part of a larger function z, and we want x.grad to store 𝑑𝑥 , we can
𝑑𝑧
pass in the head gradient 𝑑𝑦 as an input to backward(). The default argument is nd.ones_like(y).
See Wikipedia for more details.
In [7]: with autograd.record():
y = x * 2
z = y * x

head_gradient = nd.array([[10, 1.], [.1, .01]])

z.backward(head_gradient)
print(x.grad)
[[ 40. 8. ]
[ 1.20000005 0.16 ]]
Now that we know the basics, we can do some wild things with autograd, including building differentiable
functions using Pythonic control flow.
In [8]: a = nd.random_normal(shape=3)
a.attach_grad()
with autograd.record():
b = a * 2
while (nd.norm(b) < 1000).asscalar():
b = b * 2
if (mx.nd.sum(b) > 0).asscalar():

c = b
else:
c = 100 * b
In [9]: head_gradient = nd.array([0.01, 1.0, .1])
c.backward(head_gradient)
In [10]: print(a.grad)
[ 2048. 204800. 20480.]

<NDArray 3 @cpu(0)>
3.7.3 Next
Chapter 1 Problem Set
In [ ]:
3.8 Linear regression from scratch

Powerful ML libraries can eliminate repetitive work, but if you rely too much on abstractions, you might
never learn how neural networks really work under the hood. So for this first example, let’s get our hands
dirty and build everything from scratch, relying only on autograd and NDArray. First, we’ll import the same
dependencies as in the autograd chapter. We’ll also import the powerful gluon package but in this chapter,
we’ll only be using it for data loading.
In [21]: from __future__ import print_function
import mxnet as mx
from mxnet import nd, autograd, gluon
mx.random.seed(1)
3.8. Linear regression from scratch 61

3.8.1 Set the context

We’ll also want to specify the contexts where computation should happen. This tutorial is so simple that
you could probably run it on a calculator watch. But, to develop good habits we’re going to specify two
contexts: one for data and one for our models.
In [23]: data_ctx = mx.cpu()
model_ctx = mx.cpu()
3.8.2 Linear regression

To get our feet wet, we’ll start off by looking at the problem of regression. This is the task of predicting a
real valued target 𝑦 given a data point 𝑥. In linear regression, the simplest and still perhaps the most useful
approach, we assume that prediction can be expressed as a linear combination of the input features (thus
giving the name linear regression):
𝑦ˆ = 𝑤1 · 𝑥1 + ... + 𝑤𝑑 · 𝑥𝑑 + 𝑏
Given a collection of data points 𝑋, and corresponding target values 𝑦, we’ll try to find the weight vector
𝑤 and bias term 𝑏 (also called an offset or intercept) that approximately associate data points 𝑥𝑖 with their
corresponding labels y_i. Using slightly more advanced math notation, we can express the predictions 𝑦 ^
corresponding to a collection of datapoints 𝑋 via the matrix-vector product:
𝑦
^ = 𝑋𝑤 + 𝑏
Before we can get going, we will need two more things

• Some way to measure the quality of the current model
• Some way to manipulate the model to improve its quality
Square loss
In order to say whether we’ve done a good job, we need some way to measure the quality of a model.
Generally, we will define a loss function that says how far are our predictions from the correct answers. For
the classical case of linear regression, we usually focus on the squared error. Specifically, our loss will be
the sum, over all examples, of the squared error (𝑦𝑖 − 𝑦ˆ)2 ) on each:
𝑛
∑︁
ℓ(𝑦, 𝑦ˆ) = 𝑦𝑖 − 𝑦𝑖 )2 .
(ˆ
𝑖=1
For one-dimensional data, we can easily visualize the relationship between our single feature and the target
variable. It’s also easy to visualize a linear predictor and it’s error on each example. Note that squared loss
heavily penalizes outliers. For the visualized predictor below, the lone outlier would contribute most of the
loss.
Manipulating the model

For us to minimize the error, we need some mechanism to alter the model. We do this by choosing values
of the parameters 𝑤 and 𝑏. This is the only job of the learning algorithm. Take training data (𝑋, 𝑦) and the
functional form of the model 𝑦ˆ = 𝑋𝑤 + 𝑏. Learning then consists of choosing the best possible 𝑤 and 𝑏
based on the available evidence.

Historical note
You might reasonably point out that linear regression is a classical statistical model. According to Wikipedia,
Legendre first developed the method of least squares regression in 1805, which was shortly thereafter re-
discovered by Gauss in 1809. Presumably, Legendre, who had Tweeted about the paper several times, was
peeved that Gauss failed to cite his arXiv preprint.
Matters of provenance aside, you might wonder - if Legendre and Gauss worked on linear regression, does
that mean there were the original deep learning researchers? And if linear regression doesn’t wholly belong
to deep learning, then why are we presenting a linear model as the first example in a tutorial series on neural
networks? Well it turns out that we can express linear regression as the simplest possible (useful) neural
network. A neural network is just a collection of nodes (aka neurons) connected by directed edges. In most
networks, we arrange the nodes into layers with each feeding its output into the layer above. To calculate the
value of any node, we first perform a weighted sum of the inputs (according to weights w) and then apply an
activation function. For linear regression, we only have two layers, one corresponding to the input (depicted
in orange) and a one-node layer (depicted in green) corresponding to the ouput. For the output node the
activation function is just the identity function.
While you certainly don’t have to view linear regression through the lens of deep learning, you can (and we
will!). To ground the concepts that we just discussed in code, let’s actually code up a neural network for
linear regression from scratch.
To get going, we will generate a simple synthetic dataset by sampling random data points X[i] and cor-
responding labels y[i] in the following manner. Our inputs will each be sampled from a random normal
distribution with mean 0 and variance 1. Our features will be independent. Another way of saying this is
that they will have diagonal covariance. The labels will be generated accoding to the true labeling func-
tion y[i] = 2 * X[i][0]- 3.4 * X[i][1] + 4.2 + noise where the noise is drawn from a
random gaussian with mean 0 and variance .01. We could express the labeling function in mathematical

Fig. 3.1: Legendre

notation as:
𝑦 = 𝑋 · 𝑤 + 𝑏 + 𝜂, for 𝜂 ∼ 𝒩 (0, 𝜎 2 )
In [25]: num_inputs = 2
num_outputs = 1
num_examples = 10000
def real_fn(X):
return 2 * X[:, 0] - 3.4 * X[:, 1] + 4.2
X = nd.random_normal(shape=(num_examples, num_inputs), ctx=data_ctx)

noise = .1 * nd.random_normal(shape=(num_examples,), ctx=data_ctx)
y = real_fn(X) + noise
Notice that each row in X consists of a 2-dimensional data point and that each row in Y consists of a 1-
dimensional target value.
In [27]: print(X[0])
print(y[0])
[-1.22338355 2.39233518]
<NDArray 2 @cpu(0)>
[-6.09602737]
<NDArray 1 @cpu(0)>
Note that because our synthetic features X live on data_ctx and because our noise also lives on
data_ctx, the labels y, produced by combining X and noise in real_fn also live on data_ctx.
We can confirm that for any randomly chosen point, a linear combination with the (known) optimal param-
eters produces a prediction that is indeed close to the target value
In [28]: print(2 * X[0, 0] - 3.4 * X[0, 1] + 4.2)
[-6.38070679]
<NDArray 1 @cpu(0)>
We can visualize the correspondence between our second feature (X[:, 1]) and the target values Y by
generating a scatter plot with the Python plotting package matplotlib. Make sure that matplotlib
is installed. Otherwise, you may install it by running pip2 install matplotlib (for Python 2) or
pip3 install matplotlib (for Python 3) on your command line.
In order to plot with matplotlib we’ll just need to convert X and y into NumPy arrays by using the
.asnumpy() function.
plt.scatter(X[:, 1].asnumpy(),y.asnumpy())
plt.show()

3.8.3 Data iterators

Once we start working with neural networks, we’re going to need to iterate through our data points quickly.
We’ll also want to be able to grab batches of k data points at a time, to shuffle our data. In MXNet, data
iterators give us a nice set of utilities for fetching and manipulating data. In particular, we’ll work with the
simple DataLoader class, that provides an intuitive way to use an ArrayDataset for training models.
We can load X and y into an ArrayDataset, by calling gluon.data.ArrayDataset(X, y). It’s ok
for X to be a multi-dimensional input array (say, of images) and y to be just a one-dimensional array of
labels. The one requirement is that they have equal lengths along the first axis, i.e., len(X) == len(y).
Given an ArrayDataset, we can create a DataLoader which will grab random batches of data from an
ArrayDataset. We’ll want to specify two arguments. First, we’ll need to say the batch_size, i.e.,
how many examples we want to grab at a time. Second, we’ll want to specify whether or not to shuffle the
data between iterations through the dataset.
In [30]: batch_size = 4
train_data = gluon.data.DataLoader(gluon.data.ArrayDataset(X, y),
batch_size=batch_size, shuffle=True)
Once we’ve initialized our DataLoader (train_data), we can easily fetch batches by iterating over
train_data just as if it were a Python list. You can use your favorite iterating techniques like fore-
ach loops: for data, label in train_data or enumerations: for i, (data, label) in
enumerate(train_data). First, let’s just grab one batch and break out of the loop.
In [31]: for i, (data, label) in enumerate(train_data):
print(data, label)
break
[[-0.14732301 -1.32803488]
[-0.56128627 0.48301753]

[ 0.75564283 -0.12659997]
[-0.96057719 -0.96254188]]
[ 8.25711536 1.30587864 6.15542459 5.48825312]
<NDArray 4 @cpu(0)>
If we run that same code again you’ll notice that we get a different batch. That’s because we instructed the
DataLoader that shuffle=True.
In [32]: for i, (data, label) in enumerate(train_data):
print(data, label)
break
[[-0.59027743 -1.52694809]
[-0.00750104 2.68466949]
[ 1.50308061 0.54902577]
[ 1.69129586 0.32308948]]
[ 8.28844357 -5.07566643 5.3666563 6.52408457]
<NDArray 4 @cpu(0)>
Finally, if we actually pass over the entire dataset, and count the number of batches, we’ll find that there are
2500 batches. We expect this because our dataset has 10,000 examples and we configured the DataLoader
with a batch size of 4.
In [33]: counter = 0
for i, (data, label) in enumerate(train_data):
pass
print(i+1)
2500
3.8.4 Model parameters

Now let’s allocate some memory for our parameters and set their initial values. We’ll want to initialize these
parameters on the model_ctx.
In [34]: w = nd.random_normal(shape=(num_inputs, num_outputs), ctx=model_ctx)
b = nd.random_normal(shape=num_outputs, ctx=model_ctx)
params = [w, b]
In the succeeding cells, we’re going to update these parameters to better fit our data. This will involve
taking the gradient (a multi-dimensional derivative) of some loss function with respect to the parameters.
We’ll update each parameter in the direction that reduces the loss. But first, let’s just allocate some memory
for each gradient.
In [35]: for param in params:
param.attach_grad()
3.8.5 Neural networks

Next we’ll want to define our model. In this case, we’ll be working with linear models, the simplest possible
useful neural network. To calculate the output of the linear model, we simply multiply a given input with
the model’s weights (w), and add the offset b.

In [36]: def net(X):

return mx.nd.dot(X, w) + b
Ok, that was easy.
3.8.6 Loss function

Train a model means making it better and better over the course of a period of training. But in order for this
goal to make any sense at all, we first need to define what better means in the first place. In this case, we’ll
use the squared distance between our prediction and the true value.
In [37]: def square_loss(yhat, y):
return nd.mean((yhat - y) ** 2)
3.8.7 Optimizer
It turns out that linear regression actually has a closed-form solution. However, most interesting models that
we’ll care about cannot be solved analytically. So we’ll solve this problem by stochastic gradient descent.
At each step, we’ll estimate the gradient of the loss with respect to our weights, using one batch randomly
drawn from our dataset. Then, we’ll update our parameters a small amount in the direction that reduces the
loss. The size of the step is determined by the learning rate lr.
In [38]: def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
3.8.8 Execute training loop

Now that we have all the pieces, we just need to wire them together by writing a training loop. First we’ll
define epochs, the number of passes to make over the dataset. Then for each pass, we’ll iterate through
train_data, grabbing batches of examples and their corresponding labels.
For each batch, we’ll go through the following ritual:
• Generate predictions (yhat) and the loss (loss) by executing a forward pass through the network.
• Calculate gradients by making a backwards pass through the network (loss.backward()).
• Update the model parameters by invoking our SGD optimizer.
In [39]: epochs = 10
learning_rate = .0001
num_batches = num_examples/batch_size
for e in range(epochs):
cumulative_loss = 0
# inner loop
data = data.as_in_context(model_ctx)
label = label.as_in_context(model_ctx).reshape((-1, 1))
output = net(data)
loss = square_loss(output, label)
loss.backward()
SGD(params, learning_rate)

cumulative_loss += loss.asscalar()
print(cumulative_loss / num_batches)
24.6606138554
9.09776815639
3.36058844271
1.24549788469
0.465710770596
0.178157229481
0.0721970594548
0.0331197250206
0.0186954441286
0.0133724625537
3.8.9 Visualizing our training progess

In the succeeding chapters, we’ll introduce more realistic data, fancier models, more complicated loss func-
tions, and more. But the core ideas are the same and the training loop will look remarkably familiar. Because
these tutorials are self-contained, you’ll get to know this ritual quite well. In addition to updating our model,
we’ll often want to do some bookkeeping. Among other things, we might want to keep track of training
progress and visualize it graphically. We demonstrate one slighly more sophisticated training loop below.
In [41]: ############################################
# Re-initialize parameters because they
# were already trained in the first loop
############################################
w[:] = nd.random_normal(shape=(num_inputs, num_outputs), ctx=model_ctx)
b[:] = nd.random_normal(shape=num_outputs, ctx=model_ctx)
############################################
# Script to plot the losses over time
############################################
def plot(losses, X, sample_size=100):
xs = list(range(len(losses)))
f, (fg1, fg2) = plt.subplots(1, 2)
fg1.set_title('Loss during training')
fg1.plot(xs, losses, '-r')
fg2.set_title('Estimated vs real function')
fg2.plot(X[:sample_size, 1].asnumpy(),
net(X[:sample_size, :]).asnumpy(), 'or', label='Estimated')
fg2.plot(X[:sample_size, 1].asnumpy(),
real_fn(X[:sample_size, :]).asnumpy(), '*g', label='Real')
fg2.legend()
plt.show()
losses = []
plot(losses, X)
cumulative_loss = 0

label = label.as_in_context(model_ctx).reshape((-1, 1))

output = net(data)
loss.backward()
cumulative_loss += loss.asscalar()
print("Epoch %s, batch %s. Mean loss: %s" % (e, i, cumulative_loss/num_batches

losses.append(cumulative_loss/num_batches)
plot(losses, X)
Epoch 0, batch 2499. Mean loss: 16.9325145943


3.8.10 Conclusion
You’ve seen that using just mxnet.ndarray and mxnet.autograd, we can build statistical models from scratch.
In the following tutorials, we’ll build on this foundation, introducing the basic ideas behind modern neural
networks and demonstrating the powerful abstractions in MXNet’s gluon package for building complex
models with little code.
3.8.11 Next
Linear regression with gluon
3.9 Linear regression with gluon

Now that we’ve implemented a whole neural network from scratch, using nothing but mx.ndarray and
mxnet.autograd, let’s see how we can make the same model while doing a lot less work.
Again, let’s import some packages, this time adding mxnet.gluon to the list of dependencies.
import mxnet as mx

We’ll also want to set a context to tell gluon where to do most of the computation.
3.9. Linear regression with gluon 71

3.9.2 Build the dataset

Again we’ll look at the problem of linear regression and stick with the same synthetic data.
In [34]: num_inputs = 2
num_outputs = 1
def real_fn(X):
return 2 * X[:, 0] - 3.4 * X[:, 1] + 4.2
X = nd.random_normal(shape=(num_examples, num_inputs))
noise = 0.01 * nd.random_normal(shape=(num_examples,))
y = real_fn(X) + noise
3.9.3 Load the data iterator

We’ll stick with the DataLoader for handling our data batching.
train_data = gluon.data.DataLoader(gluon.data.ArrayDataset(X, y),
3.9.4 Define the model

When we implemented things from scratch, we had to individually allocate parameters and then compose
them together as a model. While it’s good to know how to do things from scratch, with gluon, we can just
compose a network from predefined layers. For a linear model, the appropriate layer is called Dense. It’s
called a dense layer because every node in the input is connected to every node in the subsequent layer. That
description seems excessive because we only have one (non-input) layer here, and that layer only contains
one node! But in subsequent chapters we’ll typically work with networks that have multiple outputs, so we
might as well start thinking in terms of layers of nodes. Because a linear model consists of just a single
Dense layer, we can instantiate it with one line.
As in the previous notebook, we have an input dimension of 2 and an output dimension of 1. the most direct
way to instantiate a Dense layer with these dimensions is to specify the number of inputs and the number
of outputs.
In [36]: net = gluon.nn.Dense(1, in_units=2)
That’s it! We’ve already got a neural network. Like our hand-crafted model in the previous notebook, this
model has a weight matrix and bias vector.
In [37]: print(net.weight)
print(net.bias)
Out[37]: Parameter dense4_weight (shape=(1, 2), dtype=None)
Parameter dense4_bias (shape=(1,), dtype=None)
Here, net.weight and net.bias are not actually NDArrays. They are instances of the Parameter
class. We use Parameter instead of directly accessing NDAarrays for several reasons. For example, they
provide convenient abstractions for initializing values. Unlike NDArrays, Parameters can be associated with
multiple contexts simultaneously. This will come in handy in future chapters when we start thinking about
distributed learning across multiple GPUs.

In gluon, all neural networks are made out of Blocks (gluon.Block). Blocks are just units that take
inputs and generate outputs. Blocks also contain parameters that we can update. Here, our network consists
of only one layer, so it’s convenient to access our parameters directly. When our networks consist of 10s of
layers, this won’t be so fun. No matter how complex our network, we can grab all its parameters by calling
collect_params() as follows:
In [38]: net.collect_params()
Out[38]: dense4_ (
Parameter dense4_weight (shape=(1, 2), dtype=None)
Parameter dense4_bias (shape=(1,), dtype=None)
)
The returned object is a gluon.parameter.ParameterDict. This is a convenient abstraction for

retrieving and manipulating groups of Parameter objects. Most often, we’ll want to retrieve all of the pa-
rameters in a neural network:
In [39]: type(net.collect_params())
Out[39]: mxnet.gluon.parameter.ParameterDict
3.9.5 Initialize parameters

Once we initialize our Parameters, we can access their underlying data and context(s), and we can also
feed data through the neural network to generate output. However, we can’t get going just yet. If we try
invoking your model by calling net(nd.array([[0,1]])), we’ll confront the following hideous error
message:
RuntimeError: Parameter dense1_weight has not been initialized...
That’s because we haven’t yet told gluon what the initial values for our parameters should be! We ini-
tialize parameters by calling the .initialize() method of a ParameterDict. We’ll need to pass in two
arguments.
• An initializer, many of which live in the mx.init module.
• A context where the parameters should live. In this case we’ll pass in the model_ctx. Most often
this will either be a GPU or a list of GPUs.
MXNet provides a variety of common initializers in mxnet.init. To keep things consistent with the
model we built by hand, we’ll initialize each parameter by sampling from a standard normal distribution,
using mx.init.Normal(sigma=1.).
In [40]: net.collect_params().initialize(mx.init.Normal(sigma=1.), ctx=model_ctx)
3.9.6 Deferred Initialization

When we call initialize, gluon associates each parameter with an initializer. However, the actual ini-
tialization is deferred until we make a first forward pass. In other words, the parameters are only initialized
when they’re needed. If we try to call net.weight.data() we’ll get the following error:
DeferredInitializationError: Parameter dense2_weight has not
been initialized yet because initialization was deferred. Actual
initialization happens during the first forward pass. Please pass one
batch of data through the network before accessing Parameters.

Passing data through a gluon model is easy. We just sample a batch of the appropriate shape and call net
just as if it were a function. This will invoke net’s forward() method.
In [41]: example_data = nd.array([[4,7]])
net(example_data)
Out[41]:
[[-1.33219385]]
Now that net is initialized, we can access each of its parameters.

In [42]: print(net.weight.data())
print(net.bias.data())
[[-0.25217363 -0.04621419]]
[ 0.]
<NDArray 1 @cpu(0)>
3.9.7 Shape inference

Recall that previously, we instantiated our network with gluon.nn.Dense(1, in_units=2). One
slick feature that we can take advantage of in gluon is shape inference on parameters. Because our param-
eters never come into action until we pass data through the network, we don’t actually have to declare the
input dimension (in_units). Let’s try this again, but letting gluon do more of the work:
In [43]: net = gluon.nn.Dense(1)
net.collect_params().initialize(mx.init.Normal(sigma=1.), ctx=model_ctx)
We’ll elaborate on this and more of gluon’s internal workings in subsequent chapters.
3.9.8 Define loss

Instead of writing our own loss function we’re just going to access squared error by instantiating gluon.
loss.L2Loss. Just like layers, and whole networks, a loss in gluon is just a Block.
In [44]: square_loss = gluon.loss.L2Loss()
3.9.9 Optimizer
Instead of writing stochastic gradient descent from scratch every time, we can instantiate a gluon.
Trainer, passing it a dictionary of parameters. Note that the SGD optimizer in gluon also has a few
bells and whistles that you can turn on at will, including momentum and clipping (both are switched off by
default). These modifications can help to converge faster and we’ll discuss them later when we go over a
variety of optimization algorithms in detail.
In [45]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.0001})

You might have noticed that it was a bit more concise to express our model in gluon. For example, we
didn’t have to individually allocate parameters, define our loss function, or implement stochastic gradient

descent. The benefits of relying on gluon’s abstractions will grow substantially once we start working with
much more complex models. But once we have all the basic pieces in place, the training loop itself is quite
similar to what we would do if implementing everything from scratch.
To refresh your memory. For some number of epochs, we’ll make a complete pass over the dataset
(train_data), grabbing one mini-batch of inputs and the corresponding ground-truth labels at a time.
Then, for each batch, we’ll go through the following ritual. So that this process becomes maximally ritual-
istic, we’ll repeat it verbatim:
• Generate predictions (yhat) and the loss (loss) by executing a forward pass through the network.
• Calculate gradients by making a backwards pass through the network via loss.backward().
• Update the model parameters by invoking our SGD optimizer (note that we need not tell trainer.
step about which parameters but rather just the amount of data, since we already performed that in
the initialization of trainer).
loss_sequence = []
num_batches = num_examples / batch_size
cumulative_loss = 0
# inner loop
label = label.as_in_context(model_ctx)
output = net(data)
loss.backward()
trainer.step(batch_size)
cumulative_loss += nd.mean(loss).asscalar()
print("Epoch %s, loss: %s" % (e, cumulative_loss / num_examples))
loss_sequence.append(cumulative_loss)
Epoch 0, loss: 3.44980202263

Epoch 1, loss: 2.10364257665
Epoch 2, loss: 1.28279426137
Epoch 3, loss: 0.782256319318
Epoch 4, loss: 0.477034088909
Epoch 5, loss: 0.290909814427
Epoch 6, loss: 0.177411796283
Epoch 7, loss: 0.108197494675
Epoch 8, loss: 0.0659899789031
Epoch 9, loss: 0.040249745576
3.9.11 Visualizing the learning curve

Now let’s check how quickly SGD learns the linear regression model by plotting the learning curve.
In [47]: # plot the convergence of the estimated loss function
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt
plt.figure(num=None,figsize=(8, 6))
plt.plot(loss_sequence)
# Adding some bells and whistles to the plot

plt.grid(True, which="both")
plt.xlabel('epoch',fontsize=14)
plt.ylabel('average loss',fontsize=14)
Out[47]: <matplotlib.text.Text at 0x7efc87a7f0f0>
As we can see, the loss function converges quickly to the optimal solution.
3.9.12 Getting the learned model parameters

As an additional sanity check, since we generated the data from a Gaussian linear regression model, we want
to make sure that the learner managed to recover the model parameters, which were set to weight 2, −3.4
with an offset of 4.2.
In [48]: params = net.collect_params() # this returns a ParameterDict
print('The type of "params" is a ',type(params))

# A ParameterDict is a dictionary of Parameter class objects

# therefore, here is how we can read off the parameters from it.
for param in params.values():

print(param.name,param.data())
The type of "params" is a <class 'mxnet.gluon.parameter.ParameterDict'>
dense5_weight
[[ 1.7913872 -3.10427046]]
dense5_bias
[ 3.85259581]
<NDArray 1 @cpu(0)>
3.9.13 Conclusion
As you can see, even for a simple example like linear regression, gluon can help you to write quick and
clean code. Next, we’ll repeat this exercise for multi-layer perceptrons, extending these lessons to deep
neural networks and (comparatively) real datasets.
3.9.14 Next
Binary classification with logistic regression
3.10 Binary classification with logistic regression

Over the last two tutorials we worked through how to implement a linear regression model, both *from
scratch* and using Gluon to automate most of the repetitive work like allocating and initializing parameters,
defining loss functions, and implementing optimizers.
Regression is the hammer we reach for when we want to answer how much? or how many? questions.
If you want to predict the number of dollars (the price) at which a house will be sold, or the number of
wins a baseball team might have, or the number of days that a patient will remain hospitalized before being
discharged, then you’re probably looking for a regression model.
Based on our experience, in industry, we’re more often interested in making categorical assignments. Does
this email belong in the spam folder or the inbox? How likely is this custromer to sign up for subscription
service? When we’re interested in either assigning datapoints to categories or assessing the probability that
a category applies, we call this task classification.
The simplest kind of classification problem is binary classification, when there are only two categories, so
let’s start there. Let’s call our two categories the positive class 𝑦𝑖 = 1 and the negative class 𝑦𝑖 = 0. Even
with just two categories, and even confining ourselves to linear models, there are many ways we might
approach the problem. For example, we might try to draw a line that best separates the points.
A whole family of algorithms called support vector machines pursue this approach. The main idea here is
choose a line that maximizes the margin to the closest data points on either side of the decision boundary.
In these approaches, only the points closest to the decision boundary (the support vectors) actually influence
the choice of the linear separator.
3.10. Binary classification with logistic regression 77

chapter02_supervised-learning/../img/linear-separato
With neural networks, we usually approach the problem differently. Instead of just trying to separate the
points, we train a probabilistic classifier which estimates, for each data point, the conditional probability
that it belongs to the positive class.
Recall that in linear regression, we made predictions of the form
𝑦ˆ = 𝑤𝑇 𝑥 + 𝑏.
We are interested in asking the question “what is the probability that example :math:‘x‘ belongs to the
positive class?” A regular linear model is a poor choice here because it can output values greater than 1 or
less than 0. To coerce reasonable answers from our model, we’re going to modify it slightly, by running the
linear function through a sigmoid activation function 𝜎:
𝑦ˆ = 𝜎(𝑤𝑇 𝑥 + 𝑏).
The sigmoid function 𝜎, sometimes called a squashing function or a logistic function - thus the name logistic
regression - maps a real-valued input to the range 0 to 1. Specifically, it has the functional form:
1
𝜎(𝑧) =
1 + 𝑒−𝑧
Let’s get our imports out of the way and visualize the logistic function using mxnet and matplotlib.
In [ ]: import mxnet as mx
def logistic(z):
return 1. / (1. + nd.exp(-z))
x = nd.arange(-5, 5, .1)
y = logistic(x)
plt.plot(x.asnumpy(),y.asnumpy())
plt.show()
Because the sigmoid outputs a value between 0 and 1, it’s more reasonable to think of it as a probability.
Note that an input of 0 gives a value of .5. So in the common case, where we want to predict positive
whenever the probability is greater than .5 and negative whenever the probability is less than .5, we can just
look at the sign of 𝑤𝑇 𝑥 + 𝑏.
3.10.1 Binary cross-entropy loss

Now that we’ve got a model that outputs probabilities, we need to choose a loss function. When we wanted
to predict how much, we used squared error (𝑦 − 𝑦ˆ)2 , as our measure our model’s performance.

Since now we’re thinking about outputing probabilities, one natural objective is to say that we should choose
the weights that give the actual labels in the training data the highest probability.
max 𝑃𝜃 ((𝑦1 , ..., 𝑦𝑛 )|𝑥1 , ..., 𝑥𝑛 )

𝜃
Because each example is independent of the others, and each label depends only on the features of the
corresponding examples, we can rewrite the above as
max 𝑃𝜃 (𝑦1 |𝑥1 )𝑃𝜃 (𝑦2 |𝑥2 )...𝑃 (𝑦𝑛 |𝑥𝑛 )

𝜃
This function is a product over the examples, but in general, because we want to train by stochastic gradient
descent, it’s a lot easier to work with a loss function that breaks down as a sum over the training examples.
max log 𝑃𝜃 (𝑦1 |𝑥1 ) + ... + log 𝑃 (𝑦𝑛 |𝑥𝑛 )

𝜃
Because we typically express our objective as a loss we can just flip the sign, giving us the negative log
probability:
(︃ 𝑛 )︃
∑︁
min − log 𝑃𝜃 (𝑦𝑖 |𝑥𝑖 )
𝜃
𝑖=1
If we interpret 𝑦ˆ𝑖 as the probability that the 𝑖-th example belongs to the positive class (i.e 𝑦𝑖 = 1), then 1 − 𝑦ˆ𝑖
is the probability that the 𝑖-th example belongs to the negative class (i.e 𝑦𝑖 = 0). This is equivalent to saying
{︃
𝑦ˆ𝑖 , if 𝑦𝑖 = 1
𝑃𝜃 (𝑦𝑖 |𝑥𝑖 ) =
1 − 𝑦ˆ𝑖 , if 𝑦𝑖 = 0
which can be written in a more compact form
𝑃𝜃 (𝑦𝑖 |𝑥𝑖 ) = 𝑦ˆ𝑖𝑦𝑖 (1 − 𝑦ˆ𝑖 )1−𝑦𝑖
Thus we can express our learning objective as:

𝑛
∑︁
^) = −
ℓ(𝑦, 𝑦 𝑦𝑖 log 𝑦ˆ𝑖 + (1 − 𝑦𝑖 ) log(1 − 𝑦ˆ𝑖 ).
𝑖=1
If you’re learning machine learning for the first time, that might have been too much information too quickly.
Let’s take a look at this loss function and break down what’s going on more slowly. The loss function consists
of two terms, 𝑦𝑖 log 𝑦ˆ𝑖 and (1 − 𝑦𝑖 ) log(1 − 𝑦ˆ𝑖 ). Because 𝑦𝑖 only takes values 0 or 1, for a given data point,
one of these terms disappears. When 𝑦𝑖 is 1, this loss says that we should maximize log 𝑦ˆ𝑖 , giving higher
probability to the correct answer. When 𝑦𝑖 is 0, this loss function takes value log(1 − 𝑦ˆ𝑖 ). That says that we
should maximize the value 1 − 𝑦ˆ which we already know is the probability assigned to 𝑥𝑖 belonging to the
negative class.
Note that this loss function is commonly called log loss and is also commonly referred to as binary cross
entropy. It is a special case of negative log likelihood. And it is a special case of cross-entropy, which can
apply to the multi-class (> 2) setting.
While for linear regression, we demonstrated a completely different implementation from scratch and with
‘‘gluon‘‘, here we’re going to demonstrate how we can mix and match the two. We’ll use gluon for our
modeling, but we’ll write our loss function from scratch.

3.10.2 Data
As usual, we’ll want to work out these concepts using a real dataset. This time around, we’ll use the
Adult dataset taken from the UCI repository. The dataset was constructed by Barry Becker from 1994
census data. In its original form, the dataset contained 14 features, including age, education, occupation,
sex, native-country, among others. In this version, hosted by National Taiwan University, the data have
been re-processed to 123 binary features each representing quantiles among the original features. The label
is a binary indicator indicating whether the person corresponding to each row made more (𝑦𝑖 = 1) or less
(𝑦𝑖 = 0) than $50,000 of income in 1994. The dataset we’re working with contains 30,956 training examples
and 1,605 examples set aside for testing. We can read the datasets into main memory like so:
In [ ]: data_ctx = mx.cpu()
# Change this to `mx.gpu(0) if you would like to train on an NVIDIA GPU
with open("../data/adult/a1a.train") as f:
train_raw = f.read()
with open("../data/adult/a1a.test") as f:
test_raw = f.read()
The data consists of lines like the following:

-1 4:1 6:1 15:1 21:1 35:1 40:1 57:1 63:1 67:1 73:1 74:1 77:1 80:1 83:1
\n
The first entry in each row is the value of the label. The following tokens are the indices of the non-zero
features. The number 1 here is redundant. But we don’t always have control over where our data comes
from, so we might as well get used to mucking around with weird file formats. Let’s write a simple script to
process our dataset.
In [ ]: def process_data(raw_data):
train_lines = raw_data.splitlines()
num_examples = len(train_lines)
num_features = 123
X = nd.zeros((num_examples, num_features), ctx=data_ctx)
Y = nd.zeros((num_examples, 1), ctx=data_ctx)
for i, line in enumerate(train_lines):
tokens = line.split()
label = (int(tokens[0]) + 1) / 2 # Change label from {-1,1} to {0,1}
Y[i] = label
for token in tokens[1:]:
index = int(token[:-2]) - 1
X[i, index] = 1
return X, Y
In [ ]: Xtrain, Ytrain = process_data(train_raw)
Xtest, Ytest = process_data(test_raw)
We can now verify that our data arrays have the right shapes.
In [ ]: print(Xtrain.shape)
print(Ytrain.shape)
print(Xtest.shape)
print(Ytest.shape)

We can also check the fraction of positive examples in our training and test sets. This will give us one
nice (necessary but insufficient) sanity check that our training and test data really are drawn from the same
distribution.
In [ ]: print(nd.sum(Ytrain)/len(Ytrain))
print(nd.sum(Ytest)/len(Ytest))
3.10.3 Instantiate a dataloader

In [ ]: batch_size = 64
train_data = gluon.data.DataLoader(gluon.data.ArrayDataset(Xtrain, Ytrain),

test_data = gluon.data.DataLoader(gluon.data.ArrayDataset(Xtest, Ytest),


In [ ]: net = gluon.nn.Dense(1)
net.collect_params().initialize(mx.init.Normal(sigma=1.), ctx=model_ctx)
3.10.5 Instantiate an optimizer

In [ ]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01})
3.10.6 Define log loss

In [ ]: def log_loss(output, y):
yhat = logistic(output)
return - nd.nansum( y * nd.log(yhat) + (1-y) * nd.log(1-yhat))
In [ ]: epochs = 30
loss_sequence = []
num_examples = len(Xtrain)
cumulative_loss = 0
output = net(data)
loss = log_loss(output, label)
loss.backward()
cumulative_loss += nd.sum(loss).asscalar()
print("Epoch %s, loss: %s" % (e, cumulative_loss ))
loss_sequence.append(cumulative_loss)
3.10.7 Visualize the learning curve

In [ ]: # plot the convergence of the estimated loss function
%matplotlib inline

import matplotlib
plt.figure(num=None,figsize=(8, 6))
plt.plot(loss_sequence)
# Adding some bells and whistles to the plot

plt.grid(True, which="both")
plt.xlabel('epoch',fontsize=14)
plt.ylabel('average loss',fontsize=14)
3.10.8 Calculating accuracy

While the negative log likelihood gives us a sense of how well the predicted probabilities agree with the
observed labels, it’s not the only way to assess the performance of our classifiers. For example, at the end of
the day, we’ll often want to apply a threshold to the predicted probabilities in order to make hard predictions.
For example, if we were building a spam filter, we’ll need to either send the email to the spam folder or to
the inbox. In these cases, we might not care about negative log likelihood, but instead we want know how
many errors our classifier makes. Let’s code up a simple script that calculates the accuracy of our classifier.
In [ ]: num_correct = 0.0
num_total = len(Xtest)
for i, (data, label) in enumerate(test_data):
output = net(data)
prediction = (nd.sign(output) + 1) / 2
num_correct += nd.sum(prediction == label)
print("Accuracy: %0.3f (%s/%s)" % (num_correct.asscalar()/num_total, num_correct.as
This isn’t too bad! A naive classifier would predict that nobody had an income greater than $50k (the
majority class). This classifier would achieve an accuracy of roughly 75%. By contrast, our classifier gets
an accuracy of .84 (results may vary a small amount on each run owing to random initializations and random
sampling of the batches).
By now you should have some feeling for the two most fundamental tasks in supervised learning: regression
and classification. In the following chapters we’ll go deeper into these problems, exploring more complex
models, loss functions, optimizers, and training schemes. We’ll also look at more interesting datasets. And
finally, in the following chapters we’ll also look more advanced problems where we want, for example, to
predict more structured objects.
3.10.9 Next:
Softmax regression from scratch
3.11 Multiclass logistic regression from scratch

If you’ve made it through our tutorials on linear regression from scratch, then you’re past the hardest part.
You already know how to load and manipulate data, build computation graphs on the fly, and take derivatives.

You also know how to define a loss function, construct a model, and write your own optimizer. Nearly
all neural networks that we’ll build in the real world consist of these same fundamental parts. The main
differences will be the type and scale of the data and the complexity of the models. And every year or two,
a new hipster optimizer comes around, but at their core they’re all subtle variations of stochastic gradient
descent.
In the previous chapter, we introduced logistic regression, a classic algorithm for performing binary classi-
fication. We implemented a model
𝑦ˆ = 𝜎(𝑥𝑤𝑇 + 𝑏)
𝑤ℎ𝑒𝑟𝑒 : 𝑚𝑎𝑡ℎ : ‘𝜎‘𝑖𝑠𝑡ℎ𝑒𝑠𝑖𝑔𝑚𝑜𝑖𝑑𝑠𝑞𝑢𝑎𝑠ℎ𝑖𝑛𝑔𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛.
This activation function on the final layer was crucial because it forced our outputs to take values in the
range [0,1]. That allowed us to interpret these outputs as probabilties. We then updated our parameters to
give the true labels (which take values either 1 or 0) the highest probability. In that tutorial, we looked at
predicting whether or not an individual’s income exceeded $50k based on features available in 1994 census
data.
Binary classification is quite useful. We can use it to predict spam vs. not spam or cancer vs not cancer.
But not every problem fits the mold of binary classification. Sometimes we encounter a problem where each
example could belong to one of 𝑘 classes. For example, a photograph might depict a cat or a dog or a zebra
or . . . (you get the point). Given 𝑘 classes, the most naive way to solve a multiclass classification problem
is to train 𝑘 different binary classifiers 𝑓𝑖 (𝑥). We could then predict that an example 𝑥 belongs to the class
𝑖 for which the probability that the label applies is highest:
max 𝑓𝑖 (𝑥)
𝑖
There’s a smarter way to go about this. We could force the output layer to be a discrete probability distri-
bution over the 𝑘 classes. To be a valid probability distribution, we’ll want the output 𝑦ˆ to (i) contain only
non-negative values, and (ii) sum to 1. We accomplish this by using the softmax function. Given an input
vector 𝑧, softmax does two things. First, it exponentiates (elementwise) 𝑒𝑧 , forcing all values to be strictly
positive. Then it normalizes so that all values sum to 1. Following the softmax operation computes the
following
𝑒𝑧
softmax(𝑧) = ∑︀𝑘
𝑧𝑖
𝑖=1 𝑒
Because now we have 𝑘 outputs and not 1 we’ll need weights connecting each of our inputs to each of our
outputs. Graphically, the network looks something like this:
We can represent these weights one for each input node, output node pair in a matrix 𝑊 . We generate the
linear mapping from inputs to outputs via a matrix-vector product 𝑥𝑊 + 𝑏. Note that the bias term is now
a vector, with one component for each output node. The whole model, including the activation function can
be written:
𝑦ˆ = softmax(𝑥𝑊 + 𝑏)
This model is sometimes called multiclass logistic regression. Other common names for it include softmax
regression and multinomial regression. For these concepts to sink in, let’s actually implement softmax re-
3.11. Multiclass logistic regression from scratch 83

gression, and pick a slightly more interesting dataset this time. We’re going to classify images of handwritten
digits like these:
3.11.1 About batch training

In the above, we used plain lowercase letters for scalar variables, bolded lowercase letters for row vectors,
and uppercase letters for matrices. Assume we have 𝑑 inputs and 𝑘 outputs. Let’s note the shapes of the
various variables explicitly as follows:
𝑧 = 𝑥 𝑊 + 𝑏
1×𝑘 1×𝑑 𝑑×𝑘 1×𝑘
Often we would one-hot encode the output label, for example 𝑦ˆ = 5 would be 𝑦 ^𝑜𝑛𝑒−ℎ𝑜𝑡 =
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0] when one-hot encoded for a 10-class classfication problem. So 𝑦ˆ = softmax(𝑧)
becomes
^𝑜𝑛𝑒−ℎ𝑜𝑡 = softmax𝑜𝑛𝑒−ℎ𝑜𝑡 ( 𝑧 )
𝑦
1×𝑘 1×𝑘
When we input a batch of 𝑚 training examples, we would have matrix 𝑋 that is the vertical stacking of
𝑚×𝑑
individual training examples 𝑥𝑖 , due to the choice of using row vectors.
⎡ ⎤ ⎡ ⎤
𝑥1 𝑥11 𝑥12 𝑥13 . . . 𝑥1𝑑
⎢ 𝑥2 ⎥ ⎢ 𝑥21 𝑥22 𝑥23 . . . 𝑥2𝑑 ⎥
𝑋=⎢ . ⎥=⎢ .
⎢ ⎥ ⎢ ⎥
.. .. .. .. ⎥
⎣ .. ⎦ ⎣ .. . . . . ⎦
𝑥𝑚 𝑥𝑚1 𝑥𝑚2 𝑥𝑚3 . . . 𝑥𝑚𝑑
Under this batch training situation, 𝑦

^𝑜𝑛𝑒−ℎ𝑜𝑡 = softmax(𝑧) turns into
𝑌 = softmax(𝑍) = softmax(𝑋𝑊 + 𝐵)

where matrix 𝐵 is formed by having 𝑚 copies of 𝑏 as follows

𝑚×𝑘
⎡ ⎤ ⎡ ⎤
𝑏 𝑏1 𝑏2 𝑏3 . . . 𝑏𝑘
⎢𝑏⎥ ⎢𝑏1 𝑏2 𝑏3 . . . 𝑏𝑘 ⎥
𝐵 = ⎢.⎥ = ⎢ .
⎢ ⎥ ⎢ ⎥
.. .. . . .⎥
⎣ .. ⎦ ⎣ .. . . . .. ⎦
𝑏 𝑏1 𝑏2 𝑏3 . . . 𝑏𝑘
In actual implementation we can often get away with using 𝑏 directly instead of 𝐵 in the equation for 𝑍
above, due to broadcasting.
Each row of matrix 𝑍 corresponds to one training example. The softmax function operates on each row
𝑚×𝑘
of matrix 𝑍 and returns a matrix 𝑌 , each row of which corresponds to the one-hot encoded prediction of
𝑚×𝑘
one training example.
3.11.2 Imports
To start, let’s import the usual libraries.
In [ ]: from __future__ import print_function
import numpy as np
import mxnet as mx
mx.random.seed(1)
3.11.3 Set Context

We’ll also want to set the compute context where our data will typically live and where we’ll be doing
our modeling. Feel free to go ahead and change model_ctx to mx.gpu(0) if you’re running on an
appropriately endowed machine.
In [ ]: data_ctx = mx.cpu()
# model_ctx = mx.gpu()
3.11.4 The MNIST dataset

This time we’re going to work with real data, each a 28 by 28 centrally cropped black & white photograph
of a handwritten digit. Our task will be come up with a model that can associate each image with the digit
(0-9) that it depicts.
To start, we’ll use MXNet’s utility for grabbing a copy of this dataset. The datasets accept a transform
callback that can preprocess each item. Here we cast data and label to floats and normalize data to range [0,
1]:
In [ ]: def transform(data, label):
return data.astype(np.float32)/255, label.astype(np.float32)
mnist_train = gluon.data.vision.MNIST(train=True, transform=transform)
mnist_test = gluon.data.vision.MNIST(train=False, transform=transform)
There are two parts of the dataset for training and testing. Each part has N items and each item is a tuple of
an image and a label:

In [ ]: image, label = mnist_train[0]

print(image.shape, label)
Note that each image has been formatted as a 3-tuple (height, width, channel). For color images, the channel
would have 3 dimensions (red, green and blue).
3.11.5 Record the data and label shapes

Generally, we don’t want our model code to care too much about the exact shape of our input data. This
way we could switch in a different dataset without changing the code that follows. Let’s define variables to
hold the number of inputs and outputs.
In [ ]: num_inputs = 784
num_outputs = 10
Machine learning libraries generally expect to find images in (batch, channel, height, width) format. How-
ever, most libraries for visualization prefer (height, width, channel). Let’s transpose our image into the
expected shape. In this case, matplotlib expects either (height, width) or (height, width, channel) with RGB
channels, so let’s broadcast our single channel to 3.
In [ ]: im = mx.nd.tile(image, (1,1,3))
print(im.shape)
Now we can visualize our image and make sure that our data and labels line up.
In [ ]: import matplotlib.pyplot as plt
plt.imshow(im.asnumpy())
plt.show()
Ok, that’s a beautiful five.
3.11.6 Load the data iterator

Now let’s load these images into a data iterator so we don’t have to do the heavy lifting.
train_data = mx.gluon.data.DataLoader(mnist_train, batch_size, shuffle=True)
We’re also going to want to load up an iterator with test data. After we train on the training dataset we’re
going to want to test our model on the test data. Otherwise, for all we know, our model could be doing
something stupid (or treacherous?) like memorizing the training examples and regurgitating the labels on
command.
In [ ]: test_data = mx.gluon.data.DataLoader(mnist_test, batch_size, shuffle=False)
3.11.7 Allocate model parameters

Now we’re going to define our model. For this example, we’re going to ignore the multimodal structure of
our data and just flatten each image into a single 1D vector with 28x28 = 784 components. Because our task
is multiclass classification, we want to assign a probability to each of the classes 𝑃 (𝑌 = 𝑐 | 𝑋) given the
input 𝑋. In order to do this we’re going to need one vector of 784 weights for each class, connecting each
feature to the corresponding output. Because there are 10 classes, we can collect these weights together in a
784 by 10 matrix.

We’ll also want to allocate one offset for each of the outputs. We call these offsets the bias term and collect
them in the 10-dimensional array b.
In [ ]: W = nd.random_normal(shape=(num_inputs, num_outputs),ctx=model_ctx)
b = nd.random_normal(shape=num_outputs,ctx=model_ctx)
params = [W, b]
As before, we need to let MXNet know that we’ll be expecting gradients corresponding to each of these
parameters during training.
In [ ]: for param in params:
param.attach_grad()
3.11.8 Multiclass logistic regression

In the linear regression tutorial, we performed regression, so we had just one output 𝑦ˆ and tried to push this
value as close as possible to the true target 𝑦. Here, instead of regression, we are performing classification,
where we want to assign each input 𝑋 to one of 𝐿 classes.
The basic modeling idea is that we’re going to linearly map our input 𝑋 onto 10 different real valued
outputs y_linear. Then, before outputting these values, we’ll want to normalize them so that they are
non-negative and sum to 1. This normalization allows us to interpret the output 𝑦ˆ as a valid probability
distribution.
In [ ]: def softmax(y_linear):
exp = nd.exp(y_linear-nd.max(y_linear, axis=1).reshape((-1,1)))
norms = nd.sum(exp, axis=1).reshape((-1,1))
return exp / norms
In [ ]: sample_y_linear = nd.random_normal(shape=(2,10))
sample_yhat = softmax(sample_y_linear)
print(sample_yhat)
Let’s confirm that indeed all of our rows sum to 1.

In [ ]: print(nd.sum(sample_yhat, axis=1))
But for small rounding errors, the function works as expected.

Now we’re ready to define our model
In [ ]: def net(X):
y_linear = nd.dot(X, W) + b
yhat = softmax(y_linear)
return yhat
3.11.10 The cross-entropy loss function

Before we can start training, we’re going to need to define a loss function that makes sense when our
prediction is a probability distribution.

The relevant loss function here is called cross-entropy and it may be the most common loss function you’ll
find in all of deep learning. That’s because at the moment, classification problems tend to be far more
abundant than regression problems.
The basic idea is that we’re going to take a target Y that has been formatted as a one-hot vector, meaning
one value corresponding to the correct label is set to 1 and the others are set to 0, e.g. [0, 1, 0, 0, 0,
0, 0, 0, 0, 0].
The basic idea of cross-entropy loss is that we only care about how much probability the prediction assigned
to the correct label. In other words, for true label 2, we only care about the component of yhat corresponding
to 2. Cross-entropy attempts to maximize the log-likelihood given to the correct labels.
In [ ]: def cross_entropy(yhat, y):
return - nd.sum(y * nd.log(yhat+1e-6))
3.11.11 Optimizer
For this example we’ll be using the same stochastic gradient descent (SGD) optimizer as last time.
In [ ]: def SGD(params, lr):
3.11.12 Write evaluation loop to calculate accuracy

While cross-entropy is nice, differentiable loss function, it’s not the way humans usually evaluate perfor-
mance on multiple choice tasks. More commonly we look at accuracy, the number of correct answers
divided by the total number of questions. Let’s write an evaluation loop that will take a data iterator and a
network, returning the model’s accuracy averaged over the entire dataset.
In [ ]: def evaluate_accuracy(data_iterator, net):
numerator = 0.
denominator = 0.
for i, (data, label) in enumerate(data_iterator):
data = data.as_in_context(model_ctx).reshape((-1,784))
label_one_hot = nd.one_hot(label, 10)
output = net(data)
predictions = nd.argmax(output, axis=1)
numerator += nd.sum(predictions == label)
denominator += data.shape[0]
return (numerator / denominator).asscalar()
Because we initialized our model randomly, and because roughly one tenth of all examples belong to each
of the ten classes, we should have an accuracy in the ball park of .10.
In [ ]: evaluate_accuracy(test_data, net)

In [ ]: epochs = 5
cumulative_loss = 0


output = net(data)
loss = cross_entropy(output, label_one_hot)
loss.backward()
test_accuracy = evaluate_accuracy(test_data, net)

train_accuracy = evaluate_accuracy(train_data, net)
print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" % (e, cumulative_loss/num
3.11.14 Using the model for prediction

Let’s make it more intuitive by picking 10 random data points from the test set and use the trained model
for predictions.
In [ ]: # Define the function to do prediction
def model_predict(net,data):
output = net(data)
return nd.argmax(output, axis=1)
# let's sample 10 random data points from the test set

sample_data = mx.gluon.data.DataLoader(mnist_test, 10, shuffle=True)
for i, (data, label) in enumerate(sample_data):
print(data.shape)
im = nd.transpose(data,(1,0,2,3))
im = nd.reshape(im,(28,10*28,1))
imtiles = nd.tile(im, (1,1,3))
plt.imshow(imtiles.asnumpy())
plt.show()
pred=model_predict(net,data.reshape((-1,784)))
print('model predictions are:', pred)
break
3.11.15 Conclusion
Jeepers. We can get nearly 90% accuracy at this task just by training a linear model for a few seconds! You
might reasonably conclude that this problem is too easy to be taken seriously by experts.
But until recently, many papers (Google Scholar says 13,800) were published using results obtained on this
data. Even this year, I reviewed a paper whose primary achievement was an (imagined) improvement in
performance. While MNIST can be a nice toy dataset for testing new ideas, we don’t recommend writing
papers with it.
3.11.16 Next
Softmax regression with gluon

3.12 Multiclass logistic regression with gluon

Now that we’ve built a logistic regression model from scratch, let’s make this more efficient with gluon.
If you completed the corresponding chapters on linear regression, you might be tempted rest your eyes a
little in this one. We’ll be using gluon in a rather similar way and since the interface is reasonably well
designed, you won’t have to do much work. To keep you awake we’ll introduce a few subtle tricks.
Let’s start by importing the standard packages.
import mxnet as mx
from mxnet import gluon
import numpy as np

Now, let’s set the context. In the linear regression tutorial we did all of our computation on the cpu (mx.
cpu()) just to keep things simple. When you’ve got 2-dimensional data and scalar labels, a smartwatch
can probably handle the job. Already, in this tutorial we’ll be working with a considerably larger dataset.
If you happen to be running this code on a server with a GPU and installed the GPU-enabled version of
MXNet (or remembered to build MXNet with CUDA=1), you might want to substitute the following line for
its commented-out counterpart.
# model_ctx = mx.gpu()
3.12.2 The MNIST Dataset

We won’t suck up too much wind describing the MNIST dataset for a second time. If you’re unfamiliar with
the dataset and are reading these chapters out of sequence, take a look at the data section in the previous
chapter on softmax regression from scratch.
We’ll load up data iterators corresponding to the training and test splits of MNIST dataset.
num_inputs = 784
num_outputs = 10
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, trans
batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, trans
batch_size, shuffle=False)
We’re also going to want to load up an iterator with test data. After we train on the training dataset we’re
going to want to test our model on the test data. Otherwise, for all we know, our model could be doing
something stupid (or treacherous?) like memorizing the training examples and regurgitating the labels on
command.

3.12.3 Multiclass Logistic Regression

Now we’re going to define our model. Remember from our tutorial on linear regression with ‘‘gluon‘
<./P02-C02-linear-regression-gluon>‘__ that we add Dense layers by calling net.add(gluon.nn.
Dense(num_outputs)). This leaves the parameter shapes under-specified, but gluon will infer the
desired shapes the first time we pass real data through the network.
In [55]: net = gluon.nn.Dense(num_outputs)
3.12.4 Parameter initialization

As before, we’re going to register an initializer for our parameters. Remember that gluon doesn’t even
know what shape the parameters have because we never specified the input dimension. The parameters will
get initialized during the first call to the forward method.
In [56]: net.collect_params().initialize(mx.init.Normal(sigma=1.), ctx=model_ctx)
3.12.5 Softmax Cross Entropy Loss

Note, we didn’t have to include the softmax layer because MXNet’s has an efficient function that simulta-
neously computes the softmax activation and cross-entropy loss. However, if ever need to get the output
probabilities,
In [57]: softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
3.12.6 Optimizer
And let’s instantiate an optimizer to make our updates
3.12.7 Evaluation Metric

This time, let’s simplify the evaluation code by relying on MXNet’s built-in metric package.
In [59]: def evaluate_accuracy(data_iterator, net):
acc = mx.metric.Accuracy()
output = net(data)
acc.update(preds=predictions, labels=label)
return acc.get()[1]
Because we initialized our model randomly, and because roughly one tenth of all examples belong to each
of the ten classes, we should have an accuracy in the ball park of .10.
In [60]: evaluate_accuracy(test_data, net)
Out[60]: 0.1154

moving_loss = 0.
3.12. Multiclass logistic regression with gluon 91

cumulative_loss = 0
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()

print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" % (e, cumulative_loss/nu
Epoch 0. Loss: 0.000342435105642, Train_acc 0.793733333333, Test_acc 0.809
3.12.9 Visualize predictions

output = net(data.as_in_context(model_ctx))

sample_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, tra
10, shuffle=True)
print(data.shape)
plt.show()
break
(10, 28, 28, 1)

model predictions are:

[ 9. 9. 0. 4. 7. 6. 8. 2. 7. 3.]
3.12.10 Next
Overfitting and regularization from scratch
3.13 Overfitting and regularization

In the last tutorial, we introduced the task of multiclass classification. We showed how you can tackle this
problem with a linear model called logistic regression. Owing to some amount of randomness, you might get
slightly different results, but when I ran the notebook, the model achieved 88.1% accuracy on the training
data and actually did slightly (but not significantly) better on the test data than on the training data.
Not every algorithm that performs well on training data will also perform well on test data. Take, for
example, a trivial algorithm that memorizes its inputs and stores the associated labels. This model would
have 100% accuracy on training data but would have no way of making any prediction at all on previously
unseen data.
The goal of supervised learning is to produce models that generalize to previously unseen data. When a
model achieves low error on training data but performs much worse on test data, we say that the model has
overfit. This means that the model has caught on to idiosyncratic features of the training data (e.g. one “2”
happened to have a white pixel in the top-right corner), but hasn’t really picked up on general patterns.
We can express this more formally. The quantity we really care about is the test error 𝑒. Because this
quantity reflects the error of our model when generalized to previously unseen data, we commonly call it the
generalization error. When we have simple models and abundant data, we expect the generalization error
to resemble the training error. When we work with more complex models and fewer examples, we expect
the training error to go down but the generalization gap to grow. Fixing the size of the dataset, the following
graph should give you some intuition about what we generally expect to see.
What precisely constitutes model complexity is a complex matter. Many factors govern whether a model will
generalize well. For example a model with more parameters might be considered more complex. A model
whose parameters can take a wider range of values might be more complex. Often with neural networks, we
think of a model that takes more training steps as more complex, and one subject to early stopping as less
complex.
It can be difficult to compare the complexity among members of very different model classes (say decision
trees versus neural networks). Researchers in the field of statistical learning theory have developed a large
body of mathematical analysis that formulizes the notion of model complexity and provides guarantees on
the generalization error for simple classes of models. We won’t get into this theory but may delve deeper in
a future chapter. For now a simple rule of thumb is quite useful: A model that can readily explain arbitrary
3.13. Overfitting and regularization 93

facts is what statisticians view as complex, whereas one that has only a limited expressive power but still
manages to explain the data well is probably closer to the truth. In philosophy this is closely related to
Popper’s criterion of falsifiability of a scientific theory: a theory is good if it fits data and if there are specific
tests which can be used to disprove it. This is important since all statistical estimation is post hoc, i.e. we
estimate after we observe the facts, hence vulnerable to the associated fallacy. Ok, enough of philosophy,
let’s get to more tangible issues.
To give you some intuition in this chapter, we’ll focus on a few factors that tend to influence the generaliz-
ability of a model class:
1. The number of tunable parameters. When the number of tunable parameters, sometimes denoted
as the number of degrees of freedom, is large, models tend to be more susceptible to overfitting.
2. The values taken by the parameters. When weights can take a wider range of values, models can
be more susceptible to over fitting.
3. The number of training examples. It’s trivially easy to overfit a dataset containing only one or two
examples even if your model is simple. But overfitting a dataset with millions of examples requires
an extremely flexible model.
When classifying handwritten digits before, we didn’t overfit because our 60,000 training examples far out
numbered the 784 × 10 = 7, 840 weights plus 10 bias terms, which gave us far fewer parameters than
training examples. Let’s see how things can go wrong. We begin with our import ritual.
import mxnet as mx
import mxnet.ndarray as nd
from mxnet import autograd

import numpy as np
ctx = mx.cpu()
mx.random.seed(1)
# for plotting purposes

%matplotlib inline
import matplotlib
3.13.1 Load the MNIST dataset

In [ ]: mnist = mx.test_utils.get_mnist()
num_examples = 1000
batch_size = 64
train_data = mx.gluon.data.DataLoader(
mx.gluon.data.ArrayDataset(mnist["train_data"][:num_examples],
mnist["train_label"][:num_examples].astype(np.float3
test_data = mx.gluon.data.DataLoader(
mx.gluon.data.ArrayDataset(mnist["test_data"][:num_examples],
mnist["test_label"][:num_examples].astype(np.float32
3.13.2 Allocate model parameters and define model

We pick a simple linear model 𝑓 (𝑥) = 𝑊 𝑥 + 𝑏 with subsequent softmax, i.e. 𝑝(𝑦|𝑥) ∝ exp(𝑓 (𝑥)𝑦 ). This
is about as simple as it gets.
In [ ]: W = nd.random_normal(shape=(784,10))
b = nd.random_normal(shape=10)
params = [W, b]

param.attach_grad()
def net(X):
y_linear = nd.dot(X, W) + b
yhat = nd.softmax(y_linear, axis=1)
return yhat
3.13.3 Define loss function and optimizer

A sensible thing to do is to minimize the negative log-likelihood of the data, i.e. − log 𝑝(𝑦|𝑥). Statisticians
have proven that this is actually the most efficient estimator, i.e. the one that makes the most use of the data
provided. This is why it is so popular.
return - nd.sum(y * nd.log(yhat), axis=0, exclude=True)
def SGD(params, lr):



Ultimately we want to recognize digits. This is a bit different from knowing the probability of a digit - when
given an image we need to decide what digit we are seeing, regardless of how uncertain we are. Hence we
measure the number of actual misclassifications.
For diagnosis purposes, it is always a good idea to calculate the average loss function.
numerator = 0.
denominator = 0.
loss_avg = 0.
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
output = net(data)
loss_avg = loss_avg*i/(i+1) + nd.mean(loss).asscalar()/(i+1)
return (numerator / denominator).asscalar(), loss_avg
3.13.5 Write a utility function to plot the learning curves

Just to visualize how loss functions and accuracy changes over the number of iterations.
In [ ]: def plot_learningcurves(loss_tr,loss_ts, acc_tr,acc_ts):
xs = list(range(len(loss_tr)))
f = plt.figure(figsize=(12,6))
fg1 = f.add_subplot(121)
fg1.set_xlabel('epoch',fontsize=14)
fg1.set_title('Comparing loss functions')
fg1.semilogy(xs, loss_tr)
fg1.semilogy(xs, loss_ts)
fg1.grid(True,which="both")
fg1.legend(['training loss', 'testing loss'],fontsize=14)
fg2.set_title('Comparing accuracy')
fg2.plot(xs, acc_tr)
fg2.plot(xs, acc_ts)
fg2.legend(['training accuracy', 'testing accuracy'],fontsize=14)


We now train the model until there is no further improvement. Our approach is actually a bit naive since
we will keep the learning rate unchanged but it fits the purpose (we want to keep the code simple and avoid
confusing anyone with further tricks for adjusting learning rate schedules).
In [ ]: epochs = 1000
moving_loss = 0.
niter=0
loss_seq_train = []
loss_seq_test = []
acc_seq_train = []
acc_seq_test = []
output = net(data)
loss.backward()
SGD(params, .001)
##########################
# Keep a moving average of the losses
##########################
niter +=1
moving_loss = .99 * moving_loss + .01 * nd.mean(loss).asscalar()
est_loss = moving_loss/(1-0.99**niter)
test_accuracy, test_loss = evaluate_accuracy(test_data, net)

train_accuracy, train_loss = evaluate_accuracy(train_data, net)
# save them for later

loss_seq_train.append(train_loss)
loss_seq_test.append(test_loss)
acc_seq_train.append(train_accuracy)
acc_seq_test.append(test_accuracy)
if e % 100 == 99:
print("Completed epoch %s. Train Loss: %s, Test Loss %s, Train_acc %s, Test
(e+1, train_loss, test_loss, train_accuracy, test_accuracy))
## Plotting the learning curves

plot_learningcurves(loss_seq_train,loss_seq_test,acc_seq_train,acc_seq_test)

3.13.7 What Happened?

By the 700th epoch, our model achieves 100% accuracy on the training data. However, it only classifies
75% of the test examples accurately. This is a clear case of overfitting. At a high level, there’s a reason this
went wrong. Because we have 7450 parameters and only 1000 data points, there are actually many settings
of the parameters that could produce 100% accuracy on training data.
To get some intuition imagine that we wanted to fit a dataset with 2 dimensional data and 2 data points. Our
model has three degrees of freedom, and thus for any dataset can find an arbitrary number of separators that
will perfectly classify our training points. Note below that we can produce completely orthogonal separators
that both classify our training data perfectly. Even if it seems preposterous that they could both describe our
training data well.
3.13.8 Regularization
Now that we’ve characterized the problem of overfitting, we can begin talking about some solutions. Broadly
speaking the family of techniques geared towards mitigating overfitting are referred to as regularization. The
core idea is this: when a model is overfitting, its training error is substantially lower than its test error. We’re
already doing as well as we possibly can on the training data, but our test data performance leaves something
to be desired. Typically, regularization techniques attempt to trade off our training performance in exchange
for lowering our test error.
There are several straightforward techniques we might employ. Given the intuition from the previous chart,
we might attempt to make our model less complex. One way to do this would be to lower the number
of free parameters. For example, we could throw away some subset of our input features (and thus the
corresponding parameters) that we thought were least informative.

Another approach is to limit the values that our weights might take. One common approach is to force
the weights to take small values. [give more intuition with example of polynomial curve fitting] We can
accomplish this by changing our optimization objective to penalize the value of our weights. The most
popular regularizer is the ℓ22 norm. For linear models, ℓ22 regularization has the additional benefit that it
makes the solution unique, even when our model is overparameterized.
∑︁
𝑦 − 𝑦)2 + 𝜆‖w‖22
(ˆ
𝑖
Here, ‖w‖ is the ℓ22 norm and 𝜆 is a hyper-parameter that determines how aggressively we want to push the
weights towards 0. In code, we can express the ℓ22 penalty succinctly:
In [ ]: def l2_penalty(params):
penalty = nd.zeros(shape=1)
penalty = penalty + nd.sum(param ** 2)
return penalty
3.13.9 Re-initializing the parameters

Just for good measure to ensure that the results in the second training run don’t depend on the first one.
param[:] = nd.random_normal(shape=param.shape)
3.13.10 Training L2-regularized logistic regression

In [ ]: epochs = 1000
moving_loss = 0.

l2_strength = .1
niter=0
loss_seq_train = []
loss_seq_test = []
acc_seq_train = []
acc_seq_test = []
output = net(data)
loss = nd.sum(cross_entropy(output, label_one_hot)) + l2_strength * l2_
loss.backward()
SGD(params, .001)
##########################
##########################
niter +=1
test_accuracy, test_loss = evaluate_accuracy(test_data, net)

train_accuracy, train_loss = evaluate_accuracy(train_data, net)

if e % 100 == 99:

3.13.11 Analysis
By adding 𝐿2 regularization we were able to increase the performance on test data from 75% accuracy to
83% accuracy. That’s a 32% reduction in error. In a lot of applications, this big an improvement can make
the difference between a viable product and useless system. Note that L2 regularization is just one of many
ways of controlling capacity. Basically we assumed that small weight values are good. But there are many
more ways to constrain the values of the weights:

• We could require
∑︀ that the total sum of the weights is small. That is what 𝐿1 regularization does via
the penalty 𝑖 |𝑤𝑖 |.
• We could require that the largest weight is not too large. This is what 𝐿∞ regularization does via the
penalty max𝑖 |𝑤𝑖 |.
• We could require that the number of nonzero
∑︀ weights is small, i.e. that the weight vectors are sparse.
This is what the 𝐿0 penalty does, i.e. 𝑖 𝐼{𝑤𝑖 ̸= 0}. This penalty is quite difficult to deal with
explicitly since it is nonsmooth. There is a lot of research that shows how to solve this problem
approximately using an 𝐿1 penalty.
From left to right: 𝐿2 regularization, which constrains the parameters to a ball, 𝐿1 regularization, which
constrains the parameters to a diamond (for lack of a better name, this is often referred to as an 𝐿1 -ball), and
𝐿∞ regularization, which constrains the parameters to a hypercube.
All of this raises the question of why regularization is any good. After all, choice is good and giving
our model more flexibility ought to be better (e.g. there are plenty of papers which show improvements
on ImageNet using deeper networks). What is happening is somewhat more subtle. Allowing for many
different parameter values allows our model to cherry pick a combination that is just right for all the training
data it sees, without really learning the underlying mechanism. Since our observations are likely noisy,
this means that we are trying to approximate the errors at least as much as we’re learning what the relation
between data and labels actually is. There is an entire field of statistics devoted to this issue - Statistical
Learning Theory. For now, a few simple rules of thumb suffice:
• Fewer parameters tend to be better than more parameters.
• Better engineering for a specific problem that takes the actual problem into account will lead to better
models, due to the prior knowledge that data scientists have about the problem at hand.
• 𝐿2 is easier to optimize for than 𝐿1 . In particular, many optimizers will not work well out of the box
for 𝐿1 . Using the latter requires something called proximal operators.
• Dropout and other methods to make the model robust to perturbations in the data often work better
than off-the-shelf 𝐿2 regularization.
We conclude with an XKCD Cartoon which captures the entire situation more succinctly than the proceeding
paragraph.
3.13.12 Next
Overfitting and regularization with gluon


3.14 Overfitting and regularization (with gluon)

Now that we’ve built a regularized logistic regression model from scratch, let’s make this more efficient with
gluon. We recommend that you read that section for a description as to why regularization is a good idea.
As always, we begin by loading libraries and some data.
[REFINED DRAFT - RELEASE STAGE: CATFOOD]
import mxnet as mx
import numpy as np
ctx = mx.cpu()
# for plotting purposes

%matplotlib inline
import matplotlib
3.14.1 The MNIST Dataset

num_examples = 1000
batch_size = 64
train_data = mx.gluon.data.DataLoader(
mx.gluon.data.ArrayDataset(mnist["train_data"][:num_examples],
mnist["train_label"][:num_examples].astype(np.float3
test_data = mx.gluon.data.DataLoader(
mx.gluon.data.ArrayDataset(mnist["test_data"][:num_examples],
mnist["test_label"][:num_examples].astype(np.float32
3.14.2 Multiclass Logistic Regression

In [ ]: net = gluon.nn.Sequential()
with net.name_scope():
net.add(gluon.nn.Dense(10))

In [ ]: net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
3.14.4 Softmax Cross Entropy Loss

In [ ]: loss = gluon.loss.SoftmaxCrossEntropyLoss()
3.14. Overfitting and regularization (with gluon) 103

3.14.5 Optimizer
By default gluon tries to keep the coefficients from diverging by using a weight decay penalty. So, to get
the real overfitting experience we need to switch it off. We do this by passing 'wd': 0.0' when we
instantiate the trainer.
In [ ]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01, 'wd':

In [ ]: def evaluate_accuracy(data_iterator, net, loss_fun):
loss_avg = 0.
output = net(data)
loss = loss_fun(output, label)
loss_avg = loss_avg*i/(i+1) + nd.mean(loss).asscalar()/(i+1)
return acc.get()[1], loss_avg
def plot_learningcurves(loss_tr,loss_ts, acc_tr,acc_ts):

xs = list(range(len(loss_tr)))
f = plt.figure(figsize=(12,6))
fg1.set_title('Comparing loss functions')
fg1.semilogy(xs, loss_tr)
fg1.semilogy(xs, loss_ts)
fg1.legend(['training loss', 'testing loss'],fontsize=14)
fg2.set_title('Comparing accuracy')
fg2.plot(xs, acc_tr)
fg2.plot(xs, acc_ts)
fg2.legend(['training accuracy', 'testing accuracy'],fontsize=14)
plt.show()

In [ ]: epochs = 700
moving_loss = 0.
niter=0
loss_seq_train = []
loss_seq_test = []

acc_seq_train = []
acc_seq_test = []
output = net(data)
cross_entropy = loss(output, label)
cross_entropy.backward()
trainer.step(data.shape[0])
##########################
##########################
niter +=1
moving_loss = .99 * moving_loss + .01 * nd.mean(cross_entropy).asscalar()
test_accuracy, test_loss = evaluate_accuracy(test_data, net, loss)

train_accuracy, train_loss = evaluate_accuracy(train_data, net, loss)

if e % 20 == 0:

3.14.8 Regularization
Now let’s see what this mysterious weight decay is all about. We begin with a bit of math. When we add an
L2 penalty to the weights we are effectively adding 𝜆2 ‖𝑤‖2 to the loss. Hence, every time we compute the
gradient it gets an additional 𝜆𝑤 term that is added to 𝑔𝑡 , since this is the very derivative of the L2 penalty.
As a result we end up taking a descent step not in the direction −𝜂𝑔𝑡 but rather in the direction −𝜂(𝑔𝑡 + 𝜆𝑤).
This effectively shrinks 𝑤 at each step by 𝜂𝜆𝑤, thus the name weight decay. To make this work in practice
we just need to set the weight decay to something nonzero.
In [ ]: net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx, force_rein
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01, 'wd':
moving_loss = 0.
niter=0
loss_seq_train = []
loss_seq_test = []
3.14. Overfitting and regularization (with gluon) 105

acc_seq_train = []
acc_seq_test = []
output = net(data)
##########################
##########################
niter +=1
moving_loss = .99 * moving_loss + .01 * nd.mean(cross_entropy).asscalar()
test_accuracy, test_loss = evaluate_accuracy(test_data, net,loss)

train_accuracy, train_loss = evaluate_accuracy(train_data, net, loss)

if e % 20 == 0:

As we can see, the test accuracy improves a bit. Note that the amount by which it improves actually depends
on the amount of weight decay. We recommend that you try and experiment with different extents of weight
decay. For instance, a larger weight decay (e.g. 0.01) will lead to inferior performance, one that’s larger still
(0.1) will lead to terrible results. This is one of the reasons why tuning parameters is quite so important in
getting good experimental results in practice.
3.14.9 Next
Learning environments
3.15 The Perceptron

We just employed an optimization method - stochastic gradient descent, without really thinking twice about
why it should work at all. It’s probably worth while to pause and see whether we can gain some intuition
about why this should actually work at all. We start with considering the E. Coli of machine learning

algorithms - the Perceptron. After that, we’ll give a simple convergence proof for SGD. This chapter is not
really needed for practitioners but will help to understand why the algorithms that we use are working at all.
import numpy as np
mx.random.seed(1)
3.15.1 A Separable Classification Problem

The Perceptron algorithm aims to solve the following problem: given some classification problem of data
𝑥 ∈ R𝑑 and labels 𝑦 ∈ {±1}, can we find a linear function 𝑓 (𝑥) = 𝑤⊤ 𝑥 + 𝑏 such that 𝑓 (𝑥) > 0 whenever
𝑦 = 1 and 𝑓 (𝑥) < 0 for 𝑦 = −1. Obviously not all classification problems fall into this category but it’s
a very good baseline for what can be solved easily. It’s also the kind of problems computers could solve in
the 1960s. The easiest way to ensure that we have such a problem is to fake it by generating such data. We
are going to make the problem a bit more interesting by specifying how well the data is separated.
In [2]: # generate fake data that is linearly separable with a margin epsilon given the dat
def getfake(samples, dimensions, epsilon):
wfake = nd.random_normal(shape=(dimensions)) # fake weight vector for separat
bfake = nd.random_normal(shape=(1)) # fake bias
wfake = wfake / nd.norm(wfake) # rescale to unit length
# making some linearly separable data, simply by chosing the labels accordingly
X = nd.zeros(shape=(samples, dimensions))
Y = nd.zeros(shape=(samples))
i = 0
while (i < samples):
tmp = nd.random_normal(shape=(1,dimensions))
margin = nd.dot(tmp, wfake) + bfake
if (nd.norm(tmp).asscalar() < 3) & (abs(margin.asscalar()) > epsilon):
X[i,:] = tmp[0]
Y[i] = 1 if margin.asscalar() > 0 else -1
i += 1
return X, Y
# plot the data with colors chosen according to the labels

def plotdata(X,Y):
for (x,y) in zip(X,Y):
if (y.asscalar() == 1):
plt.scatter(x[0].asscalar(), x[1].asscalar(), color='r')
else:
plt.scatter(x[0].asscalar(), x[1].asscalar(), color='b')
# plot contour plots on a [-3,3] x [-3,3] grid

def plotscore(w,d):
xgrid = np.arange(-3, 3, 0.02)
ygrid = np.arange(-3, 3, 0.02)
xx, yy = np.meshgrid(xgrid, ygrid)
zz = nd.zeros(shape=(xgrid.size, ygrid.size, 2))
zz[:,:,0] = nd.array(xx)
zz[:,:,1] = nd.array(yy)
3.15. The Perceptron 107

vv = nd.dot(zz,w) + d
CS = plt.contour(xgrid,ygrid,vv.asnumpy())
plt.clabel(CS, inline=1, fontsize=10)
X, Y = getfake(50, 2, 0.3)
plotdata(X,Y)
plt.show()
Now we are going to use the simplest possible algorithm to learn parameters. It’s inspired by the Hebbian
Learning Rule which suggests that positive events should be reinforced and negative ones diminished. The
analysis of the algorithm is due to Rosenblatt and we will give a detailed proof of it after illustrating how it
works. In a nutshell, after initializing parameters 𝑤 = 0 and 𝑏 = 0 it updates them by 𝑦𝑥 and 𝑦 respectively
to ensure that they are properly aligned with the data. Let’s see how well it works:
In [3]: def perceptron(w,b,x,y):
if (y * (nd.dot(w,x) + b)).asscalar() <= 0:
w += y * x
b += y
return 1
else:
return 0
w = nd.zeros(shape=(2))
b = nd.zeros(shape=(1))
res = perceptron(w,b,x,y)
if (res == 1):
print('Encountered an error and updated parameters')
print('data {}, label {}'.format(x.asnumpy(),y.asscalar()))
print('weight {}, bias {}'.format(w.asnumpy(),b.asscalar()))
plotscore(w,b)
plotdata(X,Y)

plt.scatter(x[0].asscalar(), x[1].asscalar(), color='g')

plt.show()
Encountered an error and updated parameters
data [ 0.57595438 -0.95017916], label -1.0
weight [-0.57595438 0.95017916], bias -1.0

data [-0.3469252 0.03751944], label 1.0
weight [-0.92287958 0.98769861], bias 0.0


data [-1.80471897 -2.04010558], label 1.0
weight [-2.72759867 -1.05240703], bias 1.0

data [ 0.60334933 -1.08074296], label -1.0
weight [-3.33094788 0.02833593], bias 0.0
As we can see, the model has learned something - all the red dots are positive and all the blue dots correspond
to a negative value. Moreover, we saw that the values for 𝑤⊤ 𝑥 + 𝑏 became more extreme as values over the
grid. Did we just get lucky in terms of classification or is there any math behind it? Obviously there is, and

there’s actually a nice theorem to go with this. It’s the perceptron convergence theorem.
3.15.2 The Perceptron Convergence Theorem

Theorem Given data 𝑥𝑖 with ‖𝑥𝑖 ‖ ≤ 𝑅 and labels 𝑦𝑖 ∈ {±1} for which there exists some pair of parameters
(𝑤* , 𝑏* ) such that 𝑦𝑖 ((𝑤* )⊤ 𝑥𝑖 + 𝑏) ≥ 𝜖 for all data, and for which ‖𝑤* ‖ ≤ 1 and 𝑏2 ≤ 1, then the perceptron
algorithm converges after at most 2(𝑅2 + 1)/𝜖2 iterations.
The cool thing is that this theorem is independent of the dimensionality of the data. Moreover, it is inde-
pendent of the number of observations. Lastly, looking at the algorithm itself, we see that we only need to
store the mistakes that the algorithm made - for the data that was classified correctly no update on (𝑤, 𝑏)
happened. As a first step, let’s check how accurate the theorem is.
In [4]: Eps = np.arange(0.025, 0.45, 0.025)
Err = np.zeros(shape=(Eps.size))
for j in range(10):
for (i,epsilon) in enumerate(Eps):
X, Y = getfake(1000, 2, epsilon)

Err[i] += perceptron(w,b,x,y)
Err = Err / 10.0

plt.plot(Eps, Err, label='average number of updates for training')
plt.legend()
plt.show()
As we can see, the number of errors (and with it, updates) decreases inversely with the width of the margin.
Let’s see whether we can put this into equations. The first thing to consider is the size of the inner product
between (𝑤, 𝑏) and (𝑤* , 𝑏* ), the parameter that solves the classification problem with margin 𝜖. Note that
we do not need explicit knowledge of (𝑤* , 𝑏* ) for this, just know about its existence. For convenience, we

will index 𝑤 and 𝑏 by 𝑡, the number of updates on the parameters. Moreover, whenever convenient we will
treat (𝑤, 𝑏) as a new vector with an extra dimension and with the appropriate terms such as norms ‖(𝑤, 𝑏)‖
and inner products.
First off, 𝑤0⊤ 𝑤* + 𝑏0 𝑏* = 0 by construction. Second, by the update rule we have that
𝑡𝑜
(︁ )︁
(𝑤𝑡+1 , 𝑏𝑡+1 )⊤ (𝑤* , 𝑏* ) =(𝑤𝑡 , 𝑏𝑡 )⊤ (𝑤* , 𝑏* ) + 𝑦𝑡 𝑥⊤ *
𝑡 𝑤 +𝑏
*
≥(𝑤𝑡 , 𝑏𝑡 )⊤ (𝑤* , 𝑏* ) + 𝜖
≥(𝑡 + 1)𝜖(3.1)
(𝑤𝑡 , 𝑏𝑡 )⊤ (𝑤* , 𝑏* ) + 𝜖 ≥
Here the first equality follows from the definition of the weight updates. The next inequality follows from the
fact that (𝑤* , 𝑏* ) separate the problem with margin at least 𝜖, and the last inequality is simply a consequence
of iterating this inequality 𝑡 + 1 times. Growing alignment between the ‘ideal’ and the actual weight vectors
is great, but only if the actual weight vectors don’t grow too rapidly. So we need a bound on their length:
𝑡𝑜
‖(𝑤𝑡+1 , 𝑏𝑡+1 )‖2 ≤‖(𝑤𝑡 , 𝑏𝑡 )‖2 + 2𝑦𝑡 𝑥⊤

𝑡 𝑤𝑡 + 2𝑦𝑡 𝑏𝑡 + ‖(𝑥𝑡 , 1)‖
2
(︁ )︁
=‖(𝑤𝑡 , 𝑏𝑡 )‖2 + 2𝑦𝑡 𝑥⊤ 𝑡 𝑤𝑡 + 𝑏𝑡 + ‖(𝑥𝑡 , 1)‖
2
≤‖(𝑤𝑡 , 𝑏𝑡 )‖2 + 𝑅2 + 1
≤(𝑡 + 1)(𝑅2 + 1)(3.1)
(︁ )︁
‖(𝑤𝑡 , 𝑏𝑡 )‖2 + 2𝑦𝑡 𝑥⊤𝑡 𝑤 𝑡 + 𝑏𝑡 + ‖(𝑥𝑡 , 1)‖2 ≤
(𝑡 + 1)(𝑅2 + 1)

Now let’s combine both inequalities. By Cauchy-Schwartz,

√ i.e. ‖𝑎‖ · ‖𝑏‖ ≥ 𝑎⊤ 𝑏 and the first inequality
⊤ * *
𝑡𝜖 ≤ (𝑤𝑡 , 𝑏𝑡 ) (𝑤 , 𝑏 ) ≤ ‖(𝑤𝑡 , 𝑏𝑡 )‖ 2. Using the second inequality we furthermore get
we have that √︀
‖(𝑤𝑡 , 𝑏𝑡 )‖ ≤ 𝑡(𝑅2 + 1). Combined this yields
√︀
𝑡𝜖 ≤ 2𝑡(𝑅2 + 1)
This is a strange equation - we have a linear term on the left and a sublinear term on the right. So this
inequality clearly cannot hold indefinitely for large 𝑡. The only logical conclusion is that there must never
be updates beyond when the inequality is no longer satisfied. We have 𝑡 ≤ 2(𝑅2 + 1)/𝜖2 , which proves our
claim.
Note - sometimes the perceptron convergence theorem is written without bias 𝑏. In this case a lot of things
get simplified both in the proof and in the bound, since we can do away with the constant terms. Without
going through details, the theorem becomes 𝑡 ≤ 𝑅2 /𝜖2 .
Note - the perceptron convergence proof crucially relied on the fact that the data is actually separable. If
this is not the case, the perceptron algorithm will diverge. It will simply keep on trying to get things right
by updating (𝑤, 𝑏). And since it has no safeguard to keep the parameters bounded, the solution will get
worse. This sounds like an ‘academic’ concern, alas it is not. The avatar in the computer game [Black and
White](https://en.wikipedia.org/wiki/Black_%26_White_(video_game%29) used the perceptron to adjust
the avatar. Due to the poorly implemented update rule the game quickly became unplayable after a few
hours (as one of the authors can confirm).
3.15.3 Stochastic Gradient Descent

The perceptron algorithm also can be viewed as a stochastic gradient descent algorithm, albeit with a rather
strange loss function: max(0, −𝑦𝑓 (𝑥)). This is commonly called the hinge loss. As can be checked quite
easily, its gradient is 0 whenever 𝑦𝑓 (𝑥) > 0, i.e. whenever 𝑥 is classified correctly, and gradient −𝑦 for
incorrect classification. For a linear function, this leads exactly to the updates that we have (with the minor
difference that we consider 𝑓 (𝑥) = 0 as an example of incorrect classification). To get some intuition, let’s
plot the loss function.
In [5]: f = np.arange(-5,5,0.1)
zero = np.zeros(shape=(f.size))
lplus = np.max(np.array([f,zero]), axis=0)
lminus = np.max(np.array([-f,zero]), axis=0)
plt.plot(f,lplus, label='max(0,f(x))')
plt.plot(f,lminus, label='max(0,-f(x))')
plt.legend()
plt.show()

More generally, a stochastic gradient descent algorithm uses the following template:
initialize w
loop over data and labels (x,y):
compute f(x)
compute loss gradient g = partial_w l(y, f(x))
w = w - eta g
Here the learning rate 𝜂 may well change as we iterate over the data. Moreover, we may traverse the data
in nonlinear order (e.g. we might reshuffle the data), depending on the specific choices of the algorithm.
The issue is that as we go over the data, sometimes the gradient might point us into the right direction and
sometimes it might not. Intuitively, on average things should get better. But to be really sure, there’s only
one way to find out - we need to prove it. We pick a simple and elegant (albeit a bit restrictive) proof of
Nesterov and Vial.
The situation we consider are convex losses. This is a bit restrictive in the age of deep networks but still
quite instructive (in addition to that, nonconvex convergence proofs are a lot messier). For recap - a convex
function 𝑓 (𝑥) satisfies 𝑓 (𝜆𝑥 + (1 − 𝜆)𝑥′ ) ≤ 𝜆𝑓 (𝑥) + (1 − 𝜆)𝑓 (𝑥′ ), that is, the linear interpolant between
function values is larger than the function values in between. Likewise, a convex set 𝑆 is a set where for
any points 𝑥, 𝑥′ ∈ 𝑆 the line [𝑥, 𝑥′ ] is in the set, i.e. 𝜆𝑥 + (1 − 𝜆)𝑥′ ∈ 𝑆 for all 𝜆 ∈ [0, 1]. Now assume that
𝑤* is the minimizer of the expected loss that we are trying to minimize, e.g.
𝑚
* 1 ∑︁
𝑤 = argmin𝑤 𝑅(𝑤) where 𝑅(𝑤) = 𝑙(𝑦𝑖 , 𝑓 (𝑥𝑖 , 𝑤))
𝑚
𝑖=1
Let’s assume that we actually know that 𝑤* is contained in some set convex set 𝑆, e.g. a ball of radius 𝑅
around the origin. This is convenient since we want to make sure that during optimization our parameter 𝑤
doesn’t accidentally diverge. We can ensure that, e.g. by shrinking it back to such a ball whenever needed.
Secondly, assume that we have an upper bound on the magnitude of the gradient 𝑔𝑖 := 𝜕𝑤 𝑙(𝑦𝑖 , 𝑓 (𝑥𝑖 , 𝑤))
for all 𝑖 by some constant 𝐿 (it’s called so since this is often referred to as the Lipschitz constant). Again,

this is super useful since we don’t want 𝑤 to diverge while we’re optimizing. In practice, many algorithms
employ e.g. gradient clipping to force our gradients to be well behaved, by shrinking the gradients back to
something tractable.
Third, to get rid of variance in the parameter 𝑤𝑡 that is obtained during the optimization,
∑︀ we
∑︀use the weighted
average over the entire optimization process as our solution, i.e. we use 𝑤 ¯ := 𝑡 𝜂𝑡 𝑤𝑡 / 𝑡 𝜂𝑡 .
Let’s look at the distance 𝑟𝑡 := ‖𝑤𝑡 − 𝑤* ‖, i.e. the distance between the optimal solution vector 𝑤* and
what we currently have. It is bounded as follows:
𝑡𝑜
‖𝑤𝑡+1 − 𝑤* ‖2 =‖𝑤𝑡 − 𝑤* ‖2 + 𝜂𝑡2 ‖𝑔𝑡 ‖2 − 2𝜂𝑡 𝑔𝑡⊤ (𝑤𝑡 − 𝑤* )

≤‖𝑤𝑡 − 𝑤* ‖2 + 𝜂𝑡2 𝐿2 − 2𝜂𝑡 𝑔𝑡⊤ (𝑤𝑡 − 𝑤* )(3.1)
‖𝑤𝑡 − 𝑤* ‖2 + 𝜂𝑡2 𝐿2 − 2𝜂𝑡 𝑔𝑡⊤ (𝑤𝑡 − 𝑤* )
Next we use convexity of 𝑅(𝑤). We know that 𝑅(𝑤* ) ≥ 𝑅(𝑤𝑡 ) + 𝜕𝑤 𝑅(𝑤𝑡 )⊤ (𝑤* ∑︀ − 𝑤𝑡 ) and moreover
∑︀ that
𝑇
the average of function values is larger than the function value of the average, i.e. 𝑡=1 𝜂𝑡 𝑅(𝑤𝑡 )/ 𝑡 𝜂𝑡 ≥
𝑅(𝑤).
¯ The first inequality allows us to bound the expected decrease in distance to optimality via
E[𝑟𝑡+1 − 𝑟𝑡 ] ≤ 𝜂𝑡2 𝐿2 − 2𝜂𝑡 E[𝑔𝑡⊤ (𝑤𝑡 − 𝑤* )] ≤ 𝜂𝑡2 𝐿2 − 2𝜂𝑡 E[𝑅[𝑤𝑡 ] − 𝑅[𝑤* ]]
Summing over 𝑡 and using the facts that 𝑟𝑇 ≥ 0 and that 𝑤 is contained inside a ball of radius 𝑅 yields:
𝑇
∑︁ ∑︁
2
−𝑅 ≤ 𝐿 2
𝜂𝑡2 − 2 𝜂𝑡 E[𝑅[𝑤𝑡 ] − 𝑅[𝑤* ]]
𝑡=1 𝑡
∑︀
Rearranging terms, using convexity of 𝑅 the second time, and dividing by 𝑡 𝜂𝑡 yields a bound on how far
we are likely to stray from the best possible solution:
𝑅2 + 𝐿2 𝑇𝑡=1 𝜂𝑡2
∑︀
*
¯ − 𝑅[𝑤 ] ≤
E[𝑅[𝑤]]
2 𝑇𝑡=1 𝜂𝑡
∑︀
Depending on how we choose 𝜂𝑡 we will get different bounds. For instance, if we make 𝜂 constant, i.e. if√we
2 2 2
√ we get the bounds (𝑅 + 𝐿 𝜂 𝑇 )/(2𝜂𝑇 ). This is minimized for 𝜂 = 𝑅/𝐿 𝑇 ,
use a constant learning rate,
yielding a bound of 𝑅𝐿/ 𝑇 . A few things are interesting in this context:

• If we are potentially far away from the optimal solution, we should use a large learning rate (the O(R)
dependency).
• If the gradients are potentially large, we should use a smaller learning rate (the O(1/L) dependency).
• If we have a long time to converge, we should use a smaller learning rate, but not too small.
• Large gradients and a large degree of uncertainty as to how far we are away from the optimal solution
lead to poor convergence.
• More optimization steps make things better.
None of these insights are terribly surprising, albeit useful to keep in mind when we use SGD in the wild.
And this was the very point of going through
√ this somewhat tedious proof. Furthermore, if we use a de-
√ rate, e.g. 𝜂𝑡 = 𝑂(1/ 𝑡), then our bounds are somewhat less tight, and we get a bound
creasing learning
of 𝑂(log 𝑇 / 𝑇 ) bound on how far away from optimality we might be. The key difference is that for the
decreasing learning rate we need not know when to stop. In other words, we get an anytime algorithm that
provides a good result at any time, albeit not as good as what we could expect if we knew how much time
to optimize we have right from the beginning.
3.15.4 Next
Environment
3.16 Environment
So far we did not worry very much about where the data came from and how the models that we build get
deployed. Not caring about it can be problematic. Many failed machine learning deployments can be traced
back to this situation. This chapter is meant to help with detecting such situations early and points out how to
mitigate them. Depending on the case this might be rather simple (ask for the ‘right’ data) or really difficult
(implement a reinforcement learning system).
3.16.1 Covariate Shift

At its heart is a problem that is easy to understand but also equally easy to miss. Consider being given the
challenge of distinguishing cats and dogs. Our training data consists of images of the following kind:
cat cat dog dog
At test time we are asked to classify the following images:

cat cat dog dog
Obviously this is unlikely to work well. The training set consists of photos, while the test set contains only
cartoons. The colors aren’t even accurate. Training on a dataset that looks substantially different from the
test set without some plan for how to adapt to the new domain is a bad idea. Unfortunately, this is a very
common pitfall. Statisticians call this Covariate Shift, i.e. the situation where the distribution over the
covariates (aka training data) is shifted on test data relative to the training case. Mathematically speaking,
we are referring the case where 𝑝(𝑥) changes but 𝑝(𝑦|𝑥) remains unchanged.
3.16.2 Concept Shift

A related problem is that of concept shift. This is the situation where the the labels change. This sounds
weird - after all, a cat is a cat is a cat. Well, cats maybe but not soft drinks. There is considerable concept
shift throughout the USA, even for such a simple term:
If we were to build a machine translation system, the distribution 𝑝(𝑦|𝑥) would be different, e.g. depending
on our location. This problem can be quite tricky to spot. A saving grace is that quite often the 𝑝(𝑦|𝑥) only
shifts gradually (e.g. the click-through rate for NOKIA phone ads). Before we go into further details, let us
discuss a number of situations where covariate and concept shift are not quite as blatantly obvious.
3.16. Environment 117

3.16.3 Examples
Medical Diagnostics
Imagine you want to design some algorithm to detect cancer. You get data of healthy and sick people;
you train your algorithm; it works fine, giving you high accuracy and you conclude that you’re ready for a
successful career in medical diagnostics. Not so fast . . .
Many things could go wrong. In particular, the distributions that you work with for training and those in the
wild might differ considerably. This happened to an unfortunate startup I had the opportunity to consult for
many years ago. They were developing a blood test for a disease that affects mainly older men and they’d
managed to obtain a fair amount of blood samples from patients. It is considerably more difficult, though,
to obtain blood samples from healthy men (mainly for ethical reasons). To compensate for that, they asked
a large number of students on campus to donate blood and they performed their test. Then they asked me
whether I could help them build a classifier to detect the disease. I told them that it would be very easy to
distinguish between both datasets with probably near perfect accuracy. After all, the test subjects differed
in age, hormone level, physical activity, diet, alcohol consumption, and many more factors unrelated to the
disease. This was unlikely to be the case with real patients: Their sampling procedure had caused an extreme
case of covariate shift that couldn’t be corrected by conventional means. In other words, training and test
data were so different that nothing useful could be done and they had wasted significant amounts of money.
Self Driving Cars

A company wanted to build a machine learning system for self-driving cars. One of the key components
is a roadside detector. Since real annotated data is expensive to get, they had the (smart and questionable)
idea to use synthetic data from a game rendering engine as additional training data. This worked really well
on ‘test data’ drawn from the rendering engine. Alas, inside a real car it was a disaster. As it turned out,
the roadside had been rendered with a very simplistic texture. More importantly, all the roadside had been
rendered with the same texture and the roadside detector learned about this ‘feature’ very quickly.
A similar thing happened to the US Army when they first tried to detect tanks in the forest. They took aerial
photographs of the forest without tanks, then drove the tanks into the forest and took another set of pictures.
The so-trained classifier worked ‘perfectly’. Unfortunately, all it had learned was to distinguish trees with
shadows from trees without shadows - the first set of pictures was taken in the early morning, the second
one at noon.
Nonstationary distributions
A much more subtle situation is where the distribution changes slowly and the model is not updated ade-
quately. Here are a number of typical cases:
• We train a computational advertising model and then fail to update it frequently (e.g. we forget to
incorporate that an obscure new device called an iPad was just launched).
• We build a spam filter. It works well at detecting all spam that we’ve seen so far. But then the
spammers wisen up and craft new messages that look quite unlike anything we’ve seen before.
• We build a product recommendation system. It works well for the winter. But then it keeps on
recommending Santa hats after Christmas.

More Anecdotes
• We build a classifier for “Not suitable/safe for work” (NSFW) images. To make our life easy, we
scrape a few seedy Subreddits. Unfortunately the accuracy on real life data is lacking (the pictures
posted on Reddit are mostly ‘remarkable’ in some way, e.g. being taken by skilled photographers,
whereas most real NSFW images are fairly unremarkable . . . ). Quite unsurprisingly the accuracy is
not very high on real data.
• We build a face detector. It works well on all benchmarks. Unfortunately it fails on test data - the
offending examples are close-ups where the face fills the entire image (no such data was in the training
set).
• We build a web search engine for the USA market and want to deploy it in the UK.
In short, there are many cases where training and test distribution 𝑝(𝑥) are different. In some cases, we
get lucky and the models work despite the covariate shift. We now discuss principled solution strategies.
Warning - this will require some math and statistics.
3.16.4 Covariate Shift Correction

Assume that we want to estimate some dependency 𝑝(𝑦|𝑥) for which we have labeled data (𝑥𝑖 , 𝑦𝑖 ). Alas,
the observations 𝑥𝑖 are drawn from some distribution 𝑞(𝑥) rather than the ‘proper’ distribution 𝑝(𝑥). To
make progress, we need to reflect about what exactly is happening during training: we iterate over training
data and associated labels {(𝑥1 , 𝑦1 ), . . . (𝑦𝑚 , 𝑦𝑚 )} and update the weight vectors of the model after every
minibatch. Depending on the situation we also apply some penalty to the parameters, e.g. 𝐿2 regularization.
In other words, we want to solve
𝑚
1 ∑︁ 𝜆
minimize 𝑙(𝑥𝑖 , 𝑦𝑖 , 𝑓 (𝑥𝑖 )) + ‖𝑤‖22
𝑤 𝑚 2
𝑖=1
Statisticians call the first term an empirical average, that is an average computed over the data drawn from
𝑝(𝑥)𝑝(𝑦|𝑥). If the data is drawn from the ‘wrong’ distribution 𝑞, we can correct for that by using the
following simple identity:
∫︁ ∫︁ [︂ ]︂
𝑝(𝑥) 𝑝(𝑥)
E𝑥∼𝑝(𝑥) [𝑓 (𝑥)] = 𝑓 (𝑥)𝑝(𝑥)𝑑𝑥 = 𝑓 (𝑥) 𝑞(𝑥)𝑑𝑥 = E𝑥∼𝑞(𝑥) 𝑓 (𝑥)
𝑞(𝑥) 𝑞(𝑥)
In other words, we need to re-weight each instance by the ratio of probabilities that it would have been
drawn from the correct distribution 𝛽(𝑥) := 𝑝(𝑥)/𝑞(𝑥). Alas, we do not know that ratio, so before we can do
anything useful we need to estimate it. Many methods are available, e.g. some rather fancy operator theoretic
ones which try to recalibrate the expectation operator directly using a minimum-norm or a maximum entropy
principle. Note that for any such approach, we need samples drawn from both distributions - the ‘true’ 𝑝, e.g.
by access to training data, and the one used for generating the training set 𝑞 (the latter is trivially available).
In this case there exists a very effective approach that will give almost as good results: logistic regression.
This is all that is needed to compute estimate probability ratios. We learn a classifier to distinguish be-
tween data drawn from 𝑝(𝑥) and data drawn from 𝑞(𝑥). If it is impossible to distinguish between the two
distributions then it means that the associated instances are equally likely to come from either one of the
two distributions. On the other hand, any instances that can be well discriminated should be significantly
over/underweighted accordingly. For simplicity’s sake assume that we have an equal number of instances
from both distributions, denoted by 𝑥𝑖 ∼ 𝑝(𝑥) and 𝑥𝑖 ∼ 𝑞(𝑥) respectively. Now denote by 𝑧𝑖 labels which

are 1 for data drawn from 𝑝 and -1 for data drawn from 𝑞. Then the probability in a mixed dataset is given
by
𝑝(𝑥) 𝑝(𝑧 = 1|𝑥) 𝑝(𝑥)

𝑝(𝑧 = 1|𝑥) = and hence =
𝑝(𝑥) + 𝑞(𝑥) 𝑝(𝑧 = −1|𝑥) 𝑞(𝑥)
1
Hence, if we use a logistic regression approach where 𝑝(𝑧 = 1|𝑥) = 1+exp(𝑓 (𝑥) it follows (after some
simple algebra) that 𝛽(𝑥) = exp(𝑓 (𝑥)). In summary, we need to solve two problems: first one to distinguish
between data drawn from both distributions, and then a reweighted minimization problem where we weigh
terms by 𝛽, e.g. via the head gradients. Here’s a prototypical algorithm for that purpose:
CovariateShiftCorrector(X, Z)
X: Training dataset (without labels)
Z: Test dataset (without labels)
generate training set with {(x_i, -1) ... (z_j, 1)}

train binary classifier using logistic regression to get function f
weigh data using beta_i = exp(f(x_i)) or
beta_i = min(exp(f(x_i)), c)
use weights beta_i for training on X with labels Y
Generative Adversarial Networks use the very idea described above to engineer a data generator such
that it cannot be distinguished from a reference dataset. For this, we use one network, say 𝑓 to distinguish
real and fake data and a second network 𝑔 that tries to fool the discriminator 𝑓 into accepting fake data as
real. We will discuss this in much more detail later.
3.16.5 Concept Shift Correction

Concept shift is much harder to fix in a principled manner. For instance, in a situation where suddenly the
problem changes from distinguishing cats from dogs to one of distinguishing white from black animals, it
will be unreasonable to assume that we can do much better than just training from scratch using the new
labels. Fortunately, in practice, such extreme shifts almost never happen. Instead, what usually happens is
that the task keeps on changing slowly. To make things more concrete, here are some examples:
• In computational advertising, new products are launched, old products become less popular. This
means that the distribution over ads and their popularity changes gradually and any click-through rate
predictor needs to change gradually with it.
• Traffic cameras lenses degrade gradually due to environmental wear, affecting image quality progres-
sively.
• News content changes gradually (i.e. most of the news remains unchanged but new stories appear).
In such cases, we can use the same approach that we used for training networks to make them adapt to the
change in the data. In other words, we use the existing network weights and simply perform a few update
steps with the new data rather than training from scratch.
3.16.6 A Taxonomy of Learning Problems

Armed with knowledge about how to deal with changes in 𝑝(𝑥) and in 𝑝(𝑦|𝑥), let us consider a number of
problems that we can solve using machine learning.

• Batch Learning. Here we have access to training data and labels {(𝑥1 , 𝑦1 ), . . . (𝑥𝑛 , 𝑦𝑛 )}, which we
use to train a network 𝑓 (𝑥, 𝑤). Later on, we deploy this network to score new data (𝑥, 𝑦) drawn from
the same distribution. This is the default assumption for any of the problems that we discuss here.
For instance, we might train a cat detector based on lots of pictures of cats and dogs. Once we trained
it, we ship it as part of a smart catdoor computer vision system that lets only cats in. This is then
installed in a customer’s home and is never updated again (barring extreme circumstances).
• Online Learning. Now imagine that the data (𝑥𝑖 , 𝑦𝑖 ) arrives one sample at a time. More specifically,
assume that we first observe 𝑥𝑖 , then we need to come up with an estimate 𝑓 (𝑥𝑖 , 𝑤) and only once
we’ve done this, we observe 𝑦𝑖 and with it, we receive a reward (or incur a loss), given our decision.
Many real problems fall into this category. E.g. we need to predict tomorrow’s stock price, this allows
us to trade based on that estimate and at the end of the day we find out whether our estimate allowed
us to make a profit. In other words, we have the following cycle where we are continuously improving
our model given new observations.
model 𝑓𝑡 −→ data 𝑥𝑡 −→ estimate 𝑓𝑡 (𝑥𝑡 ) −→ observation 𝑦𝑡 −→ loss 𝑙(𝑦𝑡 , 𝑓𝑡 (𝑥𝑡 )) −→ model 𝑓𝑡+1
• Bandits. They are a special case of the problem above. While in most learning problems we have a
continuously parametrized function 𝑓 where we want to learn its parameters (e.g. a deep network), in
a bandit problem we only have a finite number of arms that we can pull (i.e. a finite number of actions
that we can take). It is not very surprising that for this simpler problem stronger theoretical guarantees
in terms of optimality can be obtained. We list it mainly since this problem is often (confusingly)
treated as if it were a distinct learning setting.
• Control (and nonadversarial Reinforcement Learning). In many cases the environment remembers
what we did. Not necessarily in an adversarial manner but it’ll just remember and the response will
depend on what happened before. E.g. a coffee boiler controller will observe different temperatures
depending on whether it was heating the boiler previously. PID (proportional integral derivative)
controller algorithms are a popular choice there. Likewise, a user’s behavior on a news site will
depend on what we showed him previously (e.g. he will read most news only once). Many such
algorithms form a model of the environment in which they act such as to make their decisions appear
less random (i.e. to reduce variance).
• Reinforcement Learning. In the more general case of an environment with memory, we may en-
counter situations where the environment is trying to cooperate with us (cooperative games, in partic-
ular for non-zero-sum games), or others where the environment will try to win. Chess, Go, Backgam-
mon or StarCraft are some of the cases. Likewise, we might want to build a good controller for
autonomous cars. The other cars are likely to respond to the autonomous car’s driving style in non-
trivial ways, e.g. trying to avoid it, trying to cause an accident, trying to cooperate with it, etc.
One key distinction between the different situations above is that the same strategy that might have worked
throughout in the case of a stationary environment, might not work throughout when the environment can
adapt. For instance, an arbitrage opportunity discovered by a trader is likely to disappear once he starts
exploiting it. The speed and manner at which the environment changes determines to a large extent the
type of algorithms that we can bring to bear. For instance, if we know that things may only change slowly,
we can force any estimate to change only slowly, too. If we know that the environment might change
instantaneously, but only very infrequently, we can make allowances for that. These types of knowledge are
crucial for the aspiring data scientist to deal with concept shift, i.e. when the problem that he is trying to
solve changes over time.

3.17 Multilayer perceptrons from scratch

In the previous chapters we showed how you could implement multiclass logistic regression (also called
softmax regression) for classifiying images of handwritten digits into the 10 possible categories (from scratch
and with gluon). This is where things start to get fun. We understand how to wrangle data, coerce our outputs
into a valid probability distribution, how to apply an appropriate loss function, and how to optimize over our
parameters. Now that we’ve covered these preliminaries, we can extend our toolbox to include deep neural
networks.
Recall that before, we mapped our inputs directly onto our outputs through a single linear transformation.
𝑦ˆ = softmax(𝑊 𝑥 + 𝑏)
Graphically, we could depict the model like this, where the orange nodes
represent inputs and the teal nodes on the top represent the output:
If our labels really were related to our input data by an approximately linear function, then this approach
might be adequate. But linearity is a strong assumption. Linearity means that given an output of interest, for
each input, increasing the value of the input should either drive the value of the output up or drive it down,
irrespective of the value of the other inputs.
Imagine the case of classifying cats and dogs based on black and white images. That’s like saying that for
each pixel, increasing its value either increases the probability that it depicts a dog or decreases it. That’s
not reasonable. After all, the world contains both black dogs and black cats, and both white dogs and white
cats.
Teasing out what is depicted in an image generally requires allowing more complex relationships between
our inputs and outputs, considering the possibility that our pattern might be characterized by interactions
among the many features. In these cases, linear models will have low accuracy. We can model a more
general class of functions by incorporating one or more hidden layers. The easiest way to do this is to stack
a bunch of layers of neurons on top of each other. Each layer feeds into the layer above it, until we generate
an output. This architecture is commonly called a “multilayer perceptron”. With an MLP, we’re going to
stack a bunch of layers on top of each other.
ℎ1 = 𝜑(𝑊1 𝑥 + 𝑏1 )
ℎ2 = 𝜑(𝑊2 ℎ1 + 𝑏2 )

...
ℎ𝑛 = 𝜑(𝑊𝑛 ℎ𝑛−1 + 𝑏𝑛 )
Note that each layer requires its own set of parameters. For each hidden layer, we calculate its value by first
applying a linear function to the activations of the layer below, and then applying an element-wise nonlinear
activation function. Here, we’ve denoted the activation function for the hidden layers as 𝜑. Finally, given
the topmost hidden layer, we’ll generate an output. Because we’re still focusing on multiclass classification,
we’ll stick with the softmax activation in the output layer.
𝑦ˆ = softmax(𝑊𝑦 ℎ𝑛 + 𝑏𝑦 )
Graphically, a multilayer perceptron could be depicted like this:
Multilayer perceptrons can account for complex interactions in the inputs because the hidden neurons de-
pend on the values of each of the inputs. It’s easy to design a hidden node that that does arbitrary compu-
tation, such as, for instance, logical operations on its inputs. And it’s even widely known that multilayer
perceptrons are universal approximators. That means that even for a single-hidden-layer neural network,
with enough nodes, and the right set of weights, it could model any function at all! Actually learning that
function is the hard part. And it turns out that we can approximate functions much more compactly if we use
deeper (vs wider) neural networks. We’ll get more into the math in a subsequent chapter, but for now let’s
actually build an MLP. In this example, we’ll implement a multilayer perceptron with two hidden layers and
one output layer.
3.17. Multilayer perceptrons from scratch 123

3.17.1 Imports
import mxnet as mx
import numpy as np
3.17.2 Set contexts

In [ ]: ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()
data_ctx = ctx
model_ctx = ctx
3.17.3 Load MNIST data

Let’s go ahead and grab our data.
In [ ]: num_inputs = 784
num_outputs = 10
batch_size = 64
train_data = gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transform
test_data = gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transform
3.17.4 Allocate parameters

In [ ]: #######################
# Set some constants so it's easy to modify the network later
#######################
num_hidden = 256
weight_scale = .01
#######################
# Allocate parameters for the first hidden layer
#######################
W1 = nd.random_normal(shape=(num_inputs, num_hidden), scale=weight_scale, ctx=model
b1 = nd.random_normal(shape=num_hidden, scale=weight_scale, ctx=model_ctx)
#######################
# Allocate parameters for the second hidden layer
#######################
W2 = nd.random_normal(shape=(num_hidden, num_hidden), scale=weight_scale, ctx=model
b2 = nd.random_normal(shape=num_hidden, scale=weight_scale, ctx=model_ctx)
#######################
# Allocate parameters for the output layer
#######################
W3 = nd.random_normal(shape=(num_hidden, num_outputs), scale=weight_scale, ctx=mode
b3 = nd.random_normal(shape=num_outputs, scale=weight_scale, ctx=model_ctx)
params = [W1, b1, W2, b2, W3, b3]

Again, let’s allocate space for each parameter’s gradients.

param.attach_grad()
3.17.5 Activation functions

If we compose a multi-layer network but use only linear operations, then our entire network will still be a
linear function. That’s because $ = X ·𝑊 _1 · 𝑊 _2 · 𝑊 _2 = 𝑋 · 𝑊 _4$𝑓 𝑜𝑟𝑊4 = 𝑊1 · 𝑊2 · 𝑊 3. To give
our model the capacity to capture nonlinear functions, we’ll need to interleave our linear operations with
activation functions. In this case, we’ll use the rectified linear unit (ReLU):
In [ ]: def relu(X):
return nd.maximum(X, nd.zeros_like(X))
3.17.6 Softmax output

As with multiclass logistic regression, we’ll want the outputs to constitute a valid probability distribution.
We’ll use the same softmax activation function on our output to make sure that our outputs sum to one and
are non-negative.
exp = nd.exp(y_linear-nd.max(y_linear))
partition = nd.nansum(exp, axis=0, exclude=True).reshape((-1, 1))
return exp / partition
3.17.7 The softmax cross-entropy loss function

In the previous example, we calculated our model’s output and then ran this output through the cross-entropy
loss function:
return - nd.nansum(y * nd.log(yhat), axis=0, exclude=True)
Mathematically, that’s a perfectly reasonable thing to do. However, computationally, things can get hairy.
We’ll revisit the issue at length in a chapter more dedicated to implementation and less interested in statistical
modeling. But we’re going to make a change here so we want to give you the gist of why.
𝑧𝑗
Recall that the softmax function calculates 𝑦ˆ𝑗 = ∑︀𝑛𝑒 𝑒𝑧𝑖 , where 𝑦ˆ𝑗 is the j-th element of the input yhat
𝑖=1
variable in function cross_entropy and 𝑧𝑗 is the j-th element of the input y_linear variable in func-
tion softmax
If some of the 𝑧𝑖 are very large (i.e. very positive), 𝑒𝑧𝑖 might be larger than the largest number we can
have for certain types of float (i.e. overflow). This would make the denominator (and/or numerator)
inf and we get zero, or inf, or nan for 𝑦ˆ𝑗 . In any case, we won’t get a well-defined return value for
cross_entropy. This is the reason we subtract max(𝑧𝑖 ) from all 𝑧𝑖 first in softmax function. You can
verify that this shifting in 𝑧𝑖 will not change the return value of softmax.
After the above subtraction/ normalization step, it is possible that 𝑧𝑗 is very negative. Thus, 𝑒𝑧𝑗 will be
very close to zero and might be rounded to zero due to finite precision (i.e underflow), which makes 𝑦ˆ𝑗
zero and we get -inf for log(ˆ𝑦𝑗 ). A few steps down the road in backpropagation, we starts to get horrific
not-a-number (nan) results printed to screen.

Our salvation is that even though we’re computing these exponential functions, we ultimately plan to take
their log in the cross-entropy functions. It turns out that by combining these two operators softmax and
cross_entropy together, we can elude the numerical stability issues that might otherwise plague us
during backpropagation. As shown in the equation below, we avoided calculating 𝑒𝑧𝑗 but directly used 𝑧𝑗
due to 𝑙𝑜𝑔(𝑒𝑥𝑝(·)).
(︃ 𝑛 )︃ (︃ 𝑛 )︃
𝑒𝑧𝑗
(︂ )︂ ∑︁ ∑︁
𝑦𝑗 ) = log ∑︀𝑛 𝑧 = log(𝑒𝑧𝑗 ) − log
log(ˆ 𝑒𝑧𝑖 = 𝑧𝑗 − log 𝑒 𝑧𝑖
𝑖=1 𝑒
𝑖
𝑖=1 𝑖=1
We’ll want to keep the conventional softmax function handy in case we ever want to evaluate the probabili-
ties output by our model. But instead of passing softmax probabilities into our new loss function, we’ll just
pass our yhat_linear and compute the softmax and its log all at once inside the softmax_cross_entropy
loss function, which does smart things like the log-sum-exp trick (see on Wikipedia).
In [ ]: def softmax_cross_entropy(yhat_linear, y):
return - nd.nansum(y * nd.log_softmax(yhat_linear), axis=0, exclude=True)

In [ ]: def net(X):
#######################
# Compute the first hidden layer
#######################
h1_linear = nd.dot(X, W1) + b1
h1 = relu(h1_linear)
#######################
# Compute the second hidden layer
#######################
h2_linear = nd.dot(h1, W2) + b2
#######################
# Compute the output layer.
# We will omit the softmax function here
# because it will be applied
# in the softmax_cross_entropy loss
#######################
yhat_linear = nd.dot(h2, W3) + b3
return yhat_linear
3.17.9 Optimizer
3.17.10 Evaluation metric

numerator = 0.

denominator = 0.
data = data.as_in_context(model_ctx).reshape((-1, 784))
output = net(data)
3.17.11 Execute the training loop

In [ ]: epochs = 10
smoothing_constant = .01
cumulative_loss = 0
output = net(data)
loss = softmax_cross_entropy(output, label_one_hot)
loss.backward()

print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" %
(e, cumulative_loss/num_examples, train_accuracy, test_accuracy))
3.17.12 Using the model for prediction

Let’s pick a few random data points from the test set to visualize algonside our predictions. We already know
quantitatively that the model is more accurate, but visualizing results is a good practice that can (1) help us
to sanity check that our code is actually working and (2) provide intuition about what kinds of mistakes our
model tends to make.
In [ ]: %matplotlib inline
# Define the function to do prediction

output = net(data)
samples = 10
mnist_test = mx.gluon.data.vision.MNIST(train=False, transform=transform)


sample_data = mx.gluon.data.DataLoader(mnist_test, samples, shuffle=True)
plt.show()
print('true labels :', label)
break
3.17.13 Conclusion
Nice! With just two hidden layers containing 256 hidden nodes, respectively, we can achieve over 95%
accuracy on this task.
3.17.14 Next
Multilayer perceptrons with gluon
3.18 Multilayer perceptrons in gluon

Building a multilayer perceptron to classify MNIST images with gluon is not much harder than
implementing softmax regression with ‘‘gluon‘ <../chapter02_supervised-learning/softmax-regression-
gluon.ipynb>‘__, like we did in Chapter 2. In that chapter, our entire neural network consisted of one
Dense layer (net = gluon.nn.Dense(num_outputs)).
In this chapter, we’re going to show you how to compose multiple layers together into a neural network.
There are two main ways to do this in Gluon and we’ll walk through both. The first is to define a custom
Block. In Gluon, everything is a Block! Layers, losses, whole networks, they’re all blocks! So naturally,
that’s a flexible way to do nearly anything you want.
We’ll also make use of gluon.nn.Sequential. Sequential gives us a special way of rapidly building
networks when follow a common design pattern: they look like a stack of pancakes. Many networks follow
this pattern: a bunch of layers, one stacked on top of another, where the output of each layer is the input to
the next layer. Sequential just takes a list of layers (we pass them in by calling net.add(<Layer goes
here!>). The following unnecessary picture should give you an intuitive sense of when to (and not to)
use sequential.
chapter03_deep-neural-networks/../img/sequential-not

3.18.1 Imports
First we’ll import the necessary bits.
import numpy as np
import mxnet as mx
We’ll also want to set the contexts for our data and our models.
In [ ]: ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()
data_ctx = ctx
model_ctx = ctx

num_inputs = 784
num_outputs = 10
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transf
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transf
3.18.3 Define the model with gluon.Block

Now instead of having one gluon.nn.Dense layer, we’ll want to compose several together. First let’s go
through the most fundamental way of doing this. Then we’ll introduce some shortcuts. In gluon a Block
has one main job - define a forward method that takes some NDArray input x and generates an NDArray
output. Because the output and input are related to each other via NDArray operations, MXNet can take
derivatives through the block automatically. A Block can just do something simple like apply an activation
function. But it can also combine a bunch of other Blocks together in creative ways. In this case, we’ll just
want to instantiate three Dense layers. The forward can then invoke the layers in turn to generate its output.
In [ ]: class MLP(gluon.Block):
def __init__(self, **kwargs):
super(MLP, self).__init__(**kwargs)
with self.name_scope():
self.dense0 = gluon.nn.Dense(64)
def forward(self, x):

x = nd.relu(self.dense0(x))
x = self.dense2(x)
return x
We can now instantiate a multilayer perceptron using our MLP class. And just as with any other block, we
can grab its parameters with collect_params and initialize them.
3.18. Multilayer perceptrons in gluon 129

In [ ]: net = MLP()
net.collect_params().initialize(mx.init.Normal(sigma=.01), ctx=model_ctx)
And we can synthesize some gibberish data just to demonstrate one forward pass through the network.
In [ ]: data = nd.ones((1,784))
net(data.as_in_context(model_ctx))
Because we’re working with an imperative framework and not a symbolic framework, debugging Gluon
Blocks is easy. If we want to see what’s going on at each layer of the neural network, we can just plug in a
bunch of Python print statements.
In [ ]: class MLP(gluon.Block):
self.dense0 = gluon.nn.Dense(64, activation="relu")
self.dense1 = gluon.nn.Dense(64, activation="relu")

x = self.dense0(x)
print("Hidden Representation 1: %s" % x)
x = self.dense1(x)
print("Hidden Representation 2: %s" % x)
x = self.dense2(x)
print("Network output: %s" % x)
return x
net = MLP()
net.collect_params().initialize(mx.init.Normal(sigma=.01), ctx=model_ctx)
net(data.as_in_context(model_ctx))
3.19 Faster modeling with gluon.nn.Sequential

MLPs, like many deep neural networks follow a pretty boring architecture. Just take a list of the layers,
chain them together, and return the output. There’s no reason why we have to actually define a new class
every time we want to do this. Gluon’s Sequential class provides a nice way of rapidly implementing
this standard network architecture. We just
• Instantiate a Sequential (let’s call it net)
• Add a bunch of layers to it using net.add(...)
Sequential assumes that the layers arrive bottom to top (with input at the very bottom). We could implement
the same architecture as shown above using sequential in just 6 lines.
In [ ]: num_hidden = 64
net = gluon.nn.Sequential()
net.add(gluon.nn.Dense(num_hidden, activation="relu"))
net.add(gluon.nn.Dense(num_outputs))


In [ ]: net.collect_params().initialize(mx.init.Normal(sigma=.1), ctx=model_ctx)
3.19.2 Softmax cross-entropy loss

In [ ]: softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
3.19.3 Optimizer
In [ ]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .01})

output = net(data)
return acc.get()[1]
3.19.5 Training loop

In [ ]: epochs = 10
cumulative_loss = 0
output = net(data)
loss.backward()

(e, cumulative_loss/num_examples, train_accuracy, test_accuracy))
3.19.6 Conclusion
In this chapter, we showed two ways to build multilayer perceptrons with Gluon. We demonstrated how to
subclass gluon.Block, and define your own forward passes. We also showed how you might debug your
network by lacing your forward pass with print statements. Finally, we showed how you could define and
instantiate an equivalent network with just 6 lines of code by using gluon.nn.Sequential. Now that
you understand the basics, you’re ready to leap ahead. If you’re following the book in order, then the next
3.19. Faster modeling with gluon.nn.Sequential 131

stop will be dropout regularization. Other possible choices would be to start leanring about convolutional
neural networks which are especialy handy for working with images, or recurrent neural networks, which
are especially useful for natural language processing.
3.19.7 Next
Dropout regularization from scratch
3.20 Dropout regularization from scratch

If you’re reading the tutorials in sequence, then you might remember from Part 2 that machine learning
models can be susceptible to overfitting. To recap: in machine learning, our goal is to discover general
patterns. For example, we might want to learn an association between genetic markers and the development
of dementia in adulthood. Our hope would be to uncover a pattern that could be applied successfully to
assess risk for the entire population.
However, when we train models, we don’t have access to the entire population (or current or potential
humans). Instead, we can access only a small, finite sample. Even in a large hospital system, we might get
hundreds of thousands of medical records. Given such a finite sample size, it’s possible to uncover spurious
associations that don’t hold up for unseen data.
Let’s consider an extreme pathological case. Imagine that you want to learn to predict which people will
repay their loans. A lender hires you as a data scientist to investigate the case and gives you complete files
on 100 applicants, of which 5 defaulted on their loans within 3 years. The files might include hundreds
of features including income, occupation, credit score, length of employment etcetera. Imagine that they
additionally give you video footage of their interview with a lending agent. That might seem like a lot of
data!
Now suppose that after generating an enormous set of features, you discover that of the 5 applicants who
defaults, all 5 were wearing blue shirts during their interviews, while only 40% of general population wore
blue shirts. There’s a good chance that any model you train would pick up on this signal and use it as an
important part of its learned pattern.
Even if defaulters are no more likely to wear blue shirts, there’s a 1% chance that we’ll observe all five
defaulters wearing blue shirts. And keeping the sample size low while we have hundreds or thousands of
features, we may observe a large number of spurious correlations. Given trillions of training examples, these
false associations might disappear. But we seldom have that luxury.
The phenomena of fitting our training distribution more closely than the real distribution is called overfit-
ting, and the techniques used to combat overfitting are called regularization. In the previous chapter, we
introduced one classical approach to regularize statistical models. We penalized the size (the ℓ2 norm) of the
weights, coercing them to take smaller values. In probabilistic terms we might say this imposes a Gaussian
prior on the value of the weights. But in more intuitive, functional terms, we can say this encourages the
model to spread out its weights among many features and not to depend too much on a small number of
potentially spurious associations.

3.20.1 With great flexibility comes overfitting liability

Given many more features than examples, linear models can overfit. But when there are many more exam-
ples than features, linear models can usually be counted on not to overfit. Unfortunately this propensity to
generalize well comes at a cost. For every feature, a linear model has to assign it either positive or negative
weight. Linear models can’t take into account nuanced interactions between features. In more formal texts,
you’ll see this phenomena discussed as the bias-variance tradeoff. Linear models have high bias, (they can
only represent a small class of functions), but low variance (they give similar results across different random
samples of the data). [point to more formal discussion of generalization when chapter exists]
Deep neural networks, however, occupy the opposite end of the bias-variance spectrum. Neural networks
are so flexible because they aren’t confined to looking at each feature individually. Instead, they can learn
complex interactions among groups of features. For example, they might infer that “Nigeria” and “Western
Union” appearing together in an email indicates spam but that “Nigeria” without “Western Union” does not
connote spam.
Even for a small number of features, deep neural networks are capable of overfitting. As one demonstration
of the incredible flexibility of neural networks, researchers showed that neural networks perfectly classify
randomly labeled data. Let’s think about what means. If the labels are assigned uniformly at random, and
there are 10 classes, then no classifier can get better than 10% accuracy on holdout data. Yet even in these
situations, when there is no true pattern to be learned, neural networks can perfectly fit the training labels.
3.20.2 Dropping out activations

In 2012, Professor Geoffrey Hinton and his students including Nitish Srivastava introduced a new idea for
how to regularize neural network models. The intuition goes something like this. When a neural network
overfits badly to training data, each layer depends too heavily on the exact configuration of features in the
previous layer.
To prevent the neural network from depending too much on any exact activation pathway, Hinton and Sri-
vastava proposed randomly dropping out (i.e. setting to 0) the hidden nodes in every layer with probability
.5. Given a network with 𝑛 nodes we are sampling uniformly at random from the 2𝑛 networks in which a
subset of the nodes are turned off.
One intuition here is that because the nodes to drop out are chosen randomly on every pass, the representa-
tions in each layer can’t depend on the exact values taken by nodes in the previous layer.
3.20.3 Making predictions with dropout models

However, when it comes time to make predictions, we want to use the full representational power of our
model. In other words, we don’t want to drop out activations at test time. One principled way to justify the
use of all nodes simultaneously, despite not training in this fashion, is that it’s a form of model averaging.
At each layer we average the representations of all of the 2𝑛 dropout networks. Because each node has a .5
probability of being on during training, its vote is scaled by .5 when we use all nodes at prediction time
import mxnet as mx
import numpy as np
mx.random.seed(1)
ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()
3.20. Dropout regularization from scratch 133



Let’s go ahead and grab our data.
[SWITCH TO CIFAR TO GET BETTER FEEL FOR REGULARIZATION]
batch_size = 64
train_data = gluon.data.DataLoader(gluon.data.vision.MNIST(train=True, transform=tr
test_data = gluon.data.DataLoader(gluon.data.vision.MNIST(train=False, transform=tr
In [ ]: W1 = nd.random_normal(shape=(784,256), ctx=ctx) *.01
b1 = nd.random_normal(shape=256, ctx=ctx) * .01
W2 = nd.random_normal(shape=(256,128), ctx=ctx) *.01

b2 = nd.random_normal(shape=128, ctx=ctx) * .01
W3 = nd.random_normal(shape=(128,10), ctx=ctx) *.01

b3 = nd.random_normal(shape=10, ctx=ctx) *.01
params = [W1, b1, W2, b2, W3, b3]
Again, let’s allocate space for gradients.

param.attach_grad()

If we compose a multi-layer network but use only linear operations, then our entire network will still be a
linear function. That’s because $ = X ·𝑊 _1 · 𝑊 _2 · 𝑊 _2 = 𝑋 · 𝑊 _4$𝑓 𝑜𝑟𝑊4 = 𝑊1 · 𝑊2 · 𝑊 3. To give
our model the capacity to capture nonlinear functions, we’ll need to interleave our linear operations with
activation functions. In this case, we’ll use the rectified linear unit (ReLU):
return nd.maximum(X, 0)
3.20.6 Dropout
In [ ]: def dropout(X, drop_probability):
keep_probability = 1 - drop_probability
mask = nd.random_uniform(0, 1.0, X.shape, ctx=X.context) < keep_probability
#############################
# Avoid division by 0 when scaling
#############################
if keep_probability > 0.0:
scale = (1/keep_probability)
else:
scale = 0.0
return mask * X * scale

In [ ]: A = nd.arange(20).reshape((5,4))
dropout(A, 0.0)
In [ ]: dropout(A, 0.5)
In [ ]: dropout(A, 1.0)

partition = nd.nansum(exp, axis=0, exclude=True).reshape((-1,1))


In [ ]: def net(X, drop_prob=0.0):
#######################
# Compute the first hidden layer
#######################
h1_linear = nd.dot(X, W1) + b1
h1 = dropout(h1, drop_prob)
#######################
# Compute the second hidden layer
#######################
h2 = dropout(h2, drop_prob)
#######################
# Compute the output layer.
# We will omit the softmax function here
# because it will be applied
# in the softmax_cross_entropy loss
#######################
return yhat_linear
3.20.10 Optimizer


numerator = 0.
denominator = 0.
output = net(data)

In [ ]: epochs = 10
moving_loss = 0.
################################
# Drop out 50% of hidden activations on the forward pass
################################
output = net(data, drop_prob=.5)
loss.backward()
##########################
##########################
if i == 0:
moving_loss = nd.mean(loss).asscalar()
else:

print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" % (e, moving_loss, train_
3.20.13 Conclusion
Nice. With just two hidden layers containing 256 and 128 hidden nodes, respectively, we can achieve over
95% accuracy on this task.
3.20.14 Next
Dropout regularization with gluon

3.21 Dropout regularization with gluon

In the previous chapter, we introduced Dropout regularization, implementing the algorithm from scratch. As
a reminder, Dropout is a regularization technique that zeroes out some fraction of the nodes during training.
Then at test time, we use all of the nodes, but scale down their values, essentially averaging the various
dropped out nets. If you’re approaching this chapter out of sequence, and aren’t sure how Dropout works,
it’s best to take a look at the implementation by hand since gluon will manage the low-level details for us.
Dropout is a special kind of layer because it behaves differently when training and predicting. We’ve already
seen how gluon can keep track of when to record vs not record the computation graph. Since this is a
gluon implementation chapter, let’s get into the thick of things by importing our dependencies and some
toy data.
import mxnet as mx
import numpy as np

num_inputs = 784
num_outputs = 10

Now we can add Dropout following each of our hidden layers. Setting the dropout probability to .6 would
mean that 60% of activations are dropped (set to zero) out and 40% are kept.
###########################
# Adding first hidden layer
###########################
###########################
# Adding dropout with rate .5 to the first hidden layer
###########################
net.add(gluon.nn.Dropout(.5))
###########################
# Adding second hidden layer
###########################

###########################
# Adding dropout with rate .5 to the second hidden layer
###########################
net.add(gluon.nn.Dropout(.5))
###########################
# Adding the output layer
###########################

Now that we’ve got an MLP with dropout layers, let’s register an initializer so we can play with some data.
3.21.4 Train mode and predict mode

Let’s grab some data and pass it through the network. To see what effect dropout is having on our predictions,
it’s instructive to pass the same example through our net multiple times.
In [ ]: for x, _ in train_data:
x = x.as_in_context(ctx)
break
print(net(x[0:1]))
print(net(x[0:1]))
Note that we got the exact same answer on both forward passes through the net! That’s because by, default,
mxnet assumes that we are in predict mode. We can explicitly invoke this scope by placing code within a
with autograd.predict_mode(): block.
In [ ]: with autograd.predict_mode():
print(net(x[0:1]))
print(net(x[0:1]))
Unless something’s gone horribly wrong, you should see the same result as before. We can also run the code
in train mode. This tells MXNet to run our Blocks as they would run during training.
In [ ]: with autograd.train_mode():
print(net(x[0:1]))
print(net(x[0:1]))
3.21.5 Accessing is_training() status

You might wonder, how precisely do the Blocks determine whether they should run in train mode or
predict mode? Basically, autograd maintains a Boolean state that can be accessed via autograd.
is_training(). By default this value is False in the global scope. This way if someone just wants
to make predictions and doesn’t know anything about training models, everything will just work. When we
enter a train_mode() block, we create a scope in which is_training() returns True.
In [ ]: with autograd.predict_mode():
print(autograd.is_training())
3.21. Dropout regularization with gluon 139

with autograd.train_mode():
print(autograd.is_training())
3.21.6 Integration with autograd.record

When we train neural network models, we nearly always enter record() blocks. The purpose of
record() is to build the computational graph. And the purpose of train is to indicate that we are
training our model. These two are highly correlated but should not be confused. For example, when we
generate adversarial examples (a topic we’ll investigate later) we may want to record, but for the model to
behave as in predict mode. On the other hand, sometimes, even when we’re not recording, we still want to
evaluate the model’s training behavior.
A problem then arises. Since record() and train_mode() are distinct, how do we avoid having to
declare two scopes every time we train the model?
In [ ]: ##########################
# Writing this every time could get cumbersome
##########################
with autograd.train_mode():
yhat = net(x)
To make our lives a little easier, record() takes one argument, train_mode, which has a default value of
True. So when we turn on autograd, this by default turns on train_mode (with autograd.record()
is equivalent to with autograd.record(train_mode=True):). To change this default behav-
ior (as when generating adversarial examples), we can optionally call record via (with autograd.
record(train_mode=False):).

3.21.8 Optimizer

data = data.as_in_context(ctx).reshape((-1, 784))
output = net(data)
return acc.get()[1]

In [ ]: epochs = 10

output = net(data)
loss.backward()
##########################
##########################
curr_loss = nd.mean(loss).asscalar()
moving_loss = (curr_loss if ((i == 0) and (e == 0))
else (1 - smoothing_constant) * moving_loss + (smoothing_con

(e, moving_loss, train_accuracy, test_accuracy))
3.21.11 Conclusion
Now let’s take a look at how to build convolutional neural networks.
3.21.12 Next
Introduction to ‘‘gluon.Block‘ and gluon.nn.Sequential <../chapter03_deep-neural-
networks/plumbing.ipynb>‘__
3.22 Plumbing: A look under the hood of gluon

In the previous tutorials, we taught you about linear regression and softmax regression. We explained how
these models work in principle, showed you how to implement them from scratch, and presented a compact
implementation using gluon. We explained how to do things in gluon but didn’t really explain how they
work. We relied on nn.Sequential, syntactically convenient shorthand for nn.Block but didn’t peek
under the hood. And while each notebook presented a working, trained model, we didn’t show you how
to inspect its parameters, save and load models, etc. In this chapter, we’ll take a break from modeling to
explore the gory details of mxnet.gluon.
3.22.1 Load up the data

First, let’s get the preliminaries out of the way.
import mxnet as mx
import numpy as np
from mxnet.gluon import nn, Block
3.22. Plumbing: A look under the hood of gluon 141

###########################
# Specify the context we'll be using
###########################
###########################
# Load up our dataset
###########################
batch_size = 64
3.22.2 Composing networks with gluon.Block

Now you might remember that up until now, we’ve defined neural networks (for example, a multilayer
perceptron) like this:
In [ ]: net1 = gluon.nn.Sequential()
with net1.name_scope():
net1.add(gluon.nn.Dense(128, activation="relu"))
net1.add(gluon.nn.Dense(10))
This is a convenient shorthand that allows us to express a neural network compactly. When we want to build
simple networks, this saves us a lot of time. But both (i) to understand how nn.Sequential works, and
(ii) to compose more complex architectures, you’ll want to understand gluon.Block.
Let’s take a look at how the same model would be expressed with gluon.Block.
In [ ]: class MLP(Block):
self.dense0 = nn.Dense(128)

return self.dense2(x)
Now that we’ve defined a class for MLPs, we can go ahead and instantiate one:
In [ ]: net2 = MLP()
And initialize its parameters:

In [ ]: net2.initialize(ctx=ctx)

At this point we can pass data through the network by calling it like a function, just as we have in the
previous tutorials.
In [ ]: for data, _ in train_data:
data = data.as_in_context(ctx)
break
net2(data[0:1])
3.22.3 Calling Block as a function

Notice that MLP is a class and thus its instantiation, net2, is an object. If you’re a casual Python user, you
might be surprised to see that we can call an object as a function. This is a syntactic convenience owing to
Python’s __call__ method. Basically, gluon.Block.__call__(x) is defined so that net(data)
behaves identically to net.forward(data). Since passing data through models is so fundamental and
common, you’ll be glad to save these 8 characters many times per day.
3.22.4 So what is a Block?

In gluon, a Block is a generic component in a neural network. The entire network is a Block, each layer
is a Block, and we can even have repeating sequences of layers that form an intermediate Block.
This might sounds confusing, so let’s make it simple. Each neural network has to do the following things: 1.
Store parameters 2. Accept inputs 3. Produce outputs (the forward pass) 4. Take derivatives (the backward
pass)
This can be said for the network as a whole, but it can also be said of each individual layer. A single fully-
connected layer is parameterized by a weight matrix and a bias vector, produces outputs from inputs, and,
given the derivative of some objective with respect to its outputs, can calculate the derivative with respect to
its inputs.
Fortunately, MXNet can take derivatives automatically. So we only have to define the forward pass
(forward(self, x)). Then, using mxnet.autograd, gluon can handle the backward pass. This
is quite a powerful interface. For example we could define the forward pass for some component to take
multiple inputs and combine them in arbitrary ways. We can even compose the forward() function such
that it throws together a different architecture on the fly depending on some conditions that we could specify
in Python. As long as the result is an NDArray, we’re in the clear.
3.22.5 What’s the deal with name_scope()?

The next thing you might have noticed is that we added all of our layers inside a with net1.
name_scope(): block. This coerces gluon to give each parameter an appropriate name, indicating
which model it belongs to, e.g. sequential8_dense2_weight. Keeping these names straight makes
our lives much easier once we start writing more complex code where we might be working with multiple
models and saving and loading the parameters of each. It helps us to make sure that we associate each
weight with the right model.
3.22.6 Demystifying nn.Sequential

So Sequential is basically a way of throwing together a Block on the fly. Let’s revisit the Sequential
version of our multilayer perceptron.
3.22. Plumbing: A look under the hood of gluon 143

net1.add(gluon.nn.Dense(10))
In just 5 lines and 183 characters, we defined a multilayer perceptron with three fully-connected layers, each
parametrized by weight matrix and bias term. We also specified the ReLU activation function for the hidden
layers.
Sequential itself subclasses Block and maintains a list of _children. Then, every time we call net1.
add(...) our net simply registers a new child. We can actually pass in an arbitrary Block, even layers
that we write ourselves.
When we call forward on a Sequential, it executes the following code:

for block in self._children:
x = block(x)
return x
Basically, it calls each child on the output of the previous one, returning the final output at the end of the
chain.
3.22.7 Shape inference

One of the first things you might notice is that for each layer, we only specified the number of nodes output,
we never specified how many input nodes! You might wonder, how does gluon know that the first weight
matrix should be 784 × 128 and not 42 × 128. In fact it doesn’t. We can see this by accessing the network’s
parameters.
In [ ]: print(net1.collect_params())
Take a look at the shapes of the weight matrices: (128,0), (64, 0), (10, 0). What does it mean to have zero
dimension in a matrix? This is gluon’s way of marking that the shape of these matrices is not yet known.
The shape will be inferred on the fly once the network is provided with some input.
So when we initialize our parameters, you might wonder, what precisely is happening?
In [ ]: net1.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
In this situation, gluon is not actually initializing any parameters! Instead, it’s making a note of which
initializer to associate with each parameter, even though its shape is not yet known. The parameters are
instantiated and the initializer is called once we provide the network with some input.
In [ ]: net1(data)
print(net1.collect_params())
This shape inference can be extremely useful at times. For example, when working with convnets, it can be
quite a pain to calculate the shape of various hidden layers. It depends on both the kernel size, the number of
filters, the stride, and the precise padding scheme used which can vary in subtle ways from library to library.

3.22.8 Specifying shape manually

If we want to specify the shape manually, that’s always an option. We accomplish this by using the
in_units argument when adding each layer.
net2.add(gluon.nn.Dense(128, in_units=784, activation="relu"))
net2.add(gluon.nn.Dense(64, in_units=128, activation="relu"))
net2.add(gluon.nn.Dense(10, in_units=64))
Note that the parameters from this network can be initialized before we see any real data.
print(net2.collect_params())
3.22.9 Next
Writing custom layers with ‘‘gluon.Block‘ <../chapter03_deep-neural-networks/custom-layer.ipynb>‘__
3.23 Designing a custom layer with gluon

Now that we’ve peeled back some of the syntactic sugar conferred by nn.Sequential() and given you
a feeling for how gluon works under the hood, you might feel more comfortable when writing your high-
level code. But the real reason to get to know gluon more intimately is so that we can mess around with it
and write our own Blocks.
Up until now, we’ve presented two versions of each tutorial. One from scratch and one in gluon. Empow-
ered with such independence, you might be wondering, “if I wanted to write my own layer, why wouldn’t I
just do it from scratch?”
In reality, writing every model completely from scratch can be cumbersome. Just like there’s only so many
times a developer can code up a blog from scratch without hating life, there’s only so many times that you’ll
want to write out a convolutional layer, or define the stochastic gradient descent updates. Even in a pure
research environment, we usually want to customize one part of the model. For example, we might want to
implement a new layer, but still rely on other common layers, loss functions, optimizers, etc. In some cases
it might be nontrivial to compute the gradient efficiently and the automatic differentiation subsystem might
need some help: When was the last time you performed backprop through a log-determinant, a Cholesky
factorization, or a matrix exponential? In other cases things might not be numerically very stable when
calculated straightforwardly (e.g. taking logs of exponentials of some arguments).
By hacking gluon, we can get the desired flexibility in one part of our model, without screwing up every-
thing else that makes our life easy.
import mxnet as mx
import numpy as np
from mxnet.gluon import nn, Block
mx.random.seed(1)
###########################
3.23. Designing a custom layer with gluon 145

# Speficy the context we'll be using

###########################
###########################
# Load up our dataset
###########################
batch_size = 64
3.23.1 Defining a (toy) custom layer

To start, let’s pretend that we want to use gluon for its optimizer, serialization, etc, but that we need a new
layer. Specifically, we want a layer that centers its input about 0 by subtracting its mean. We’ll go ahead
and define the simplest possible Block. Remember from the last tutorial that in gluon a layer is called a
Block (after all, we might compose multiple blocks into a larger block, etc.).
In [ ]: class CenteredLayer(Block):
super(CenteredLayer, self).__init__(**kwargs)

return x - nd.mean(x)
That’s it. We can just instantiate this block and make a forward pass. Note that this layer doesn’t actually
care what its input or output dimensions are. So we can just feed in an arbitrary array and should expect
appropriately transformed output. Whenever we are happy with whatever the automatic differentiation gen-
erates, this is all we need.
In [ ]: net = CenteredLayer()
net(nd.array([1,2,3,4,5]))
We can also incorporate this layer into a more complicated network, such as by using nn.Sequential().
In [ ]: net2 = nn.Sequential()
net2.add(nn.Dense(128))
net2.add(nn.Dense(10))
net2.add(CenteredLayer())
This network contains Blocks (Dense) that contain parameters and thus require initialization
Now we can pass some data through it, say the first image from our MNIST dataset.
break
output = net2(data[0:1])
print(output)
And we can verify that as expected, the resulting vector has mean 0.

In [ ]: nd.mean(output)
There’s a good chance you’ll see something other than 0. When I ran this code, I got 2.68220894e-08.
That’s roughly .000000027. This is due to the fact that MXNet often uses low precision arithmetics. For
deep learning research, this is often a compromise that we make. In exchange for giving up a few significant
digits, we get tremendous speedups on modern hardware. And it turns out that most deep learning algorithms
don’t suffer too much from the loss of precision.
3.23.2 Custom layers with parameters

While CenteredLayer should give you some sense of how to implement a custom layer, it’s missing a
few important pieces. Most importantly, CenteredLayer doesn’t care about the dimensions of its input
or output, and it doesn’t contain any trainable parameters. Since you already know how to implement a
fully-connected layer from scratch, let’s learn how to make parametric Block by implementing MyDense,
our own version of a fully-connected (Dense) layer.
3.23.3 Parameters
Before we can add parameters to our custom Block, we should get to know how gluon deals with param-
eters generally. Instead of working with NDArrays directly, each Block is associated with some number
(as few as zero) of Parameter (groups).
At a high level, you can think of a Parameter as a wrapper on an NDArray. However, the Parameter
can be instantiated before the corresponding NDArray is. For example, when we instantiate a Block but
the shapes of each parameter still need to be inferred, the Parameter will wait for the shape to be inferred
before allocating memory.
To get a hands-on feel for mxnet.Parameter, let’s just instantiate one outside of a Block:
In [ ]: my_param = gluon.Parameter("exciting_parameter_yay", grad_req='write', shape=(5,5))
print(my_param)
Here we’ve instantiated a parameter, giving it the name “exciting_parameter_yay”. We’ve also specified
that we’ll want to capture gradients for this Parameter. Under the hood, that lets gluon know that it has
to call .attach_grad() on the underlying NDArray. We also specified the shape. Now that we have a
Parameter, we can initialize its values via .initialize() and extract its data by calling .data().
In [ ]: my_param.initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
print(my_param.data())
For data parallelism, a Parameter can also be initialized on multiple contexts. The Parameter will then keep
a copy of its value on each context. Keep in mind that you need to maintain consistency among the copies
when updating the Parameter (usually gluon.Trainer does this for you).
Note that you need at least two GPUs to run this section.
In [ ]: if len(mx.test_utils.list_gpus()) >= 2:
my_param = gluon.Parameter("exciting_parameter_yay", grad_req='write', shape=(5
my_param.initialize(mx.init.Xavier(magnitude=2.24), ctx=[mx.gpu(0), mx.gpu(1)])
print(my_param.data(mx.gpu(0)), my_param.data(mx.gpu(1)))

3.23.4 Parameter dictionaries (introducing ParameterDict)

Rather than directly store references to each of its Parameters, Blocks typicaly contain a parameter
dictionary (ParameterDict). In practice, we’ll rarely instantiate our own ParameterDict. That’s
because whenever we call the Block constructor it’s generated automatically. For pedagogical purposes,
we’ll do it from scratch this one time.
In [ ]: pd = gluon.ParameterDict(prefix="block1_")
MXNet’s ParameterDict does a few cool things for us. First, we can instantiate a new Parameter by
calling pd.get()
In [ ]: pd.get("exciting_parameter_yay", grad_req='write', shape=(5,5))
Note that the new parameter is (i) contained in the ParameterDict and (ii) appends the prefix to its name.
This naming convention helps us to know which parameters belong to which Block or sub-Block. It’s
especially useful when we want to write parameters to disc (i.e. serialize), or read them from disc (i.e.
deserialize).
Like a regular Python dictionary, we can get the names of all parameters with .keys() and can access
parameters with:
In [ ]: pd["block1_exciting_parameter_yay"]
3.23.5 Craft a bespoke fully-connected gluon layer

Now that we know how parameters work, we’re ready to create our very own fully-connected layer. We’ll
use the familiar relu activation from previous tutorials.
Now we can define our Block.

In [ ]: class MyDense(Block):
####################
# We add arguments to our constructor (__init__)
# to indicate the number of input units (``in_units``)
# and output units (``units``)
####################
def __init__(self, units, in_units=0, **kwargs):
super(MyDense, self).__init__(**kwargs)
self.units = units
self._in_units = in_units
#################
# We add the required parameters to the ``Block``'s ParameterDict ,
# indicating the desired shape
#################
self.weight = self.params.get(
'weight', init=mx.init.Xavier(magnitude=2.24),
shape=(in_units, units))
self.bias = self.params.get('bias', shape=(units,))
#################
# Now we just have to write the forward pass.

# We could rely upong the FullyConnected primitive in NDArray,

# but it's better to get our hands dirty and write it out
# so you'll know how to compose arbitrary functions
#################
with x.context:
linear = nd.dot(x, self.weight.data()) + self.bias.data()
activation = relu(linear)
return activation
Recall that every Block can be run just as if it were an entire network. In fact, linear models are nothing
more than neural networks consisting of a single layer as a network.
So let’s go ahead and run some data through our bespoke layer. We’ll want to first instantiate the layer and
initialize its parameters.
In [ ]: dense = MyDense(20, in_units=10)
dense.collect_params().initialize(ctx=ctx)
In [ ]: dense.params
Now we can run through some dummy data.

In [ ]: dense(nd.ones(shape=(2,10)).as_in_context(ctx))
3.23.6 Using our layer to build an MLP

While it’s a good sanity check to run some data though the layer, the real proof that it works will be if we
can compose a network entirely out of MyDense layers and achieve respectable accuracy on a real task. So
we’ll revisit the MNIST digit classification task, and use the familiar nn.Sequential() syntax to build
our net.
In [ ]: net = gluon.nn.Sequential()
net.add(MyDense(128, in_units=784))
3.23.7 Initialize Parameters

In [ ]: net.collect_params().initialize(ctx=ctx)
3.23.8 Instantiate a loss

In [ ]: loss = gluon.loss.SoftmaxCrossEntropyLoss()
3.23.9 Optimizer

In [ ]: metric = mx.metric.Accuracy()
def evaluate_accuracy(data_iterator, net):

numerator = 0.

denominator = 0.

output = net(data)
metric.update([label], [output])
return metric.get()[1]

In [ ]: epochs = 2 # Low number for testing, set higher when you run!
moving_loss = 0.
output = net(data)

print("Epoch %s. Train_acc %s, Test_acc %s" % (e, train_accuracy, test_accuracy
3.23.12 Conclusion
It works! There’s a lot of other cool things you can do. In more advanced chapters, we’ll show how you
can make a layer that takes in multiple inputs, or one that cleverly calls down to MXNet’s symbolic API to
squeeze out extra performance without screwing up your convenient imperative workflow.
3.23.13 Next
Serialization: saving your models and parameters for later re-use
3.24 Serialization - saving, loading and checkpointing

At this point we’ve already covered quite a lot of ground. We know how to manipulate data and labels.
We know how to construct flexible models capable of expressing plausible hypotheses. We know how to fit
those models to our dataset. We know of loss functions to use for classification and for regression, and we
know how to minimize those losses with respect to our models’ parameters. We even know how to write our
own neural network layers in gluon.
But even with all this knowledge, we’re not ready to build a real machine learning system. That’s because

we haven’t yet covered how to save and load models. In reality, we often train a model on one device and
then want to run it to make predictions on many devices simultaneously. In order for our models to persist
beyond the execution of a single Python script, we need mechanisms to save and load NDArrays, gluon
Parameters, and models themselves.
import mxnet as mx
3.24.1 Saving and loading NDArrays

To start, let’s show how you can save and load a list of NDArrays for future use. Note that while it’s possible
to use a general Python serialization package like Pickle, it’s not optimized for use with NDArrays and
will be unnecessarily slow. We prefer to use ndarray.save and ndarray.load.
In [ ]: X = nd.ones((100, 100))
Y = nd.zeros((100, 100))
import os
dir_name = 'checkpoints'
if not os.path.exists(dir_name):
os.makedirs(dir_name)
filename = os.path.join(dir_name, "test1.params")

nd.save(filename, [X, Y])
It’s just as easy to load a saved NDArray.

In [ ]: A, B = nd.load(filename)
print(A)
print(B)
We can also save a dictionary where the keys are strings and the values are NDArrays.
In [ ]: mydict = {"X": X, "Y": Y}
filename = os.path.join(dir_name, "test2.params")
nd.save(filename, mydict)
In [ ]: C = nd.load(filename)
print(C)
3.24.2 Saving and loading the parameters of gluon models

Recall from our first look at the plumbing behind ‘‘gluon‘ blocks <P03.5-C01-plumbing.ipynb%5D>‘__ that
gluon wraps the NDArrays corresponding to model parameters in Parameter objects. We’ll often want
to store and load an entire model’s parameters without having to individually extract or load the NDarrays
from the Parameters via ParameterDicts in each block.
Fortunately, gluon blocks make our lives very easy by providing a .save_parameters() and .
load_parameters() methods. To see them in work, let’s just spin up a simple MLP.
num_outputs = 1
3.24. Serialization - saving, loading and checkpointing 151

Now, let’s initialize the parameters by attaching an initializer and actually passing in a datapoint to induce
shape inference.
In [ ]: net.collect_params().initialize(mx.init.Normal(sigma=1.), ctx=ctx)
net(nd.ones((1, 100), ctx=ctx))
So this randomly initialized model maps a 100-dimensional vector of all ones to the number 362.53 (that’s
the number on my machine–your mileage may vary). Let’s save the parameters, instantiate a new network,
load them in and make sure that we get the same result.
In [ ]: filename = os.path.join(dir_name, "testnet.params")
net.save_parameters(filename)
net2 = gluon.nn.Sequential()
net2.add(gluon.nn.Dense(num_hidden, activation="relu"))
net2.add(gluon.nn.Dense(num_hidden, activation="relu"))
net2.add(gluon.nn.Dense(num_outputs))
net2.load_parameters(filename, ctx=ctx)
net2(nd.ones((1, 100), ctx=ctx))
Great! Now we’re ready to save our work. The practice of saving models is sometimes called checkpointing
and it’s especially important for a number of reasons. 1. We can preserve and syndicate models that are
trained once. 2. Some models perform best (as determined on validation data) at some epoch in the middle
of training. If we checkpoint the model after each epoch, we can later select the best epoch. 3. We might
want to ask questions about our trained model that we didn’t think of when we first wrote the scripts for
our experiments. Having the parameters lying around allows us to examine our past work without having
to train from scratch. 4. Sometimes people might want to run our models who don’t know how to execute
training themselves or can’t access a suitable dataset for training. Checkpointing gives us a way to share our
work with others.
3.24.3 Next
Convolutional neural networks from scratch
3.25 Convolutional neural networks from scratch

Now let’s take a look at convolutional neural networks (CNNs), the models people really use for classifying
images.
import mxnet as mx
import numpy as np
ctx = mx.cpu()
# ctx = mx.gpu()
mx.random.seed(1)

3.25.1 MNIST data (last one, we promise!)

num_inputs = 784
num_outputs = 10
return nd.transpose(data.astype(np.float32), (2,0,1))/255, label.astype(np.floa
3.25.2 Convolutional neural networks (CNNs)

In the previous example, we connected the nodes of our neural networks in what seems like the simplest
possible way. Every node in each layer was connected to every node in the subsequent layers.
This can require a lot of parameters! If our input were a 256x256 color image (still quite small for a
photograph), and our network had 1,000 nodes in the first hidden layer, then our first weight matrix would
require (256x256x3)x1000 parameters. That’s nearly 200 million. Moreover the hidden layer would ignore
all the spatial structure in the input image even though we know the local structure represents a powerful
source of prior knowledge.
Convolutional neural networks incorporate convolutional layers. These layers associate each of their nodes
with a small window, called a receptive field, in the previous layer, instead of connecting to the full layer.
3.25. Convolutional neural networks from scratch 153

This allows us to first learn local features via transformations that are applied in the same way for the top
right corner as for the bottom left. Then we collect all this local information to predict global qualities of
the image (like whether or not it depicts a dog).
(Image credit: Stanford cs231n http://cs231n.github.io/assets/cnn/depthcol.jpeg)

In short, there are two new concepts you need to grok here. First, we’ll be introducting convolutional layers.
Second, we’ll be interleaving them with pooling layers.
3.25.3 Parameters
Each node in a convolutional layer is associated with a 3D block (height x width x channel) in the input
tensor. Moreover, the convolutional layer itself has multiple output channels. So the layer is parameterized
by a 4 dimensional weight tensor, commonly called a convolutional kernel.
The output tensor is produced by sliding the kernel across the input image skipping locations according to a
pre-defined stride (but we’ll just assume that to be 1 in this tutorial). Let’s initialize some such kernels from
scratch.
In [ ]: #######################
# Set the scale for weight initialization and choose
# the number of hidden units in the fully-connected layer
#######################
weight_scale = .01
num_fc = 128
num_filter_conv_layer1 = 20
num_filter_conv_layer2 = 50

W1 = nd.random_normal(shape=(num_filter_conv_layer1, 1, 3,3), scale=weight_scale, c

b1 = nd.random_normal(shape=num_filter_conv_layer1, scale=weight_scale, ctx=ctx)
W2 = nd.random_normal(shape=(num_filter_conv_layer2, num_filter_conv_layer1, 5, 5),

scale=weight_scale, ctx=ctx)
b2 = nd.random_normal(shape=num_filter_conv_layer2, scale=weight_scale, ctx=ctx)
W3 = nd.random_normal(shape=(800, num_fc), scale=weight_scale, ctx=ctx)

b3 = nd.random_normal(shape=num_fc, scale=weight_scale, ctx=ctx)
W4 = nd.random_normal(shape=(num_fc, num_outputs), scale=weight_scale, ctx=ctx)

b4 = nd.random_normal(shape=num_outputs, scale=weight_scale, ctx=ctx)
params = [W1, b1, W2, b2, W3, b3, W4, b4]
And assign space for gradients

param.attach_grad()
3.25.4 Convolving with MXNet’s NDArrray

To write a convolution when using raw MXNet, we use the function nd.Convolution(). This function
takes a few important arguments: inputs (data), a 4D weight matrix (weight), a bias (bias), the shape
of the kernel (kernel), and a number of filters (num_filter).
break
conv = nd.Convolution(data=data, weight=W1, bias=b1, kernel=(3,3), num_filter=num_f
print(conv.shape)
Note the shape. The number of examples (64) remains unchanged. The number of channels (also called
filters) has increased to 20. And because the (3,3) kernel can only be applied in 26 different heights and
widths (without the kernel busting over the image border), our output is 26,26. There are some weird
padding tricks we can use when we want the input and output to have the same height and width dimensions,
but we won’t get into that now.
3.25.5 Average pooling

The other new component of this model is the pooling layer. Pooling gives us a way to downsample in
the spatial dimensions. Early convnets typically used average pooling, but max pooling tends to give better
results.
In [ ]: pool = nd.Pooling(data=conv, pool_type="max", kernel=(2,2), stride=(2,2))
print(pool.shape)
Note that the batch and channel components of the shape are unchanged but that the height and width have
been downsampled from (26,26) to (13,13).
3.25.6 Activation function

return nd.maximum(X,nd.zeros_like(X))


partition = nd.sum(exp, axis=0, exclude=True).reshape((-1,1))


In [ ]: def net(X, debug=False):
########################
# Define the computation of the first convolutional layer
########################
h1_conv = nd.Convolution(data=X, weight=W1, bias=b1, kernel=(3,3),
num_filter=num_filter_conv_layer1)
h1_activation = relu(h1_conv)
h1 = nd.Pooling(data=h1_activation, pool_type="avg", kernel=(2,2), stride=(2,2)
if debug:
print("h1 shape: %s" % (np.array(h1.shape)))
########################
# Define the computation of the second convolutional layer
########################
h2_conv = nd.Convolution(data=h1, weight=W2, bias=b2, kernel=(5,5),
num_filter=num_filter_conv_layer2)
h2_activation = relu(h2_conv)
if debug:
########################
# Flattening h2 so that we can feed it into a fully-connected layer
########################
h2 = nd.flatten(h2)
if debug:
print("Flat h2 shape: %s" % (np.array(h2.shape)))
########################
# Define the computation of the third (fully-connected) layer
########################
if debug:
########################
# Define the computation of the output layer
########################


if debug:
print("yhat_linear shape: %s" % (np.array(yhat_linear.shape)))
return yhat_linear
3.25.10 Test run

We can now print out the shapes of the activations at each layer by using the debug flag.
In [ ]: output = net(data, debug=True)
3.25.11 Optimizer

numerator = 0.
denominator = 0.
output = net(data)
3.25.13 The training loop

In [ ]: epochs = 1
learning_rate = .01
label_one_hot = nd.one_hot(label, num_outputs)
output = net(data)
loss.backward()
##########################
##########################


3.25.14 Conclusion
Contained in this example are nearly all the important ideas you’ll need to start attacking problems in
computer vision. While state-of-the-art vision systems incorporate a few more bells and whistles, they’re
all built on this foundation. Believe it or not, if you knew just the content in this tutorial 5 years ago,
you could probably have sold a startup to a Fortune 500 company for millions of dollars. Fortunately (or
unfortunately?), the world has gotten marginally more sophisticated, so we’ll have to come up with some
more sophisticated tutorials to follow.
3.25.15 Next
Convolutional neural networks with gluon
3.26 Convolutional Neural Networks in gluon

Now let’s see how succinctly we can express a convolutional neural network using gluon. You might be
relieved to find out that this too requires hardly any more code than logistic regression.
import numpy as np
import mxnet as mx
mx.random.seed(1)

In [ ]: # ctx = mx.gpu()
ctx = mx.cpu()
3.26.2 Grab the MNIST dataset

num_inputs = 784
num_outputs = 10

3.26.3 Define a convolutional neural network

Again, a few lines here is all we need in order to change the model. Let’s add a couple of convolutional
layers using gluon.nn.
In [ ]: num_fc = 512
net.add(gluon.nn.Conv2D(channels=20, kernel_size=5, activation='relu'))
net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
# The Flatten layer collapses all axis, except the first one, into one axis.
net.add(gluon.nn.Flatten())
net.add(gluon.nn.Dense(num_fc, activation="relu"))

3.26.5 Softmax cross-entropy Loss

3.26.6 Optimizer

output = net(data)
return acc.get()[1]
3.26.8 Training Loop

In [ ]: epochs = 1
output = net(data)
loss.backward()
3.26. Convolutional Neural Networks in gluon 159

##########################
##########################
else (1 - smoothing_constant) * moving_loss + smoothing_cons

3.26.9 Conclusion
You might notice that by using gluon, we get code that runs much faster whether on CPU or GPU. That’s
largely because gluon can call down to highly optimized layers that have been written in C++.
3.26.10 Next
Deep convolutional networks (AlexNet)
3.27 Deep convolutional neural networks

In the previous chapters, you got a sense of how to classify images with convolutional neural network
(CNNs). Specifically, we implemented a CNN with two convolutional layers interleaved with pooling layers,
a singly fully-connected hidden layer, and a softmax output layer. That architecture loosely resembles
a neural network affectionately named LeNet, in honor Yann LeCun, an early pioneer of convolutional
neural networks and the first to reduce them to practice in 1989 by training them with gradient descent (i.e.
backpropagation). At the time, this was a fairly novel idea. A cadre of researchers interested in biologically-
inspired learning models had taken to investigating artificial simulations of neurons as learning models.
However, as remains true to this day, few researchers believed that real brains learn by gradient descent. The
community of neural networks researchers had explored many other learning rules. LeCun demonstrated
that CNNs trained by gradient descent, could get state-of-the-art results on the task of recognizing hand-
written digits. These groundbreaking results put CNNs on the map.
However, in the intervening years, neural networks were superseded by numerous other methods. Neural
networks were considered slow to train, and there wasn’t wide consensus on whether it was possible to train
very deep neural networks from a random initialization of the weights. Moreover, training networks with
many channels, layers, and parameters required excessive computation relative to the resources available
decades ago. While it was possible to train a LeNet for MNIST digit classification and get good scores,
neural networks fell out of favor on larger, real-world datasets.
Instead researchers precomputed features based on a mixture of elbow grease, knowledge of optics, and
black magic. A typical pattern was this: 1. Grab a cool dataset 2. Preprocess it with giant bag of predeter-
mined feature functions 3. Dump the representations into a simple linear model to do the machine learning
part.
This was the state of affairs in computer vision up until 2012, just before deep learning began to change the

world of applied machine learning. One of us (Zack) entered graduate school in 2013. A friend in graduate
school summarized the state of affairs thus:
If you spoke to machine learning researchers, they believed that machine learning was both important and
beautiful. Elegant theories proved the properties of various classifiers. The field of machine learning was
thriving, rigorous and eminently useful. However, if you spoke to a computer vision researcher, you’d hear a
very different story. The dirty truth of image recognition, they’d tell you, is that the really important aspects
of the ML for CV pipeline were data and features. A slightly cleaner dataset, or a slightly better hand-tuned
feature mattered a lot to the final accuracy. However, the specific choice of classifier was little more than an
afterthought. At the end of the day you could throw your features in a logistic regression model, a support
vector machine, or any other classifier of choice, and they would all perform roughly the same.
3.27.1 Learning the representations

Another way to cast the state of affairs is that the most important part of the pipeline was the representation.
And up until 2012, this part was done mechanically, based on some hard-fought intuition. In fact, engineer-
ing a new set of feature functions, improving results, and writing up the method was a prominent genre of
paper.
Another group of researchers had other plans. They believed that features themselves ought to be learned.
Moreover they believed that to be reasonably complex, the features ought to be hierarchically composed.
These researchers, including Yann LeCun, Geoff Hinton, Yoshua Bengio, Andrew Ng, Shun-ichi Amari, and
Juergen Schmidhuber believed that by jointly training many layers of a neural network, they might come to
learn hierarchical representations of data. In the case of an image, the lowest layers might come to detect
edges, colors, and textures.
3.27. Deep convolutional neural networks 161

Higher layers might build upon these representations to represent larger structures, like eyes, noses, blades of
grass, and features. Yet higher layers might represent whole objects like people, airplanes, dogs, or frisbees.
And ultimately, before the classification layer, the final hidden state might represent a compact representa-
tion of the image that summarized the contents in a space where data belonging to different categories would
be linearly separable.
3.27.2 Missing ingredient 1: data

Despite the sustained interest of a committed group of researchers in learning deep representations of visual
data, for a long time these ambitions were frustrated. The failure to make progress was owed to a few
factors. First, while this wasn’t yet known, supervised deep models with many representation require large
amounts of labeled training data in order to outperform classical approaches. However, given the limited
storage capacity of computers and the comparatively tighter research budgets in the 1990s and prior, most
research relied on tiny datasets. For example, many credible research papers relied on a small set of corpora
hosted by UCI, many of which contained hundreds or a few thousand images.
This changed in a big way when Fei-Fei Li presented the ImageNet database in 2009. The ImageNet
dataset dwarfed all previous research datasets. It contained one million images: one thousand each from one
thousand distinct classes.
This dataset pushed both computer vision and machine learning research into a new regime where the
previous best methods would no longer dominate.
3.27.3 Missing ingredient 2: hardware

Deep Learning has a voracious need for computation. This is one of the main reasons why in the 90s and
early 2000s algorithms based on convex optimization were the preferred way of solving problems. After all,
convex algorithms have fast rates of convergence, global minima, and efficient algorithms can be found.
The game changer was the availability of GPUs. They had long been tuned for graphics processing in
computer games. In particular, they were optimized for high throughput 4x4 matrix-vector products, since

these are needed for many computer graphics tasks. Fortunately, the math required for that is very similar
to convolutional layers in deep networks. Furthermore, around that time, NVIDIA and ATI had begun
optimizing GPUs for general compute operations, going as far as renaming them GPGPU (General Purpose
GPUs).
To provide some intuition, consider the cores of a modern microprocessor. Each of the cores is quite pow-
erful, running at a high clock frequency, it has quite advanced and large caches (up to several MB of L3).
Each core is very good at executing a very wide range of code, with branch predictors, a deep pipeline and
lots of other things that make it great at executing regular programs. This apparent strength, however, is
also its Achilles’ heel: general purpose cores are very expensive to build. They require lots of chip area, a
sophisticated support structure (memory interfaces, caching logic between cores, high speed interconnects,
etc.), and they’re comparatively bad at any single task. Modern laptops have up to 4 cores, and even high
end servers rarely exceed 64 cores, simply because it is not cost effective.
Compare that with GPUs. They consist of 100-1000 small processing elements (the details differ somewhat
betwen NVIDIA, ATI, ARM and other chip vendors), often grouped into larger groups (NVIDIA calls them
warps). While each core is relatively weak, running at sub-1GHz clock frequency, it is the total number
of such cores that makes GPUs orders of magnitude faster than CPUs. For instance, NVIDIA’s latest Volta
generation offers up to 120 TFlops per chip for specialized instructions (and up to 24 TFlops for more general
purpose ones), while floating point performance of CPUs has not exceeded 1 TFlop to date. The reason for
why this is possible is actually quite simple: firstly, power consumption tends to grow quadratically with
clock frequency. Hence, for the power budget of a CPU core that runs 4x faster (a typical number) you
can use 16 GPU cores at 1/4 the speed, which yields 16 x 1/4 = 4x the performance. Furthermore GPU
cores are much simpler (in fact, for a long time they weren’t even able to execute general purpose code),
which makes them more energy efficient. Lastly, many operations in deep learning require high memory
bandwidth. Again, GPUs shine here with buses that are at least 10x as wide as many CPUs.
Back to 2012. A major breakthrough came when Alex Krizhevsky and Ilya Sutskever implemented a deep
convolutional neural network that could run on GPU hardware. They realized that the computational bot-
tlenecks in CNNs (convolutions and matrix multiplications) are all operations that could be parallelized in
hardware. Using two NIVIDA GTX 580s with 3GB of memory (depicted below) they implemented fast
convolutions. The code cuda-convnet was good enough that for several years it was the industry standard
and powered the first couple years of the deep learning boom.
3.27.4 AlexNet
In 2012, using their cuda-convnet implementation on an eight-layer CNN, Khrizhevsky, Sutskever and Hin-
ton won the ImageNet challenge on image recognition by a wide margin. Their model, introduced in this
paper, is very similar to the LeNet architecture from 1995.
In the rest of the chapter we’re going to implement a similar model to the one that they designed. Due
to memory constraints on the GPU they did some wacky things to make the model fit. For example, they
designed a dual-stream architecture in which half of the nodes live on each GPU. The two streams, and thus
the two GPUs only communicate at certain layers. This limits the amount of overhead for keeping the two
GPUs in sync with each other. Fortunately, distributed deep learning has advanced a long way in the last few
years, so we won’t be needing those features (except for very unusual architectures). In later sections, we’ll
go into greater depth on how you can speed up your networks by training on many GPUs (in AWS you can
get up to 16 on a single machine with 12GB each), and how you can train on many machine simultaneously.
As usual, we’ll start by importing the same dependencies as in the past gluon tutorials:


import mxnet as mx
import numpy as np
mx.random.seed(1)
In [ ]: # ctx = mx.gpu()
ctx = mx.cpu()
3.27.5 Load up a dataset

Now let’s load up a dataset. This time we’re going to use gluon’s new vision package, and import the
CIFAR dataset. Cifar is a much smaller color dataset, roughly the dimensions of ImageNet. It contains
50,000 training and 10,000 test images. The images belong in equal quantities to 10 categories. While this
dataset is considerably smaller than the 1M image, 1k category, 256x256 ImageNet dataset, we’ll use it here
to demonstrate the model because we don’t want to assume that you have a license for the ImageNet dataset
or a machine that can store it comfortably. To give you some sense for the proportions of working with
ImageNet data, we’ll upsample the images to 224x224 (the size used in the original AlexNet).
In [ ]: def transformer(data, label):
data = mx.image.imresize(data, 224, 224)
data = mx.nd.transpose(data, (2,0,1))
data = data.astype(np.float32)
return data, label
train_data = gluon.data.DataLoader(
gluon.data.vision.CIFAR10('./data', train=True, transform=transformer),
batch_size=batch_size, shuffle=True, last_batch='discard')
test_data = gluon.data.DataLoader(
gluon.data.vision.CIFAR10('./data', train=False, transform=transformer),
batch_size=batch_size, shuffle=False, last_batch='discard')
In [ ]: for d, l in train_data:
break
In [ ]: print(d.shape, l.shape)
In [ ]: d.dtype
3.27.6 The AlexNet architecture

This model has some notable features. First, in contrast to the relatively tiny LeNet, AlexNet contains 8
layers of transformations, five convolutional layers followed by two fully connected hidden layers and an
output layer.
The convolutional kernels in the first convolutional layer are reasonably large at 11 × 11, in the second they
are 5 × 5 and thereafter they are 3 × 3. Moreover, the first, second, and fifth convolutional layers are each
followed by overlapping pooling operations with pool size 3 × 3 and stride (2 × 2).
Following the convolutional layers, the original AlexNet had fully-connected layers with 4096 nodes each.
Using gluon.nn.Sequential(), we can define the entire AlexNet architecture in just 14 lines of code.

Besides the specific architectural choices and the data preparation, we can recycle all of the code we’d used
for LeNet verbatim.
[right now relying on a different data pipeline (the new gluon.vision). Sync this with the other chapter
soon and commit to one data pipeline.]
[add dropout once we are 100% final on API]
In [ ]: alex_net = gluon.nn.Sequential()
with alex_net.name_scope():
# First convolutional layer
alex_net.add(gluon.nn.Conv2D(channels=96, kernel_size=11, strides=(4,4), activa
alex_net.add(gluon.nn.MaxPool2D(pool_size=3, strides=2))
# Second convolutional layer
alex_net.add(gluon.nn.Conv2D(channels=192, kernel_size=5, activation='relu'))
alex_net.add(gluon.nn.MaxPool2D(pool_size=3, strides=(2,2)))
# Third convolutional layer
# Fourth convolutional layer
# Fifth convolutional layer
alex_net.add(gluon.nn.MaxPool2D(pool_size=3, strides=2))
# Flatten and apply fullly connected layers
alex_net.add(gluon.nn.Flatten())
alex_net.add(gluon.nn.Dense(4096, activation="relu"))
alex_net.add(gluon.nn.Dense(4096, activation="relu"))
alex_net.add(gluon.nn.Dense(10))

In [ ]: alex_net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
3.27.8 Optimizer
In [ ]: trainer = gluon.Trainer(alex_net.collect_params(), 'sgd', {'learning_rate': .001})

3.27.10 Evaluation loop

for d, l in data_iterator:
data = d.as_in_context(ctx)
label = l.as_in_context(ctx)
output = net(data)
return acc.get()[1]


In [ ]: ###########################
# Only one epoch so tests can run quickly, increase this variable to actually run
###########################
epochs = 1
for i, (d, l) in enumerate(train_data):
output = alex_net(data)
loss.backward()
##########################
##########################
test_accuracy = evaluate_accuracy(test_data, alex_net)

train_accuracy = evaluate_accuracy(train_data, alex_net)
3.27.12 Next
Very deep convolutional neural nets with repeating blocks
3.28 Very deep networks with repeating elements

As we already noticed in AlexNet, the number of layers in networks keeps on increasing. This means that
it becomes extremely tedious to write code that piles on one layer after the other manually. Fortunately,
programming languages have a wonderful fix for this: subroutines and loops. This way we can express
networks as code. Just like we would use a for loop to count from 1 to 10, we’ll use code to combine layers.
The first network that had this structure was VGG.
3.28.1 VGG
We begin with the usual import ritual
import mxnet as mx
3.28. Very deep networks with repeating elements 167

import numpy as np
mx.random.seed(1)
In [ ]: ctx = mx.gpu()
3.28.2 Load up a dataset



3.28.3 The VGG architecture

A key aspect of VGG was to use many convolutional blocks with relatively narrow kernels, followed by a
max-pooling step and to repeat this block multiple times. What is pretty neat about the code below is that we
use functions to return network blocks. These are then combined to larger networks (e.g. in vgg_stack)
and this allows us to construct VGG from components. What is particularly useful here is that we can use
it to reparameterize the architecture simply by changing a few lines rather than adding and removing many
lines of network definitions.
In [ ]: from mxnet.gluon import nn
def vgg_block(num_convs, channels):

out = nn.Sequential()
for _ in range(num_convs):
out.add(nn.Conv2D(channels=channels, kernel_size=3,
padding=1, activation='relu'))
out.add(nn.MaxPool2D(pool_size=2, strides=2))
return out
def vgg_stack(architecture):
out = nn.Sequential()
for (num_convs, channels) in architecture:
out.add(vgg_block(num_convs, channels))
return out
num_outputs = 10
architecture = ((1,64), (1,128), (2,256), (2,512))
net = nn.Sequential()
net.add(vgg_stack(architecture))
net.add(nn.Flatten())
net.add(nn.Dense(512, activation="relu"))
net.add(nn.Dropout(.5))
net.add(nn.Dropout(.5))
net.add(nn.Dense(num_outputs))


3.28.5 Optimizer

3.28.7 Evaluation loop

for d, l in data_iterator:
output = net(data)
return acc.get()[1]

In [ ]: ###########################
# Only one epoch so tests can run quickly, increase this variable to actually run
###########################
epochs = 1
for i, (d, l) in enumerate(train_data):
output = net(data)
loss.backward()
##########################
##########################
else (1 - smoothing_constant) * moving_loss + smoothing_cons
if i > 0 and i % 200 == 0:

print('Batch %d. Loss: %f' % (i, moving_loss))
3.28. Very deep networks with repeating elements 169


3.28.9 Next
Batch normalization from scratch
3.29 Batch Normalization from scratch

When you train a linear model, you update the weights in order to optimize some objective. And for the
linear model, the distribution of the inputs stays the same throughout training. So all we have to worry about
is how to map from these well-behaved inputs to some appropriate outputs. But if we focus on some layer
in the middle of a deep neural network, for example the third, things look a bit different. After each training
iteration, we update the weights in all the layers, including the first and the second. That means that over the
course of training, as the weights for the first two layers are learned, the inputs to the third layer might look
dramatically different than they did at the beginning. For starters, they might take values on a scale orders
of magnitudes different from when we started training. And this shift in feature scale might have serious
implications, say for the ideal learning rate at each time.
To explain, let us consider the Taylor’s expansion for the objective function 𝑓 with respect to the updated
parameter w, such as 𝑓 (w − 𝜂∇𝑓 (w)). Coefficients of those higher-order terms with respect to the learning
rate 𝜂 may be so large in scale (usually due to many layers) that these terms cannot be ignored. However,
the effect of common lower-order optimization algorithms, such as gradient descent, in iteratively reducing
the objective function is based on an important assumption: all those higher-order terms with respect to the
learning rate in the aforementioned Taylor’s expansion are ignored.
Motivated by this sort of intuition, Sergey Ioffe and Christian Szegedy proposed Batch Normalization, a
technique that normalizes the mean and variance of each of the features at every level of representation dur-
ing training. The technique involves normalization of the features across the examples in each mini-batch.
While competing explanations for the technique’s effect abound, its success is hard to deny. Empirically it
appears to stabilize the gradient (less exploding or vanishing values) and batch-normalized models appear
to overfit less. In fact, batch-normalized models seldom even use dropout. In this notebooks, we’ll explain
how it works.
3.29.1 Import dependencies and grab the MNIST dataset

We’ll get going by importing the typical packages and grabbing the MNIST data.
import mxnet as mx
import numpy as np
mx.random.seed(1)
ctx = mx.gpu()

num_inputs = 784

num_outputs = 10
3.29.3 Batch Normalization layer

The layer, unlike Dropout, is usually used before the activation layer (according to the authors’ original
paper), instead of after activation layer.
The basic idea is doing the normalization then applying a linear scale and shift to the mini-batch:
For input mini-batch 𝐵 = {𝑥1,...,𝑚 }, we want to learn the parameter 𝛾 and 𝛽. The output of the layer is
{𝑦𝑖 = 𝐵𝑁𝛾,𝛽 (𝑥𝑖 )}, where:
𝑚
1 ∑︁
𝜇𝐵 ← 𝑥𝑖
𝑚
𝑖=1
𝑚
2 1 ∑︁
𝜎𝐵 ← (𝑥𝑖 − 𝜇𝐵 )2
𝑚
𝑖=1
𝑥 𝑖 − 𝜇𝐵
𝑥ˆ𝑖 ← √︁
2 +𝜖
𝜎𝐵
𝑦𝑖 ← 𝛾 𝑥ˆ𝑖 + 𝛽 ≡ BN𝛾,𝛽 (𝑥𝑖 )
• formulas taken from Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep
network training by reducing internal covariate shift.” International Conference on Machine Learning.
2015.
With gluon, this is all actually implemented for us, but we’ll do it this one time by ourselves, using the
formulas from the original paper so you know how it works, and perhaps you can improve upon it!
Pay attention that, when it comes to (2D) CNN, we normalize batch_size * height * width over
each channel. So that gamma and beta have the lengths the same as channel_count. In our imple-
mentation, we need to manually reshape gamma and beta so that they could (be automatically broadcast
and) multipy the matrices in the desired way.
In [ ]: def pure_batch_norm(X, gamma, beta, eps = 1e-5):
if len(X.shape) not in (2, 4):
raise ValueError('only supports dense or 2dconv')
# dense
if len(X.shape) == 2:
# mini-batch mean
mean = nd.mean(X, axis=0)
# mini-batch variance
variance = nd.mean((X - mean) ** 2, axis=0)
# normalize
X_hat = (X - mean) * 1.0 / nd.sqrt(variance + eps)
3.29. Batch Normalization from scratch 171

# scale and shift

out = gamma * X_hat + beta
# 2d conv
elif len(X.shape) == 4:
# extract the dimensions
N, C, H, W = X.shape
# mini-batch mean
mean = nd.mean(X, axis=(0, 2, 3))
variance = nd.mean((X - mean.reshape((1, C, 1, 1))) ** 2, axis=(0, 2, 3))
# normalize
X_hat = (X - mean.reshape((1, C, 1, 1))) * 1.0 / nd.sqrt(variance.reshape((
# scale and shift
out = gamma.reshape((1, C, 1, 1)) * X_hat + beta.reshape((1, C, 1, 1))
return out
Let’s do some sanity checks. We expect each column of the input matrix to be normalized.
In [ ]: A = nd.array([1,7,5,4,6,10], ctx=ctx).reshape((3,2))
A
In [ ]: pure_batch_norm(A,
gamma = nd.array([1,1], ctx=ctx),
beta=nd.array([0,0], ctx=ctx))
In [ ]: ga = nd.array([1,1], ctx=ctx)
be = nd.array([0,0], ctx=ctx)
B = nd.array([1,6,5,7,4,3,2,5,6,3,2,4,5,3,2,5,6], ctx=ctx).reshape((2,2,2,2))
B
In [ ]: pure_batch_norm(B, ga, be)
Our tests seem to support that we’ve done everything correctly. Note that for batch normalization, imple-
menting backward pass is a little bit tricky. Fortunately, you won’t have to worry about that here, because
the MXNet’s autograd package can handle differentiation for us automatically.
Besides that, in the testing process, we want to use the mean and variance of the complete dataset, instead
of those of mini batches. In the implementation, we use moving statistics as a trade off, because we don’t
want to or don’t have the ability to compute the statistics of the complete dataset (in the second loop).
Then here comes another concern: we need to maintain the moving statistics along with multiple runs of
the BN. It’s an engineering issue rather than a deep/machine learning issue. On the one hand, the moving
statistics are similar to gamma and beta; on the other hand, they are not updated by the gradient backwards.
In this quick-and-dirty implementation, we use the global dictionary variables to store the statistics, in which
each key is the name of the layer (scope_name), and the value is the statistics. (Attention: always be
very careful if you have to use global variables!) Moreover, we have another parameter is_training to
indicate whether we are doing training or testing.
Now we are ready to define our complete batch_norm():
In [ ]: def batch_norm(X,
gamma,
beta,

momentum = 0.9,
eps = 1e-5,
scope_name = '',
is_training = True,
debug = False):
"""compute the batch norm """
global _BN_MOVING_MEANS, _BN_MOVING_VARS
#########################
# the usual batch norm transformation
#########################
if len(X.shape) not in (2, 4):

raise ValueError('the input data shape should be one of:\n' +
'dense: (batch size, # of features)\n' +
'2d conv: (batch size, # of features, height, width)'
)
# dense
if len(X.shape) == 2:
# mini-batch mean
mean = nd.mean(X, axis=0)
variance = nd.mean((X - mean) ** 2, axis=0)
# normalize
if is_training:
# while training, we normalize the data using its mean and variance
X_hat = (X - mean) * 1.0 / nd.sqrt(variance + eps)
else:
# while testing, we normalize the data using the pre-computed mean and
X_hat = (X - _BN_MOVING_MEANS[scope_name]) *1.0 / nd.sqrt(_BN_MOVING_VA
# scale and shift
out = gamma * X_hat + beta
# 2d conv
elif len(X.shape) == 4:
# extract the dimensions
N, C, H, W = X.shape
# mini-batch mean
mean = nd.mean(X, axis=(0,2,3))
variance = nd.mean((X - mean.reshape((1, C, 1, 1))) ** 2, axis=(0, 2, 3))
# normalize
X_hat = (X - mean.reshape((1, C, 1, 1))) * 1.0 / nd.sqrt(variance.reshape((
if is_training:
# while training, we normalize the data using its mean and variance
X_hat = (X - mean.reshape((1, C, 1, 1))) * 1.0 / nd.sqrt(variance.resha
else:
# while testing, we normalize the data using the pre-computed mean and
X_hat = (X - _BN_MOVING_MEANS[scope_name].reshape((1, C, 1, 1))) * 1.0
/ nd.sqrt(_BN_MOVING_VARS[scope_name].reshape((1, C, 1, 1)) + eps)
# scale and shift
out = gamma.reshape((1, C, 1, 1)) * X_hat + beta.reshape((1, C, 1, 1))

#########################
# to keep the moving statistics
#########################
# init the attributes

try: # to access them
_BN_MOVING_MEANS, _BN_MOVING_VARS
except: # error, create them
_BN_MOVING_MEANS, _BN_MOVING_VARS = {}, {}
# store the moving statistics by their scope_names, inplace

if scope_name not in _BN_MOVING_MEANS:
_BN_MOVING_MEANS[scope_name] = mean
else:
_BN_MOVING_MEANS[scope_name] = _BN_MOVING_MEANS[scope_name] * momentum + me
if scope_name not in _BN_MOVING_VARS:
_BN_MOVING_VARS[scope_name] = variance
else:
_BN_MOVING_VARS[scope_name] = _BN_MOVING_VARS[scope_name] * momentum + vari
#########################
# debug info
#########################
if debug:
print('== info start ==')
print('scope_name = {}'.format(scope_name))
print('mean = {}'.format(mean))
print('var = {}'.format(variance))
print('_BN_MOVING_MEANS = {}'.format(_BN_MOVING_MEANS[scope_name]))
print('_BN_MOVING_VARS = {}'.format(_BN_MOVING_VARS[scope_name]))
print('output = {}'.format(out))
print('== info end ==')
#########################
# return
#########################
return out
3.29.4 Parameters and gradients

In [ ]: #######################
# Set the scale for weight initialization and choose
# the number of hidden units in the fully-connected layer
#######################
weight_scale = .01
num_fc = 128
W1 = nd.random_normal(shape=(20, 1, 3,3), scale=weight_scale, ctx=ctx)

b1 = nd.random_normal(shape=20, scale=weight_scale, ctx=ctx)
gamma1 = nd.random_normal(shape=20, loc=1, scale=weight_scale, ctx=ctx)

beta1 = nd.random_normal(shape=20, scale=weight_scale, ctx=ctx)
W2 = nd.random_normal(shape=(50, 20, 5, 5), scale=weight_scale, ctx=ctx)

gamma2 = nd.random_normal(shape=50, loc=1, scale=weight_scale, ctx=ctx)

beta2 = nd.random_normal(shape=50, scale=weight_scale, ctx=ctx)
W3 = nd.random_normal(shape=(800, num_fc), scale=weight_scale, ctx=ctx)

b3 = nd.random_normal(shape=num_fc, scale=weight_scale, ctx=ctx)
gamma3 = nd.random_normal(shape=num_fc, loc=1, scale=weight_scale, ctx=ctx)

beta3 = nd.random_normal(shape=num_fc, scale=weight_scale, ctx=ctx)
W4 = nd.random_normal(shape=(num_fc, num_outputs), scale=weight_scale, ctx=ctx)

params = [W1, b1, gamma1, beta1, W2, b2, gamma2, beta2, W3, b3, gamma3, beta3, W4,
param.attach_grad()


partition = nd.nansum(exp, axis=0, exclude=True).reshape((-1,1))


We insert the BN layer right after each linear layer.
In [ ]: def net(X, is_training = True, debug=False):
########################
# Define the computation of the first convolutional layer
########################
h1_conv = nd.Convolution(data=X, weight=W1, bias=b1, kernel=(3,3), num_filter=2
h1_normed = batch_norm(h1_conv, gamma1, beta1, scope_name='bn1', is_training=is
h1_activation = relu(h1_normed)
if debug:
########################
# Define the computation of the second convolutional layer
########################
h2_conv = nd.Convolution(data=h1, weight=W2, bias=b2, kernel=(5,5), num_filter=

h2_normed = batch_norm(h2_conv, gamma2, beta2, scope_name='bn2', is_training=is

h2_activation = relu(h2_normed)
if debug:
########################
# Flattening h2 so that we can feed it into a fully-connected layer
########################
h2 = nd.flatten(h2)
if debug:
print("Flat h2 shape: %s" % (np.array(h2.shape)))
########################
# Define the computation of the third (fully-connected) layer
########################
h3_normed = batch_norm(h3_linear, gamma3, beta3, scope_name='bn3', is_training=
h3 = relu(h3_normed)
if debug:
########################
# Define the computation of the output layer
########################
if debug:
print("yhat_linear shape: %s" % (np.array(yhat_linear.shape)))
return yhat_linear
3.29.9 Test run

Can data be passed into the net()?
break
In [ ]: output = net(data, is_training=True, debug=True)
3.29.10 Optimizer

numerator = 0.
denominator = 0.

output = net(data, is_training=False) # attention here!

Note: you may want to use a gpu to run the code below. (And remember to set the ctx = mx.gpu()
accordingly in the very beginning of this article.)
In [ ]: epochs = 1
moving_loss = 0.
label_one_hot = nd.one_hot(label, num_outputs)
# we are in training process,
# so we normalize the data using batch mean and variance
output = net(data, is_training=True)
loss.backward()
##########################
##########################
if i == 0:
else:

3.29.13 Next
Batch normalization with gluon
3.30 Batch Normalization in gluon

In the preceding section, we implemented batch normalization ourselves using NDArray and autograd. As
with most commonly used neural network layers, Gluon has batch normalization predefined, so this section
is going to be straightforward.
3.30. Batch Normalization in gluon 177


import mxnet as mx
import numpy as np
mx.random.seed(1)
ctx = mx.cpu()

num_inputs = 784
num_outputs = 10
3.30.2 Define a CNN with Batch Normalization

To add batchnormalization to a gluon model defined with Sequential, we only need to add a few lines.
Specifically, we just insert BatchNorm layers before the applying the ReLU activations.
In [ ]: num_fc = 512
net.add(gluon.nn.Conv2D(channels=20, kernel_size=5))
net.add(gluon.nn.BatchNorm(axis=1, center=True, scale=True))
net.add(gluon.nn.Activation(activation='relu'))
net.add(gluon.nn.Conv2D(channels=50, kernel_size=5))
# The Flatten layer collapses all axis, except the first one, into one axis.
net.add(gluon.nn.Dense(num_fc))

3.30.4 Softmax cross-entropy Loss


3.30.5 Optimizer

output = net(data)
return acc.get()[1]
3.30.7 Training Loop

In [ ]: epochs = 1
output = net(data)
loss.backward()
##########################
##########################

3.30.8 Next
Introduction to recurrent neural networks
3.31 Recurrent Neural Networks (RNNs) for Language Modeling

In previous tutorials, we worked with feedforward neural networks. They’re called feedforward networks
because each layer feeds into the next layer in a chain connecting the inputs to the outputs.
3.31. Recurrent Neural Networks (RNNs) for Language Modeling 179


At each iteration 𝑡, we feed in a new example 𝑥𝑡 , by setting the values of the input nodes (orange). We then
feed the activation forward by successively calculating the activations of each higher layer in the network.
Finally, we read the outputs from the topmost layer.
So when we feed the next example 𝑥𝑡+1 , we overwrite all of the previous activations. If consecutive inputs
to our network have no special relationship to each other (say, images uploaded by unrelated users), then
this is perfectly acceptable behavior. But what if our inputs exhibit a sequential relationship?
Say for example that you want to predict the next character in a string of text. We might decide to feed each
character into the neural network with the goal of predicting the succeeding character.
In the above example, the neural network forgets the previous context every time you feed a new input. How
is the neural network supposed to know that “e” is followed by a space? It’s hard to see why that should be
so probable if you didn’t know that the “e” was the final letter in the word “Time”.
Recurrent neural networks provide a slick way to incorporate sequential structure. At each time step 𝑡, each
hidden layer ℎ𝑡 (typically) will receive input from both the current input 𝑥𝑡 and from that same hidden layer
at the previous time step ℎ𝑡−1
Now, when our net is trying to predict what comes after the “e” in time, it has access to its previous beliefs,
and by extension, the entire history of inputs. Zooming back in to see how the nodes in a basic RNN are
connected, you’ll see that each node in the hidden layer is connected to each node at the hidden layer at the
next time step:
Even though the neural network contains loops (the hidden layer is connected to itself), because this con-
nection spans a time step our network is still technically a feedforward network. Thus we can still train by

backpropagration just as we normally would with an MLP. Typically the loss function will be an average of
the losses at each time step.
In this tutorial, we’re going to roll up our sleeves and write a simple RNN in MXNet using nothing but
mxnet.ndarray and mxnet.autograd. In practice, unless you’re trying to develop fundamentally
new recurrent layers, you’ll want to use the prebuilt layers that call down to extremely optimized primitives.
You’ll also want to rely on some pre-built batching code because batching sequences can be a pain. But we
think in general, if you’re going to work with this stuff, and have a modicum of self respect, you’ll want to
implement from scratch and understand how it works at a reasonably low level.
Let’s go ahead and import our dependencies and specify our context. If you’ve been following along without
a GPU until now, this might be where you’ll want to get your hands on some faster hardware. GPU instances
are available by the hour through Amazon Web Services. A single GPU via a p2 instance (NVIDIA K80s)
or even an older g2 instance will be perfectly adequate for this tutorial.
import mxnet as mx
import numpy as np
mx.random.seed(1)
ctx = mx.gpu(0)
3.31.1 Dataset: “The Time Machine”

Now mess with some data. I grabbed a copy of the Time Machine, mostly because it’s available freely
thanks to the good people at Project Gutenberg and a lot of people are tired of seeing RNNs generate
Shakespeare. In case you prefer torturing Shakespeare to torturing H.G. Wells, I’ve also included Andrej
Karpathy’s tinyshakespeare.txt in the data folder. Let’s get started by reading in the data.

In [2]: with open("../data/nlp/timemachine.txt") as f:

time_machine = f.read()
And you’ll probably want to get a taste for what the text looks like.
In [3]: print(time_machine[0:500])
Project Gutenberg's The Time Machine, by H. G. (Herbert George) Wells
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net
Title: The Time Machine
Author: H. G. (Herbert George) Wells
Release Date: October 2, 2004 [EBook #35]

[Last updated: October 3, 2014]
Language: English
*** START OF THIS PR
3.31.2 Tidying up
I went through and discovered that the last 38083 characters consist entirely of legalese from the Gutenberg
gang. So let’s chop that off lest our language model learn to generate such boring drivel.
In [4]: print(time_machine[-38075:-37500])
time_machine = time_machine[:-38083]
End of Project Gutenberg's The Time Machine, by H. G. (Herbert George) Wells
*** END OF THIS PROJECT GUTENBERG EBOOK THE TIME MACHINE ***
***** This file should be named 35.txt or 35.zip *****

This and all associated files of various formats will be found in:
http://www.gutenberg.net/3/35/
Updated editions will replace the previous one--the old editions

will be renamed.
Creating the works from public domain print editions means that no
one owns a United States copyright in these works, so the Foundation
(and you!) c

3.31.3 Numerical representations of characters

When we create numerical representations of characters, we’ll use one-hot representations. A one-hot is a
vector that takes value 1 in the index corresponding to a character, and 0 elsewhere. Because this vector is
as long as the vocab, let’s get a definitive list of characters in this dataset so that our representation is not
longer than necessary.
In [5]: character_list = list(set(time_machine))
vocab_size = len(character_list)
print(character_list)
print("Length of vocab: %s" % vocab_size)
['H', ';', 'D', 'k', '_', 'c', ' ', '0', ',', 'V', '"', 'Y', 'C', 'l', "'", 'e', '[', 'E',
Length of vocab: 77
We’ll often want to access the index corresponding to each character quickly so let’s store this as a dictionary.
In [6]: character_dict = {}
for e, char in enumerate(character_list):
character_dict[char] = e
print(character_dict)
{'H': 0, ']': 44, ';': 1, 'J': 65, 'Q': 50, 'D': 2, '_': 4, 'a': 43, ' ': 6, '0': 7, 'V': 9
In [7]: time_numerical = [character_dict[char] for char in time_machine]
In [8]: #########################
# Check that the length is right
#########################
print(len(time_numerical))
#########################
# Check that the format looks right
#########################
print(time_numerical[:20])
#########################
# Convert back to text
#########################
print("".join([character_list[idx] for idx in time_numerical[:39]]))
179533
[61, 23, 69, 21, 15, 5, 41, 6, 62, 20, 41, 15, 27, 67, 15, 23, 55, 14, 71, 6]
Project Gutenberg's The Time Machine, b
3.31.4 One-hot representations

We can use NDArray’s one_hot() operation to render a one-hot representation of each character. But frack
it, since this is the from scratch tutorial, let’s write this ourselves.
In [9]: def one_hots(numerical_list, vocab_size=vocab_size):
result = nd.zeros((len(numerical_list), vocab_size), ctx=ctx)
for i, idx in enumerate(numerical_list):
result[i, idx] = 1.0
return result
In [10]: print(one_hots(time_numerical[:2]))

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.]]
That looks about right. Now let’s write a function to convert our one-hots back to readable text.
In [11]: def textify(embedding):
result = ""
indices = nd.argmax(embedding, axis=1).asnumpy()
for idx in indices:
result += character_list[int(idx)]
return result
In [12]: textify(one_hots(time_numerical[0:40]))
Out[12]: "Project Gutenberg's The Time Machine, by"
3.31.5 Preparing the data for training

Great, it’s not the most efficient implementation, but we know how it works. So we’re already doing better
than the majority of people with job titles in machine learning. Now, let’s chop up our dataset into sequences
that we could feed into our model.
You might think we could just feed in the entire dataset as one gigantic input and backpropagate across
the entire sequence. When you try to backpropagate across thousands of steps a few things go wrong: (1)
The time it takes to compute a single gradient update will be unreasonably long (2) The gradient across
thousands of recurrent steps has a tendency to either blow up, causing NaN errors due to losing precision,
or to vanish.
Thus we’re going to look at feeding in our data in reasonably short sequences. Note that this home-brew
version is pretty slow; if you’re still running on a CPU, this is the right time to make dinner.
In [13]: seq_length = 64
# -1 here so we have enough characters for labels later
num_samples = (len(time_numerical) - 1) // seq_length
dataset = one_hots(time_numerical[:seq_length*num_samples]).reshape((num_samples,
textify(dataset[0])
Out[13]: "Project Gutenberg's The Time Machine, by H. G. (Herbert George) "
Now that we’ve chopped our dataset into sequences of length seq_length, at every time step, our input is
a single one-hot vector. This means that our computation of the hidden layer would consist of matrix-vector
multiplications, which are not especially efficient on GPU. To take advantage of the available computing
resources, we’ll want to feed through a batch of sequences at the same time. The following code may look
tricky but it’s just some plumbing to make the data look like this.

chapter05_recurrent-neural-networks/img/recurrent-ba
In [15]: print('# of sequences in dataset: ', len(dataset))

num_batches = len(dataset) // batch_size
print('# of batches: ', num_batches)
train_data = dataset[:num_batches*batch_size].reshape((batch_size, num_batches, se
# swap batch_size and seq_length axis to make later access easier
train_data = nd.swapaxes(train_data, 0, 1)
print('Shape of data set: ', train_data.shape)
# of sequences in dataset: 2805
# of batches: 87
Shape of data set: (87, 64, 32, 77)
Let’s sanity check that everything went the way we hope. For each data_row, the second sequence should
follow the first:
In [16]: for i in range(3):
print("***Batch %s:***\n %s \n %s \n\n" % (i, textify(train_data[i, :, 0]), te
***Batch 0:***
Project Gutenberg's The Time Machine, by H. G. (Herbert George)
vement of the barometer. Yesterday it was so high, yesterday nig
***Batch 1:***
Wells
This eBook is for the use of anyone anywhere at no cost a

ht
it fell, then this morning it rose again, and so gently upwar
***Batch 2:***
nd with
almost no restrictions whatsoever. You may copy it, giv
d to
here. Surely the mercury did not trace this line in any of
3.31.6 Preparing our labels

Now let’s repurpose the same batching code to create our label batches
In [17]: labels = one_hots(time_numerical[1:seq_length*num_samples+1])
train_label = labels.reshape((batch_size, num_batches, seq_length, vocab_size))
train_label = nd.swapaxes(train_label, 0, 1)

print(train_label.shape)
(87, 64, 32, 77)
3.31.7 A final sanity check

Remember that our target at every time step is to predict the next character in the sequence. So our labels
should look just like our inputs but offset by one character. Let’s look at corresponding inputs and outputs
to make sure everything lined up as expected.
In [18]: print(textify(train_data[10, :, 3]))
print(textify(train_label[10, :, 3]))
te, but the twisted crystalline bars lay unfinished upon the
ben
e, but the twisted crystalline bars lay unfinished upon the
benc
3.31.8 Recurrent neural networks
chapter05_recurrent-neural-networks/img/simple-rnn.p
Recall that the update for an ordinary hidden layer in a neural network with activation function 𝜑 is given by
ℎ = 𝜑(𝑥𝑊 + 𝑏)
To make this a recurrent neural network, we’re simply going to add a weight sum of the previous hidden
state ℎ𝑡−1 :
ℎ𝑡 = 𝜑(𝑥𝑡 𝑊𝑥ℎ + ℎ𝑡−1 𝑊ℎℎ + 𝑏ℎ )
Then at every time set 𝑡, we’ll calculate the output as:
𝑦
^𝑡 = softmax𝑜𝑛𝑒−ℎ𝑜𝑡 (ℎ𝑡 𝑊ℎ𝑦 + 𝑏𝑦 )

In [19]: num_inputs = vocab_size
num_hidden = 256
num_outputs = vocab_size
########################
# Weights connecting the inputs to the hidden layer
########################
Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
########################

# Recurrent weights connecting the hidden layer across time steps

########################
Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx) * .01
########################
# Bias vector for hidden layer
########################
bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
########################
# Weights to the output nodes
########################
Why = nd.random_normal(shape=(num_hidden,num_outputs), ctx=ctx) * .01
by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01
# NOTE: to keep notation consistent,

# we should really use capital letters
# for hidden layers and outputs,
# since we are doing batchwise computations]
3.31.10 Attach the gradients

In [20]: params = [Wxh, Whh, bh, Why, by]

param.attach_grad()
3.31.11 Softmax Activation

In [21]: def softmax(y_linear, temperature=1.0):
lin = (y_linear-nd.max(y_linear, axis=1).reshape((-1,1))) / temperature # shif
exp = nd.exp(lin)
partition =nd.sum(exp, axis=1).reshape((-1,1))
In [22]: ####################
# With a temperature of 1 (always 1 during training), we get back some set of prob
####################
softmax(nd.array([[1, -1], [-1, 1]]), temperature=1.0)
Out[22]:
[[ 0.88079703 0.11920292]
[ 0.11920292 0.88079703]]
In [23]: ####################
# If we set a high temperature, we can get more entropic (*noisier*) probabilities
####################
softmax(nd.array([[1,-1],[-1,1]]), temperature=1000.0)
Out[23]:
[[ 0.50049996 0.49949998]
[ 0.49949998 0.50049996]]

In [24]: ####################
# Often we want to sample with low temperatures to produce sharp probabilities
####################
softmax(nd.array([[10,-10],[-10,10]]), temperature=.1)
Out[24]:
[[ 1. 0.]
[ 0. 1.]]

In [25]: def simple_rnn(inputs, state, temperature=1.0):
outputs = []
h = state
for X in inputs:
h_linear = nd.dot(X, Wxh) + nd.dot(h, Whh) + bh
h = nd.tanh(h_linear)
yhat_linear = nd.dot(h, Why) + by
yhat = softmax(yhat_linear, temperature=temperature)
outputs.append(yhat)
return (outputs, h)
3.31.13 Cross-entropy loss function

At every time step our task is to predict the next character, given the string up to that point. This is the
familiar multi-task classification that we introduced for handwritten digit classification. Accordingly, we’ll
rely on the same loss function, cross-entropy.
In [26]: # def cross_entropy(yhat, y):
# return - nd.sum(y * nd.log(yhat))
def cross_entropy(yhat, y):

return - nd.mean(nd.sum(y * nd.log(yhat), axis=0, exclude=True))
In [27]: cross_entropy(nd.array([[.2,.5,.3], [.2,.5,.3]]), nd.array([[1.,0,0], [0, 1.,0]]))
Out[27]:
[ 1.15129256]
<NDArray 1 @cpu(0)>
3.31.14 Averaging the loss over the sequence

Because the unfolded RNN has multiple outputs (one at every time step) we can calculate a loss at every
time step. The weights corresponding to the net at time step 𝑡 influence both the loss at time step 𝑡 and the
loss at time step 𝑡 + 1. To combine our losses into a single global loss, we’ll take the average of the losses
at each time step.
In [28]: def average_ce_loss(outputs, labels):
assert(len(outputs) == len(labels))
total_loss = 0.
for (output, label) in zip(outputs,labels):
total_loss = total_loss + cross_entropy(output, label)
return total_loss / len(outputs)

3.31.15 Optimizer
3.31.16 Generating text by sampling

We have now defined a model that takes a sequence of real inputs from our training data and tries to predict
the next character at every time step. You might wonder, what can we do with this model? Why should I
care about predicting the next character in a sequence of text?
This capability is exciting because given such a model, we can now generate strings of plausible text. The
generation procedure goes as follows. Say our string begins with the character “T”. We can feed the letter
“T” and get a conditional probability distribution over the next character 𝑃 (𝑥2 |𝑥1 = "T"). We can the
sample from this distribution, e.g. producing an “i”, and then assign 𝑥2 = "i", feeding this to the network at
the next time step.
[Add a nice graphic to illustrate sampling]
In [30]: def sample(prefix, num_chars, temperature=1.0):
#####################################
# Initialize the string that we'll return to the supplied prefix
#####################################
string = prefix
#####################################
# Prepare the prefix as a sequence of one-hots for ingestion by RNN
#####################################
prefix_numerical = [character_dict[char] for char in prefix]
input_sequence = one_hots(prefix_numerical)
#####################################
# Set the initial state of the hidden representation ($h_0$) to the zero vecto
#####################################
sample_state = nd.zeros(shape=(1, num_hidden), ctx=ctx)
#####################################
# For num_chars iterations,
# 1) feed in the current input
# 2) sample next character from from output distribution
# 3) add sampled character to the decoded string
# 4) prepare the sampled character as a one_hot (to be the next input)
#####################################
for i in range(num_chars):
outputs, sample_state = simple_rnn(input_sequence, sample_state, temperatu
choice = np.random.choice(vocab_size, p=outputs[-1][0].asnumpy())
string += character_list[choice]
input_sequence = one_hots([choice])
return string
In [ ]: epochs = 2000
moving_loss = 0.
learning_rate = .5

# state = nd.zeros(shape=(batch_size, num_hidden), ctx=ctx)

############################
# Attenuate the learning rate by a factor of 2 every 100 epochs.
############################
if ((e+1) % 100 == 0):
learning_rate = learning_rate / 2.0
state = nd.zeros(shape=(batch_size, num_hidden), ctx=ctx)
for i in range(num_batches):
data_one_hot = train_data[i]
label_one_hot = train_label[i]
outputs, state = simple_rnn(data_one_hot, state)
loss = average_ce_loss(outputs, label_one_hot)
loss.backward()
##########################
##########################
if (i == 0) and (e == 0):
moving_loss = np.mean(loss.asnumpy()[0])
else:
moving_loss = .99 * moving_loss + .01 * np.mean(loss.asnumpy()[0])
print("Epoch %s. Loss: %s" % (e, moving_loss))

print(sample("The Time Ma", 1024, temperature=.1))
print(sample("The Medical Man rose, came to the lamp,", 1024, temperature=.1))
3.31.17 Conclusions
Once you start running this code, it will spit out a sample at the end of each epoch. I’ll leave this output cell
blank so you don’t see megabytes of text, but here are some patterns that I observed when I ran this code.
The network seems to first work out patterns with no sequential relationship and then slowly incorporates
longer and longer windows of context. After just 1 epoch, my RNN generated this:
e e e ee e eee e e ee e e ee e e ee
˓→ e e ee e e e e e e e e
˓→e ee e aee e e ee e e ee ee e ee
˓→e e e e e ete e e e e e e ee n eee
˓→ee e eeee e e e e e e ee e e e e
˓→e e eee ee e e e e e e ee
˓→ee e e e e e e e e t e ee e eee e e e
˓→ee e e e e eee e e e eeeee
˓→ e eeee e e ee ee ee a e e eee ee e
˓→e e e aee e e e e eee e
˓→ e e e e e e e e e e e e
˓→ee e ee e e e e e e e
˓→ e e e e ee e e ee n e ee e e
˓→ e e e e t ee ee ee eee et e
˓→ e e e ee e e e e e e e e
˓→ e e"

It’s learned that spaces and “e”s (to my knowledge, there’s no aesthetically pleasing way to spell the plural
form of the letter “e”) are the most common characters.
A little bit later on it spits out strings like:
the the the the the the the the the the the the the the the the the the the the the the the the the the
the the
At this point it’s learned that after the space usually comes a nonspace character, and perhaps that “t” is the
most common character to immediately follow a space, “h” to follow a “t” and “e” to follow “th”. However
it doesn’t appear to be looking far enough back to realize that the word “the” should be very unlikely
immediately after the word “the”. . .
By the 175th epoch, the model appears to be putting together a fairly large vocabulary although it puts words
together in ways that might be charitably described as “creative”.
the little people had been as I store of the sungher had leartered along the realing of the stars of
the little past and stared at the thing that I had the sun had to the stars of the sunghed a stirnt a
moment the sun had come and fart as the stars of the sunghed a stirnt a moment the sun had to
the was completely and of the little people had been as I stood and all amations of the staring
and some of the really
In subsequent tutorials we’ll explore sophisticated techniques for evaluating and improving language mod-
els. We’ll also take a look at some related but more complicate problems like language translations and
image captioning.
3.31.18 Next
LSTM recurrent neural networks from scratch
3.32 Long short-term memory (LSTM) RNNs

import mxnet as mx
import numpy as np
mx.random.seed(1)
ctx = mx.gpu(0)



character_dict = {}
time_numerical = [character_dict[char] for char in time_machine]

return result
result = ""
for idx in indices:
return result

seq_length = 64
dataset = one_hots(time_numerical[:seq_length*num_samples]).reshape((num_samples, s
train_data = dataset[:num_batches*batch_size].reshape((num_batches, batch_size, seq

train_label = labels.reshape((num_batches, batch_size, seq_length, vocab_size))
3.32.6 Long short-term memory (LSTM) RNNs

An LSTM block has mechanisms to enable “memorizing” information for an extended number of time
steps. We use the LSTM block with the following transformations that map inputs to outputs across blocks
at consecutive layers and consecutive time steps:
𝑔𝑡 = tanh(𝑋𝑡 𝑊𝑥𝑔 + ℎ𝑡−1 𝑊ℎ𝑔 + 𝑏𝑔 ),
3.32. Long short-term memory (LSTM) RNNs 193

𝑖𝑡 = 𝜎(𝑋𝑡 𝑊𝑥𝑖 + ℎ𝑡−1 𝑊ℎ𝑖 + 𝑏𝑖 ),
𝑓𝑡 = 𝜎(𝑋𝑡 𝑊𝑥𝑓 + ℎ𝑡−1 𝑊ℎ𝑓 + 𝑏𝑓 ),
𝑜𝑡 = 𝜎(𝑋𝑡 𝑊𝑥𝑜 + ℎ𝑡−1 𝑊ℎ𝑜 + 𝑏𝑜 ),
𝑐𝑡 = 𝑓𝑡 ⊙ 𝑐𝑡−1 + 𝑖𝑡 ⊙ 𝑔𝑡 ,
ℎ𝑡 = 𝑜𝑡 ⊙ tanh(𝑐𝑡 ),
where ⊙ is an element-wise multiplication operator, and for all = [𝑥1 , 𝑥2 , . . . , 𝑥𝑘 ]⊤ ∈𝑘 the two activation
functions:
[︂ ]︂⊤
1 1
𝜎() = ,..., ] ,
1 + exp(−𝑥1 ) 1 + exp(−𝑥𝑘 )
[︂ ]︂⊤
1 − exp(−2𝑥1 ) 1 − exp(−2𝑥𝑘 )
tanh() = ,..., .
1 + exp(−2𝑥1 ) 1 + exp(−2𝑥𝑘 )
In the transformations above, the memory cell 𝑐𝑡 stores the “long-term” memory in the vector form. In other
words, the information accumulatively captured and encoded until time step 𝑡 is stored in 𝑐𝑡 and is only
passed along the same layer over different time steps.
Given the inputs 𝑐𝑡 and ℎ𝑡 , the input gate 𝑖𝑡 and forget gate 𝑓𝑡 will help the memory cell to decide how to
overwrite or keep the memory information. The output gate 𝑜𝑡 further lets the LSTM block decide how to
retrieve the memory information to generate the current state ℎ𝑡 that is passed to both the next layer of the
current time step and the next time step of the current layer. Such decisions are made using the hidden-layer
parameters 𝑊 and 𝑏 with different subscripts: these parameters will be inferred during the training phase by
gluon.

num_hidden = 256
########################
########################
Wxg = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
Wxi = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
Wxf = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
Wxo = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
########################
########################
Whg = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
Whi = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
Whf = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
Who = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
########################

########################
bg = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
bi = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
bf = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
bo = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
########################
########################

In [9]: params = [Wxg, Wxi, Wxf, Wxo, Whg, Whi, Whf, Who, bg, bi, bf, bo, Why, by]

param.attach_grad()

lin = (y_linear-nd.max(y_linear)) / temperature
exp = nd.exp(lin)

In [11]: def lstm_rnn(inputs, h, c, temperature=1.0):
outputs = []
for X in inputs:
g = nd.tanh(nd.dot(X, Wxg) + nd.dot(h, Whg) + bg)
i = nd.sigmoid(nd.dot(X, Wxi) + nd.dot(h, Whi) + bi)
f = nd.sigmoid(nd.dot(X, Wxf) + nd.dot(h, Whf) + bf)
o = nd.sigmoid(nd.dot(X, Wxo) + nd.dot(h, Who) + bo)
#######################
#
#######################
c = f * c + i * g
h = o * nd.tanh(c)
#######################
#
#######################
return (outputs, h, c)

In [12]: def cross_entropy(yhat, y):
3.32. Long short-term memory (LSTM) RNNs 195


total_loss = 0.
3.32.13 Optimizer

#####################################
#####################################
string = prefix
#####################################
#####################################
#####################################
#####################################
h = nd.zeros(shape=(1, num_hidden), ctx=ctx)
c = nd.zeros(shape=(1, num_hidden), ctx=ctx)
#####################################
#####################################
outputs, h, c = lstm_rnn(input_sequence, h, c, temperature=temperature)
return string
In [ ]: epochs = 2000
moving_loss = 0.
learning_rate = 2.0

############################
############################
if ((e+1) % 100 == 0):
h = nd.zeros(shape=(batch_size, num_hidden), ctx=ctx)
c = nd.zeros(shape=(batch_size, num_hidden), ctx=ctx)
outputs, h, c = lstm_rnn(data_one_hot, h, c)
loss.backward()
##########################
##########################
if (i == 0) and (e == 0):
else:

3.32.15 Conclusions
3.32.16 Next
Gated recurrent units (GRU) RNNs from scratch
3.33 Gated recurrent unit (GRU) RNNs

This chapter requires some exposition. The GRU updates are fully implemented and the code appears to
work properly.
import mxnet as mx
import numpy as np
mx.random.seed(1)
ctx = mx.gpu(0)

3.33. Gated recurrent unit (GRU) RNNs 197


character_dict = {}
time_numerical = [character_dict[char] for char in time_machine]

return result
result = ""
for idx in indices:
return result

seq_length = 64
dataset = one_hots(time_numerical[:seq_length*num_samples]).reshape((num_samples, s
train_data = dataset[:num_batches*batch_size].reshape((num_batches, batch_size, seq

train_label = labels.reshape((num_batches, batch_size, seq_length, vocab_size))
3.33.6 Gated recurrent units (GRU) RNNs

Similar to LSTM blocks, the GRU also has mechanisms to enable “memorizing” information for an extended
number of time steps. However, it does so in a more expedient way:
• We no longer keep a separate memory cell 𝑐𝑡 . Instead, ℎ𝑡−1 is added to a “new content” version of
itself to give ℎ𝑡 .
• The “new content” version is given by 𝑔𝑡 = tanh(𝑋𝑡 𝑊𝑥ℎ + (𝑟𝑡 ⊙ ℎ𝑡−1 )𝑊ℎℎ + 𝑏ℎ ), and is analogous
to 𝑔𝑡 in the LSTM tutorial.
• Here, there is a reset gate 𝑟𝑡 which moderates the impact of ℎ𝑡−1 on the “new content” version.

• The input gate 𝑖𝑡 and forget gate 𝑓𝑡 are replaced by an single update gate 𝑧𝑡 , which weighs the old and
new content via 𝑧𝑡 and (1 − 𝑧𝑡 ) respectively.
• There is no output gate 𝑜𝑡 ; the weighted sum is what becomes ℎ𝑡 .
We use the GRU block with the following transformations that map inputs to outputs across blocks at
consecutive layers and consecutive time steps:
𝑧𝑡 = 𝜎(𝑋𝑡 𝑊𝑥𝑧 + ℎ𝑡−1 𝑊ℎ𝑧 + 𝑏𝑧 ),
𝑟𝑡 = 𝜎(𝑋𝑡 𝑊𝑥𝑟 + ℎ𝑡−1 𝑊ℎ𝑟 + 𝑏𝑟 ),

𝑔𝑡 = tanh(𝑋𝑡 𝑊𝑥ℎ + (𝑟𝑡 ⊙ ℎ𝑡−1 )𝑊ℎℎ + 𝑏ℎ ),
ℎ𝑡 = 𝑧𝑡 ⊙ ℎ𝑡−1 + (1 − 𝑧𝑡 ) ⊙ 𝑔𝑡 ,
where 𝜎 and tanh are as before in the LSTM case.
Empirically, GRUs have similar performance to LSTMs, while requiring less parameters and forgoing an
internal time state. Intuitively, GRUs have enough gates/state for long-term retention, but not too much,
so that training and convergence remain fast and convex. See the work of Chung et al. [2014] (https:
//arxiv.org/abs/1412.3555).

num_hidden = 256
########################
########################
Wxz = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
Wxr = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
########################
########################
Whz = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
Whr = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
########################
########################
bz = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
br = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
########################
########################


In [9]: params = [Wxz, Wxr, Wxh, Whz, Whr, Whh, bz, br, bh, Why, by]

param.attach_grad()

lin = (y_linear-nd.max(y_linear)) / temperature
exp = nd.exp(lin)

In [11]: def gru_rnn(inputs, h, temperature=1.0):
outputs = []
for X in inputs:
z = nd.sigmoid(nd.dot(X, Wxz) + nd.dot(h, Whz) + bz)
r = nd.sigmoid(nd.dot(X, Wxr) + nd.dot(h, Whr) + br)
g = nd.tanh(nd.dot(X, Wxh) + nd.dot(r * h, Whh) + bh)
h = z * h + (1 - z) * g

return (outputs, h)

In [12]: def cross_entropy(yhat, y):

total_loss = nd.array([0.], ctx=ctx)
3.33.13 Optimizer

#####################################


#####################################
string = prefix
#####################################
#####################################
#####################################
#####################################
h = nd.zeros(shape=(1, num_hidden), ctx=ctx)
c = nd.zeros(shape=(1, num_hidden), ctx=ctx)
#####################################
#####################################
outputs, h = gru_rnn(input_sequence, h, temperature=temperature)
return string
In [ ]: epochs = 2000
moving_loss = 0.
learning_rate = 2.0

############################
############################
if ((e+1) % 100 == 0):
h = nd.zeros(shape=(batch_size, num_hidden), ctx=ctx)
outputs, h = gru_rnn(data_one_hot, h)
loss.backward()
##########################

##########################
if (i == 0) and (e == 0):
else:

3.33.15 Conclusions
[Placeholder]
3.33.16 Next
Simple, LSTM, and GRU RNNs with gluon
3.34 Recurrent Neural Networks with gluon

With gluon, now we can train the recurrent neural networks (RNNs) more neatly, such as the long short-
term memory (LSTM) and the gated recurrent unit (GRU). To demonstrate the end-to-end RNN training and
prediction pipeline, we take a classic problem in language modeling as a case study. Specifically, we will
show how to predict the distribution of the next word given a sequence of previous words.
3.34.1 Import packages

To begin with, we need to make the following necessary imports.
In [ ]: import math
import os
import time
import numpy as np
import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon import nn, rnn
3.34.2 Define classes for indexing words of the input document

In a language modeling problem, we define the following classes to facilitate the routine procedures for load-
ing document data. In the following, the Dictionary class is for word indexing: words in the documents
can be converted from the string format to the integer format.
In this example, we use consecutive integers to index words of the input document.
In [ ]: class Dictionary(object):
def __init__(self):
self.word2idx = {}
self.idx2word = []
def add_word(self, word):

if word not in self.word2idx:

self.idx2word.append(word)
self.word2idx[word] = len(self.idx2word) - 1
return self.word2idx[word]
def __len__(self):
return len(self.idx2word)
The Dictionary class is used by the Corpus class to index the words of the input document.
In [ ]: class Corpus(object):
def __init__(self, path):
self.dictionary = Dictionary()
self.train = self.tokenize(path + 'train.txt')
self.valid = self.tokenize(path + 'valid.txt')
self.test = self.tokenize(path + 'test.txt')
def tokenize(self, path):

"""Tokenizes a text file."""
assert os.path.exists(path)
# Add words to the dictionary
with open(path, 'r') as f:
tokens = 0
for line in f:
words = line.split() + ['<eos>']
tokens += len(words)
for word in words:
self.dictionary.add_word(word)
# Tokenize file content

with open(path, 'r') as f:
ids = np.zeros((tokens,), dtype='int32')
token = 0
for line in f:
words = line.split() + ['<eos>']
for word in words:
ids[token] = self.dictionary.word2idx[word]
token += 1
return mx.nd.array(ids, dtype='int32')
3.34.3 Provide an exposition of different RNN models with gluon

Based on the gluon.Block class, we can make different RNN models available with the following single
RNNModel class.
Users can select their preferred RNN model or compare different RNN models by configuring the argument
of the constructor of RNNModel. We will show an example following the definition of the RNNModel
class.
In [ ]: class RNNModel(gluon.Block):
"""A model with an encoder, recurrent layer, and a decoder."""
def __init__(self, mode, vocab_size, num_embed, num_hidden,

num_layers, dropout=0.5, tie_weights=False, **kwargs):
3.34. Recurrent Neural Networks with gluon 203

super(RNNModel, self).__init__(**kwargs)
self.drop = nn.Dropout(dropout)
self.encoder = nn.Embedding(vocab_size, num_embed,
weight_initializer = mx.init.Uniform(0.1))
if mode == 'rnn_relu':
self.rnn = rnn.RNN(num_hidden, num_layers, activation='relu', dropo
input_size=num_embed)
elif mode == 'rnn_tanh':
self.rnn = rnn.RNN(num_hidden, num_layers, dropout=dropout,
elif mode == 'lstm':
self.rnn = rnn.LSTM(num_hidden, num_layers, dropout=dropout,
elif mode == 'gru':
self.rnn = rnn.GRU(num_hidden, num_layers, dropout=dropout,
else:
raise ValueError("Invalid mode %s. Options are rnn_relu, "
"rnn_tanh, lstm, and gru"%mode)
if tie_weights:
self.decoder = nn.Dense(vocab_size, in_units = num_hidden,
params = self.encoder.params)
else:
self.decoder = nn.Dense(vocab_size, in_units = num_hidden)
self.num_hidden = num_hidden
def forward(self, inputs, hidden):

emb = self.drop(self.encoder(inputs))
output, hidden = self.rnn(emb, hidden)
output = self.drop(output)
decoded = self.decoder(output.reshape((-1, self.num_hidden)))
return decoded, hidden
def begin_state(self, *args, **kwargs):

return self.rnn.begin_state(*args, **kwargs)
3.34.4 Select an RNN model and configure parameters

For demonstration purposes, we provide an arbitrary selection of the parameter values. In practice, some
parameters should be more fine tuned based on the validation data set.
For instance, to obtain a better performance, as reflected in a lower loss or perplexity, one can set
args_epochs to a larger value.
In this demonstration, LSTM is the chosen type of RNN. For other RNN options, one can replace the
'lstm' string to 'rnn_relu', 'rnn_tanh', or 'gru' as provided by the aforementioned gluon.
Block class.
In [1]: args_data = '../data/nlp/ptb.'
args_model = 'rnn_relu'
args_emsize = 100
args_nhid = 100
args_nlayers = 2

args_lr = 1.0
args_clip = 0.2
args_epochs = 1
args_batch_size = 32
args_bptt = 5
args_dropout = 0.2
args_tied = True
args_cuda = 'store_true'
args_log_interval = 500
args_save = 'model.param'
3.34.5 Load data as batches

We load the document data by leveraging the aforementioned Corpus class.
To speed up the subsequent data flow in the RNN model, we pre-process the loaded data as batches. This
procedure is defined in the following batchify function.
In [ ]: context = mx.gpu() # this notebook takes too long on cpu
corpus = Corpus(args_data)
def batchify(data, batch_size):

"""Reshape data into (num_example, batch_size)"""
nbatch = data.shape[0] // batch_size
data = data[:nbatch * batch_size]
data = data.reshape((batch_size, nbatch)).T
return data
train_data = batchify(corpus.train, args_batch_size).as_in_context(context)

val_data = batchify(corpus.valid, args_batch_size).as_in_context(context)
test_data = batchify(corpus.test, args_batch_size).as_in_context(context)
3.34.6 Build the model

We go on to build the model, initialize model parameters, and configure the optimization algorithms for
training the RNN model.
In [ ]: ntokens = len(corpus.dictionary)
model = RNNModel(args_model, ntokens, args_emsize, args_nhid,

args_nlayers, args_dropout, args_tied)
model.collect_params().initialize(mx.init.Xavier(), ctx=context)
trainer = gluon.Trainer(model.collect_params(), 'sgd',
{'learning_rate': args_lr, 'momentum': 0, 'wd': 0})
loss = gluon.loss.SoftmaxCrossEntropyLoss()
3.34.7 Train the model and evaluate on validation and testing data sets
Now we can define functions for training and evaluating the model. The following are two helper functions
that will be used during model training and evaluation.
In [ ]: def get_batch(source, i):
seq_len = min(args_bptt, source.shape[0] - 1 - i)
data = source[i : i + seq_len]
target = source[i + 1 : i + 1 + seq_len]

return data, target.reshape((-1,))
def detach(hidden):
if isinstance(hidden, (tuple, list)):
hidden = [i.detach() for i in hidden]
else:
hidden = hidden.detach()
return hidden
The following is the function for model evaluation. It returns the loss of the model prediction. We will
discuss the details of the loss measure shortly.
In [ ]: def eval(data_source):
total_L = 0.0
ntotal = 0
hidden = model.begin_state(func = mx.nd.zeros, batch_size = args_batch_size, ct
for i in range(0, data_source.shape[0] - 1, args_bptt):
data, target = get_batch(data_source, i)
output, hidden = model(data, hidden)
L = loss(output, target)
total_L += mx.nd.sum(L).asscalar()
ntotal += L.size
return total_L / ntotal
Now we are ready to define the function for training the model. We can monitor the model performance on
the training, validation, and testing data sets over iterations.
In [ ]: def train():
best_val = float("Inf")
for epoch in range(args_epochs):
total_L = 0.0
start_time = time.time()
hidden = model.begin_state(func = mx.nd.zeros, batch_size = args_batch_size
for ibatch, i in enumerate(range(0, train_data.shape[0] - 1, args_bptt)):
data, target = get_batch(train_data, i)
hidden = detach(hidden)
output, hidden = model(data, hidden)
L = loss(output, target)
L.backward()
grads = [i.grad(context) for i in model.collect_params().values()]

# Here gradient is for the whole batch.
# So we multiply max_norm by batch_size and bptt size to balance it.
gluon.utils.clip_global_norm(grads, args_clip * args_bptt * args_batch_
trainer.step(args_batch_size)
total_L += mx.nd.sum(L).asscalar()
if ibatch % args_log_interval == 0 and ibatch > 0:

cur_L = total_L / args_bptt / args_batch_size / args_log_interval
print('[Epoch %d Batch %d] loss %.2f, perplexity %.2f' % (
epoch + 1, ibatch, cur_L, math.exp(cur_L)))
total_L = 0.0

val_L = eval(val_data)
print('[Epoch %d] time cost %.2fs, validation loss %.2f, validation perplex
epoch + 1, time.time() - start_time, val_L, math.exp(val_L)))
if val_L < best_val:

best_val = val_L
test_L = eval(test_data)
model.save_parameters(args_save)
print('test loss %.2f, test perplexity %.2f' % (test_L, math.exp(test_L
else:
args_lr = args_lr * 0.25
trainer._init_optimizer('sgd',
{'learning_rate': args_lr,
'momentum': 0,
'wd': 0})
model.load_parameters(args_save, context)
Recall that the RNN model training is based on maximization likelihood of observations. For evaluation
purposes, we have used the following two measures:
• Loss: the loss function is defined as the average negative log likelihood of the target words (ground
truth) under prediction:
𝑁
1 ∑︁
loss = − log 𝑝target𝑖 ,
𝑁
𝑖=1
where 𝑁 is the number of predictions and 𝑝target𝑖 the predicted likelihood of the 𝑖-th target word.
• Perplexity: the average per-word perplexity is exp(loss).
To orient the reader using concrete examples, let us illustrate the idea of the perplexity measure as follows.
• Consider the perfect scenario where the model always predicts the likelihood of the target word as 1.
In this case, for every 𝑖 we have 𝑝target𝑖 = 1. As a result, the perplexity of the perfect model is 1.
• Consider a baseline scenario where the model always predicts the likelihood of the target word ran-
domly at uniform among the given word set 𝑊 . In this case, for every 𝑖 we have 𝑝target𝑖 = 1/|𝑊 |. As
a result, the perplexity of a uniformly random prediction model is always |𝑊 |.
• Consider the worst-case scenario where the model always predicts the likelihood of the target word
as 0. In this case, for every 𝑖 we have 𝑝target𝑖 = 0. As a result, the perplexity of the worst model is
positive infinity.
Therefore, a model with a lower perplexity that is closer to 1 is generally more effective. Any effective
model has to achieve a perplexity lower than the cardinality of the target set.
Now we are ready to train the model and evaluate the model performance on validation and testing data sets.
In [ ]: train()
model.load_parameters(args_save, context)
test_L = eval(test_data)
print('Best test loss %.2f, test perplexity %.2f'%(test_L, math.exp(test_L)))

3.34.8 Next
Introduction to optimization
3.35 Introduction
You might find it weird that we’re sticking a chapter on optimization here. If you’re following the tutorials
in sequence, then you’ve probably already been optimizing over the parameters of ten or more machine
learning models. You might consider yourself an old pro. In this chapter we’ll supply some depth to
complement your experience.
We need to think seriously about optimization matters for several reasons. First, we want optimizers to be
fast. Optimizing complicated models with millions of parameters can take upsettingly long. You might have
heard of researchers training deep learning models for many hours, days, or even weeks. They probably
weren’t exaggerating. Second, optimization is how we choose our parameters. So the performance (e.g.
accuracy) of our models depends entirely on the quality of the optimizer.
3.35.1 Challenges in optimization

The pre-defined loss function in the learning problem is called the objective function for optimization.
Conventionally, optimization considers a minimization problem. Any maximization problem can be trivially
converted to an equivalent minimization problem by flipping the sign fo the objective function. Optimization
is worth studying both because it’s essential to learning. It’s also worth studying because it’s an area where
progress is being made, and smart choices can lead to superior performance. In other words, even fixing all
the other modeling decisions, figuring out how to optimize the parameters is a formidable challenge. We’ll
briefly describe some of the issues that make optimization hard, especially for neural networks.

3.35.2 Local minima

An objective function 𝑓 (𝑥) may have a local minimum 𝑥, where 𝑓 (𝑥) is smaller at 𝑥 than at the neighboring
points of 𝑥. If 𝑓 (𝑥) is the smallest value that can be obtained in the entire domain of 𝑥, 𝑓 (𝑥) is a global
mininum. The following figure demonstrates examples of local and global minima for the function:
𝑓 (𝑥) = 𝑥 · cos(𝜋𝑥), −1.0 ≤ 𝑥 ≤ 2.0.

import numpy as np
def f(x):
return x * np.cos(np.pi * x)
x = np.arange(-1.0, 2.0, 0.1)

fig = plt.figure()
subplt = fig.add_subplot(111)
subplt.annotate('local minimum', xy=(-0.3, -0.2), xytext=(-0.8, -1.0),
arrowprops=dict(facecolor='black', shrink=0.05))
subplt.annotate('global minimum', xy=(1.1, -0.9), xytext=(0.7, 0.1),
plt.plot(x, f(x))
plt.show()
3.35.3 Analytic vs approximate solutions

Ideally, we’d find the optimal solution 𝑥* that globally minimizes an objective function. For instance, the
function 𝑓 (𝑥) = 𝑥2 has a global minimum solution at 𝑥* = 0. We can obtain this solution analytically.
Another way of saying this is that there exists a closed-form solution. This just means that we can analyze
the equation for the function and produce an exact solution directly. Linear regression, for example, has an

analytic solution. To refresh your memory, in linear regression we build a predictor of the form:
y
^ = 𝑋w
We ignored the intercept term 𝑏 here, but that can be handled by simply appending a column of all 1s to the
design matrix X.
And we want to solve the following minimization problem
min ℒ(y, ŷ) = ||y − 𝑋w||22

w
As a refresher, that’s just the sum of the squared differences between our predictions and the ground truth
answers.
𝑛
∑︁
(𝑦𝑖 − w𝑇 x𝑖 )2
𝑖=1
Because we know that this function is quadratic, we know that it has a single critical point where the
derivative of the loss with respect to the weights w is equal to 0. Moreover, we know that the weights that
minimize our loss constitute a critical point. So our solution corresponds to the one setting of the weights
that gives a derivative of 0. First, let’s rewrite our loss function:
^) = (y − 𝑋w)𝑇 (y − 𝑋w)
ℒ(y, y
Now, setting the derivative of our loss to 0 gives the following equation:
𝜕ℒ(y, y
^)
= −2(𝑋)𝑇 (y − 𝑋w) = 0
𝜕w
We can now simplify these equations to find the optimal setting of the parameters w:
−2𝑋 𝑇 y + 2𝑋 𝑇 𝑋w = 0 (3.1)
𝑇 𝑇
𝑋 𝑋w = 𝑋 y
(3.2)
w = (𝑋 𝑇 𝑋)−1 𝑋 𝑇
y
(3.3)
You might have noticed that we assumed that the matrix 𝑋 𝑇 𝑋 can be inverted. If you take this fact for
granted, then it should be clear that we can recover the exact optimal value w* exactly. No matter what
values the data 𝑋, y takes we can produce an exact answer by computing just one matrix multiplication, one
matrix inversion, and two matrix-vector products.
3.35.4 Numerical optimization

However, in practice and for the most interesting models, we usually can’t find such analytical solutions.
Even for logistic regression, possibly the second simplest model considered in this book, we don’t have
any exact solution. When we don’t have an analytic solution, we need to resort to a numerical solution.
A numerical solution usually involves starting with some guess of the objective-minimizing setting of all
the parameters, and successively improving the parameters in an iterative manner. The most popular opti-
mization techniques of this variety are variants of gradient descent (GD). In the next notebook, we’ll take a
deep dive into gradient descent and stochastic gradient descent (SGD). Depending on the optimizer you use,
iterative methods may take a long time to converge on a good answer.

For many problems, even if they don’t have an analytic solution, they may have only one minima. An
especially convenient class of functions are the convex functions. These are functions with a uniformly pos-
itive second derivative. They have no local minima and are especially well-suited to efficient optimization.
Unfortunately, this is a book about neural networks. And neural networks are not in general convex. More-
over, they have abundant local minima. With numerical methods, it may not be possible to find the global
minimizer of an objective function. For non-convex functions, a numerical method often halts around local
minima that are not necessarily the global minima.
3.35.5 Saddle points

Saddle points are another challenge for optimizers. Even though these points are not local minima, they are
points where the gradient is equal to zero. For high dimensional models, saddle points are typically more
numerous than local minima. We depict a saddle point example in one-dimensional space below.
In [2]: x = np.arange(-2.0, 2.0, 0.1)
fig = plt.figure()
subplt = fig.add_subplot(111)
subplt.annotate('saddle point', xy=(0, -0.2), xytext=(-0.4, -5.0),
plt.plot(x, x**3)
plt.show()
Many optimization algorithms, like Newton’s method, are designed to be attracted to critical points, includ-
ing minima and saddle points. Since saddle points are generally common in high-dimensional space, some
optimization algorithms, such as Newton’s method, may fail to train deep learning models effectively as
they may get stuck in saddle points. Another challenging scenarios for neural networks is that there may be
large, flat regions in parameters space that correspond to bad values of the objective function.

Challenges due to machine precision

Even for convex functions, where all minima are global minima, it may still be hard to find the precise
optimal solutions. For one, the accuracy of any solution can be limited by the machine precision.
In computers, numbers are represented in a discrete manner. The accuracy of a floating-point system is
characterized by a quantity called machine precision. For IEEE binary floating-point systems,
• single precision = 2−24 (about 7 decimal digits of precision)
• double precision = 2−53 (about 16 decimal digits of precision).
In fact, the precision of a solution to optimization can be worse than the machine precision. To demonstrate
that, consider a function 𝑓 : R → R, its Taylor series exansion is
𝑓 ′′ (𝑥) 2
𝑓 (𝑥 + 𝜖) = 𝑓 (𝑥) + 𝑓 ′ (𝑥)𝜖 + 𝜖 + 𝒪(𝜖3 )
2
where 𝜖 is small. Denote the global optimum solution as 𝑥* for minimizing 𝑓 (𝑥). It usually holds that
𝑓 ′ (𝑥* ) = 0 and 𝑓 ′′ (𝑥* ) ̸= 0.
Thus, for a small value 𝜖, we have
𝑓 (𝑥* + 𝜖) ≈ 𝑓 (𝑥* ) + 𝒪(𝜖2 ),
where the coefficient term of 𝒪(𝜖2 ) is 𝑓 ′′ (𝑥)/2. This means that a small change of order 𝜖 in the optimum
solution 𝑥* will change the value of 𝑓 (𝑥* ) in the order of 𝜖2 . In other words, if there is an error in the
function value, the precision of the solution value is constrained by the order of the square root of that error.
For example, if the machine precision is 10−8 , the precision of the solution value is only in the order of
10−4 , which is much worse than the machine precision.
3.35.6 Optimality isn’t everything

Although finding the precise global optimum solution to an objective function is hard, it is not always
necessary for deep learning. To start with, we care about test set performance. So we may not even want to
minimize the error on the training set to the lowest possible value. Moreover, finding a suboptimal minimum
of a great model can still be better than finding the true global minimum of a lousy model.
Many algorithms have solid theoretical guarantees of convergence to global minima, but these guarantees
often only hold for functions that are convex. In the old days, most researchers tried to avoid non-convex
optimizations due to the lack of guarantees. Doing gradient descent without a theoretical guarantee of con-
vergence was considered unprincipled. However, the practice is supported by a large body of empirical
evidence. The state of the art models in computer vision, natural language processing, and speech recog-
nition, for example, all rely on applying numerical optimizers to non-convex objective functions. Machine
learners now often have to choose between those methods that are beautiful and those that work. In the next
sections we’ll try to give you some more background on the field of optimisation and a deeper sense of the
state of the art techniques for training neural networks.
3.35.7 Next
Gradient descent and stochastic gradient descent from scratch

3.36 Gradient descent and stochastic gradient descent from scratch

In the previous tutorials, we decided which direction to move each parameter and how much to move each
parameter by taking the gradient of the loss with respect to each parameter. We also scaled each gradient
by some learning rate, although we never really explained where this number comes from. We then updated
the parameters by performing a gradient step 𝜃𝑡+1 ← 𝜂∇𝜃 ℒ𝑡 . Each update is called a gradient step and the
process is called gradient descent.
The hope is that if we just take a whole lot of gradient steps, we’ll wind up with an awesome model that gets
very low loss on our training data, and that this performance might generalize to our hold-out data. But as a
sharp reader, you might have any number of doubts. You might wonder, for instance:
• Why does gradient descent work?
• Why doesn’t the gradient descent algorithm get stuck on the way to a low loss?
• How should we choose a learning rate?
• Do all the parameters need to share the same learning rate?
• Is there anything we can do to speed up the process?
• Why does the solution of gradient descent over training data generalize well to test data?
Some answers to these questions are known. For other questions, we have some answers but only for simple
models like logistic regression that are easy to analyze. And for some of these questions, we know of
best practices that seem to work even if they’re not supported by any conclusive mathematical analysis.
Optimization is a rich area of ongoing research. In this chapter, we’ll address the parts that are most relevant
for training neural networks. To begin, let’s take a more formal look at gradient descent.
3.36.1 Gradient descent in one dimension

To get going, consider a simple scenario in which we have one parameter to manipulate. Let’s also assume
that our objective associates every value of this parameter with a value. Formally, we can say that this
objective function has the signature 𝑓 : R → R. It maps from one real number to another.
Note that the domain of 𝑓 is in one-dimensional. According to its Taylor series expansion as shown in the
introduction chapter, we have
𝑓 (𝑥 + 𝜖) ≈ 𝑓 (𝑥) + 𝑓 ′ (𝑥)𝜖.
Substituting 𝜖 with −𝜂𝑓 ′ (𝑥) where 𝜂 is a constant, we have
𝑓 (𝑥 − 𝜂𝑓 ′ (𝑥)) ≈ 𝑓 (𝑥) − 𝜂𝑓 ′ (𝑥)2 .
If 𝜂 is set as a small positive value, we obtain
𝑓 (𝑥 − 𝜂𝑓 ′ (𝑥)) ≤ 𝑓 (𝑥).
In other words, updating 𝑥 as
𝑥 := 𝑥 − 𝜂𝑓 ′ (𝑥)
3.36. Gradient descent and stochastic gradient descent from scratch 213
may reduce the value of 𝑓 (𝑥) if its current derivative value 𝑓 ′ (𝑥) ̸= 0. Since the derivative 𝑓 ′ (𝑥) is a special
case of gradient in one-dimensional domain, the above update of 𝑥 is gradient descent in one-dimensional
domain.
The positive scalar 𝜂 is called the learning rate or step size. Note that a larger learning rate increases the
chance of overshooting the global minimum and oscillating. However, if the learning rate is too small, the
convergence can be very slow. In practice, a proper learning rate is usually selected with experiments.
3.36.2 Gradient descent over multi-dimensional parameters

Consider the objective function 𝑓 : R𝑑 → R that takes any multi-dimensional vector x = [𝑥1 , 𝑥2 , . . . , 𝑥𝑑 ]⊤
as its input. The gradient of 𝑓 (x) with respect to x is defined by the vector of partial derivatives:
[︂ ]︂⊤
𝜕𝑓 (x) 𝜕𝑓 (x) 𝜕𝑓 (x)
∇x 𝑓 (x) = , ,..., .
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑑
To keep our notation compact we may use the notation ∇𝑓 (x) and ∇x 𝑓 (x) interchangeably when there is
no ambiguity about which parameters we are optimizing over. In plain English, each element 𝜕𝑓 (x)/𝜕𝑥𝑖 of
the gradient indicates the rate of change for 𝑓 at the point x with respect to the input 𝑥𝑖 only. To measure
the rate of change of 𝑓 in any direction that is represented by a unit vector u, in multivariate calculus, we
define the directional derivative of 𝑓 at x in the direction of u as
𝑓 (x + ℎu) − 𝑓 (x)
𝐷u 𝑓 (x) = lim ,
ℎ→0 ℎ
which can be rewritten according to the chain rule as
𝐷u 𝑓 (x) = ∇𝑓 (x) · u.
Since 𝐷u 𝑓 (x) gives the rates of change of 𝑓 at the point x in all possible directions, to minimize 𝑓 , we are
interested in finding the direction where 𝑓 can be reduced fastest. Thus, we can minimize the directional
derivative 𝐷u 𝑓 (x) with respect to u. Since 𝐷u 𝑓 (x) = ‖∇𝑓 (x)‖ · ‖u‖ · cos(𝜃) = ‖∇𝑓 (x)‖ · cos(𝜃), where
𝜃 is the angle between ∇𝑓 (x) and u, the minimum value of cos(𝜃) is -1 when 𝜃 = 𝜋. Therefore, 𝐷u 𝑓 (x)
is minimized when u is at the opposite direction of the gradient ∇𝑓 (x). Now we can iteratively reduce the
value of 𝑓 with the following gradient descent update:
x := x − 𝜂∇𝑓 (x),
where the positive scalar 𝜂 is called the learning rate or step size.
3.36.3 Stochastic gradient descent

However, the gradient descent algorithm may be infeasible when the training data size is huge. Thus, a
stochastic version of the algorithm is often used instead.
To motivate the use of stochastic optimization algorithms, note that when training deep learning models, we
often consider the objective function as a sum of a finite number of functions:
𝑛
1 ∑︁
𝑓 (x) = 𝑓𝑖 (x),
𝑛
𝑖=1

where 𝑓𝑖 (x) is a loss function based on the training data instance indexed by 𝑖. It is important to highlight
that the per-iteration computational cost in gradient descent scales linearly with the training data set size 𝑛.
Hence, when 𝑛 is huge, the per-iteration computational cost of gradient descent is very high.
In view of this, stochastic gradient descent offers a lighter-weight solution. At each iteration, rather than
computing the gradient ∇𝑓 (x), stochastic gradient descent randomly samples 𝑖 at uniform and computes
∇𝑓𝑖 (x) instead. The insight is, stochastic gradient descent uses ∇𝑓𝑖 (x) as an unbiased estimator of ∇𝑓 (x)
since
𝑛
1 ∑︁
E𝑖 ∇𝑓𝑖 (x) = ∇𝑓𝑖 (x) = ∇𝑓 (x).
𝑛
𝑖=1
In a generalized case, at each iteration a mini-batch ℬ that consists of indices for training data instances may
be sampled at uniform with replacement. Similarly, we can use
1 ∑︁
∇𝑓ℬ (x) = ∇𝑓𝑖 (x)
|ℬ|
𝑖∈ℬ
to update x as
x := x − 𝜂∇𝑓ℬ (x),
where |ℬ| denotes the cardinality of the mini-batch and the positive scalar 𝜂 is the learning rate or step size.
Likewise, the mini-batch stochastic gradient ∇𝑓ℬ (x) is an unbiased estimator for the gradient ∇𝑓 (x):
Eℬ ∇𝑓ℬ (x) = ∇𝑓 (x).
This generalized stochastic algorithm is also called mini-batch stochastic gradient descent and we simply
refer to them as stochastic gradient descent (as generalized). The per-iteration computational cost is 𝒪(|ℬ|).
Thus, when the mini-batch size is small, the computational cost at each iteration is light.
There are other practical reasons that may make stochastic gradient descent more appealing than gradient
descent. If the training data set has many redundant data instances, stochastic gradients may be so close
to the true gradient ∇𝑓 (x) that a small number of iterations will find useful solutions to the optimization
problem. In fact, when the training data set is large enough, stochastic gradient descent only requires a
small number of iterations to find useful solutions such that the total computational cost is lower than that
of gradient descent even for just one iteration. Besides, stochastic gradient descent can be considered as
offering a regularization effect especially when the mini-batch size is small due to the randomness and
noise in the mini-batch sampling. Moreover, certain hardware processes mini-batches of specific sizes more
efficiently.
3.36.4 Experiments
For demonstrating the aforementioned gradient-based optimization algorithms, we use the regression prob-
lem in the linear regression chapter as a case study.
In [1]: # Mini-batch stochastic gradient descent.
def sgd(params, lr, batch_size):
param[:] = param - lr * param.grad / batch_size

from mxnet import ndarray as nd
import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
X = nd.random_normal(scale=1, shape=(num_examples, num_inputs))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
# Construct data iterator.

def data_iter(batch_size):
idx = list(range(num_examples))
random.shuffle(idx)
for batch_i, i in enumerate(range(0, num_examples, batch_size)):
j = nd.array(idx[i: min(i + batch_size, num_examples)])
yield batch_i, X.take(j), y.take(j)
# Initialize model parameters.

def init_params():
w = nd.random_normal(scale=1, shape=(num_inputs, 1))
b = nd.zeros(shape=(1,))
params = [w, b]
param.attach_grad()
return params
# Linear regression.
def net(X, w, b):
return nd.dot(X, w) + b
# Loss function.
def square_loss(yhat, y):
return (yhat - y.reshape(yhat.shape)) ** 2 / 2
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import numpy as np
def train(batch_size, lr, epochs, period):

assert period >= batch_size and period % batch_size == 0
w, b = init_params()
total_loss = [np.mean(square_loss(net(X, w, b), y).asnumpy())]

# Epoch starts from 1.

for epoch in range(1, epochs + 1):
# Decay learning rate.
if epoch > 2:
lr *= 0.1
for batch_i, data, label in data_iter(batch_size):
output = net(data, w, b)
loss.backward()
sgd([w, b], lr, batch_size)
if batch_i * batch_size % period == 0:
total_loss.append(
np.mean(square_loss(net(X, w, b), y).asnumpy()))
print("Batch size %d, Learning rate %f, Epoch %d, loss %.4e" %
(batch_size, lr, epoch, total_loss[-1]))
print('w:', np.reshape(w.asnumpy(), (1, -1)),
'b:', b.asnumpy()[0], '\n')
x_axis = np.linspace(0, epochs, len(total_loss), endpoint=True)
plt.semilogy(x_axis, total_loss)
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
In [4]: train(batch_size=1, lr=0.2, epochs=3, period=10)
Batch size 1, Learning rate 0.200000, Epoch 1, loss 5.5937e-05
w: [[ 1.99949276 -3.39981604]] b: 4.19997

w: [[ 2.00893021 -3.36536145]] b: 4.19384

w: [[ 1.99998689 -3.39983392]] b: 4.20028

In [7]: train(batch_size=10, lr=5, epochs=3, period=10)

Batch size 10, Learning rate 5.000000, Epoch 1, loss nan
w: [[ nan nan]] b: nan
Batch size 10, Learning rate 0.002000, Epoch 1, loss 9.1294e+00

w: [[ 0.9720636 -1.67973936]] b: 1.42253
3.36.5 Next
Gradient descent and stochastic gradient descent with Gluon
3.37 Gradient descent and stochastic gradient descent with Gluon

import numpy as np
import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2


square_loss = gluon.loss.L2Loss()

net.collect_params().initialize(mx.init.Normal(sigma=1), force_reinit=True)
# SGD.
trainer = gluon.Trainer(net.collect_params(), 'sgd',
{'learning_rate': lr})
data_iter = gluon.data.DataLoader(dataset, batch_size, shuffle=True)
total_loss = [np.mean(square_loss(net(X), y).asnumpy())]
if epoch > 2:
trainer.set_learning_rate(trainer.learning_rate * 0.1)
for batch_i, (data, label) in enumerate(data_iter):
output = net(data)
loss.backward()
total_loss.append(np.mean(square_loss(net(X), y).asnumpy()))
(batch_size, trainer.learning_rate, epoch, total_loss[-1]))
print('w:', np.reshape(net[0].weight.data().asnumpy(), (1, -1)),

'b:', net[0].bias.data().asnumpy()[0], '\n')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
w: [[ 1.99949276 -3.39981604]] b: 4.19997
3.37. Gradient descent and stochastic gradient descent with Gluon 221

w: [[ 2.00893021 -3.36536145]] b: 4.19384


w: [[ 1.99998689 -3.39983392]] b: 4.20028
In [6]: train(batch_size=10, lr=5, epochs=3, period=10)

w: [[ nan nan]] b: nan
3.37. Gradient descent and stochastic gradient descent with Gluon 223

w: [[ 0.9720636 -1.67973948]] b: 1.42253

3.37.1 Next
Momentum from scratch
3.38 Momentum from scratch

As discussed in the previous chapter, at each iteration stochastic gradient descent (SGD) finds the direction
where the objective function can be reduced fastest on a given example. Thus, gradient descent is also
known as the method of steepest descent. Essentially, SGD is a myopic algorithm. It doesn’t look very far
into the past and it doesn’t think much about the future. At each step, SGD just does whatever looks right
just at that moment.
You might wonder, can we do something smarter? It turns out that we can. One class of methods use an idea
called momentum. The idea of momentum-based optimizers is to remember the previous gradients from
recent optimization steps and to use them to help to do a better job of choosing the direction to move next,
acting less like a drunk student walking downhill and more like a rolling ball.In this chapter we’ll motivate
and explain SGD with momentum.
3.38.1 Motivating example

In order to motivate the method, let’s start by visualizing a simple quadratic objective function 𝑓 : R2 → R
taking a two-dimensional vector x = [𝑥1 , 𝑥2 ]⊤ as the input. In the following figure, each contour line
indicates points of equivalent value 𝑓 (x). The objective function is minimized in the center and the outer
rings have progressively worse values.
The red triangle indicates the starting point for our stochastic gradient descent optimizer. The lines and
arrows that follow indicate each step of SGD. You might wonder why the lines don’t just point directly
towards the center. That’s because the gradient estimates in SGD are noisy, due to the small sample size.
So the gradient steps are noisy even if they are correct on average (unbiased). As you can see, SGD wastes
too much time swinging back and forth along the direction in parallel with the 𝑥2 -axis while advancing too
slowly along the direction of the 𝑥1 -axis.
3.38.2 Curvature and Hessian matrix

Even if we just did plain old gradient descent, we’d expect our function to bounce around quite a lot. That’s
because our gradient is changing as we move around in parameter space due to the curvature of the function.
We can reason about the curvature of objective function by considering their second derivative. The second
derivative says how much the gradient changes as we move in parameter space. In one dimension, a second
derivative of a function indicates how fast the first derivative changes when the input changes. Thus, it is
often considered as a measure of the curvature of a function. It is the rate of change of the rate of change.
If you’ve never done calculus before, that might sound rather meta, but you’ll get over it.
Consider the objective function 𝑓 : R𝑑 → R that takes a multi-dimensional vector x = [𝑥1 , 𝑥2 , . . . , 𝑥𝑑 ]⊤ as
the input. Its Hessian matrix H ∈ R𝑑×𝑑 collects its second derivatives. Each entry (𝑖, 𝑗) says how much
the gradient of the objective with respect to parameter 𝑖 changes, with a small change in parameter 𝑗.
𝜕 2 𝑓 (x)
H𝑖,𝑗 =
𝜕𝑥𝑖 𝜕𝑥𝑗
3.38. Momentum from scratch 225

for all 𝑖, 𝑗 = 1, . . . , 𝑑. Since H is a real symmetric matrix, by spectral theorem, it is orthogonally diagonal-
izable as
S⊤ HS = Λ,
where S is an orthonormal eigenbasis composed of eigenvectors of H with corresponding eigenvalues in a

diagonal matrix Λ: the eigenvalue Λ𝑖,𝑖 corresponds to the eigenvector in the 𝑖th column of S. The second
derivative (curvature) of the objective function 𝑓 in any direction d (unit vector) is a quadratic form d⊤ Hd.
Specifically, if the direction d is an eigenvector of H, the curvature of 𝑓 in that direction is equal to the
corresponding eigenvalue of d. Since the curvature of the objective function in any direction is a weighted
average of all the eigenvalues of the Hessian matrix, the curvature is bounded by the minimum and maximum
eigenvalues of the Hessian matrix H. The ratio of the maximum to the minimum eigenvalue is the condition
number of the Hessian matrix H.
3.38.3 Gradient descent in ill-conditioned problems

How does the condition number of the Hessian matrix of the objective function affect the performance of
gradient descent? Let us revisit the problem in the motivating example.
Recall that gradient descent is a greedy approach that selects the steepest gradient at the current point as the
direction of advancement. At the starting point, the search by gradient descent advances more aggressively
in the direction of the 𝑥2 -axis than that of the 𝑥1 -axis.
In the plotted problem of the motivating example, the curvature in the direction of the 𝑥2 -axis is much
larger than that of the 𝑥1 -axis. Thus, gradient descent tends to overshoot the bottom of the function that is
projected to the plane in parallel with the 𝑥2 -axis. At the next iteration, if the gradient along the direction
in parallel with the 𝑥2 -axis remains larger, the search continues to advance more aggressively along the
direction in parallel with the 𝑥2 -axis and the overshooting continues to take place. As a result, gradient
descent wastes too much time swinging back and forth in parallel with the 𝑥2 -axis due to overshooting
while the advancement in the direction of the 𝑥1 -axis is too slow.

To generalize, the problem in the motivating example is an ill-conditioned problem. In an ill-conditioned

problem, the condition number of the Hessian matrix of the objective function is large. In other words, the
ratio of the largest curvature to the smallest is high.
The momentum algorithm

The aforementioned ill-conditioned problems are challenging for gradient descent. By treating gradient
descent as a special form of stochastic gradient descent, we can address the challenge with the following
momentum algorithm for stochastic gradient descent.
v := 𝛾v + 𝜂∇𝑓ℬ (x),
x := x − v,
where v is the current velocity and 𝛾 is the momentum parameter. The learning rate 𝜂 and the stochastic
gradient ∇𝑓ℬ (x) with respect to the sampled mini-batch ℬ are both defined in the previous chapter.
It is important to highlight that, the scale of advancement at each iteration now also depends on how aligned
the directions of the past gradients are. This scale is the largest when all the past gradients are perfectly
aligned to the same direction.
To better understand the momentum parameter 𝛾, let us simplify the scenario by assuming the stochastic
gradients ∇𝑓ℬ (x) are the same as g throughout the iterations. Since all the gradients are perfectly aligned
to the same direction, the momentum algorithm accelerates the advancement along the same direction of g
as
v1 := 𝜂g,
v2 := 𝛾v1 + 𝜂g = 𝜂g(𝛾 + 1),
v3 := 𝛾v2 + 𝜂g = 𝜂g(𝛾 2 + 𝛾 + 1),
...
𝜂g
vinf := .
1−𝛾
Thus, if 𝛾 = 0.99, the final velocity is 100 times faster than that of the corresponding gradient descent where
the gradient is g.
Now with the momentum algorithm, a sample search path can be improved as illustrated in the following
figure.
Experiments
For demonstrating the momentum algorithm, we still use the regression problem in the linear regression
chapter as a case study. Specifically, we investigate stochastic gradient descent with momentum.
In [1]:
def sgd_momentum(params, vs, lr, mom, batch_size):
for param, v in zip(params, vs):
v[:] = mom * v + lr * param.grad / batch_size
param[:] = param - v

import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2

random.shuffle(idx)

def init_params():
params = [w, b]
vs = []
param.attach_grad()

#
vs.append(param.zeros_like())
return params, vs
def net(X, w, b):
# Loss function.
import numpy as np
def train(batch_size, lr, mom, epochs, period):

[w, b], vs = init_params()

if epoch > 2:
lr *= 0.1
loss.backward()
sgd_momentum([w, b], vs, lr, mom, batch_size)
total_loss.append(np.mean(square_loss(net(X, w, b), y).asnumpy()))
'b:', b.asnumpy()[0], '\n')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
In [4]: train(batch_size=10, lr=0.2, mom=0.9, epochs=3, period=10)
w: [[ 1.99991071 -3.39920688]] b: 4.19865

Next
Momentum with Gluon
3.39 Momentum with Gluon

import numpy as np
import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2

def train(batch_size, lr, mom, epochs, period):

# SGD with momentum.
trainer = gluon.Trainer(net.collect_params(), 'sgd',
{'learning_rate': lr, 'momentum': mom})

if epoch > 2:
trainer.set_learning_rate(trainer.learning_rate * 0.1)
output = net(data)
loss.backward()


plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
In [3]: train(batch_size=10, lr=0.2, mom=0.9, epochs=3, period=10)
w: [[ 1.99991047 -3.39920712]] b: 4.19865
3.39. Momentum with Gluon 231

3.39.1 Next
Adagrad from scratch
3.40 Adagrad from scratch

In [1]: from mxnet import ndarray as nd
# Adagrad.
def adagrad(params, sqrs, lr, batch_size):
eps_stable = 1e-7
for param, sqr in zip(params, sqrs):
g = param.grad / batch_size
sqr[:] += nd.square(g)
div = lr * g / nd.sqrt(sqr + eps_stable)
param[:] -= div
import mxnet as mx
import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000

true_w = [2, -3.4]

true_b = 4.2

random.shuffle(idx)

def init_params():
params = [w, b]
sqrs = []
param.attach_grad()
#
sqrs.append(param.zeros_like())
return params, sqrs
def net(X, w, b):
# Loss function.
import numpy as np

[w, b], sqrs = init_params()

loss.backward()
3.40. Adagrad from scratch 233

adagrad([w, b], sqrs, lr, batch_size)

'b:', b.asnumpy()[0], '\n')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
w: [[ 1.99946415 -3.39996123]] b: 4.19967
3.40.1 Next
Adagrad with Gluon
3.41 Adagrad with Gluon



import numpy as np
import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2

# Adagrad.
trainer = gluon.Trainer(net.collect_params(), 'adagrad',

output = net(data)
loss.backward()


plt.xlabel('epoch')
3.41. Adagrad with Gluon 235

plt.ylabel('loss')
plt.show()
w: [[ 1.99946415 -3.39996123]] b: 4.19967
3.41.1 Next
RMSProp from scratch
3.42 RMSprop from scratch

# RMSProp.
def rmsprop(params, sqrs, lr, gamma, batch_size):
eps_stable = 1e-8
for param, sqr in zip(params, sqrs):
sqr[:] = gamma * sqr + (1. - gamma) * nd.square(g)
div = lr * g / nd.sqrt(sqr + eps_stable)
param[:] -= div
import mxnet as mx


import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2

random.shuffle(idx)

def init_params():
params = [w, b]
sqrs = []
param.attach_grad()
return params, sqrs
def net(X, w, b):
# Loss function.
import numpy as np
def train(batch_size, lr, gamma, epochs, period):

[w, b], sqrs = init_params()
3.42. RMSprop from scratch 237


loss.backward()
rmsprop([w, b], sqrs, lr, gamma, batch_size)
'b:', b.asnumpy()[0], '\n')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
In [3]: train(batch_size=10, lr=0.03, gamma=0.9, epochs=3, period=10)
w: [[ 2.003901 -3.3957026]] b: 4.20971

3.42.1 Next
RMSProp with Gluon
3.43 RMSprop with Gluon

import numpy as np
import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
def train(batch_size, lr, gamma, epochs, period):

# RMSProp.
trainer = gluon.Trainer(net.collect_params(), 'rmsprop',
{'learning_rate': lr, 'gamma1': gamma})

output = net(data)
loss.backward()
3.43. RMSprop with Gluon 239



plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
In [3]: train(batch_size=10, lr=0.03, gamma=0.9, epochs=3, period=10)
w: [[ 2.00390077 -3.39570308]] b: 4.20971
3.43.1 Next
AdaDalta from scratch
3.44 Adadelta from scratch


# Adadalta.
def adadelta(params, sqrs, deltas, rho, batch_size):
eps_stable = 1e-5
for param, sqr, delta in zip(params, sqrs, deltas):
sqr[:] = rho * sqr + (1. - rho) * nd.square(g)
cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g
delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta
# update weight
param[:] -= cur_delta
import mxnet as mx

import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2

random.shuffle(idx)

def init_params():
params = [w, b]
sqrs = []
deltas = []
param.attach_grad()
#
deltas.append(param.zeros_like())
return params, sqrs, deltas
3.44. Adadelta from scratch 241

def net(X, w, b):

# Loss function.
import numpy as np
def train(batch_size, rho, epochs, period):

[w, b], sqrs, deltas = init_params()

loss.backward()
adadelta([w, b], sqrs, deltas, rho, batch_size)
print("Batch size %d, Epoch %d, loss %.4e" %
(batch_size, epoch, total_loss[-1]))
'b:', b.asnumpy()[0], '\n')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
In [3]: train(batch_size=10, rho=0.9999, epochs=3, period=10)
Batch size 10, Epoch 1, loss 5.2081e-05
w: [[ 1.99959445 -3.3999126 ]] b: 4.19964

3.44.1 Next
AdaDalta with Gluon
3.45 Adadelta with Gluon

import numpy as np
import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
3.45. Adadelta with Gluon 243

def train(batch_size, rho, epochs, period):

# AdaDelta.
trainer = gluon.Trainer(net.collect_params(), 'adadelta',
{'rho': rho})

output = net(data)
loss.backward()

print("Batch size %d, Epoch %d, loss %.4e" %
(batch_size, epoch, total_loss[-1]))

plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
In [3]: train(batch_size=10, rho=0.9999, epochs=3, period=10)
w: [[ 1.99959445 -3.3999126 ]] b: 4.19964

3.45.1 Next
Adam from scratch
3.46 Adam from scratch

In [1]: # Adam.
def adam(params, vs, sqrs, lr, batch_size, t):
beta1 = 0.9
beta2 = 0.999
eps_stable = 1e-8
for param, v, sqr in zip(params, vs, sqrs):

v[:] = beta1 * v + (1. - beta1) * g

sqr[:] = beta2 * sqr + (1. - beta2) * nd.square(g)
v_bias_corr = v / (1. - beta1 ** t)

sqr_bias_corr = sqr / (1. - beta2 ** t)
div = lr * v_bias_corr / (nd.sqrt(sqr_bias_corr) + eps_stable)

param[:] = param - div
import mxnet as mx
3.46. Adam from scratch 245

import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2

random.shuffle(idx)

def init_params():
params = [w, b]
vs = []
sqrs = []
param.attach_grad()
vs.append(param.zeros_like())
return params, vs, sqrs
def net(X, w, b):
# Loss function.
import numpy as np

[w, b], vs, sqrs = init_params()

t = 0
loss.backward()
# Increment t before invoking adam.
t += 1
adam([w, b], vs, sqrs, lr, batch_size, t)
'b:', b.asnumpy()[0], '\n')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
w: [[ 1.9997046 -3.39914703]] b: 4.1986
3.46. Adam from scratch 247

3.46.1 Next
Adam with Gluon
3.47 Adam with Gluon

import numpy as np
import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2

# Adam.
trainer = gluon.Trainer(net.collect_params(), 'adam',

output = net(data)
loss.backward()



plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
w: [[ 1.9997046 -3.39914703]] b: 4.1986
3.47.1 Next
Fast & flexible: combining imperative & symbolic nets with HybridBlocks
3.48 Fast, portable neural networks with Gluon HybridBlocks

The tutorials we saw so far adopt the imperative, or define-by-run, programming paradigm. It might not
even occur to you to give a name to this style of programming because it’s how we always write Python
3.48. Fast, portable neural networks with Gluon HybridBlocks 249

programs.
Take for example a prototypical program written below in pseudo-Python. We grab some input arrays, we
compute upon them to produce some intermediate values, and finally we produce the result that we actually
care about.
def our_function(A, B, C, D):

# Compute some intermediate values
E = basic_function1(A, B)
F = basic_function2(C, D)
# Finally, produce the thing you really care about

G = basic_function3(E, F)
return G
# Load up some data

W = some_stuff()
X = some_stuff()
Y = some_stuff()
Z = some_stuff()
result = our_function(W, X, Y, Z)
As you might expect when we compute E, we’re actually performing some numerical operation, like mul-
tiplication, and returning an array that we assign to the variable E. Same for F. And if we want to do a
similar computation many times by putting these lines in a function, each time our program will have to step
through these three lines of Python.
The advantage of this approach is it’s so natural that it might not even occur to some people that there is
another way. But the disadvantage is that it’s slow. That’s because we are constantly engaging the Python
execution environment (which is slow) even though our entire function performs the same three low-level
operations in the same sequence every time. It’s also holding on to all the intermediate values D and E until
the function returns even though we can see that they’re not needed. We might have made this program
more efficient by re-using memory from either E or F to store the result G.
There actually is a different way to do things. It’s called symbolic programming and most of the early deep
learning libraries, including Theano and Tensorflow, embraced this approach exclusively. You might have
also heard this approach referred to as declarative programming or define-then-run programming. These all
mean the exact same thing. The approach consists of three basic steps:
• Define a computation workflow, like a pass through a neural network, using placeholder data
• Compile the program into a front-end language, e.g. Python, independent format
• Invoke the compiled function, feeding it real data
Revisiting our previous pseudo-Python example, a symbolic version of the same program might look some-
thing like this:
# Create some placeholders to stand in for real data that might be supplied
˓→to the compiled function.
A = placeholder()
B = placeholder()
C = placeholder()

D = placeholder()
# Compute some intermediate values

E = symbolic_function1(A, B)
F = symbolic_function2(C, D)
# Finally, produce the thing you really care about

G = symbolic_function3(E, F)
our_function = library.compile(inputs=[A, B, C, D], outputs=[G])
# Load up some data

W = some_stuff()
X = some_stuff()
Y = some_stuff()
Z = some_stuff()
result = our_function(W, X, Y, Z)
Here, when we run the line E = symbolic_function1(A, B), no numerical computation actually
happens. Instead, the symbolic library notes the way that E is related to A and B and records this infor-
mation. We don’t do actual computation, we just make a roadmap for how to go from inputs to outputs.
Because we can draw all of the variables and operations (both inputs and intermediate values) a nodes, and
the relationships between nodes with edges, we call the resulting roadmap a computational graph. In the
symbolic approach, we first define the entire graph, and then compile it.
3.48.1 Imperative Programs Tend to be More Flexible

When you’re using an imperative-style library from Python, you are writing in Python. Nearly anything that
would be intuitive to write in Python, you could accelerate by calling down in the appropriate places to the
imperative deep learning library. On the other hand, when you write a symbolic program, you may not have
access to all the familiar Python constructs, like iteration. It’s also easy to debug an imperative program. For
one, because all the intermediate values hang around, it’s easy to introspect the program later. Imperative
programs are also much easier to debug because we can just stick print statements in between operations.
In short, from a developer’s standpoint, imperative programs are just better. They’re a joy to work with. You
don’t have the tricky indirection of working with placeholders. You can do anything that you can do with
native Python. And faster debugging, means you get to try out more ideas. But the catch is that imperative
programs are comparatively slow.
3.48.2 Symbolic Programs Tend to be More Efficient

The main reason is efficiency, both in terms of memory and speed. Let’s revisit our toy example from before.
Consider the following program:
import numpy as np
a = np.ones(10)
b = np.ones(10) * 2
c = b * a
d = c + 1
...

Assume that each cell in the array occupies 8 bytes of memory. How much memory do we need to execute
this program in the Python console? As an imperative program we need to allocate memory at each line.
That leaves us allocating 4 arrays of size 10. So we’ll need 4 * 10 * 8 = 320 bytes. On the other hand,
if we built a computation graph, and knew in advance that we only needed d, we could reuse the memory
originally allocated for intermediate values. For example, by performing computations in-place, we might
recycle the bits allocated for b to store c. And we might recycle the bits allocated for c to store d. In the end
we could cut our memory requirement in half, requiring just 2 * 10 * 8 = 160 bytes.
Symbolic programs can also perform another kind of optimization, called operation folding. Returning
to our toy example, the multiplication and addition operations can be folded into one operation. If the
computation runs on a GPU processor, one GPU kernel will be executed, instead of two. In fact, this is one
way we hand-craft operations in optimized libraries, such as CXXNet and Caffe. Operation folding improves
computation efficiency. Note, you can’t perform operation folding in imperative programs, because the
intermediate values might be referenced in the future. Operation folding is possible in symbolic programs
because we get the entire computation graph in advance, before actually doing any calculation, giving us a
clear specification of which values will be needed and which will not.
Getting the best of both worlds with MXNet Gluon’s HybridBlocks

Most libraries deal with the imperative / symbolic design problem by simply choosing a side. Theano and
those frameworks it inspired, like TensorFlow, run with the symbolic way. And because the first versions
of MXNet optimized performance, they also went symbolic. Chainer and its descendants like PyTorch are
fully imperative way. In designing MXNet Gluon, we asked the following question. Is it possible to get
all of the benefits of imperative programming but to still exploit, whenever possible, the speed and memory
efficiency of symbolic programming. In other words, a user should be able to use Gluon fully imperatively.
And if they never want their lives to be more complicated then they can get on just fine imagining that the
story ends there. But when a user needs production-level performance, it should be easy to compile the
entire compute graph, or at least to compile large subsets of it.
MXNet accomplishes this through the use of HybridBlocks. Each HybridBlock can run fully imper-
atively defining their computation with real functions acting on real inputs. But they’re also capable of
running symbolically, acting on placeholders. Gluon hides most of this under the hood so you’ll only need
to know how it works when you want to write your own layers. Given a HybridBlock whose forward compu-
tation consists of going through other HybridBlocks, you can compile that section of the network by calling
the HybridBlocks .hybridize() method.
All of MXNet’s predefined layers are HybridBlocks. This means that any network consisting entirely of
predefined MXNet layers can be compiled and run at much faster speeds by calling .hybridize().
HybridSequential
We already learned how to use Sequential to stack the layers. The regular Sequential can be built
from regular Blocks and so it too has to be a regular Block. However, when you want to build a network
using sequential and run it at crazy speeds, you can construct your network using HybridSequential
instead. The functionality is the same Sequential:
from mxnet.gluon import nn
def get_net():

# construct a MLP
net = nn.HybridSequential()
net.add(nn.Dense(2))
# initialize the parameters
net.collect_params().initialize()
return net
# forward
x = nd.random_normal(shape=(1, 512))
net = get_net()
print('=== net(x) ==={}'.format(net(x)))
=== net(x) ===
[[ 0.08827585 0.0050519 ]]
To compile and optimize the HybridSequential, we can then call its hybridize method. Only
HybridBlocks, e.g. HybridSequential, can be compiled. But you can still call hybridize
on normal Block and its HybridBlock children will be compiled instead. We will talk more about
HybridBlocks later.
In [2]: net.hybridize()
print('=== net(x) ==={}'.format(net(x)))
=== net(x) ===
[[ 0.08827585 0.0050519 ]]
Performance
To get a sense of the speedup from hybridizing, we can compare the performance before and after hybridiz-
ing by measuring in either case the time it takes to make 1000 forward passes through the network.
In [3]: from time import time
def bench(net, x):
mx.nd.waitall()
start = time()
for i in range(1000):
y = net(x)
mx.nd.waitall()
return time() - start
net = get_net()
print('Before hybridizing: %.4f sec'%(bench(net, x)))
net.hybridize()
print('After hybridizing: %.4f sec'%(bench(net, x)))
Before hybridizing: 0.4344 sec
After hybridizing: 0.2230 sec
As you can see, hybridizing gives a significant performance boost, almost 2x the speed.

Get the symbolic program

Previously, we feed net with NDArray data x, and then net(x) returned the forward results. Now if we
feed it with a Symbol placeholder, then the corresponding symbolic program will be returned.
In [4]: from mxnet import sym
x = sym.var('data')
print('=== input data holder ===')
print(x)
y = net(x)
print('\n=== the symbolic program of net===')
print(y)
y_json = y.tojson()
print('\n=== the according json definition===')
print(y_json)
=== input data holder ===
<Symbol data>
=== the symbolic program of net===

<Symbol hybridsequential1_dense2_fwd>
=== the according json definition===

{
"nodes": [
{
"op": "null",
"name": "data",
"inputs": []
},
{
"op": "null",
"name": "hybridsequential1_dense0_weight",
"attrs": {
"__dtype__": "0",
"__lr_mult__": "1.0",
"__shape__": "(256, 0)",
"__storage_type__": "0",
"__wd_mult__": "1.0"
},
"inputs": []
},
{
"op": "null",
"name": "hybridsequential1_dense0_bias",
"attrs": {
"__dtype__": "0",
"__init__": "zeros",
"__lr_mult__": "1.0",
"__shape__": "(256,)",
"__wd_mult__": "1.0"
},

"inputs": []
},
{
"op": "FullyConnected",
"name": "hybridsequential1_dense0_fwd",
"attrs": {
"flatten": "True",
"no_bias": "False",
"num_hidden": "256"
},
"inputs": [[0, 0, 0], [1, 0, 0], [2, 0, 0]]
},
{
"op": "Activation",
"name": "hybridsequential1_dense0_relu_fwd",
"attrs": {"act_type": "relu"},
"inputs": [[3, 0, 0]]
},
{
"op": "null",
"attrs": {
"__dtype__": "0",
"__lr_mult__": "1.0",
"__shape__": "(128, 0)",
"__wd_mult__": "1.0"
},
"inputs": []
},
{
"op": "null",
"attrs": {
"__dtype__": "0",
"__lr_mult__": "1.0",
"__shape__": "(128,)",
"__wd_mult__": "1.0"
},
"inputs": []
},
{
"attrs": {
"flatten": "True",
"no_bias": "False",
"num_hidden": "128"
},
"inputs": [[4, 0, 0], [5, 0, 0], [6, 0, 0]]
},
{

"op": "Activation",
"name": "hybridsequential1_dense1_relu_fwd",
"attrs": {"act_type": "relu"},
"inputs": [[7, 0, 0]]
},
{
"op": "null",
"attrs": {
"__dtype__": "0",
"__lr_mult__": "1.0",
"__shape__": "(2, 0)",
"__wd_mult__": "1.0"
},
"inputs": []
},
{
"op": "null",
"attrs": {
"__dtype__": "0",
"__lr_mult__": "1.0",
"__shape__": "(2,)",
"__wd_mult__": "1.0"
},
"inputs": []
},
{
"attrs": {
"flatten": "True",
"no_bias": "False",
"num_hidden": "2"
},
"inputs": [[8, 0, 0], [9, 0, 0], [10, 0, 0]]
}
],
"arg_nodes": [0, 1, 2, 5, 6, 9, 10],
"node_row_ptr": [
0,
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,

11,
12
],
"heads": [[11, 0, 0]],
"attrs": {"mxnet_version": ["int", 10300]}
}
Now we can save both the program and parameters onto disk, so that it can be loaded later not only
in Python, but in all other supported languages, such as C++, R, and Scala, as well. For that we use
the .export(prefix, epoch) function, it will save the json symbolic representation in the format
prefix-symbol.json and the corresponding parameters as prefix-{epoch}.params.
In [5]: net.export('my_model', epoch=0)
This will create two files: - my_model-symbol.json - my_model-0000.params

Learn more about how to load back the models in MXNet various APIs in the official MXNet tutorial
HybridBlock
Now let’s dive deeper into how hybridize works. Remember that gluon networks are composed of
Blocks each of which subclass gluon.Block. With normal Blocks, we just need to define a forward
function that takes an input x and computes the result of the forward pass through the network. MXNet can
figure out the backward pass for us automatically with autograd.
To define a HybridBlock, we instead have a hybrid_forward function:
In [6]: from mxnet import gluon
class Net(gluon.HybridBlock):
super(Net, self).__init__(**kwargs)
self.fc1 = nn.Dense(256)
def hybrid_forward(self, F, x):

# F is a function space that depends on the type of x
# If x's type is NDArray, then F will be mxnet.nd
# If x's type is Symbol, then F will be mxnet.sym
print('type(x): {}, F: {}'.format(
type(x).__name__, F.__name__))
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return self.fc3(x)
The hybrid_forward function takes an additional input, F, which stands for a backend. This exploits
one awesome feature of MXNet. MXNet has both a symbolic API (mxnet.symbol) and an imperative
API (mxnet.ndarray). In this book, so far, we’ve only focused on the latter. Owing to fortuitous
historical reasons, the imperative and symbolic interfaces both support roughly the same API. They have
many of same functions (currently about 90% overlap) and when they do, they support the same arguments
in the same order. When we define hybrid_forward, we pass in F. When running in imperative mode,
hybrid_forward is called with F as mxnet.ndarray and x as some ndarray input. When we compile

with hybridize, F will be mxnet.symbol and x will be some placeholder or intermediate symbolic
value. Once we call hybridize, the net is compiled, so we’ll never need to call hybrid_forward again.
Let’s demonstrate how this all works by feeding some data through the network twice. We’ll do this for
both a regular network and a hybridized net. You’ll see that in the first case, hybrid_forward is actually
called twice.
In [7]: net = Net()
net.collect_params().initialize()
x = nd.random_normal(shape=(1, 512))
print('=== 1st forward ===')
y = net(x)
print('=== 2nd forward ===')
y = net(x)
=== 1st forward ===
type(x): NDArray, F: mxnet.ndarray
=== 2nd forward ===
type(x): NDArray, F: mxnet.ndarray
Now run it again after hybridizing.

In [8]: net.hybridize()
print('=== 1st forward ===')
y = net(x)
print('=== 2nd forward ===')
y = net(x)
=== 1st forward ===
type(x): Symbol, F: mxnet.symbol
=== 2nd forward ===
It differs from the previous execution in two aspects:

1. the input data type now is Symbol even when we fed an NDArray into net, because gluon
implicitly constructed a symbolic data placeholder.
2. hybrid_forward is called once at the first time we run net(x). It is because gluon will con-
struct the symbolic program on the first forward, and then keep it for reuse later.
One main reason that the network is faster after hybridizing is because we don’t need to repeatedly invoke the
Python forward function, while keeping all computations within the highly efficient C++ backend engine.
But the potential drawback is the loss of flexibility to write the forward function. In other ways, inserting
print for debugging or control logic such as if and for into the forward function is not possible now.
Conclusion
Through HybridSequental and HybridBlock, we can convert an imperative program into a symbolic
program by calling hybridize.
Next
Training MXNet models with multiple GPUs

3.49 Training with multiple GPUs from scratch

This tutorial shows how we can increase performance by distributing training across multiple GPUs. So, as
you might expect, running this tutorial requires at least 2 GPUs. And these days multi-GPU machines are
actually quite common. The following figure depicts 4 GPUs on a single machine and connected to the CPU
through a PCIe switch.
If an NVIDIA driver is installed on our machine, then we can check how many GPUs are available by
running the command nvidia-smi.
In [1]: !nvidia-smi
Fri Oct 13 00:11:36 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 0000:00:1B.0 Off | 0 |
| N/A 34C P8 13W / 150W | 0MiB / 7613MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 On | 0000:00:1C.0 Off | 0 |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M60 On | 0000:00:1D.0 Off | 0 |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M60 On | 0000:00:1E.0 Off | 0 |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
3.49. Training with multiple GPUs from scratch 259

|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
We want to use all of the GPUs on together for the purpose of significantly speeding up training (in terms of
wall clock). Remember that CPUs and GPUs each can have multiple cores. CPUs on a laptop might have
2 or 4 cores, and on a server might have up to 16 or 32 cores. GPUs tend to have many more cores - an
NVIDIA K80 GPU has 4992 - but run at slower clock speeds. Exploiting the parallelism across the GPU
cores is how GPUs get their speed advantage in the first place.
As compared to the single CPU or single GPU setting where all the cores are typically used by default,
parallelism across devices is a little more complicated. That’s because most layers of a neural network can
only run on a single device. So, in order to parallelize across devices, we need to do a little extra. Therefore,
we need to do some additional work to partition a workload across multiple GPUs. This can be done in a
few ways.
3.49.1 Data Parallelism

For deep learning, data parallelism is by far the most widely used approach for partitioning workloads. It
works like this: Assume that we have k GPUs. We split the examples in a data batch into k parts, and send
each part to a different GPUs which then computes the gradient that part of the batch. Finally, we collect
the gradients from each of the GPUs and sum them together before updating the weights.
The following pseudo-code shows how to train one data batch on k GPUs.
def train_batch(data, k):

split data into k parts
for i = 1, ..., k: # run in parallel
compute grad_i w.r.t. weight_i using data_i on the i-th GPU
grad = grad_1 + ... + grad_k
for i = 1, ..., k: # run in parallel
copy grad to i-th GPU
update weight_i by using grad
Next we will present how to implement this algorithm from scratch.
3.49.2 Automatic Parallelization

We first demonstrate how to run workloads in parallel. Writing parallel code in Python in non-trivial, but
fortunately, MXNet is able to automatically parallelize the workloads. Two technologies help to achieve this
goal.
First, workloads, such as nd.dot are pushed into the backend engine for lazy evaluation. That is, Python
merely pushes the workload nd.dot and returns immediately without waiting for the computation to be
finished. We keep pushing until the results need to be copied out from MXNet, such as print(x) or are
converted into numpy by x.asnumpy(). At that time, the Python thread is blocked until the results are
ready.
In [2]: from mxnet import nd
from time import time
start = time()
x = nd.random_uniform(shape=(2000,2000))

y = nd.dot(x, x)
print('=== workloads are pushed into the backend engine ===\n%f sec' % (time() - st
z = y.asnumpy()
print('=== workloads are finished ===\n%f sec' % (time() - start))
=== workloads are pushed into the backend engine ===
0.001160 sec
=== workloads are finished ===
0.174040 sec
Second, MXNet depends on a powerful scheduling algorithm that analyzes the dependencies of the pushed
workloads. This scheduler checks to see if two workloads are independent of each other. If they are, then
the engine may run them in parallel. If a workload depend on results that have not yet been computed, it
will be made to wait until its inputs are ready.
For example, if we call three operators:
a = nd.random_uniform(...)
b = nd.random_uniform(...)
c = a + b
Then the computation for a and b may run in parallel, while c cannot be computed until both a and b are
ready.
The following code shows that the engine effectively parallelizes the dot operations on two GPUs:
In [3]: from mxnet import gpu
def run(x):
"""push 10 matrix-matrix multiplications"""
return [nd.dot(x,x) for i in range(10)]
def wait(x):
"""explicitly wait until all results are ready"""
for y in x:
y.wait_to_read()
x0 = nd.random_uniform(shape=(4000, 4000), ctx=gpu(0))

x1 = x0.copyto(gpu(1))
print('=== Run on GPU 0 and 1 in sequential ===')

start = time()
wait(run(x0))
wait(run(x1))
print('time: %f sec' %(time() - start))
print('=== Run on GPU 0 and 1 in parallel ===')

start = time()
y0 = run(x0)
y1 = run(x1)
wait(y0)
wait(y1)
print('time: %f sec' %(time() - start))
=== Run on GPU 0 and 1 in sequential ===

time: 1.842752 sec

=== Run on GPU 0 and 1 in parallel ===
time: 0.396227 sec
In [4]: from mxnet import cpu
def copy(x, ctx):

"""copy data to a device"""
return [y.copyto(ctx) for y in x]
print('=== Run on GPU 0 and then copy results to CPU in sequential ===')
start = time()
y0 = run(x0)
wait(y0)
z0 = copy(y0, cpu())
wait(z0)
print(time() - start)
print('=== Run and copy in parallel ===')

start = time()
y0 = run(x0)
z0 = copy(y0, cpu())
wait(z0)
print(time() - start)
=== Run on GPU 0 and then copy results to CPU in sequential ===
0.6489872932434082
=== Run and copy in parallel ===
0.39962267875671387
3.49.3 Define model and updater

We will use the convolutional neural networks and plain SGD introduced in ‘cnn-scratch <>‘__ as an
example workload.
# initialize parameters
scale = .01
W1 = nd.random_normal(shape=(20,1,3,3))*scale
b1 = nd.zeros(shape=20)
W2 = nd.random_normal(shape=(50,20,5,5))*scale
W3 = nd.random_normal(shape=(800,128))*scale
W4 = nd.random_normal(shape=(128,10))*scale
params = [W1, b1, W2, b2, W3, b3, W4, b4]
# network and loss

def lenet(X, params):
# first conv
h1_conv = nd.Convolution(data=X, weight=params[0], bias=params[1], kernel=(3,3)
h1_activation = nd.relu(h1_conv)
h1 = nd.Pooling(data=h1_activation, pool_type="max", kernel=(2,2), stride=(2,2)
# second conv

h2_conv = nd.Convolution(data=h1, weight=params[2], bias=params[3], kernel=(5,5

h2_activation = nd.relu(h2_conv)
h2 = nd.Pooling(data=h2_activation, pool_type="max", kernel=(2,2), stride=(2,2)
h2 = nd.flatten(h2)
# first fullc
h3_linear = nd.dot(h2, params[4]) + params[5]
h3 = nd.relu(h3_linear)
# second fullc
yhat = nd.dot(h3, params[6]) + params[7]
return yhat
# plain SGD
def SGD(params, lr):
for p in params:
p[:] = p - lr * p.grad
3.49.4 Utility functions to synchronize data across GPUs

The following function copies the parameters into a particular GPU and initializes the gradients.
In [6]: def get_params(params, ctx):
new_params = [p.copyto(ctx) for p in params]
for p in new_params:
p.attach_grad()
return new_params
new_params = get_params(params, gpu(0))

print('=== copy b1 to GPU(0) ===\nweight = {}\ngrad = {}'.format(
new_params[1], new_params[1].grad))
=== copy b1 to GPU(0) ===
weight =
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.]
<NDArray 20 @gpu(0)>
grad =
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.]
<NDArray 20 @gpu(0)>
Given a list of data that spans multiple GPUs, we then define a function to sum the data and broadcast the
results to each GPU.
In [7]: def allreduce(data):
# sum on data[0].context, and then broadcast
for i in range(1, len(data)):
data[0][:] += data[i].copyto(data[0].context)
for i in range(1, len(data)):
data[0].copyto(data[i])
data = [nd.ones((1,2), ctx=gpu(i))*(i+1) for i in range(2)]

print("=== before allreduce ===\n {}".format(data))
allreduce(data)

print("\n=== after allreduce ===\n {}".format(data))

=== before allreduce ===
[
[[ 1. 1.]]
<NDArray 1x2 @gpu(0)>,
[[ 2. 2.]]
<NDArray 1x2 @gpu(1)>]
=== after allreduce ===

[
[[ 3. 3.]]
[[ 3. 3.]]
Given a data batch, we define a function that splits this batch and copies each part into the corresponding
GPU.
In [8]: def split_and_load(data, ctx):
n, k = data.shape[0], len(ctx)
assert (n//k)*k == n, '# examples is not divided by # devices'
idx = list(range(0, n+1, n//k))
return [data[idx[i]:idx[i+1]].as_in_context(ctx[i]) for i in range(k)]
batch = nd.arange(16).reshape((4,4))
print('=== original data ==={}'.format(batch))
ctx = [gpu(0), gpu(1)]
splitted = split_and_load(batch, ctx)
print('\n=== splitted into {} ==={}\n{}'.format(ctx, splitted[0], splitted[1]))
=== original data ===
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[ 12. 13. 14. 15.]]
=== splitted into [gpu(0), gpu(1)] ===

[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]]
[[ 8. 9. 10. 11.]
[ 12. 13. 14. 15.]]
3.49.5 Train and inference one data batch

Now we are ready to implement how to train one data batch with data parallelism.
In [9]: from mxnet import autograd
def train_batch(batch, params, ctx, lr):
# split the data batch and load them on GPUs
data = split_and_load(batch.data[0], ctx)
label = split_and_load(batch.label[0], ctx)

# run forward on each GPU

losses = [loss(lenet(X, W), Y)
for X, Y, W in zip(data, label, params)]
# run backward on each gpu
for l in losses:
l.backward()
# aggregate gradient over GPUs
for i in range(len(params[0])):
allreduce([params[c][i].grad for c in range(len(ctx))])
# update parameters with SGD on each GPU
for p in params:
SGD(p, lr/batch.data[0].shape[0])
For inference, we simply let it run on the first GPU. We leave a data parallelism implementation as an
exercise.
In [10]: def valid_batch(batch, params, ctx):
data = batch.data[0].as_in_context(ctx[0])
pred = nd.argmax(lenet(data, params[0]), axis=1)
return nd.sum(pred == batch.label[0].as_in_context(ctx[0])).asscalar()
3.49.6 Put all things together

Define the program that trains and validates the model on MNIST.
In [11]: from mxnet.test_utils import get_mnist
from mxnet.io import NDArrayIter
def run(num_gpus, batch_size, lr):

# the list of GPUs will be used
ctx = [gpu(i) for i in range(num_gpus)]
print('Running on {}'.format(ctx))
# data iterator
mnist = get_mnist()
train_data = NDArrayIter(mnist["train_data"], mnist["train_label"], batch_size
valid_data = NDArrayIter(mnist["test_data"], mnist["test_label"], batch_size)
print('Batch size is {}'.format(batch_size))
# copy parameters to all GPUs

dev_params = [get_params(params, c) for c in ctx]
for epoch in range(5):
# train
start = time()
train_data.reset()
for batch in train_data:
train_batch(batch, dev_params, ctx, lr)
nd.waitall() # wait all computations are finished to benchmark the time
print('Epoch %d, training time = %.1f sec'%(epoch, time()-start))
# validating
valid_data.reset()
correct, num = 0.0, 0.0
for batch in valid_data:

correct += valid_batch(batch, dev_params, ctx)

num += batch.data[0].shape[0]
print(' validation accuracy = %.4f'%(correct/num))
First run on a single GPU with batch size 64.

In [12]: run(1, 64, 0.3)
Running on [gpu(0)]
Batch size is 64
Epoch 0, training time = 3.7 sec
validation accuracy = 0.9586
Running on multiple GPUs, we often want to increase the batch size so that each GPU still gets a large
enough batch size for good computation performance. (A larger batch size sometimes slows down the
convergence, we often want to increases the learning rate as well but in this case we’ll keep it same. Feel
free to try higher learning rates.)
In [13]: run(2, 128, 0.3)
Running on [gpu(0), gpu(1)]
Batch size is 128
3.49.7 Conclusion
We have shown how to implement data parallelism on a deep neural network from scratch. Thanks to the
auto-parallelism, we only need to write serial codes while the engine is able to parallelize them on multiple
GPUs.
3.49.8 Next
Training with multiple GPUs with gluon

3.50 Training on multiple GPUs with gluon

Gluon makes it easy to implement data parallel training. In this notebook, we’ll implement data parallel
training for a convolutional neural network. If you’d like a finer grained view of the concepts, you might
want to first read the previous notebook, multi gpu from scratch with gluon.
To get started, let’s first define a simple convolutional neural network and loss function.
from mxnet import nd, gluon, autograd
net = gluon.nn.Sequential(prefix='cnn_')
net.add(gluon.nn.MaxPool2D(pool_size=(2,2), strides=(2,2)))
net.add(gluon.nn.MaxPool2D(pool_size=(2,2), strides=(2,2)))
net.add(gluon.nn.Dense(128, activation="relu"))
3.50.1 Initialize on multiple devices

Gluon supports initialization of network parameters over multiple devices. We accomplish this by passing
in an array of device contexts, instead of the single contexts we’ve used in earlier notebooks. When we pass
in an array of contexts, the parameters are initialized to be identical across all of our devices.
In [2]: GPU_COUNT = 2 # increase if you have more
ctx = [mx.gpu(i) for i in range(GPU_COUNT)]
net.collect_params().initialize(ctx=ctx)
Given a batch of input data, we can split it into parts (equal to the number of contexts) by calling gluon.
utils.split_and_load(batch, ctx). The split_and_load function doesn’t just split the
data, it also loads each part onto the appropriate device context.
So now when we call the forward pass on two separate parts, each one is computed on the appropriate
corresponding device and using the version of the parameters stored there.
In [3]: from mxnet.test_utils import get_mnist
mnist = get_mnist()
batch = mnist['train_data'][0:GPU_COUNT*2, :]
data = gluon.utils.split_and_load(batch, ctx)
print(net(data[0]))
print(net(data[1]))
[[-0.01876061 -0.02165037 -0.01293943 0.03837404 -0.00821797 -0.00911531

0.00416799 -0.00729158 -0.00232711 -0.00155549]
[ 0.00441474 -0.01953595 -0.00128483 0.02768224 0.01389615 -0.01320441
-0.01166505 -0.00637776 0.0135425 -0.00611765]]
[[ -6.78736670e-03 -8.86893831e-03 -1.04004676e-02 1.72976423e-02

2.26115398e-02 -6.36630831e-03 -1.54974898e-02 -1.22633884e-02
3.50. Training on multiple GPUs with gluon 267

1.19591374e-02 -6.60043515e-05]
[ -1.17358668e-02 -2.16879714e-02 1.71219767e-03 2.49827504e-02
1.16810966e-02 -9.52543691e-03 -1.03610428e-02 5.08510228e-03
7.06662657e-03 -9.25292261e-03]]
At any time, we can access the version of the parameters stored on each device. Recall from the first Chapter
that our weights may not actually be initialized when we call initialize because the parameter shapes
may not yet be known. In these cases, initialization is deferred pending shape inference.
In [4]: weight = net.collect_params()['cnn_conv0_weight']
for c in ctx:
print('=== channel 0 of the first conv on {} ==={}'.format(
c, weight.data(ctx=c)[0]))
=== channel 0 of the first conv on gpu(0) ===
[[[ 0.04118239 0.05352169 -0.04762455]
[ 0.06035256 -0.01528978 0.04946674]
[ 0.06110793 -0.00081179 0.02191102]]]
<NDArray 1x3x3 @gpu(0)>
=== channel 0 of the first conv on gpu(1) ===
[[[ 0.04118239 0.05352169 -0.04762455]
[ 0.06035256 -0.01528978 0.04946674]
[ 0.06110793 -0.00081179 0.02191102]]]
Similarly, we can access the gradients on each of the GPUs. Because each GPU gets a different part of the
batch (a different subset of examples), the gradients on each GPU vary.
In [5]: def forward_backward(net, data, label):
losses = [loss(net(X), Y) for X, Y in zip(data, label)]
for l in losses:
l.backward()
label = gluon.utils.split_and_load(mnist['train_label'][0:4], ctx)

forward_backward(net, data, label)
for c in ctx:
print('=== grad of channel 0 of the first conv2d on {} ==={}'.format(
c, weight.grad(ctx=c)[0]))
=== grad of channel 0 of the first conv2d on gpu(0) ===
[[[-0.02078936 -0.00562428 0.01711007]
[ 0.01138539 0.0280002 0.04094725]
[ 0.00993335 0.01218192 0.02122578]]]
=== grad of channel 0 of the first conv2d on gpu(1) ===
[[[-0.02543036 -0.02789939 -0.00302115]
[-0.04816786 -0.03347274 -0.00403483]
[-0.03178394 -0.01254033 0.00855637]]]

3.50.2 Put all things together

Now we can implement the remaining functions. Most of them are the same as when we did everything
by hand; one notable difference is that if a gluon trainer recognizes multi-devices, it will automatically
aggregate the gradients and synchronize the parameters.
In [ ]: from mxnet.io import NDArrayIter
from time import time
def train_batch(batch, ctx, net, trainer):

# split the data batch and load them on GPUs
data = gluon.utils.split_and_load(batch.data[0], ctx)
label = gluon.utils.split_and_load(batch.label[0], ctx)
# compute gradient
forward_backward(net, data, label)
# update parameters
trainer.step(batch.data[0].shape[0])
def valid_batch(batch, ctx, net):

data = batch.data[0].as_in_context(ctx[0])
pred = nd.argmax(net(data), axis=1)
return nd.sum(pred == batch.label[0].as_in_context(ctx[0])).asscalar()
def run(num_gpus, batch_size, lr):

# the list of GPUs will be used
ctx = [mx.gpu(i) for i in range(num_gpus)]
print('Running on {}'.format(ctx))
# data iterator
mnist = get_mnist()
train_data = NDArrayIter(mnist["train_data"], mnist["train_label"], batch_size)
valid_data = NDArrayIter(mnist["test_data"], mnist["test_label"], batch_size)
print('Batch size is {}'.format(batch_size))
net.collect_params().initialize(force_reinit=True, ctx=ctx)
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
# train
start = time()
train_data.reset()
train_batch(batch, ctx, net, trainer)
nd.waitall() # wait until all computations are finished to benchmark the t
print('Epoch %d, training time = %.1f sec'%(epoch, time()-start))
# validating
valid_data.reset()
correct, num = 0.0, 0.0
for batch in valid_data:
correct += valid_batch(batch, ctx, net)
num += batch.data[0].shape[0]
print(' validation accuracy = %.4f'%(correct/num))
run(1, 64, .3)
3.50. Training on multiple GPUs with gluon 269

run(GPU_COUNT, 64*GPU_COUNT, .3)

Running on [gpu(0)]
Batch size is 64
Running on [gpu(0), gpu(1)]
Batch size is 128
3.50.3 Conclusion
Both parameters and trainers in gluon support multi-devices. Moving from one device to multi-devices is
straightforward.
3.50.4 Next
Distributed training with multiple machines
3.51 Distributed training with multiple machines

In the previous two tutorials, we saw that using multiple GPUs within a machine can accelerate training. The
speedup, however, is limited by the number of GPUs installed in that machine. And it’s rare to find a single
machine with more than 16 GPUs nowadays. For some truly large-scale applications, this speedup might
still be insufficient. For example, it could still take many days to train a state-of-the-art CNN on millions of
images.
In this tutorial, we’ll discuss the key concepts you’ll need in order to go from a program that does single-
machine training to one that executes distributed training across multiple machines. We depict a typical
distributed system in the following figure, where multiple machines are connected by network switches.
Note that the way we used copyto to copy data from one GPU to another in the multiple-GPU tutorial
does not work when our GPUs are sitting on different machines. To make use of the available resources here
well need a better abstraction.
3.51.1 Key-value store

MXNet provides a key-value store to synchronize data among devices. The following code initializes an
ndarray associated with the key “weight” on a key-value store.
In [1]: from mxnet import kv, nd
store = kv.create('local')
shape = (2, 3)
x = nd.random_uniform(shape=shape)

store.init('weight', x)
print('=== init "weight" ==={}'.format(x))
=== init "weight" ===
[[ 0.54881352 0.59284461 0.71518934]
[ 0.84426576 0.60276335 0.85794562]]
After initialization, we can pull the value to multiple devices.

In [2]: from mxnet import gpu
ctx = [gpu(0), gpu(1)]
y = [nd.zeros(shape, ctx=c) for c in ctx]
store.pull('weight', out=y)
print('=== pull "weight" to {} ===\n{}'.format(ctx, y))
=== pull "weight" to [gpu(0), gpu(1)] ===
[
[[ 0.54881352 0.59284461 0.71518934]
[ 0.84426576 0.60276335 0.85794562]]
[[ 0.54881352 0.59284461 0.71518934]
[ 0.84426576 0.60276335 0.85794562]]
We can also push new data value into the store. It will first sum the data on the same key and then overwrite
the current value.
In [3]: z = [nd.ones(shape, ctx=ctx[i])+i for i in range(len(ctx))]
store.push('weight', z)
print('=== push to "weight" ===\n{}'.format(z))
store.pull('weight', out=y)
print('=== pull "weight" ===\n{}'.format(y))
=== push to "weight" ===
[
[[ 1. 1. 1.]
[ 1. 1. 1.]]
3.51. Distributed training with multiple machines 271

[[ 2. 2. 2.]
[ 2. 2. 2.]]
=== pull "weight" ===
[
[[ 3. 3. 3.]
[ 3. 3. 3.]]
[[ 3. 3. 3.]
[ 3. 3. 3.]]
With push and pull we can replace the allreduce function defined in multiple-gpus-scratch by
def allreduce(data, data_name, store):

store.push(data_name, data)
store.pull(data_name, out=data)
3.51.2 Distributed key-value store

Not only can we synchronize data within a machine, with the key-value store we can facilitate inter-machine
communication. To use it, one can create a distributed kvstore by using the following command: (Note:
distributed key-value store requires MXNet to be compiled with the flag USE_DIST_KVSTORE=1, e.g.
make USE_DIST_KVSTORE=1.)
store = kv.create('dist')
Now if we run the code from the previous section on two machines at the same time, then the store will
aggregate the two ndarrays pushed from each machine, and after that, the pulled results will be:
[[ 6. 6. 6.]
[ 6. 6. 6.]]
In the distributed setting, MXNet launches three kinds of processes (each time, running python myprog.
py will create a process). One is a worker, which runs the user program, such as the code in the previous
section. The other two are the server, which maintains the data pushed into the store, and the scheduler,
which monitors the aliveness of each node.
It’s up to users which machines to run these processes on. But to simplify the process placement and
launching, MXNet provides a tool located at tools/launch.py.
Assume there are two machines, A and B. They are ssh-able, and their IPs are saved in a file named
hostfile. Then we can start one worker in each machine through:
$ mxnet_path/tools/launch.py -H hostfile -n 2 python myprog.py
It will also start a server in each machine, and the scheduler on the same machine we are currently on.
3.51.3 Using kvstore in gluon

As mentioned in our section on training with multiple GPUs from scratch, to implement data parallelism
we just need to specify

chapter07_distributed-learning/img/dist_kv.png
• how to split data

• how to synchronize gradients and weights
We already see from multiple-gpu-gluon that a gluon trainer can automatically aggregate the gradients
among different GPUs. What it really does is having a key-value store with type local within it. Therefore,
to change to multi-machine training we only need to pass a distributed key-value store, for example,
store = kv.create('dist')
trainer = gluon.Trainer(..., kvstore=store)
To split the data, however, we cannot directly copy the previous approach. One commonly used solution is
to split the whole dataset into k parts at the beginning, then let the i-th worker only read the i-th part of the
data.
We can obtain the total number of workers by reading the attribute num_workers and the rank of the
current worker from the attribute rank.
In [4]: print('total number of workers: %d'%(store.num_workers))
print('my rank among workers: %d'%(store.rank))
total number of workers: 1
my rank among workers: 0
With this information, we can manually access the proper chunk of the input data. In addition, several data
iterators provided by MXNet already support reading only part of the data. For example,
from mxnet.io import ImageRecordIter

data = ImageRecordIter(num_parts=store.num_workers, part_index=store.rank, ...
˓→)
3.51. Distributed training with multiple machines 273


CHAPTER
FOUR
PART 2: APPLICATIONS
4.1 Object Detection Using Convolutional Neural Networks

So far, when we’ve talked about making predictions based on images, we were concerned only with classi-
fication. We asked questions like is this digit a “0”, “1”, . . . , or “9?” or, does this picture depict a “cat” or
a “dog”? Object detection is a more challenging task. Here our goal is not only to say what is in the image
but also to recognize where it is in the image. As an example, consider the following image, which depicts
two dogs and a cat together with their locations.
work/mxnet/tests/nightly/straight_dope/tmp_notebook/
So object defers from image classification in a few ways. First, while a classifier outputs a single category
per image, an object detector must be able to recognize multiple objects in a single image. Technically, this
task is called multiple object detection, but most research in the area addresses the multiple object setting, so
we’ll abuse terminology just a little. Second, while classifiers need only to output probabilities over classes,
object detectors must output both probabilities of class membership and also the coordinates that identify
the location of the objects.
On this chapter we’ll demonstrate the single shot multiple box object detector (SSD), a popular model for
object detection that was first described in this paper, and is straightforward to implement in MXNet Gluon.
4.1.1 SSD: Single Shot MultiBox Detector

The SSD model predicts anchor boxes at multiple scales. The model architecture is illustrated in the follow-
ing figure.
/home/doosik/gluon/mxnet-the-straight-dope/build/_bu
We first use a body network to extract the image features, which are used as the input to the first scale
(scale 0). The class labels and the corresponding anchor boxes are predicted by class_predictor
275
and box_predictor, respectively. We then downsample the representations to the next scale (scale 1).
Again, at this new resolution, we predict both classes and anchor boxes. This downsampling and predicting
routine can be repeated in multiple times to obtain results on multiple resolution scales. Let’s walk through
the components one by one in a bit more detail.
Default anchor boxes

Since an anchor box can have arbituary shape, we sample a set of anchor boxes as the candidate. In particu-
lar, for each pixel, we sample multiple boxes centered at this pixel but have various sizes and ratios. Assume
the input size is 𝑤 × ℎ, - for size 𝑠 ∈ (0, 1], the generated box shape will be 𝑤𝑠 × ℎ𝑠 - for ratio 𝑟 > 0, the
√
generated box shape will be 𝑤 𝑟 × √ℎ𝑟
We can sample the boxes using the operator MultiBoxPrior. It accepts n sizes and m ratios to generate
n+m-1 boxes for each pixel. The first i boxes are generated from sizes[i], ratios[0] if 𝑖 ≤ 𝑛
otherwise sizes[0], ratios[i-n].
import mxnet as mx
from mxnet.contrib.ndarray import MultiBoxPrior
n = 40
# shape: batch x channel x height x weight
x = nd.random_uniform(shape=(1, 3, n, n))
y = MultiBoxPrior(x, sizes=[.5, .25, .1], ratios=[1, 2, .5])
# the first anchor box generated for pixel at (20,20)

# its format is (x_min, y_min, x_max, y_max)
boxes = y.reshape((n, n, -1, 4))
print('The first anchor box at row 21, column 21:', boxes[20, 20, 0, :])
The first anchor box at row 21, column 21:
[0.2625 0.2625 0.7625 0.7625]
<NDArray 4 @cpu(0)>
We can visualize all anchor boxes generated for one pixel on a certain size feature map.
def box_to_rect(box, color, linewidth=3):
"""convert an anchor box to a matplotlib rectangle"""
box = box.asnumpy()
return plt.Rectangle(
(box[0], box[1]), (box[2]-box[0]), (box[3]-box[1]),
fill=False, edgecolor=color, linewidth=linewidth)
colors = ['blue', 'green', 'red', 'black', 'magenta']
plt.imshow(nd.ones((n, n, 3)).asnumpy())
anchors = boxes[20, 20, :, :]
for i in range(anchors.shape[0]):
plt.gca().add_patch(box_to_rect(anchors[i,:]*n, colors[i]))
plt.show()
276 Chapter 4. Part 2: Applications

Predict classes
For each anchor box, we want to predict the associated class label. We make this prediction by using a
convolution layer. We choose a kernel of size 3 × 3 with padding size (1, 1) so that the output will have
the same width and height as the input. The confidence scores for the anchor box class labels are stored in
channels. In particular, for the i-th anchor box:
• channel i*(num_class+1) store the scores for this box contains only background
• channel i*(num_class+1)+1+j store the scores for this box contains an object from the j-th class
In [3]: from mxnet.gluon import nn
def class_predictor(num_anchors, num_classes):
"""return a layer to predict classes"""
return nn.Conv2D(num_anchors * (num_classes + 1), 3, padding=1)
cls_pred = class_predictor(5, 10)

cls_pred.initialize()
x = nd.zeros((2, 3, 20, 20))
print('Class prediction', cls_pred(x).shape)
Class prediction (2, 55, 20, 20)
Predict anchor boxes

The goal is predict how to transfer the current anchor box to the correct box. That is, assume 𝑏 is one of the
sampled default box, while 𝑌 is the ground truth, then we want to predict the delta positions ∆(𝑌, 𝑏), which
is a 4-length vector.
More specifically, the we define the delta vector as: [𝑡𝑥 , 𝑡𝑦 , 𝑡𝑤𝑖𝑑𝑡ℎ , 𝑡ℎ𝑒𝑖𝑔ℎ𝑡 ], where
• 𝑡𝑥 = (𝑌𝑥 − 𝑏𝑥 )/𝑏𝑤𝑖𝑑𝑡ℎ
4.1. Object Detection Using Convolutional Neural Networks 277

• 𝑡𝑦 = (𝑌𝑦 − 𝑏𝑦 )/𝑏ℎ𝑒𝑖𝑔ℎ𝑡
• 𝑡𝑤𝑖𝑑𝑡ℎ = (𝑌𝑤𝑖𝑑𝑡ℎ − 𝑏𝑤𝑖𝑑𝑡ℎ )/𝑏𝑤𝑖𝑑𝑡ℎ
• 𝑡ℎ𝑒𝑖𝑔ℎ𝑡 = (𝑌ℎ𝑒𝑖𝑔ℎ𝑡 − 𝑏ℎ𝑒𝑖𝑔ℎ𝑡 )/𝑏ℎ𝑒𝑖𝑔ℎ𝑡
Normalizing the deltas with box width/height tends to result in better convergence behavior.
Similar to classes, we use a convolution layer here. The only difference is that the output channel size is
now num_anchors * 4, with the predicted delta positions for the i-th box stored from channel i*4 to
i*4+3.
In [4]: def box_predictor(num_anchors):
"""return a layer to predict delta locations"""
return nn.Conv2D(num_anchors * 4, 3, padding=1)
box_pred = box_predictor(10)
box_pred.initialize()
x = nd.zeros((2, 3, 20, 20))
print('Box prediction', box_pred(x).shape)
Box prediction (2, 40, 20, 20)
Down-sample features
Each time, we downsample the features by half. This can be achieved by a simple pooling layer with pooling
size 2. We may also stack two convolution, batch normalization and ReLU blocks before the pooling layer
to make the network deeper.
In [5]: def down_sample(num_filters):
"""stack two Conv-BatchNorm-Relu blocks and then a pooling layer
to halve the feature size"""
out = nn.HybridSequential()
for _ in range(2):
out.add(nn.Conv2D(num_filters, 3, strides=1, padding=1))
out.add(nn.BatchNorm(in_channels=num_filters))
out.add(nn.Activation('relu'))
out.add(nn.MaxPool2D(2))
return out
blk = down_sample(10)
blk.initialize()
x = nd.zeros((2, 3, 20, 20))
print('Before', x.shape, 'after', blk(x).shape)
Before (2, 3, 20, 20) after (2, 10, 10, 10)
Manage preditions from multiple layers

A key property of SSD is that predictions are made at multiple layers with shrinking spatial size. Thus, we
have to handle predictions from multiple feature layers. One idea is to concatenate them along convolutional
channels, with each one predicting a corresponding value (class or box) for each default anchor. We give
class predictor as an example, and box predictor follows the same rule.
In [6]: # a certain feature map with 20x20 spatial shape
feat1 = nd.zeros((2, 8, 20, 20))

print('Feature map 1', feat1.shape)

cls_pred1 = class_predictor(5, 10)
cls_pred1.initialize()
y1 = cls_pred1(feat1)
print('Class prediction for feature map 1', y1.shape)
# down-sample
ds = down_sample(16)
ds.initialize()
feat2 = ds(feat1)
print('Feature map 2', feat2.shape)
cls_pred2 = class_predictor(3, 10)
cls_pred2.initialize()
y2 = cls_pred2(feat2)
print('Class prediction for feature map 2', y2.shape)
Feature map 1 (2, 8, 20, 20)
Class prediction for feature map 1 (2, 55, 20, 20)
Feature map 2 (2, 16, 10, 10)
Class prediction for feature map 2 (2, 33, 10, 10)
In [7]: def flatten_prediction(pred):
return nd.flatten(nd.transpose(pred, axes=(0, 2, 3, 1)))
def concat_predictions(preds):
return nd.concat(*preds, dim=1)
flat_y1 = flatten_prediction(y1)
print('Flatten class prediction 1', flat_y1.shape)
flat_y2 = flatten_prediction(y2)
print('Flatten class prediction 2', flat_y2.shape)
print('Concat class predictions', concat_predictions([flat_y1, flat_y2]).shape)
Flatten class prediction 1 (2, 22000)
Flatten class prediction 2 (2, 3300)
Concat class predictions (2, 25300)
Body network
The body network is used to extract features from the raw pixel inputs. Common choices follow the ar-
chitectures of the state-of-the-art convolution neural networks for image classification. For demonstration
purpose, we just stack several down sampling blocks to form the body network.
def body():
"""return the body network"""
out = nn.HybridSequential()
for nfilters in [16, 32, 64]:
out.add(down_sample(nfilters))
return out
bnet = body()
bnet.initialize()
x = nd.zeros((2, 3, 256, 256))
print('Body network', [y.shape for y in bnet(x)])
Body network [(64, 32, 32), (64, 32, 32)]

Create a toy SSD model

Now, let’s create a toy SSD model that takes images of resolution 256 × 256 as input.
In [9]: def toy_ssd_model(num_anchors, num_classes):
"""return SSD modules"""
downsamples = nn.Sequential()
class_preds = nn.Sequential()
box_preds = nn.Sequential()
downsamples.add(down_sample(128))
for scale in range(5):

class_preds.add(class_predictor(num_anchors, num_classes))
box_preds.add(box_predictor(num_anchors))
return body(), downsamples, class_preds, box_preds
print(toy_ssd_model(5, 2))
(HybridSequential(
(0): HybridSequential(
(0): Conv2D(None -> 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm(momentum=0.9, fix_gamma=False, axis=1, use_global_stats=False, eps=1e-05
(2): Activation(relu)
(6): MaxPool2D(size=(2, 2), stride=(2, 2), padding=(0, 0), ceil_mode=False)
)
)
)
), Sequential(

)
)
)
), Sequential(
), Sequential(
))
Forward
Given an input and the model, we can run the forward pass.
In [10]: def toy_ssd_forward(x, body, downsamples, class_preds, box_preds, sizes, ratios):
# extract feature with the body network
x = body(x)
# for each scale, add anchors, box and class predictions,

# then compute the input to next scale
default_anchors = []
predicted_boxes = []
predicted_classes = []
for i in range(5):
default_anchors.append(MultiBoxPrior(x, sizes=sizes[i], ratios=ratios[i]))
predicted_boxes.append(flatten_prediction(box_preds[i](x)))

predicted_classes.append(flatten_prediction(class_preds[i](x)))
if i < 3:
x = downsamples[i](x)
elif i == 3:
# simply use the pooling layer
x = nd.Pooling(x, global_pool=True, pool_type='max', kernel=(4, 4))
return default_anchors, predicted_classes, predicted_boxes
Put all things together

class ToySSD(gluon.Block):
def __init__(self, num_classes, **kwargs):
super(ToySSD, self).__init__(**kwargs)
# anchor box sizes for 4 feature scales
self.anchor_sizes = [[.2, .272], [.37, .447], [.54, .619], [.71, .79], [.8
# anchor box ratios for 4 feature scales
self.anchor_ratios = [[1, 2, .5]] * 5
self.num_classes = num_classes
self.body, self.downsamples, self.class_preds, self.box_preds = toy_ss

default_anchors, predicted_classes, predicted_boxes = toy_ssd_forward(x, s
self.class_preds, self.box_preds, self.anchor_sizes, self.anchor_ratio
# we want to concatenate anchors, class predictions, box predictions from
anchors = concat_predictions(default_anchors)
box_preds = concat_predictions(predicted_boxes)
class_preds = concat_predictions(predicted_classes)
# it is better to have class predictions reshaped for softmax computation
class_preds = nd.reshape(class_preds, shape=(0, -1, self.num_classes + 1))
return anchors, class_preds, box_preds
Outputs of ToySSD
In [12]: # instantiate a ToySSD network with 10 classes
net = ToySSD(2)
net.initialize()
x = nd.zeros((1, 3, 256, 256))
default_anchors, class_predictions, box_predictions = net(x)
print('Outputs:', 'anchors', default_anchors.shape, 'class prediction', class_pred
Outputs: anchors (1, 5444, 4) class prediction (1, 5444, 3) box prediction (1, 21776)
4.1.2 Dataset
For demonstration purposes, we’ll train our model to detect Pikachu in the wild. We generated a synthetic
toy dataset by rendering images from open-sourced 3D Pikachu models. The dataset consists of 1000
pikachus with random pose/scale/position in random background images. The exact locations are recorded
as ground-truth for training and validation.


Download dataset
In [13]: from mxnet.test_utils import download
import os.path as osp
def verified(file_path, sha1hash):
import hashlib
sha1 = hashlib.sha1()
with open(file_path, 'rb') as f:
while True:
data = f.read(1048576)
if not data:
break
sha1.update(data)
matched = sha1.hexdigest() == sha1hash
if not matched:
print('Found hash mismatch in file {}, possibly due to incomplete download
return matched
url_format = 'https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/pikac
hashes = {'train.rec': 'e6bcb6ffba1ac04ff8a9b1115e650af56ee969c8',
'train.idx': 'dcf7318b2602c06428b9988470c731621716c393',
'val.rec': 'd6c33f799b4d058e82f2cb5bd9a976f69d72d520'}
for k, v in hashes.items():
fname = 'pikachu_' + k
target = osp.join('data', fname)
url = url_format.format(k)
if not osp.exists(target) or not verified(target, v):
print('Downloading', target, url)
download(url, fname=fname, dirname='data', overwrite=True)
Load dataset
In [14]: import mxnet.image as image
data_shape = 256
batch_size = 32
def get_iterators(data_shape, batch_size):
class_names = ['pikachu']
num_class = len(class_names)
train_iter = image.ImageDetIter(
batch_size=batch_size,
data_shape=(3, data_shape, data_shape),
path_imgrec='./data/pikachu_train.rec',
path_imgidx='./data/pikachu_train.idx',
shuffle=True,
mean=True,
rand_crop=1,
min_object_covered=0.95,
max_attempts=200)
val_iter = image.ImageDetIter(
data_shape=(3, data_shape, data_shape),
path_imgrec='./data/pikachu_val.rec',
shuffle=False,
mean=True)

return train_iter, val_iter, class_names, num_class
train_data, test_data, class_names, num_class = get_iterators(data_shape, batch_si

batch = train_data.next()
print(batch)
DataBatch: data shapes: [(32, 3, 256, 256)] label shapes: [(32, 1, 5)]
Illustration
Let’s display one image loaded by ImageDetIter.
img = batch.data[0][0].asnumpy() # grab the first image, convert to numpy array

img = img.transpose((1, 2, 0)) # we want channel to be the last dimension
img += np.array([123, 117, 104])
img = img.astype(np.uint8) # use uint8 (0-255)
# draw bounding boxes on image
for label in batch.label[0][0].asnumpy():
if label[0] < 0:
break
print(label)
xmin, ymin, xmax, ymax = [int(x * data_shape) for x in label[1:5]]
rect = plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin, fill=False, edgec
plt.gca().add_patch(rect)
plt.imshow(img)
plt.show()
[0. 0.4849993 0.39879292 0.607934 0.54115665]
4.1.3 Train

Losses
Network predictions will be penalized for incorrect class predictions and wrong box deltas.
In [16]: from mxnet.contrib.ndarray import MultiBoxTarget
def training_targets(default_anchors, class_predicts, labels):
class_predicts = nd.transpose(class_predicts, axes=(0, 2, 1))
z = MultiBoxTarget(*[default_anchors, labels, class_predicts])
box_target = z[0] # box offset target for (x, y, width, height)
box_mask = z[1] # mask is used to ignore box offsets we don't want to penaliz
cls_target = z[2] # cls_target is an array of labels for all anchors boxes
return box_target, box_mask, cls_target
Pre-defined losses are provided in gluon.loss package, however, we can define losses manually.
First, we need a Focal Loss for class predictions.
In [17]: class FocalLoss(gluon.loss.Loss):
def __init__(self, axis=-1, alpha=0.25, gamma=2, batch_axis=0, **kwargs):
super(FocalLoss, self).__init__(None, batch_axis, **kwargs)
self._axis = axis
self._alpha = alpha
self._gamma = gamma
def hybrid_forward(self, F, output, label):

output = F.softmax(output)
pt = F.pick(output, label, axis=self._axis, keepdims=True)
loss = -self._alpha * ((1 - pt) ** self._gamma) * F.log(pt)
return F.mean(loss, axis=self._batch_axis, exclude=True)
# cls_loss = gluon.loss.SoftmaxCrossEntropyLoss()
cls_loss = FocalLoss()
print(cls_loss)
FocalLoss(batch_axis=0, w=None)
Next, we need a SmoothL1Loss for box predictions.

In [18]: class SmoothL1Loss(gluon.loss.Loss):
def __init__(self, batch_axis=0, **kwargs):
super(SmoothL1Loss, self).__init__(None, batch_axis, **kwargs)
def hybrid_forward(self, F, output, label, mask):

loss = F.smooth_l1((output - label) * mask, scalar=1.0)
return F.mean(loss, self._batch_axis, exclude=True)
box_loss = SmoothL1Loss()
print(box_loss)
SmoothL1Loss(batch_axis=0, w=None)
Evaluation metrics
Here, we define two metrics that we’ll use to evaluate our performance whien training. You’re already
familiar with accuracy unless you’ve been naughty and skipped straight to object detection. We use the
accuracy metric to assess the quality of the class predictions. Mean absolute error (MAE) is just the L1
distance, introduced in our linear algebra chapter. We use this to determine how close the coordinates of

the predicted bounding boxes are to the ground-truth coordinates. Because we are jointly solving both a
classification problem and a regression problem, we need an appropriate metric for each task.
In [19]: cls_metric = mx.metric.Accuracy()
box_metric = mx.metric.MAE() # measure absolute difference between prediction and
In [20]: ### Set context for training
ctx = mx.gpu() # it may takes too long to train using CPU
try:
_ = nd.zeros(1, ctx=ctx)
# pad label for cuda implementation
train_data.reshape(label_shape=(3, 5))
train_data = test_data.sync_label_shape(train_data)
except mx.base.MXNetError as err:
print('No GPU enabled, fall back to CPU, sit back and be patient...')
ctx = mx.cpu()
Initialize parameters
In [21]: net = ToySSD(num_class)
net.initialize(mx.init.Xavier(magnitude=2), ctx=ctx)
Set up trainer
In [22]: net.collect_params().reset_ctx(ctx)
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1, 'wd':
Start training
Optionally we load pretrained model for demonstration purpose. One can set from_scratch = True
to training from scratch, which may take more than 30 mins to finish using a single capable GPU.
In [23]: epochs = 1 # set larger to get better performance
log_interval = 20
from_scratch = False # set to True to train from scratch
if from_scratch:
start_epoch = 0
else:
start_epoch = 148
pretrained = 'ssd_pretrained.params'
sha1 = 'fbb7d872d76355fff1790d864c2238decdb452bc'
url = 'https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/models/ssd_pikac
if not osp.exists(pretrained) or not verified(pretrained, sha1):
print('Downloading', pretrained, url)
download(url, fname=pretrained, overwrite=True)
net.load_parameters(pretrained, ctx)
In [24]: import time
from mxnet import autograd as ag
for epoch in range(start_epoch, epochs):
# reset iterator and tick
train_data.reset()
cls_metric.reset()
box_metric.reset()
tic = time.time()

# iterate through all batch

for i, batch in enumerate(train_data):
btic = time.time()
# record gradients
with ag.record():
x = batch.data[0].as_in_context(ctx)
y = batch.label[0].as_in_context(ctx)
default_anchors, class_predictions, box_predictions = net(x)
box_target, box_mask, cls_target = training_targets(default_anchors, c
# losses
loss1 = cls_loss(class_predictions, cls_target)
loss2 = box_loss(box_predictions, box_target, box_mask)
# sum all losses
loss = loss1 + loss2
# backpropagate
loss.backward()
# apply
# update metrics
cls_metric.update([cls_target], [nd.transpose(class_predictions, (0, 2, 1)
box_metric.update([box_target], [box_predictions * box_mask])
if (i + 1) % log_interval == 0:
name1, val1 = cls_metric.get()
name2, val2 = box_metric.get()
print('[Epoch %d Batch %d] speed: %f samples/s, training: %s=%f, %s=%
%(epoch ,i, batch_size/(time.time()-btic), name1, val1, name2, v
# end of epoch logging

name1, val1 = cls_metric.get()
name2, val2 = box_metric.get()
print('[Epoch %d] training: %s=%f, %s=%f'%(epoch, name1, val1, name2, val2))
print('[Epoch %d] time cost: %f'%(epoch, time.time()-tic))
# we can save the trained parameters to disk

net.save_parameters('ssd_%d.params' % epochs)
4.1.4 Test
Testing is similar to training, except that we don’t need to compute gradients and training targets. Instead,
we take the predictions from network output, and combine them to get the real detection output.
Prepare the test data

import cv2
def preprocess(image):
"""Takes an image and apply preprocess"""
# resize to data_shape
image = cv2.resize(image, (data_shape, data_shape))
# swap BGR to RGB
image = image[:, :, (2, 1, 0)]
# convert to float before subtracting mean
image = image.astype(np.float32)

# subtract mean
image -= np.array([123, 117, 104])
# organize as [batch-channel-height-width]
image = np.transpose(image, (2, 0, 1))
image = image[np.newaxis, :]
# convert to ndarray
image = nd.array(image)
return image
image = cv2.imread('../img/pikachu.jpg')
x = preprocess(image)
print('x', x.shape)
x (1, 3, 256, 256)
Network inference
In a single line of code!
In [26]: # if pre-trained model is provided, we can load it
# net.load_parameters('ssd_%d.params' % epochs, ctx)
anchors, cls_preds, box_preds = net(x.as_in_context(ctx))
print('anchors', anchors)
print('class predictions', cls_preds)
print('box delta predictions', box_preds)
anchors
[[[-0.084375 -0.084375 0.115625 0.115625 ]
[-0.12037501 -0.12037501 0.151625 0.151625 ]
[-0.12579636 -0.05508568 0.15704636 0.08633568]
...
[ 0.01949999 0.01949999 0.9805 0.9805 ]
[-0.12225395 0.18887302 1.1222539 0.81112695]
[ 0.18887302 -0.12225395 0.81112695 1.1222539 ]]]
class predictions
[[[ 0.3136385 -1.6613694 ]
[ 1.1190383 -1.7688792 ]
[ 1.165454 -0.97607 ]
...
[-0.26088136 -1.2618818 ]
[ 0.4366543 -0.88175875]
[ 0.24387847 -0.8944956 ]]]
box delta predictions
[[-0.16194503 -0.15946479 -0.68138134 ... -0.23063782 0.09888595
-0.25365576]]
Convert predictions to real object detection results

In [27]: from mxnet.contrib.ndarray import MultiBoxDetection
# convert predictions to probabilities using softmax
cls_probs = nd.SoftmaxActivation(nd.transpose(cls_preds, (0, 2, 1)), mode='channel
# apply shifts to anchors boxes, non-maximum-suppression, etc...

output = MultiBoxDetection(*[cls_probs, box_preds, anchors], force_suppress=True,

print(output)
[[[ 0. 0.63146746 0.5177919 0.5050526 0.6725537

0.70119935]
[-1. 0.6285925 0.5252597 0.5088168 0.6633131
0.704806 ]
[-1. 0.5856898 0.5370002 0.5022707 0.6642288
0.7136973 ]
...
[-1. -1. -1. -1. -1.
-1. ]
[-1. -1. -1. -1. -1.
-1. ]
[-1. -1. -1. -1. -1.
-1. ]]]
Each row in the output corresponds to a detection box, as in format [class_id, confidence, xmin, ymin, xmax,
ymax].
Most of the detection results are -1, indicating that they either have very small confidence scores, or been
suppressed through non-maximum-suppression.
Display results
In [28]: def display(img, out, thresh=0.5):
import random
mpl.rcParams['figure.figsize'] = (10,10)
pens = dict()
plt.clf()
plt.imshow(img)
for det in out:
cid = int(det[0])
if cid < 0:
continue
score = det[1]
if score < thresh:
continue
if cid not in pens:
pens[cid] = (random.random(), random.random(), random.random())
scales = [img.shape[1], img.shape[0]] * 2
xmin, ymin, xmax, ymax = [int(p * s) for p, s in zip(det[2:6].tolist(), sc
rect = plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin, fill=False,
edgecolor=pens[cid], linewidth=3)
plt.gca().add_patch(rect)
text = class_names[cid]
plt.gca().text(xmin, ymin-2, '{:s} {:.3f}'.format(text, score),
bbox=dict(facecolor=pens[cid], alpha=0.5),
fontsize=12, color='white')
plt.show()
display(image[:, :, (2, 1, 0)], output[0].asnumpy(), thresh=0.45)

4.1.5 Conclusion
Detection is harder than classification, since we want not only class probabilities, but also localizations of
different objects including potential small objects. Using sliding window together with a good classifier
might be an option, however, we have shown that with a properly designed convolutional neural network,
we can do single shot detection which is blazing fast and accurate!
4.2 Transfering knowledge through finetuning

In previous chapters, we demonstrated how to train a neural network to recognize the categories correspond-
ing to objects in images. We looked at toy datasets like hand-written digits, and thumbnail-sized pictures of
4.2. Transfering knowledge through finetuning 291

animals. And we talked about the ImageNet dataset, the default academic benchmark, which contains 1M
million images, 1000 each from 1000 separate classes.
The ImageNet dataset categorically changed what was possible in computer vision. It turns out some things
are possible (these days, even easy) on gigantic datasets, that simply aren’t with smaller datasets. In fact,
we don’t know of any technique that can comparably powerful model on a similar photograph dataset but
containing only, say, 10k images.
And that’s a problem. Because however impressive the results of CNNs on ImageNet may be, most people
aren’t interested in ImageNet itself. They’re interested in their own problems. Recognize people based
on pictures of their faces. Distinguish between photographs of 10 different types of coral on the ocean
floor. Usually when individuals (and not Amazon, Google, or inter-institutional big science initiatives) are
interested in solving a computer vision problem, they come to the table with modestly sized datasets. A few
hundred examples may be common and a few thousand examples may be as much as you can reasonably
ask for.
So one natural question emerges. Can we somehow use the powerful models trained on millions of examples
for one dataset, and apply them to improve performance on a new problem with a much smaller dataset?
This kind of problem (learning on source dataset, bringing knowledge to target dataset), is appropriately
called transfer learning. Fortunately, we have some effective tools for solving this problem.
For deep neural networks, the most popular approach is called finetuning and the idea is both simple and
effective:
• Train a neural network on the source task 𝑆.
• Decapitate it, replacing it’s output layer appropriate to target task 𝑇 .
• Initialize the weights on the new output layer randomly, keeping all other (pretrained) weights the
same.
• Begin training on the new dataset.
This might be clearer if we visualize the algorithm:
In this section, we’ll demonstrate fine-tuning, using the popular and compact SqueezeNet architecture. Since
we don’t want to saddle you with the burden of downloading ImageNet, or of training on ImageNet from
scratch, we’ll pull the weights of the pretrained Squeeze net from the internet. Specifically, we’ll be fine-
tuning a squeezenet-1.1 that was pre-trained on imagenet-12. Finally, we’ll fine-tune it to recognize hotdogs.
We’ll start with the obligatory ritual of importing a bunch of stuff that you’ll need later.
In [ ]: %pylab inline
pylab.rcParams['figure.figsize'] = (10, 6)
4.2.1 Settings
We’ll set a few settings up here that you can configure later to manipulate the behavior of the algorithm.
These are mostly familiar. Hybrid mode, uses the just in time compiler described in our chapter on high
performance training to make the network much faster to train. Since we’re not working with any crazy
dynamic graphs that can’t be compiled, there’s no reason not to hybridize. The batch size, number of
training epochs, weight decay, and learing rate should all be familiar by now. The positive class weight,
says how much more we should upweight the importance of positive instances (photos of hot dogs) in the

Fig. 4.1: hot dog

objective function. We use this to combat the extreme class imbalance (not surprisingly, most pictures do
not depict hot dogs).
In [ ]: # Demo mode uses the validation dataset for training, which is smaller and faster t
demo = True
log_interval = 100
# Options are imperative or hybrid. Use hybrid for better performance.

mode = 'hybrid'
# training hyperparameters
batch_size = 256
if demo:
epochs = 5
learning_rate = 0.02
wd = 0.002
else:
epochs = 40
wd = 0.002
# the class weight for hotdog class to help the imbalance problem.
positive_class_weight = 5
import logging
logging.basicConfig(level=logging.INFO)
import os
import time
from collections import OrderedDict
import skimage.io as io
import mxnet as mx
from mxnet.test_utils import download
mx.random.seed(127)
# setup the contexts; will use gpus if avaliable, otherwise cpu

gpus = mx.test_utils.list_gpus()
contexts = [mx.gpu(i) for i in gpus] if len(gpus) > 0 else [mx.cpu()]
4.2.2 Dataset
Formally, hot dog recognition is a binary classification problem. We’ll use 1 to represent the hotdog class,
and 0 for the not hotdog class. Our hot dog dataset (the target dataset which we’ll fine-tune the model
to) contains 18,141 sample images, 2091 of which are hotdogs. Because the dataset is imbalanced (e.g.
hotdog class is only 1% in mscoco dataset), sampling interesting negative samples can help to improve the
performance of our algorithm. Thus, in the negative class in the our dataset, two thirds are images from food
categories (e.g. pizza) other than hotdogs, and 30% are images from all other categories.
Files
We prepare the dataset in the format of MXRecord using im2rec tool. As of the current draft, rec files are
not yet explained in the book, but if you’re reading after November or December 2017 and you still see this

note, open an issue on GitHub and let us know to stop slacking off.
• not_hotdog_train.rec 641M (1882 positive, 10000 interesting negative, and 5000 random negative)
• not_hotdog_validation.rec 49M (209 positive, 700 interesting negative, and 350 random negative)
In [ ]: dataset_files = {'train': ('not_hotdog_train-e6ef27b4.rec', '0aad7e1f16f5fb109b719a
'validation': ('not_hotdog_validation-c0201740.rec', '723ae5f8a433
To demo the model here, we’re justgoing to use the smaller validation set. But if you’re interested in training
on the full set, set ‘demo’ to False in the settings at the beginning. Now we’re ready to download and verify
the dataset.
In [ ]: if demo:
training_dataset, training_data_hash = dataset_files['validation']
else:
training_dataset, training_data_hash = dataset_files['train']
validation_dataset, validation_data_hash = dataset_files['validation']

import hashlib
while True:
if not data:
break
sha1.update(data)
if not matched:
logging.warn('Found hash mismatch in file {}, possibly due to incomplete do
.format(file_path))
return matched
url_format = 'https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/{}'
if not os.path.exists(training_dataset) or not verified(training_dataset, training_
logging.info('Downloading training dataset.')
download(url_format.format(training_dataset),
overwrite=True)
if not os.path.exists(validation_dataset) or not verified(validation_dataset, valid
logging.info('Downloading validation dataset.')
download(url_format.format(validation_dataset),
overwrite=True)
Iterators
The record files can be read using mx.io.ImageRecordIter
In [ ]: # load dataset
train_iter = mx.io.ImageRecordIter(path_imgrec=training_dataset,
min_img_size=256,
data_shape=(3, 224, 224),
rand_crop=True,
shuffle=True,

max_random_scale=1.5,
min_random_scale=0.75,
rand_mirror=True)
val_iter = mx.io.ImageRecordIter(path_imgrec=validation_dataset,
min_img_size=256,
data_shape=(3, 224, 224),
batch_size=batch_size)
4.2.3 Model
The model we are finetuning is SqueezeNet. Gluon module offers squeezenet v1.0 and v1.1 that are pre-
trained on ImageNet. This is just a convolutional neural network, with an architecture chosen to have a small
number of parameters and to require a minimal amount of computation. It’s especially popular for folks that
need to run CNNs on low-powered devices like cell phones and other internet-of-things devices.
4.2.4 Pulling the pre-trained model

Fortunately, MXNet has a model zoo that gives us convenient access to a number of popular models, both
their architectres and their pretrained parameters. Let’s download SqueezeNet right now with just a few
lines of code.
In [ ]: from mxnet.gluon import nn
from mxnet.gluon.model_zoo import vision as models
# get pretrained squeezenet

net = models.squeezenet1_1(pretrained=True, prefix='deep_dog_', ctx=contexts)
# hot dog happens to be a class in imagenet.
# we can reuse the weight for that class for better performance
# here's the index for that class for later use
imagenet_hotdog_index = 713
DeepDog net
We can now use the feature extractor part from the pretrained squeezenet to build our own network. The
model zoo, even handles the decaptiation for us. All we have to do is specify the number out of output
classes in our new task, which we do via the keyword argument classes=2.
In [ ]: deep_dog_net = models.squeezenet1_1(prefix='deep_dog_', classes=2)
deep_dog_net.collect_params().initialize(ctx=contexts)
deep_dog_net.features = net.features
print(deep_dog_net)
The network can already be used for prediction. However, since it hasn’t been finetuned yet, the network
performance could be bad.
In [ ]: from skimage.color import rgba2rgb
def classify_hotdog(net, url, contexts):

I = io.imread(url)
if I.shape[2] == 4:
I = rgba2rgb(I)
image = mx.nd.array(I).astype(np.uint8)
plt.subplot(1, 2, 1)
plt.imshow(image.asnumpy())

image = mx.image.resize_short(image, 256)

image, _ = mx.image.center_crop(image, (224, 224))
plt.subplot(1, 2, 2)
plt.imshow(image.asnumpy())
image = mx.image.color_normalize(image.astype(np.float32)/255,
mean=mx.nd.array([0.485, 0.456, 0.406]),
std=mx.nd.array([0.229, 0.224, 0.225]))
image = mx.nd.transpose(image.astype('float32'), (2,1,0))
image = mx.nd.expand_dims(image, axis=0)
out = mx.nd.SoftmaxActivation(net(image.as_in_context(contexts[0])))
print('Probabilities are: '+str(out[0].asnumpy()))
result = np.argmax(out.asnumpy())
outstring = ['Not hotdog!', 'Hotdog!']
print(outstring[result])
In [ ]: classify_hotdog(deep_dog_net, '../img/real_hotdog.jpg', contexts)
Reuse class weights

As mentioned earlier, in addition to the feature extractor, we can reuse the class weights for hot dog from the
pretrained model, since hot dog was already a class in the imagenet. To do that, we need to get the weight
from the classifier layers of the pretrained model, find the right slice, and put it into our two-class classifier.
In [ ]: # let's examine the output layer and find the last conv layer
print(net.output)
In [ ]: # the last conv layer is the second layer
pretrained_conv_params = net.output[0].params
# weights can then be found from the above parameter dict

pretrained_weight_param = pretrained_conv_params.get('weight')
pretrained_bias_param = pretrained_conv_params.get('bias')
# next, we locate the right slice that we're interested in.

hotdog_w = mx.nd.split(pretrained_weight_param.data(ctx=contexts[0]),
1000, axis=0)[imagenet_hotdog_index]
hotdog_b = mx.nd.split(pretrained_bias_param.data(ctx=contexts[0]),
1000, axis=0)[imagenet_hotdog_index]
# our classifier is for two classes. here, we reuse the hotdog class weight,
# and randomly initialize the 'not hotdog' class.
new_classifier_w = mx.nd.concat(mx.nd.random_normal(shape=hotdog_w.shape, scale=0.0
hotdog_w,
dim=0)
new_classifier_b = mx.nd.concat(mx.nd.random_normal(shape=hotdog_b.shape, scale=0.0
hotdog_b,
dim=0)
# finally, we initialize the parameter buffers and set the values.

# since classifier is a HybridSequential/Sequential, the following
# takes the zero-indexed 1-st layer of the classifier
final_conv_layer_params = deep_dog_net.output[0].params
final_conv_layer_params.get('weight').set_data(new_classifier_w)
final_conv_layer_params.get('bias').set_data(new_classifier_b)

4.2.5 Evaluation
Our task is a binary classification problem with imbalanced classes. So we’ll monitor performance both
using accuracy and F1 score, a metric favored in settings with extreme class imbalance. [Note to authors:
ensure that F1 score is explained earlier or explain it here in full]
In [ ]: # return metrics string representation
def metric_str(names, accs):
return ', '.join(['%s=%f'%(name, acc) for name, acc in zip(names, accs)])
metric = mx.metric.create(['acc', 'f1'])
The following snippet performs inferences on evaluation dataset, and updates the metrics. Once the evalua-
tion data iterator is exhausted, it returns the values of each of the metrics.
In [ ]: import mxnet.gluon as gluon
from mxnet.image import color_normalize
def evaluate(net, data_iter, ctx):

data_iter.reset()
for batch in data_iter:
data = color_normalize(batch.data[0]/255,
mean=mx.nd.array([0.485, 0.456, 0.406]).reshape((1,3
std=mx.nd.array([0.229, 0.224, 0.225]).reshape((1,3,
data = gluon.utils.split_and_load(data, ctx_list=ctx, batch_axis=0)
label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_axis
outputs = []
for x in data:
outputs.append(net(x))
metric.update(label, outputs)
out = metric.get()
metric.reset()
return out
4.2.6 Training
We now can train the model just as we would any supervised model. In this example, we set up the training
loop for multi-GPU use as described from first principles here and in the context of gluon here.
In [ ]: import mxnet.autograd as autograd
def train(net, train_iter, val_iter, epochs, ctx):

if isinstance(ctx, mx.Context):
ctx = [ctx]
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': learning
best_f1 = 0
val_names, val_accs = evaluate(net, val_iter, ctx)
logging.info('[Initial] validation: %s'%(metric_str(val_names, val_accs)))
for epoch in range(epochs):
tic = time.time()
train_iter.reset()
btic = time.time()
for i, batch in enumerate(train_iter):
# the model zoo models expect normalized images

data = color_normalize(batch.data[0]/255,
mean=mx.nd.array([0.485, 0.456, 0.406]).reshape(
std=mx.nd.array([0.229, 0.224, 0.225]).reshape((
data = gluon.utils.split_and_load(data, ctx_list=ctx, batch_axis=0)
label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_
outputs = []
Ls = []
for x, y in zip(data, label):
z = net(x)
# rescale the loss based on class to counter the imbalance prob
L = loss(z, y) * (1+y*positive_class_weight)/positive_class_wei
# store the loss and do backward after we have done forward
# on all GPUs for better speed on multiple GPUs.
Ls.append(L)
outputs.append(z)
for L in Ls:
L.backward()
trainer.step(batch.data[0].shape[0])
metric.update(label, outputs)
if log_interval and not (i+1)%log_interval:
names, accs = metric.get()
logging.info('[Epoch %d Batch %d] speed: %f samples/s, training: %s
epoch, i, batch_size/(time.time()-btic), metric_str(
btic = time.time()
names, accs = metric.get()

metric.reset()
logging.info('[Epoch %d] training: %s'%(epoch, metric_str(names, accs)))
logging.info('[Epoch %d] time cost: %f'%(epoch, time.time()-tic))
val_names, val_accs = evaluate(net, val_iter, ctx)
logging.info('[Epoch %d] validation: %s'%(epoch, metric_str(val_names, val_
if val_accs[1] > best_f1:

best_f1 = val_accs[1]
logging.info('Best validation f1 found. Checkpointing...')
net.save_parameters('deep-dog-%d.params'%(epoch))
if mode == 'hybrid':
deep_dog_net.hybridize()
if epochs > 0:
deep_dog_net.collect_params().reset_ctx(contexts)
train(deep_dog_net, train_iter, val_iter, epochs, contexts)
4.2.7 Try it out!

Once our model is trained, we can either use the deep_dog_net model in the notebook kernel, or load it
from the best checkpoint.
In [ ]: # Uncomment below line and replace the file name with the last checkpoint.
# deep_dog_net.load_parameters('deep-dog-3.params', contexts)
#
# Alternatively, you can uncomment the following lines to get the model that we fin
# with validation F1 score of 0.74.

download('https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/models/deep-dog-5a
overwrite=True)
deep_dog_net.load_parameters('deep-dog-5a342a6f.params', contexts)
In [ ]: classify_hotdog(deep_dog_net, '../img/real_hotdog.jpg', contexts)
In [ ]: classify_hotdog(deep_dog_net, '../img/leg_hotdog.jpg', contexts)
In [ ]: classify_hotdog(deep_dog_net, '../img/dog_hotdog.jpg', contexts)
4.2.8 Conclusions
As you can see, given a pretrained model, we can get a great classifier, even for tasks where we simply
don’t have enough data to train from scratch. That’s because the representations necessary to perform both
tasks have a lot in common. Since they both address natural images, they both require recognizing textures,
shapes, edges, etc. Whenever you have a small enough dataset that you fear impoverishing your model,
try thinking about what larger datasets you might be able to pre-train your model on, so that you can just
perform fine-tuning on the task at hand.
4.2.9 Next
This section is still changing too fast to say for sure what will come next. Stay tuned!
4.3 Visual Question Answering in gluon

This is a notebook for implementing visual question answering in gluon.
import numpy as np
import mxnet as mx
import mxnet.ndarray as F
import mxnet.contrib.ndarray as C
import mxnet.gluon as gluon
import bisect
from IPython.core.display import display, HTML
import logging
import os
import json
from IPython.display import HTML, display
4.3.1 The VQA dataset

In the VQA dataset, for each sample, there is one image and one question. The label is the answer for the
question regarding the image. You can download the VQA1.0 dataset from VQA website.
You need to preprocess the data:
1. Extract the samples from original json files.

2. Filter the samples giving top k answers(k can be 1000, 2000. . . ). This will make the prediction easier.
4.3.2 Pretrained Models

Usually people use pretrained models to extract features from the image and question.
Image pretrained model:
VGG: A key aspect of VGG was to use many convolutional blocks with relatively narrow kernels, followed
by a max-pooling step and to repeat this block multiple times.
Resnet: It is a residual learning framework to ease the training of networks that are substantially deep. It
reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning
unreferenced functions.
Question pretrained model:
Word2Vec: The word2vec tool takes a text corpus as input and produces the word vectors as output. It
first constructs a vocabulary from the training text data and then learns vector representation of words. The
model contains 300-dimensional vectors for 3 million words and phrases.
Glove: Similar to Word2Vec, it is a word embedding dataset. It contains 100/200/300-dimensional vectors
for 2 million words.
skipthought: This is an encoder-decoder model that tries to reconstruct the surrounding sentences of an
encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector
representations. Different from the previous two model, this is a sentence based model.
GNMT encoder: We propose using the encoder of google neural machine translation system to extract the
question features.
We will discuss about how to extract the features here in details.

We define out model with gluon. gluon.Block is the basic building block of models. If any operator is not
defined under gluon, you can use mxnet.ndarray operators to subsititude.
4.3. Visual Question Answering in gluon 301

In [2]: # Some parameters we are going to use

batch_size = 64
ctx = mx.cpu()
compute_size = batch_size
out_dim = 10000
gpus = 1
In the first model, we will concatenate the image and question features and use multilayer perception(MLP)
to predict the answer.
In [3]: class Net1(gluon.Block):
super(Net1, self).__init__(**kwargs)
# layers created in name_scope will inherit name space
# from parent layer.
self.bn = nn.BatchNorm()
self.dropout = nn.Dropout(0.3)
self.fc1 = nn.Dense(8192,activation="relu")

x1 = F.L2Normalization(x[0])
z = F.concat(x1,x2,dim=1)
z = self.fc1(z)
z = self.bn(z)
z = self.dropout(z)
z = self.fc2(z)
return z
In the second model, instead of linearly combine the image and text features, we use count sketch to estimate
the outer product of the image and question features. It is also named as multimodel compact bilinear
pooling(MCB).
This method was proposed in Multimodal Compact Bilinear Pooling for VQA. The key idea is:
𝜓(𝑥 ⊗ 𝑦, ℎ, 𝑠) = 𝜓(𝑥, ℎ, 𝑠) ⋆ 𝜓(𝑦, ℎ, 𝑠)
where 𝜓 is the count sketch operator, 𝑥, 𝑦 are the inputs, ℎ, 𝑠 are the hash tables, ⊗ defines outer product
and ⋆ is the convolution operator. This can further be simplified by using FFT properties: convolution in
time domain equals to elementwise product in frequency domain.
One improvement we made is adding ones vectors to each features before count sketch. The intuition is:
given input vectors 𝑥, 𝑦, estimating outer product between [𝑥, 1𝑠] and [𝑦, 1𝑠] gives us information more than
just 𝑥 ⊗ 𝑦. It also contains information of 𝑥 and 𝑦.
In [4]: class Net2(gluon.Block):
super(Net2, self).__init__(**kwargs)
# layers created in name_scope will inherit name space
# from parent layer.
self.bn = nn.BatchNorm()

self.dropout = nn.Dropout(0.3)
self.fc1 = nn.Dense(8192,activation="relu")

text_ones = F.ones((batch_size/gpus, 2048),ctx = ctx)
img_ones = F.ones((batch_size/gpus, 1024),ctx = ctx)
text_data = F.Concat(x1, text_ones,dim = 1)
image_data = F.Concat(x2,img_ones,dim = 1)
# Initialize hash tables
S1 = F.array(np.random.randint(0, 2, (1,3072))*2-1,ctx = ctx)
H1 = F.array(np.random.randint(0, out_dim,(1,3072)),ctx = ctx)
S2 = F.array(np.random.randint(0, 2, (1,3072))*2-1,ctx = ctx)
H2 = F.array(np.random.randint(0, out_dim,(1,3072)),ctx = ctx)
# Count sketch
cs1 = C.count_sketch( data = image_data, s=S1, h = H1 ,name='cs1',out_dim =
cs2 = C.count_sketch( data = text_data, s=S2, h = H2 ,name='cs2',out_dim =
fft1 = C.fft(data = cs1, name='fft1', compute_size = compute_size)
fft2 = C.fft(data = cs2, name='fft2', compute_size = compute_size)
c = fft1 * fft2
ifft1 = C.ifft(data = c, name='ifft1', compute_size = compute_size)
# MLP
z = self.fc1(ifft1)
z = self.bn(z)
z = self.dropout(z)
z = self.fc2(z)
return z
We will introduce attention model in this notebook.
4.3.4 Data Iterator

The inputs of the data iterator are extracted image and question features. At each step, the data iterator will
return a data batch list: question data batch and image data batch.
We need to seperate the data batches by the length of the input data because the input questions are in
different lengths. The 𝑏𝑢𝑐𝑘𝑒𝑡𝑠 parameter defines the max length you want to keep in the data iterator. Here
since we already used pretrained model to extract the question feature, the question length is fixed as the
output of the pretrained model.
The 𝑙𝑎𝑦𝑜𝑢𝑡 parameter defines the layout of the data iterator output. “N” specify where is the data batch
dimension is.
𝑟𝑒𝑠𝑒𝑡() function is called after every epoch. 𝑛𝑒𝑥𝑡() function is call after each batch.
In [5]: class VQAtrainIter(mx.io.DataIter):
def __init__(self, img, sentences, answer, batch_size, buckets=None, invalid_la
text_name='text', img_name = 'image', label_name='softmax_label',
super(VQAtrainIter, self).__init__()
if not buckets:
buckets = [i for i, j in enumerate(np.bincount([len(s) for s in sentenc
if j >= batch_size]

buckets.sort()
ndiscard = 0
self.data = [[] for _ in buckets]
for i in range(len(sentences)):
buck = bisect.bisect_left(buckets, len(sentences[i]))
if buck == len(buckets):
ndiscard += 1
continue
buff = np.full((buckets[buck],), invalid_label, dtype=dtype)
buff[:len(sentences[i])] = sentences[i]
self.data[buck].append(buff)
self.data = [np.asarray(i, dtype=dtype) for i in self.data]

self.answer = answer
self.img = img
print("WARNING: discarded %d sentences longer than the largest bucket."%ndi
self.batch_size = batch_size
self.buckets = buckets
self.text_name = text_name
self.img_name = img_name
self.label_name = label_name
self.dtype = dtype
self.invalid_label = invalid_label
self.nd_text = []
self.nd_img = []
self.ndlabel = []
self.major_axis = layout.find('N')
self.default_bucket_key = max(buckets)
if self.major_axis == 0:
self.provide_data = [(text_name, (batch_size, self.default_bucket_key))
(img_name, (batch_size, self.default_bucket_key))]
self.provide_label = [(label_name, (batch_size, self.default_bucket_key
elif self.major_axis == 1:
self.provide_data = [(text_name, (self.default_bucket_key, batch_size))
(img_name, (self.default_bucket_key, batch_size))]
self.provide_label = [(label_name, (self.default_bucket_key, batch_size
else:
raise ValueError("Invalid layout %s: Must by NT (batch major) or TN (ti
self.idx = []
for i, buck in enumerate(self.data):
self.idx.extend([(i, j) for j in range(0, len(buck) - batch_size + 1, b
self.curr_idx = 0
self.reset()
def reset(self):
self.curr_idx = 0
self.nd_text = []
self.nd_img = []
self.ndlabel = []

for buck in self.data:

label = np.empty_like(buck.shape[0])
label = self.answer
self.nd_text.append(mx.ndarray.array(buck, dtype=self.dtype))
self.nd_img.append(mx.ndarray.array(self.img, dtype=self.dtype))
self.ndlabel.append(mx.ndarray.array(label, dtype=self.dtype))
def next(self):
if self.curr_idx == len(self.idx):
raise StopIteration
i, j = self.idx[self.curr_idx]
self.curr_idx += 1
if self.major_axis == 1:
img = self.nd_img[i][j:j + self.batch_size].T
text = self.nd_text[i][j:j + self.batch_size].T
label = self.ndlabel[i][j:j+self.batch_size]
else:
img = self.nd_img[i][j:j + self.batch_size]
text = self.nd_text[i][j:j + self.batch_size]
label = self.ndlabel[i][j:j+self.batch_size]
data = [text, img]

return mx.io.DataBatch(data, [label],
bucket_key=self.buckets[i],
provide_data=[(self.text_name, text.shape),(self.img_name,
provide_label=[(self.label_name, label.shape)])
4.3.5 Load the data

Here we will use subset of VQA dataset in this tutorial. We extract the image feature from ResNet-152,
text feature from GNMT encoder. In first two model, we have 21537 training samples and 1044 validation
samples in this tutorial. Image feature is a 2048-dim vector. Question feature is a 1048-dim vector.
In [6]: # Download the dataset
dataset_files = {'train': ('train_question.npz','train_img.npz','train_ans.npz'),
'validation': ('val_question.npz','val_img.npz','val_ans.npz'),
'test':('test_question_id.npz','test_question.npz','test_img_id.np
train_q, train_i, train_a = dataset_files['train']

val_q, val_i, val_a = dataset_files['validation']
url_format = 'https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/VQA-no
if not os.path.exists(train_q):
download(url_format.format(train_q),overwrite=True)
download(url_format.format(train_i),overwrite=True)
download(url_format.format(train_a),overwrite=True)
if not os.path.exists(val_q):
logging.info('Downloading validation dataset.')
download(url_format.format(val_q),overwrite=True)
download(url_format.format(val_i),overwrite=True)
download(url_format.format(val_a),overwrite=True)

INFO:root:Downloading training dataset.

INFO:root:downloaded https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/VQA-not
INFO:root:Downloading validation dataset.
In [7]: layout = 'NT'
bucket = [1024]
train_question = np.load(train_q)['x']
val_question = np.load(val_q)['x']
train_ans = np.load(train_a)['x']
val_ans = np.load(val_a)['x']
train_img = np.load(train_i)['x']
val_img = np.load(val_i)['x']
print("Total training sample:",train_ans.shape[0])

print("Total validation sample:",val_ans.shape[0])
data_train = VQAtrainIter(train_img, train_question, train_ans, batch_size, bucket

data_eva = VQAtrainIter(val_img, val_question, val_ans, batch_size, buckets = bucke
Total training sample: 21537

Total validation sample: 1044
WARNING: discarded 0 sentences longer than the largest bucket.
4.3.6 Initialize the Parameters

In [8]: net = Net1()
#net = Net2()
net.collect_params().initialize(mx.init.Xavier(), ctx=ctx)
4.3.7 Loss and Evaluation Metrics

In [9]: loss = gluon.loss.SoftmaxCrossEntropyLoss()
metric = mx.metric.Accuracy()
def evaluate_accuracy(data_iterator, net):

numerator = 0.
denominator = 0.
data_iterator.reset()
for i, batch in enumerate(data_iterator):
data1 = batch.data[0].as_in_context(ctx)
data = [data1,data2]
label = batch.label[0].as_in_context(ctx)
output = net(data)

metric.update([label], [output])
return metric.get()[1]
4.3.8 Optimizer
4.3.9 Training loop

moving_loss = 0.
best_eva = 0
data_train.reset()
for i, batch in enumerate(data_train):
label = batch.label[0].as_in_context(ctx)
output = net(data)
trainer.step(data[0].shape[0])
##########################
##########################
if i == 0:
moving_loss = np.mean(cross_entropy.asnumpy()[0])
else:
moving_loss = .99 * moving_loss + .01 * np.mean(cross_entropy.asnumpy(
#if i % 200 == 0:
# print("Epoch %s, batch %s. Moving avg of loss: %s" % (e, i, moving_lo
eva_accuracy = evaluate_accuracy(data_eva, net)
train_accuracy = evaluate_accuracy(data_train, net)
print("Epoch %s. Loss: %s, Train_acc %s, Eval_acc %s" % (e, moving_loss, train
if eva_accuracy > best_eva:
best_eva = eva_accuracy
logging.info('Best validation acc found. Checkpointing...')
net.save_parameters('vqa-mlp-%d.params'%(e))
INFO:root:Best validation acc found. Checkpointing...

Epoch 0. Loss: 3.07848375872, Train_acc 0.439319957386, Eval_acc 0.3525390625


4.3.10 Try it out!

Currently we have test data for the first two models we mentioned above. After the training loop over Net1
or Net2, we can try it on test data. Here we have 10 test samples.
In [12]: test = True
if test:
test_q_id, test_q, test_i_id, test_i, atoi,text = dataset_files['test']
if test and not os.path.exists(test_q):

logging.info('Downloading test dataset.')
download(url_format.format(test_q_id),overwrite=True)
download(url_format.format(test_q),overwrite=True)
download(url_format.format(test_i_id),overwrite=True)
download(url_format.format(test_i),overwrite=True)
download(url_format.format(atoi),overwrite=True)
download(url_format.format(text),overwrite=True)
if test:
test_question = np.load("test_question.npz")['x']
test_img = np.load("test_img.npz")['x']
test_question_id = np.load("test_question_id.npz")['x']
test_img_id = np.load("test_img_id.npz")['x']
#atoi = np.load("atoi.json")['x']
INFO:root:Downloading test dataset.

We pass the test data iterator to the trained model.

In [13]: data_test = VQAtrainIter(test_img, test_question, np.zeros((test_img.shape[0],1)),
for i, batch in enumerate(data_test):
#label = batch.label[0].as_in_context(ctx)
#label_one_hot = nd.one_hot(label, 10)
output = net(data)
output = np.argmax(output.asnumpy(), axis = 1)
In [17]: idx = np.random.randint(10)
print(idx)
question = json.load(open(text))
print("Question:", question[idx])
6
Question: Is there a boat in the water?
In [18]: image_name = 'COCO_test2015_' + str(int(test_img_id[idx])).zfill(12)+'.jpg'
if not os.path.exists(image_name):
download(url_format.format('test_images/'+image_name),overwrite=True)
from IPython.display import Image

Image(filename=image_name)
INFO:root:Downloading training dataset.


In [19]: dataset = json.load(open('atoi.json'))

ans = dataset['ix_to_ans'][str(output[idx]+1)]
print("Answer:", ans)
Answer: yes
4.4 Tree LSTM modeling for semantic relatedness

Just five years ago, many of the most successful models for doing supervised learning with text ignored
word order altogether. Some of the most successful models represented documents or sentences with the
order-invariant bag-of-words representation. Anyone thinking hard should probably have realized that these
models couldn’t dominate forever. That’s because we all know that word order actually does matter. Bag-
of-words models, which ignored word order, left some information on the table.
The recurrent neural networks that we introduced in chapter 5 model word order, by passing over the se-
quence of words in order, updating the models representation of the sentence after each word. And, with
LSTM recurrent cells and training on GPUs, even the straightforward LSTM far outpaces classical ap-
proaches, on a number of tasks, including language modeling, named entity recognition and more.
But while those models are impressive, they still may be leaving some knowledge on the table. To begin
with, we know a priori that sentence have a grammatical structure. And we already have some tools that
are very good at recovering parse trees that reflect grammatical structure of the sentences. While it may be
possible for an LSTM to learn this information implicitly, it’s often a good idea to build known information
into the structure of a neural network. Take for example convolutional neural networks. They build in the
prior knowledge that low level feature should be translation-invariant. It’s possible to come up with a fully
connected net that does the same thing, but it would require many more nodes and would be much more
susceptible to overfitting. In this case, we would like to build the grammatical tree structure of the sentences
into the architecture of an LSTM recurrent neural network. This tutorial walks through tree LSTMs, an
approach that does precisely that. The models here are based on the tree-structured LSTM by Kai Sheng
Tai, Richard Socher, and Chris Manning. Our implementation borrows from this Pytorch example.
4.4.1 Sentences involving Compositional Knowledge

This tutorial walks through training a child-sum Tree LSTM model for analyzing semantic relatedness of
sentence pairs given their dependency parse trees.
4.4.2 Preliminaries
Before getting going, you’ll probably want to note a couple preliminary details:
• Use of GPUs is preferred if one wants to run the complete training to match the state-of-the-art results.
• To show a progress meter, one should install the tqdm (“progress” in Arabic) through pip
install tqdm. One should also install the HTTP library through pip install requests.
from mxnet.gluon import Block, nn
from mxnet.gluon.parameter import Parameter
In [2]: class Tree(object):
def __init__(self, idx):
4.4. Tree LSTM modeling for semantic relatedness 311

self.children = []
self.idx = idx
def __repr__(self):
if self.children:
return '{0}: {1}'.format(self.idx, str(self.children))
else:
return str(self.idx)
In [3]: tree = Tree(0)
tree.children.append(Tree(1))
tree.children[1].children.append(Tree(4))
print(tree)
0: [1, 2: [4], 3]
4.4.3 Model
The model is based on child-sum tree LSTM. For each sentence, the tree LSTM model extracts information
following the dependency parse tree structure, and produces the sentence embedding at the root of each tree.
This embedding can be used to predict semantic similarity.
Child-sum Tree LSTM

In [4]: class ChildSumLSTMCell(Block):
def __init__(self, hidden_size,
i2h_weight_initializer=None,
hs2h_weight_initializer=None,
hc2h_weight_initializer=None,
i2h_bias_initializer='zeros',
hs2h_bias_initializer='zeros',
hc2h_bias_initializer='zeros',
input_size=0, prefix=None, params=None):
super(ChildSumLSTMCell, self).__init__(prefix=prefix, params=params)
self._hidden_size = hidden_size
self._input_size = input_size
self.i2h_weight = self.params.get('i2h_weight', shape=(4*hidden_size, i
init=i2h_weight_initializer)
self.hs2h_weight = self.params.get('hs2h_weight', shape=(3*hidden_size,
init=hs2h_weight_initializer)
self.hc2h_weight = self.params.get('hc2h_weight', shape=(hidden_size, h
init=hc2h_weight_initializer)
self.i2h_bias = self.params.get('i2h_bias', shape=(4*hidden_size,),
init=i2h_bias_initializer)
self.hs2h_bias = self.params.get('hs2h_bias', shape=(3*hidden_size,),
init=hs2h_bias_initializer)
self.hc2h_bias = self.params.get('hc2h_bias', shape=(hidden_size,),
init=hc2h_bias_initializer)
def forward(self, F, inputs, tree):

children_outputs = [self.forward(F, inputs, child)

for child in tree.children]

if children_outputs:
_, children_states = zip(*children_outputs) # unzip
else:
children_states = None
with inputs.context as ctx:

return self.node_forward(F, F.expand_dims(inputs[tree.idx], axis=0), ch
self.i2h_weight.data(ctx),
self.hs2h_weight.data(ctx),
self.hc2h_weight.data(ctx),
self.i2h_bias.data(ctx),
self.hs2h_bias.data(ctx),
self.hc2h_bias.data(ctx))
def node_forward(self, F, inputs, children_states,

i2h_weight, hs2h_weight, hc2h_weight,
i2h_bias, hs2h_bias, hc2h_bias):
# comment notation:
# N for batch size
# C for hidden state dimensions
# K for number of children.
# FC for i, f, u, o gates (N, 4*C), from input to hidden

i2h = F.FullyConnected(data=inputs, weight=i2h_weight, bias=i2h_bias,
num_hidden=self._hidden_size*4)
i2h_slices = F.split(i2h, num_outputs=4) # (N, C)*4
i2h_iuo = F.concat(*[i2h_slices[i] for i in [0, 2, 3]], dim=1) # (N, C*3)
if children_states:
# sum of children states, (N, C)
hs = F.add_n(*[state[0] for state in children_states])
# concatenation of children hidden states, (N, K, C)
hc = F.concat(*[F.expand_dims(state[0], axis=1) for state in children_s
# concatenation of children cell states, (N, K, C)
cs = F.concat(*[F.expand_dims(state[1], axis=1) for state in children_s
# calculate activation for forget gate. addition in f_act is done with

i2h_f_slice = i2h_slices[1]
f_act = i2h_f_slice + hc2h_bias + F.dot(hc, hc2h_weight) # (N, K, C)
forget_gates = F.Activation(f_act, act_type='sigmoid') # (N, K, C)
else:
# for leaf nodes, summation of children hidden states are zeros.
hs = F.zeros_like(i2h_slices[0])
# FC for i, u, o gates, from summation of children states to hidden state

hs2h_iuo = F.FullyConnected(data=hs, weight=hs2h_weight, bias=hs2h_bias,
num_hidden=self._hidden_size*3)
i2h_iuo = i2h_iuo + hs2h_iuo
iuo_act_slices = F.SliceChannel(i2h_iuo, num_outputs=3) # (N, C)*3

i_act, u_act, o_act = iuo_act_slices[0], iuo_act_slices[1], iuo_act_slices[
# calculate gate outputs

in_gate = F.Activation(i_act, act_type='sigmoid')

in_transform = F.Activation(u_act, act_type='tanh')
out_gate = F.Activation(o_act, act_type='sigmoid')
# calculate cell state and hidden state

next_c = in_gate * in_transform
if children_states:
next_c = F.sum(forget_gates * cs, axis=1) + next_c
next_h = out_gate * F.Activation(next_c, act_type='tanh')
return next_h, [next_h, next_c]
Similarity regression module

In [5]: # module for distance-angle similarity
class Similarity(nn.Block):
def __init__(self, sim_hidden_size, rnn_hidden_size, num_classes):
super(Similarity, self).__init__()
self.wh = nn.Dense(sim_hidden_size, in_units=2*rnn_hidden_size)
self.wp = nn.Dense(num_classes, in_units=sim_hidden_size)
def forward(self, F, lvec, rvec):

# lvec and rvec will be tree_lstm cell states at roots
mult_dist = F.broadcast_mul(lvec, rvec)
abs_dist = F.abs(F.add(lvec,-rvec))
vec_dist = F.concat(*[mult_dist, abs_dist],dim=1)
out = F.log_softmax(self.wp(F.sigmoid(self.wh(vec_dist))))
return out
Final model
In [6]: # putting the whole model together
class SimilarityTreeLSTM(nn.Block):
def __init__(self, sim_hidden_size, rnn_hidden_size, embed_in_size, embed_dim,
super(SimilarityTreeLSTM, self).__init__()
self.embed = nn.Embedding(embed_in_size, embed_dim)
self.childsumtreelstm = ChildSumLSTMCell(rnn_hidden_size, input_size=em
self.similarity = Similarity(sim_hidden_size, rnn_hidden_size, num_clas
def forward(self, F, l_inputs, r_inputs, l_tree, r_tree):

l_inputs = self.embed(l_inputs)
r_inputs = self.embed(r_inputs)
# get cell states at roots
lstate = self.childsumtreelstm(F, l_inputs, l_tree)[1][1]
rstate = self.childsumtreelstm(F, r_inputs, r_tree)[1][1]
output = self.similarity(F, lstate, rstate)
return output
4.4.4 Dataset classes

Vocab
In [7]: import os
import logging
import numpy as np
import random
from tqdm import tqdm
import mxnet as mx
# class for vocabulary and the word embeddings

class Vocab(object):
# constants for special tokens: padding, unknown, and beginning/end of sentence
PAD, UNK, BOS, EOS = 0, 1, 2, 3
PAD_WORD, UNK_WORD, BOS_WORD, EOS_WORD = '<blank>', '<unk>', '<s>', '</s>'
def __init__(self, filepaths=[], embedpath=None, include_unseen=False, lower=Fa
self.idx2tok = []
self.tok2idx = {}
self.lower = lower
self.include_unseen = include_unseen
self.add(Vocab.PAD_WORD)
self.add(Vocab.UNK_WORD)
self.add(Vocab.BOS_WORD)
self.add(Vocab.EOS_WORD)
self.embed = None
for filename in filepaths:

logging.info('loading %s'%filename)
with open(filename, 'r') as f:
self.load_file(f)
if embedpath is not None:
logging.info('loading %s'%embedpath)
with open(embedpath, 'r') as f:
self.load_embedding(f, reset=set([Vocab.PAD_WORD, Vocab.UNK_WORD, V
Vocab.EOS_WORD]))
@property
def size(self):
return len(self.idx2tok)
def get_index(self, key):

return self.tok2idx.get(key.lower() if self.lower else key,
Vocab.UNK)
def get_token(self, idx):

if idx < self.size:
return self.idx2tok[idx]
else:
return Vocab.UNK_WORD
def add(self, token):

token = token.lower() if self.lower else token

if token in self.tok2idx:
idx = self.tok2idx[token]
else:
idx = len(self.idx2tok)
self.idx2tok.append(token)
self.tok2idx[token] = idx
return idx
def to_indices(self, tokens, add_bos=False, add_eos=False):

vec = [BOS] if add_bos else []
vec += [self.get_index(token) for token in tokens]
if add_eos:
vec.append(EOS)
return vec
def to_tokens(self, indices, stop):

tokens = []
for i in indices:
tokens += [self.get_token(i)]
if i == stop:
break
return tokens
def load_file(self, f):

for line in f:
tokens = line.rstrip('\n').split()
for token in tokens:
self.add(token)
def load_embedding(self, f, reset=[]):

vectors = {}
for line in tqdm(f.readlines(), desc='Loading embeddings'):
tokens = line.rstrip('\n').split(' ')
word = tokens[0].lower() if self.lower else tokens[0]
if self.include_unseen:
self.add(word)
if word in self.tok2idx:
vectors[word] = [float(x) for x in tokens[1:]]
dim = len(vectors.values()[0])
def to_vector(tok):
if tok in vectors and tok not in reset:
return vectors[tok]
elif tok not in vectors:
return np.random.normal(-0.05, 0.05, size=dim)
else:
return [0.0]*dim
self.embed = mx.nd.array([vectors[tok] if tok in vectors and tok not in res
else [0.0]*dim for tok in self.idx2tok])
Data iterator
In [8]: # Iterator class for SICK dataset
class SICKDataIter(object):

def __init__(self, path, vocab, num_classes, shuffle=True):

super(SICKDataIter, self).__init__()
self.vocab = vocab
self.num_classes = num_classes
self.l_sentences = []
self.r_sentences = []
self.l_trees = []
self.r_trees = []
self.labels = []
self.size = 0
self.shuffle = shuffle
self.reset()
def reset(self):
if self.shuffle:
mask = list(range(self.size))
random.shuffle(mask)
self.l_sentences = [self.l_sentences[i] for i in mask]
self.r_sentences = [self.r_sentences[i] for i in mask]
self.l_trees = [self.l_trees[i] for i in mask]
self.r_trees = [self.r_trees[i] for i in mask]
self.labels = [self.labels[i] for i in mask]
self.index = 0
def next(self):
out = self[self.index]
self.index += 1
return out
def set_context(self, context):

self.l_sentences = [a.as_in_context(context) for a in self.l_sentences]
self.r_sentences = [a.as_in_context(context) for a in self.r_sentences]
def __len__(self):
return self.size
def __getitem__(self, index):

l_tree = self.l_trees[index]
r_tree = self.r_trees[index]
l_sent = self.l_sentences[index]
r_sent = self.r_sentences[index]
label = self.labels[index]
return (l_tree, l_sent, r_tree, r_sent, label)
4.4.5 Training with autograd

In [9]: import argparse, pickle, math, os, random
import logging
import numpy as np
import mxnet as mx

from mxnet import autograd as ag
# training settings and hyper-parameters

use_gpu = False
optimizer = 'AdaGrad'
seed = 123
batch_size = 25
training_batches_per_epoch = 10
weight_decay = 0.0001
epochs = 1
rnn_hidden_size, sim_hidden_size, num_classes = 150, 50, 5
# initialization
context = [mx.gpu(0) if use_gpu else mx.cpu()]
# seeding
mx.random.seed(seed)
np.random.seed(seed)
random.seed(seed)
# read dataset
import hashlib
while True:
if not data:
break
sha1.update(data)
if not matched:
logging.warn('Found hash mismatch in file {}, possibly due to incomplete do
.format(file_path))
return matched
data_file_name = 'tree_lstm_dataset-3d85a6c4.cPickle'
data_file_hash = '3d85a6c44a335a33edc060028f91395ab0dcf601'
if not os.path.exists(data_file_name) or not verified(data_file_name, data_file_has
download('https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/%s'%da
overwrite=True)
with open('tree_lstm_dataset-3d85a6c4.cPickle', 'rb') as f:

train_iter, dev_iter, test_iter, vocab = pickle.load(f)
logging.info('==> SICK vocabulary size : %d ' % vocab.size)

logging.info('==> Size of train data : %d ' % len(train_iter))
logging.info('==> Size of dev data : %d ' % len(dev_iter))
logging.info('==> Size of test data : %d ' % len(test_iter))
# get network

net = SimilarityTreeLSTM(sim_hidden_size, rnn_hidden_size, vocab.size, vocab.embed.
# use pearson correlation and mean-square error for evaluation

metric = mx.metric.create(['pearsonr', 'mse'])
# the prediction from the network is log-probability vector of each score class
# so use the following function to convert scalar score to the vector
# e.g 4.5 -> [0, 0, 0, 0.5, 0.5]
def to_target(x):
target = np.zeros((1, num_classes))
ceil = int(math.ceil(x))
floor = int(math.floor(x))
if ceil==floor:
target[0][floor-1] = 1
else:
target[0][floor-1] = ceil - x
target[0][ceil-1] = x - floor
return mx.nd.array(target)
# and use the following to convert log-probability vector to score

def to_score(x):
levels = mx.nd.arange(1, 6, ctx=x.context)
return [mx.nd.sum(levels*mx.nd.exp(x), axis=1).reshape((-1,1))]
# when evaluating in validation mode, check and see if pearson-r is improved

# if so, checkpoint and run evaluation on test dataset
def test(ctx, data_iter, best, mode='validation', num_iter=-1):
data_iter.reset()
samples = len(data_iter)
data_iter.set_context(ctx[0])
preds = []
labels = [mx.nd.array(data_iter.labels, ctx=ctx[0]).reshape((-1,1))]
for _ in tqdm(range(samples), desc='Testing in {} mode'.format(mode)):
l_tree, l_sent, r_tree, r_sent, label = data_iter.next()
z = net(mx.nd, l_sent, r_sent, l_tree, r_tree)
preds.append(z)
preds = to_score(mx.nd.concat(*preds, dim=0))

metric.update(preds, labels)
names, values = metric.get()
metric.reset()
for name, acc in zip(names, values):
logging.info(mode+' acc: %s=%f'%(name, acc))
if name == 'pearsonr':
test_r = acc
if mode == 'validation' and num_iter >= 0:
if test_r >= best:
best = test_r
logging.info('New optimum found: {}.'.format(best))
return best
def train(epoch, ctx, train_data, dev_data):

# initialization with context

if isinstance(ctx, mx.Context):
ctx = [ctx]
net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx[0])
net.embed.weight.set_data(vocab.embed.as_in_context(ctx[0]))
train_data.set_context(ctx[0])
dev_data.set_context(ctx[0])
# set up trainer for optimizing the network.

trainer = gluon.Trainer(net.collect_params(), optimizer, {'learning_rate': lear
best_r = -1
Loss = gluon.loss.KLDivLoss()
for i in range(epoch):
train_data.reset()
num_samples = min(len(train_data), training_batches_per_epoch*batch_size)
# collect predictions and labels for evaluation metrics
preds = []
labels = [mx.nd.array(train_data.labels[:num_samples], ctx=ctx[0]).reshape(
for j in tqdm(range(num_samples), desc='Training epoch {}'.format(i)):
# get next batch
l_tree, l_sent, r_tree, r_sent, label = train_data.next()
# use autograd to record the forward calculation
with ag.record():
# forward calculation. the output is log probability
z = net(mx.nd, l_sent, r_sent, l_tree, r_tree)
# calculate loss
loss = Loss(z, to_target(label).as_in_context(ctx[0]))
# backward calculation for gradients.
loss.backward()
preds.append(z)
# update weight after every batch_size samples
if (j+1) % batch_size == 0:
# translate log-probability to scores, and evaluate

preds = to_score(mx.nd.concat(*preds, dim=0))
metric.update(preds, labels)
names, values = metric.get()
metric.reset()
for name, acc in zip(names, values):
logging.info('training acc at epoch %d: %s=%f'%(i, name, acc))
best_r = test(ctx, dev_data, best_r, num_iter=i)
train(epochs, context, train_iter, dev_iter)

INFO:root:==> SICK vocabulary size : 2412
INFO:root:==> Size of train data : 4500
INFO:root:==> Size of dev data : 500
INFO:root:==> Size of test data : 4927
Training epoch 0: 100%|| 250/250 [00:11<00:00, 21.48it/s]
INFO:root:training acc at epoch 0: pearsonr=0.096197
INFO:root:training acc at epoch 0: mse=1.138699
Testing in validation mode: 100%|| 500/500 [00:09<00:00, 51.57it/s]
INFO:root:validation acc: pearsonr=0.490352

INFO:root:validation acc: mse=1.237509

INFO:root:New optimum found: 0.49035187610029013.
4.4.6 Conclusion
• Gluon offers great tools for modeling in an imperative way.
4.5 Introduction to recommender systems

[Early, early draft]
This chapter introduces recommender systems (commonly called RecSys), tools that recommmend items to
users. Many of the most popular uses of recommender systems involve to suggesting products to customers.
Amazon, for example, uses recommender systems to choose which retail products to display. Recommender
systems aren’t limited to physical products. For example, the algorithms that Pandora and Spotify use to
curate playlists are recommender systems. Personalized suggestions on news websites are recommender
systems. And as of this writing, several carousels on the home page for Amazon’s Prime Videos’s contain
personalized TV and Movie recommendations.
I (Zack) have honestly no idea why Amazon wants me to watch Bubble Guppies. It’s possible that Bubble
Guppies is a masterpiece, and the recommender systems knows that my life will change upon watching it.
It’s also possible that the recommender made a mistake. For example, it might have extrapolated incor-
rectly from my affinity for the anime Death Note, thinking that I would similarly love any animated series.
And, since I’ve never rated a nickelodean series (either postiively or negatively), the system may have no
knowledge to the contrary. It’s also possible that this series is a new addition to the catalogue, and thus they
need to recommend the item to many users in ordder to develop a sense of who likes Bubble Guppies. This
problem, of sorting out how to handle a new item, is called the cold-start problem.
A recommender system doesn’t have to use any sophisticated machine learning techniques. And it doesn’t
even have to be personalized. One reasonable baseline for most applications is to suggest the most popular
items to everyone. But we have to be careful. Depending on how we define popularity, we might create
a feedback loop. The most popular items get recommended which makes them even more popular, which
makes them even more frequently recommended, etc.
For services with diverse users, however, personalization can be essential. Diapers are among the most
popular items on Amazon, but we probably shouldn’t recommend diapers to adolescents. We also probably
should not recommend anything associated with Justin Bieber to a user who isn’t an adolescent. Moreover,
we might want to personalize, not only to the user, but to the context. For example, just after I bought a
Pixel phone, I was in the market for a phone case. But I have no interested in buying a phone case one year
later.
4.5. Introduction to recommender systems 321

4.5.1 Many ways to pose the problem

While it might seem obvious, that personalization is a good strategy, it’s not immediately obvious how best
to articualate recommendation as a machine learning problem.
Discuss: * Rating prediction * Passive feedback (view/notview) * Content-based recommendation
4.5.2 Amazon review dataset

• introduce dataset
In [5]: import mxnet
import urllib
import gzip
In [10]: with gzip.open(urllib.request.urlopen("http://snap.stanford.edu/data/amazon/produc
data = [eval(l) for l in f]
In [11]: data[0]
Out[11]: {'asin': '616719923X',
'helpful': [0, 0],
'overall': 4.0,
'reviewText': 'Just another flavor of Kit Kat but the taste is unique and a bit d
'reviewTime': '06 1, 2013',
'reviewerID': 'A1VEELTKS8NLZB',
'reviewerName': 'Amazon Customer',
'summary': 'Good Taste',
'unixReviewTime': 1370044800}
4.5.3 [Do some dataset exploration]

• Look at the average rating
• Look at the number of unique users and items
• Plot a histogram of the number of ratings/reviews corresponding to each user
• “” for items
In [17]: users = [d['reviewerID'] for d in data]
In [18]: items = [d['asin'] for d in data]
In [14]: ratings = [d['overall'] for d in data]
4.5.4 Models
• Just the average
• Offset plus user and item biases
• Latent factor model / matrix factorization
In [ ]:
In [ ]:

In [ ]:
In [ ]:
In [ ]:
4.6 Linear Dynamical Systems with MXNet

In this notebook we will look at how to implement filtering in general linear dynamical systems (aka Kalman
filtering) using MXNet.
First, a short mathematical description of the problem.
A general (Gaussian) linear dynamical system is specified by two equations.
• The first, called the transition equation,
ℎ𝑡 = 𝐴ℎ𝑡−1 + 𝜖𝑡 𝜖𝑡 ∼ 𝒩 (0, Σℎ )
describes how the hidden (also called “latent”) state ℎ𝑡 ∈ R𝐻 evolves with time. In a LDS this involves
applying a linear transformation 𝐴 ∈ R𝐻×𝐻 to the previous hidden state ℎ𝑡−1 , followed by adding zero-
mean Gaussian noise.
• The second, the observation equation or emission model,
𝑣𝑡 = 𝐵ℎ𝑡 + 𝜈𝑡 𝜈𝑡 ∼ 𝒩 (0, Σ𝑣 )
descibes how the latent state ℎ𝑡 relates to the observations (“visibles”) 𝑣𝑡 ∈ R𝐷 . In particular, 𝑣𝑡 is a linear
transformation of the hidden state, 𝐵ℎ𝑡 , to which Gaussian noise is added.
Finally, we need to specify the initial state, usually by placing a Gaussian prior on ℎ0 ,
ℎ0 ∼ 𝒩 (𝜇0 , Σ0 )
The LDS is thus fully specified by the system parameters 𝐴 ∈ R𝐻×𝐻 , 𝐵 ∈ R𝐷×𝐻 , Σℎ ∈ 𝒮+
𝐻 , Σ ∈ 𝒮𝐷,
𝑣 +
𝜇0 ∈ R𝐻 , Σ0 ∈ 𝒮+𝐻 . 𝒮 denotes the space of positive definite (PD) matrices.
+
Given such a LDS specification, and a sequence of observations 𝑣0 , 𝑣1 , . . . , 𝑣𝑇 , one is typically interested in
one of the following
1. (Log-)Likelihood computation, i.e. computing the probability of the data under the model,
𝑃 (𝑣0 , 𝑣1 , . . . , 𝑣𝑇 )
2. Filtering, i.e. computing the mean and covariance of 𝑃 (ℎ𝑡 |𝑣0 , 𝑣1 , . . . , 𝑣𝑡 )
3. Smoothing, i.e. computing the mean and covariance of 𝑃 (ℎ𝑡 |𝑣0 , 𝑣1 , . . . , 𝑣𝑇 )
4. Parameter learning: find the system parameters that best describe the data, e.g. by maximizing likeli-
hood
In this notebook we will focus on the filtering problem, and will also see how to compute the log-likelihood
as a byproduct. For details on other problems, See e.g. Barber, 2012, Chapter 24.
4.6. Linear Dynamical Systems with MXNet 323

4.7 Filtering
We want to find the “filtered” distributions 𝑝(ℎ𝑡 |𝑣0:𝑡 ) where 𝑣0:𝑡 denotes {𝑣0 , · · · , 𝑣𝑡 }. Due to the closure
properties of Gaussian distributions, each of these distributions is also Gaussian 𝑝(ℎ𝑡 |𝑣0:𝑡 ) = 𝒩 (ℎ𝑡 |𝑓𝑡 , 𝐹𝑇 ).
The filtering procedure proceeds sequentially, by expressing 𝑓𝑡 and 𝐹𝑡 in terms of 𝑓𝑡−1 and 𝐹𝑡−1 . We
initialize 𝑓0 and 𝐹0 to be 0.
4.7.1 Prerequisite
To derive the formulas for filtering, here is all you need [see Bishop 2008, Appendix B]
• Conditional Gaussian equations
𝑝(𝑥) = 𝒩 (𝜇, Λ−1 )

𝑝(𝑦|𝑥) = 𝒩 (𝑦|𝐴𝑥 + 𝑏, 𝐿−1 )
The marginal distribution of 𝑦 and the conditional distribution of 𝑥 given 𝑦 are
𝑝(𝑦) = 𝒩 (𝑦|𝐴𝜇 + 𝑏, 𝐿−1 + 𝐴Λ−1 𝐴𝑇 )

(1)
𝑝(𝑥|𝑦) = 𝒩 (𝑥|Σ 𝐴𝑇 𝐿(𝑦 − 𝑏) + Λ𝜇 , Σ),
[︀ ]︀
Σ = (Λ + 𝐴𝑇 𝐿𝐴)−1 (2)
• Matrix Inversion Lemma (aka Woodbury matrix identity)
(𝐴 + 𝐵𝐷−1 𝐶)−1 = 𝐴−1 − 𝐴−1 𝐵(𝐷 + 𝐶𝐴−1 𝐵)−1 𝐶𝐴−1 (3)
4.7.2 Derivation
Now we are ready to derive the filtering equations, by Bayes Theorem
𝑝(ℎ𝑡 |𝑣0:𝑡 ) = 𝑝(ℎ𝑡 |𝑣𝑡 , 𝑣0:𝑡−1 )

∝ 𝑝(𝑣𝑡 |ℎ𝑡 , 𝑣0:𝑡−1 )𝑝(ℎ𝑡 |𝑣0:𝑡−1 )
= 𝑝(𝑣𝑡 |ℎ𝑡 )𝑝(ℎ𝑡 |𝑣0:𝑡−1 ) by Markov property
The derivation boils down to caclulate the two terms on the right hand side (you can think that the first is
𝑝(𝑦|𝑥) and the second is 𝑝(𝑥) as in the conditional Gaussian equations) and use (2) above to get the desired
formula.
The first term is directly given by the observation equation, i.e., 𝑝(𝑣𝑡 |ℎ𝑡 ) = 𝒩 (𝐵ℎ𝑡 , Σ𝑣 ), and the second
term can be calculated as follows
∫︁
𝑝(ℎ𝑡 |𝑣0:𝑡−1 ) = 𝑝(ℎ𝑡 |ℎ𝑡−1 , 𝑣0:𝑡−1 )𝑝(ℎ𝑡−1 |𝑣0:𝑡−1 )dℎ𝑡−1
∫︁
= 𝑝(ℎ𝑡 |ℎ𝑡−1 )𝑝(ℎ𝑡−1 |𝑣0:𝑡−1 )dℎ𝑡−1 by Markov property
∫︁
= 𝒩 (ℎ𝑡 |𝐴ℎ𝑡−1 , Σℎ )𝒩 (ℎ𝑡−1 |𝑓𝑡−1 , 𝐹𝑡−1 )dℎ𝑡−1
= 𝒩 (𝐴𝑓𝑡−1 , 𝐴𝐹𝑡−1 𝐴𝑇 + Σℎ ) using the marginalization equation (1)

= 𝒩 (𝜇𝑓 , Σℎℎ )

First, we calculate the covariance matrix 𝐹𝑡 ,

)︀−1
𝐹𝑡 = Σ−1 𝑇 −1
= Σℎℎ − Σℎℎ 𝐵 𝑇 (Σ𝑣 + 𝐵Σℎℎ 𝐵 𝑇 )−1 𝐵Σℎℎ = (𝐼 − 𝐾𝐵)Σℎℎ ,
(︀
ℎℎ − 𝐵 Σ𝑣 𝐵
where we have used the matrix inversion lemman and define the Kalman gain matrix as
𝐾 = Σℎℎ 𝐵 𝑇 (Σ𝑣 + 𝐵Σℎℎ 𝐵 𝑇 )−1 .
Notice that for numerical stability, the covariance matrix is normally calculated using so-called “Joseph’s
symmetrized update,”
𝐹𝑡 = (𝐼 − 𝐾𝑡 𝐵)Σℎℎ (𝐼 − 𝐾𝑡 𝐵)𝑇 + 𝐾𝑡 Σ𝑣 𝐾𝑡𝑇 ,
which consists of summing two PD matrices.

Finally, after some algebraic manipulation, we have the mean
𝑓𝑡 = 𝜇ℎ + 𝐾(𝑣 − 𝐵𝜇ℎ ).
4.7.3 LDS Foward Pass

To summarize, the iterative algorithm proceeds as follows
𝜇ℎ = 𝐴𝑓𝑡−1 𝜇𝑣 = 𝐵𝜇ℎ
Σℎℎ = 𝐴𝐹𝑡−1 𝐴𝑇 + Σℎ Σ𝑣𝑣 = 𝐵Σℎℎ 𝐵 𝑇 + Σ𝑣
𝐾𝑡 = Σℎℎ 𝐵 𝑇 Σ−1
𝑣𝑣
𝑓𝑡 = 𝜇ℎ + 𝐾(𝑣 − 𝜇𝑣 ) 𝐹𝑡 = (𝐼 − 𝐾𝑡 𝐵)Σℎℎ (𝐼 − 𝐾𝑡 𝐵)𝑇 + 𝐾𝑡 Σ𝑣 𝐾𝑡𝑇

As we can see, each step in the recursive filtering procedure involves a few matrix-matrix and matrix-vector
multiplications, as well as some matrix and vector additions and subtractions. These are standard operators
available is most deep learning frameworks (including MXNet, where matrix multiplication is available
through mx.sym.dot()). However, the update also involves the term Σ−1 𝑣𝑣 , the inverse of a 𝐷-by-𝐷
symmetric, postive semidefinite matrix (due to it being a covariance matrix). If the output dimensionality 𝐷
is 1, this is simply the scalar 1/Σ𝑣𝑣 , but in the general case we need to compute the inverse (or, preferably,
solve the corresponding linear system directly).
Luckily, operators for doing exactly that have recently been added to MXNet. In particular, we have
• mx.nd.linalg_gemm2 (Matrix-matrix product, more flexible than dot)
• mx.nd.linalg_potrf (Cholesky decomposition 𝐴 = 𝐿𝐿𝑇 for symmetric, PD matrix 𝐴)
• mx.nd.linalg_trsm (solve system of equations involving triangular matrices)
• mx.nd.linalg_sumlogdiag (compute the sum of the log of the diagonal elements of a matrix)
• mx.nd.linalg_potri (compute 𝐴−1 from a previously computed Cholesky factor 𝐿 of 𝐴 (i.e.
𝐿𝐿𝑇 = 𝐴).
4.7. Filtering 325

4.7.4 Computing the likelihood

The terms 𝜇𝑣 and Σ𝑣𝑣 computed during the filtering update correspond to the mean and covariance of the
predictive distribution 𝑃 (𝑣𝑡 |𝑣0 , 𝑣1 , . . . , 𝑣𝑡−1 ), which allows us to compute the likelihood by decomposing it
into telescoping conditional distributions,
𝑃 (𝑣0 , 𝑣1 , . . . , 𝑣𝑡 ) = 𝑃 (𝑣0 )𝑃 (𝑣1 |𝑣0 )𝑃 (𝑣2 |𝑣0 , 𝑣1 ) · · · 𝑃 (𝑣𝑡 |𝑣0 , 𝑣1 , . . . , 𝑣𝑡−1 )
and then using 𝑃 (𝑣𝑡 |𝑣0 , 𝑣1 , . . . , 𝑣𝑡−1 ) = 𝒩 (𝑣𝑡 |𝜇𝑣 , Σ𝑣𝑣 ) with parameters obtained during filtering to com-
pute each term.
from mxnet.ndarray import linalg_gemm as gemm

from mxnet.ndarray import linalg_gemm2 as gemm2
from mxnet.ndarray import linalg_potrf as potrf
from mxnet.ndarray import linalg_trsm as trsm
from mxnet.ndarray import linalg_sumlogdiag as sumlogdiag
%matplotlib inline
plt.rcParams["figure.figsize"] = (10, 5)
4.8 Generating Synthetic Dataset

This example is adapted from Example 24.3 in (Barber, 2017). The two-dimensional latent vector ℎ𝑡 is
rotated at each iteration and then is projected to produce a scalar observation. More precisely, we have
(︂ )︂
cos 𝜃 − sin 𝜃
ℎ𝑡+1 = 𝐴ℎ𝑡 + 𝜖ℎ , 𝐴 = , 𝜖ℎ ∼ 𝒩 (0, 𝛼2 · I2 )
sin 𝜃 cos 𝜃
𝑣𝑡+1 = [1, 0] · ℎ𝑡+1 + 𝜖𝑣 , , 𝜖𝑣 ∼ 𝒩 (0, 𝜎 2 ).

In [3]: alpha = 0.5
sigma = 0.5
theta = np.pi / 6
T = 50
In [4]: A = nd.array([[np.cos(theta), -np.sin(theta)],
[np.sin(theta), np.cos(theta)]])
B = nd.array([[1, 0]])
S_h = nd.array(np.square(alpha) * np.eye(2))

S_v = nd.array(np.square(sigma) * np.eye(1))
v = []
# initial state h_0
h = np.array([1, 0])
for t in range(T):
# h_t = Bh_{t-1} + \epsilon_h
h = np.random.multivariate_normal(A.asnumpy().dot(h), S_h.asnumpy())

# v_t = Ah_t + \epsilon_v

vv = np.random.normal(B.asnumpy().dot(h), S_v.asnumpy())
v.append(vv)
v = nd.array(np.array(v).reshape((T,1)))
In [5]: plt.plot(v.asnumpy());
4.8.1 LDS Forward Function (Filtering)

In [6]: def LDS_forward(v, A, B, S_h, S_v):
H = A.shape[0] # dim of latent state

D = B.shape[0] # dim of observation
T = v.shape[0] # num of observations
f_0 = nd.zeros((H,1))
F_0 = nd.zeros((H,H))
eye_h = nd.array(np.eye(H))
F_t = None
f_t = None
F_seq = []
f_seq = []
log_p_seq = []
for t in range(T):
if t == 0:
# At the first time step, use the prior
mu_h = f_0
S_hh = F_0
4.8. Generating Synthetic Dataset 327

else:
# Otherwise compute using update eqns.
mu_h = gemm2(A, f_t)
S_hh = gemm2(A, gemm2(F_t, A, transpose_b=1)) + S_h
# direct transcription of the update equations above

mu_v = gemm2(B, mu_h)
S_hh_x_B_t = gemm2(S_hh, B, transpose_b=1)
S_vv = gemm2(B, S_hh_x_B_t) + S_v
S_vh = gemm2(B, S_hh)
# use potrf to compute the Cholesky decomposition S_vv = LL^T

S_vv_chol = potrf(S_vv)
# K = S_hh X with X = B^T S_vv^{-1}

# We have X = B^T S_vv^{-1} => X S_vv = B^T => X LL^T = B^T
# We can thus obtain X by solving two linear systems involving L
K = trsm(S_vv_chol, trsm(S_vv_chol, S_hh_x_B_t, rightside=1, transpose=1),
delta = v[t] - mu_v

f_t = mu_h + gemm2(K, delta)
ImKB = eye_h - gemm2(K, B)

F_t = (gemm2(ImKB, gemm2(S_hh, ImKB, transpose_b=True))
+ gemm2(K, gemm2(S_v, K, transpose_b=True), name="Ft"))
# save filtered covariance and mean

F_seq.append(F_t)
f_seq.append(f_t)
# compute the likelihood using mu_v and L (LL^T = S_vv)

Z = trsm(S_vv_chol, trsm(S_vv_chol, delta), transpose=1)
log_p = (-0.5 * (mx.nd.reshape(gemm2(delta, Z, transpose_a=True), shape=(0,
+ D*np.log(2.0 * np.pi)) - sumlogdiag(S_vv_chol))
log_p_seq.append(log_p)
return f_seq, F_seq, log_p_seq

In [7]: f_seq, F_seq, _ = LDS_forward(v, A, B, S_h, S_v)
4.8.2 Calculate the filtered mean and variance

Given 𝑝(ℎ𝑡 |𝑣0:𝑡 ) = 𝒩 (𝜇𝑡 , Σ𝑡 ), we can compute the distribution of the reconstructed observations
𝑝(𝑣̂︀𝑡 ) = 𝒩 (𝐵𝜇𝑡 , 𝐵Σ𝑡 𝐵 𝑇 + 𝜎 2 ).
In [8]: from functools import reduce

B_np = B.asnumpy()
h_states = reduce(lambda x, y: np.hstack((x,y)), [ff.asnumpy() for ff in f_seq])
v_filtered_mean = B.asnumpy().dot(h_states).reshape((T,))
In [9]: v_filtered_var = np.sqrt(
np.array([B_np.dot(ff.asnumpy()).dot(B_np.T) + np.square(sigma) for ff in F_seq

In [10]: plt.plot(v.asnumpy(), color="r")

plt.plot(v_filtered_mean, color="b")
x = np.arange(T)
plt.fill_between(x, v_filtered_mean-v_filtered_var,
v_filtered_mean+v_filtered_var,
facecolor="blue", alpha=0.2)
plt.legend(["data", "reconstruction"]);
In the next notebook, we will use Kalman filtering as a subroutine in more complex models. In particular,
we will show how to do time series forecasting with innovative state space models (ISSMs).
4.9 Exponential Smoothing and Innovation State Space Model (ISSM)

In this notebook we will illustrate the implementation of filtering in innovation state space model (ISSM,
for short) using MXNet. Let us first briefy reivew the basic concepts.
Time series forecasting is a central problem occuring in many applications from optimal inventory manage-
ment, staff scheduling to topology planning. Given a sequence of measurements 𝑧1 , . . . , 𝑧𝑇 observed over
time, the problem here is to predict future values of the time series 𝑧𝑇 +1 , . . . , 𝑧𝑇 +𝜏 , where 𝜏 is referred as
the time horizon.
Exponential smoothing (ETS, which stands for Error, Trend, and Seasonality) is a family of very successful
forecasting methods which are based on the key property that forecasts are weighted combinations of past
observations (Hyndman et. al, 2008).
For example, in simple exponential smoothing, the foreacast 𝑧ˆ𝑇 +1 for time step 𝑇 + 1 is written as (Hynd-
man, Athanasopoulos, 2012)
𝑧ˆ𝑇 +1 = 𝑧ˆ𝑇 + 𝛼(𝑧𝑇 − 𝑧ˆ𝑇 ) = 𝛼 · 𝑧𝑇 + (1 − 𝛼) · 𝑧ˆ𝑇 ,
In words, the next step forecast is a convex combination of the most recent obseravtion and forecast. Ex-
panding the above equation, it is clear that the forecast is given by the exponentially weighted average of
4.9. Exponential Smoothing and Innovation State Space Model (ISSM) 329
past observations,
𝑧ˆ𝑇 +1 = 𝛼𝑧𝑇 + 𝛼(1 − 𝛼)𝑧𝑇 −1 + 𝛼(1 − 𝛼)2 𝑧𝑇 −2 + · · · .
Here 𝛼 > 0 is a smoothing parameter that controls the weight given to each observation. Note that the
recent observations are given more weight than the older observations. In fact the weight given to the past
observation descreases exponentially as it gets older and hence the name exponential smoothing.
General exponential smoothing methods consider the extensions of simple ETS to include time series pat-
terns such as (linear) trend, various periodic seasonal effects. All ETS methods falls under the category of
forecasting methods as the predictions are point forecasts (a single value is predicted for each future time
step). On the other hand a statistical model describes the underlying data generation process and has an
advantage that it can produce an entire probability distribuiton for each of the future time steps. Innova-
tion state space model (ISSM) is an example of such models with considerable flexibility in respresnting
commonly occurring time series patterns and underlie the exponential smoothing methods.
The idea behind ISSMs is to maintain a latent state vector 𝑙𝑡 with recent information about level, trend, and
seasonality factors. The state vector 𝑙𝑡 evolves over time adding small innvoation (i.e., the Gaussian noise)
at each time step. The observations are then a linear combination of the components of the current state.
Mathematically, ISSM is specified by two equations
• The state transition equation is given by
𝑙𝑡 = 𝐹𝑡 𝑙𝑡−1 + 𝑔𝑡 𝜖𝑡 , 𝜖𝑡 ∼ 𝒩 (0, 1).
Note that the innovation strength is controlled by 𝑔𝑡 , i.e., 𝑔𝑡 𝜖𝑡 ∼ 𝒩 (0, 𝑔𝑡2 ).
• The observation equation is given by
𝑧𝑡 = 𝑎 ⊤
𝑡 𝑙𝑡−1 + 𝑏𝑡 + 𝜈𝑡 , 𝜈𝑡 ∼ 𝒩 (0, 𝜎𝑡2 )
Note that here we allow for an additional term 𝑏𝑡 which can model any determinstic component (exogenous
variables).
This describes a fairy generic model allowing the user to encode specific time series patterns using the
coefficients 𝐹 , 𝑎𝑡 and thus are problem dependent. The innovation vector 𝑔𝑡 comes in terms of parameters
to be learned (the innovation strengths). Moreover, the initial state 𝑙0 has to be specified. We do so by
specifying a Gaussian prior distribution 𝑃 (𝑙0 ), whose parameters (means, standard deviation) are learned
from data as well.
The parameters of the ISSM are typically learned using the maximum likelihood principle. This requires
the computation of the log-likelihood of the given observations i.e., computing the probability of the data
under the model, 𝑃 (𝑧1 , . . . , 𝑧𝑇 ). Fortunately, in the previous notebook, we have learned how to compute
the log-likelihood as a byproduct of LDS filtering problem.
4.10 Filtering
We remark that ISSM is a special case of linear dynamical system except that the coefficients are allowed
to change over time. The filtering equations for ISSM can readily be obtained from the general derivation
described in LDS.

Note the change in the notation in the following equations for filtered mean (𝜇𝑡 ) and filtered variance (𝑆𝑡 )
because of the conflict of notation for the ISSM coefficient 𝐹 . Also note that the deterministic part 𝑏𝑡 needs
to be subtracted from the observations [𝑧𝑡 ].
𝜇ℎ = 𝐹𝑡 𝜇𝑡−1 𝜇𝑣 = 𝑎⊤
𝑡 𝜇ℎ
Σℎℎ = 𝐹𝑡 𝑆𝑡−1 𝐹𝑡𝑇 + 𝑔𝑡 𝑔𝑡𝑇 𝜎𝑣2 = 𝑎𝑇𝑡 Σℎℎ 𝑎𝑡 + 𝜎𝑡2

1
𝐾𝑡 = Σℎℎ 𝑎𝑡
𝜎𝑣2
𝜇𝑡 = 𝜇ℎ + 𝐾(𝑧𝑡 − 𝑏𝑡 − 𝜇𝑣 ) 𝑆𝑡 = (𝐼 − 𝐾𝑡 𝑎𝑇𝑡 )Σℎℎ (𝐼 − 𝐾𝑡 𝑎𝑇𝑡 )𝑇 + 𝜎𝑡2 𝐾𝑡 𝐾𝑡𝑇
from mxnet.ndarray import linalg_gemm2 as gemm2
4.10.1 ISSM Filtering Function

In [2]: def ISSM_filter(z, b, F, a, g, sigma, m_prior, S_prior):
H = F.shape[0] # dim of latent state

T = z.shape[0] # num of observations
eye_h = nd.array(np.eye(H))
mu_seq = []
S_seq = []
log_p_seq = []
for t in range(T):
if t == 0:
# At the first time step, use the prior
mu_h = m_prior
S_hh = S_prior
else:
# Otherwise compute using update eqns.
F_t = F[:, :, t]
g_t = g[:, t].reshape((H,1))
mu_h = gemm2(F_t, mu_t)

S_hh = gemm2(F_t, gemm2(S_t, F_t, transpose_b=1)) + \
gemm2(g_t, g_t, transpose_b=1)
a_t = a[:, t].reshape((H,1))

mu_v = gemm2(mu_h, a_t, transpose_a=1)
# Compute the Kalman gain (vector)

S_hh_x_a_t = gemm2(S_hh, a_t)
sigma_t = sigma[t]
S_vv = gemm2(a_t, S_hh_x_a_t, transpose_a=1) + nd.square(sigma_t)
kalman_gain = nd.broadcast_div(S_hh_x_a_t, S_vv)
4.10. Filtering 331

# Compute the error (delta)

delta = z[t] - b[t] - mu_v
# Filtered estimates
mu_t = mu_h + gemm2(kalman_gain, delta)
# Joseph's symmetrized update for covariance:

ImKa = nd.broadcast_sub(eye_h, gemm2(kalman_gain, a_t, transpose_b=1))
S_t = gemm2(gemm2(ImKa, S_hh), ImKa, transpose_b=1) + \
nd.broadcast_mul(gemm2(kalman_gain, kalman_gain, transpose_b=1), nd
# likelihood term
log_p = (-0.5 * (delta * delta / S_vv
+ np.log(2.0 * np.pi)
+ nd.log(S_vv))
)
mu_seq.append(mu_t)
S_seq.append(S_t)
log_p_seq.append(log_p)
return mu_seq, S_seq, log_p_seq
4.10.2 Data
We will use the 10 year US Government Bond Yields dataset to illustrate two specific instances of ISSM
models.
In [3]: import pandas as pd
import numpy as np
%matplotlib inline
plt.rcParams["figure.figsize"] = (12, 5)
In [4]: df = pd.read_csv("https://datahub.io/core/bond-yields-us-10y/r/monthly.csv", header
In [5]: df.set_index("Date")
# get the time series

ts = df.values[:,1]
# Let us normalize the time series

ts = np.array((ts - np.mean(ts)) / np.std(ts), dtype=np.double)
In [6]: plt.plot(ts);

4.10.3 Level ISSM

The simplest possible ISSM maintains a level component only. Abusing the notation and let 𝑙𝑡 denote level,
the level ISSM can be written as
𝑙𝑡 = 𝛿𝑙𝑡−1 + 𝛼𝜖𝑡 .
Or in ISSM terminology,
𝑎𝑡 = [𝛿], 𝐹𝑡 = [𝛿], 𝑔𝑡 = [𝛼], 𝛼 > 0.
The level 𝑙𝑡 ∈ R evolves over time by adding a random innovation 𝛼𝜖𝑡 ∼ 𝒩 (0, 𝛼2 ) to the previous level, so
that 𝛼 specifies the amount of level drift over time. At time 𝑡, the previous level 𝑙𝑡−1 is used in the prediction
𝑧𝑡 and then the level is updated. The damping factor 𝛿 ∈ (0, 1] allows the ‘‘damping” of the level. The
initial state prior 𝑃 (𝑙0 ) is given by 𝑙0 ∼ 𝑁 (𝜇0 , 𝜎02 ). For Level-ISSM, we learn the parameters 𝛼 > 0, 𝜇0 ,
𝜎0 > 0.
Here we will fix the parameters for the illustration of filtering. Learning of the parameters will be discussed
in another notebook.
In [7]: latent_dim = 1
T = len(ts)
# Set the coefficients of the ISSM

delta = 1.0
F = delta * nd.ones((1, 1, T))
a = delta * nd.ones((1, T))
# Set the parameters of the ISSM

alpha = 0.5
g = alpha * nd.ones((1, T))
4.10. Filtering 333

m_prior = nd.zeros((latent_dim, 1))

S_prior = nd.zeros((latent_dim, latent_dim))
sigma = 0.5 * nd.ones((T, 1))
b = nd.zeros((T, 1))
z = nd.array(ts).reshape((T, 1))
In [8]: mu_seq, S_seq, _ = ISSM_filter(z, b, F, a, g, sigma, m_prior, S_prior)
Calculate the filtered mean and variance of observations

Given 𝑝(𝑙𝑡−1 |𝑧1:𝑡 ) = 𝒩 (𝜇𝑡 , 𝑆𝑡 ), we can compute the distribution of the reconstructed observations
𝑝(𝑧̂︀𝑡 ) = 𝒩 (𝑎𝑇𝑡 𝜇𝑡 , 𝑎𝑇𝑡 𝑆𝑡 𝑎𝑡 + 𝜎𝑡2 ).
def reconstruct(mu_seq, S_seq):

a_np = a.asnumpy()
T = len(mu_seq)
sigma_np = sigma.asnumpy()
v_filtered_mean = np.array([a_np[:, t].dot(mu_t.asnumpy())

for t, mu_t in enumerate(mu_seq)]
).reshape(T, )
v_filtered_std = np.sqrt(np.array([a_np[:, t].dot(S_t.asnumpy()).dot(a_np[:, t]

np.square(sigma_np[t])
for t, S_t in enumerate(S_seq)]).reshape((T,
return v_filtered_mean, v_filtered_std

In [10]: reconst_mean, reconst_std = reconstruct(mu_seq, S_seq)
Forecast
One advantage of the ISSM model is that one can obtain the complete probability distribution for each of
the future time steps:
𝑇 𝑇 2
𝑇 +𝑡 ) = 𝒩 (𝑎𝑇 +𝑡 𝜇𝑇 +𝑡 , 𝑎𝑇 +𝑡 𝑆𝑇 +𝑡 𝑎𝑇 +𝑡 + 𝜎𝑇 +𝑡 ),
𝑝(𝑧̂︂ 𝑡>0
𝑝(𝑙𝑇 +𝑡 ) = 𝒩 (𝐹 𝜇𝑇 +𝑡−1 , 𝐹 𝑆𝑇 +𝑡−1 𝐹 + 𝑔𝑇 +𝑡 𝑔𝑇𝑇 +𝑡 )
𝑇
In [11]: def forecast(mu_last_state, S_last_state, F, a, g, sigma, horizon):
forecasts_mean = []
forecasts_std = []
mu_last_state = mu_last_state.asnumpy()
S_last_state = S_last_state.asnumpy()
F = F.asnumpy()
a = a.asnumpy()
g = g.asnumpy()
sigma = sigma.asnumpy()

for t in range(horizon):
a_t = a[:, t]
forecast_mean = a_t.dot(mu_last_state)[0]
forecast_std = a_t.dot(S_last_state).dot(a_t) + np.square(sigma[t])[0]
forecasts_mean.append(forecast_mean)
forecasts_std.append(forecast_std)
mu_last_state = F[:, :, t].dot(mu_last_state)

S_last_state = F[:, :, t].dot(S_last_state).dot(F[:, :, t].T)
return np.array(forecasts_mean), np.array(forecasts_std)

In [12]: # Let us use the same cofficients (constant over time) for the future as well
forecasts_mean, forecasts_std = forecast(mu_seq[-1],
S_seq[-1],
F, a, g, sigma, horizon=13)
Plot the reconstruction as well as the forecasts

In [13]: def plot_reconstruction_forecasts(v_filtered_mean, v_filtered_std, forecasts_mean,
plt.plot(ts, color="r")
plt.plot(v_filtered_mean, color="b")
T = len(v_filtered_mean)
x = np.arange(T)
plt.fill_between(x, v_filtered_mean-v_filtered_std,
v_filtered_mean+v_filtered_std,
facecolor="blue", alpha=0.2)
plt.plot(np.arange(T, T+len(forecasts_mean)), forecasts_mean, color="g")

plt.fill_between(np.arange(T, T+len(forecasts_mean)), forecasts_mean-forecasts
forecasts_mean+forecasts_std,
facecolor="green", alpha=0.2)
plt.legend(["data", "reconstruction", "forecasts"]);

In [14]: plot_reconstruction_forecasts(reconst_mean, reconst_std, forecasts_mean, forecasts
4.10. Filtering 335

Level Trend ISSM

We can model a piecewise linear random process by using a two-dimensional latent state 𝑙𝑡 ∈ R2 , where
one dimension represents the level (again with a slight abusing of notation, 𝑙) and the other represents the
trend (slope) 𝑏.
𝑙𝑡 = 𝛿𝑙𝑡−1 + 𝛾𝑏𝑡−1 + 𝛼 · 𝜖𝑡
𝑏𝑡 = 𝛾𝑏𝑡−1 + 𝛽 · 𝜖𝑡
In ISSM framework, such a (Damped) LevelTrend-ISSM is given by

[︂ ]︂ [︂ ]︂ [︂ ]︂
𝛿 𝛿 𝛾 𝛼
𝑎𝑡 = , 𝐹𝑡 = , 𝑔𝑡 = ,
𝛾 0 𝛾 𝛽
where 𝛼 > 0, 𝛽 > 0 and the damping factors 𝛿, 𝛾 ∈ (0, 1]. Both the level and slope components evolve over
time by adding innovations 𝛼𝜖𝑡 and 𝛽𝜖𝑡 respectively, so that 𝛽 > 0 is the innovation strength for the slope.
The level at time 𝑡 is the sum of level at 𝑡 − 1 and slope at 𝑡 − 1 (linear prediction) modulo the damping
factors for level 𝛿 and growth 𝛾.
In [15]: latent_dim = 2
T = len(ts)
# Set the coefficients of the ISSM

damp_fact = 1.0
damp_growth = 1.0
# Set the parameters of the ISSM

alpha = 0.5
beta = 0.1
g_t = nd.array([alpha, beta])
g = nd.repeat(g_t, T).reshape((latent_dim, T))

# F and a are constant over time

F_t = nd.reshape(nd.array([damp_fact, damp_growth, 0, damp_growth]), (latent_dim,
a_t = nd.array([damp_fact, damp_growth])
F = nd.repeat(F_t, T).reshape((latent_dim, latent_dim, T))
a = nd.repeat(a_t, T).reshape((latent_dim, T))
m_prior = nd.zeros((latent_dim, 1))

S_prior = nd.zeros((latent_dim, latent_dim))
sigma = 0.5 * nd.ones((T, 1))
b = nd.zeros((T, 1))
z = nd.array(ts).reshape((T, 1))
In [16]: mu_seq, S_seq, _ = ISSM_filter(z, b, F, a, g, sigma, m_prior, S_prior)
In [17]: # Let us use the same cofficients (constant over time) for the future as well
forecasts_mean, forecasts_std = forecast(mu_seq[-1],
S_seq[-1],
F, a, g, sigma, horizon=13)
4.10.4 Plot the reconstruction as well as the forecasts

In [18]: reconst_mean, reconst_std = reconstruct(mu_seq, S_seq)
plot_reconstruction_forecasts(reconst_mean, reconst_std, forecasts_mean, forecasts
4.10. Filtering 337


CHAPTER
FIVE
PART 3: ADVANCED TOPICS
5.1 Variational Autoencoders with Gluon

Recent progress in variation autoencoders (VAE) has made it one of the most popular frameworks for build-
ing deep generative models. In this notebook, we will first introduce the necessary background. Then, we
proceed to build a VAE model based on the paper Auto-Encoding Variational Bayes and apply it to MNIST
dataset for representation learning and sample generation tasks. In this implementation we use the MXNet
Gluon API, i.e. gluon.HybridBlock and autograd.
5.1.1 Introduction of Variational Autoencoders (VAE)

A Quick Recap of Expectation-Maximization (EM)
A straightforward way to approach VAE is through the construction of the well-known Expectation-
Maximization (EM) algorithm. Please refer to this tutorial or this blog as a refresher on EM. Just to quicly
recap a few key elements in EM: insteand of optimizing the log-liklihood (ℓ(𝜃)) directly with observable
data 𝑥, latent variable 𝑧, EM constructs and optimize on a lower bound ℒ(𝑞, 𝜃) often referred to as Evidence
Lower Bond (EBLO). The following equation derives from Jensen’s inequality and holds for any 𝑞(𝑧) as
long as it is a valid probability distribution.
∑︁
ℓ(𝜃) = log (𝑝𝜃 (𝑥𝑖 ))
𝑖
∑︁ (︂∫︁ )︂
= log 𝑝𝜃 (𝑥𝑖 , 𝑧)𝑑𝑧
𝑖
(︂ [︂ ]︂)︂
∑︁ 𝑝𝜃 (𝑥𝑖 , 𝑧)
= log E𝑧∼𝑄
𝑞(𝑧)
𝑖
[︂ (︂ )︂]︂
∑︁ 𝑝𝜃 (𝑥𝑖 , 𝑧)
≥ E𝑧∼𝑄 log
𝑞(𝑧)
𝑖
⏟ ⏞
𝐸𝐿𝐵𝑂:ℒ(𝑞,𝜃)
∑︁ (︂∫︁ )︂
= log 𝑝𝜃 (𝑥𝑖 , 𝑧)𝑑𝑧
𝑖
339
[︂ (︂ )︂]︂
∑︁ 𝑝𝜃 (𝑥𝑖 , 𝑧)
≥ E𝑧∼𝑄 log
𝑞(𝑧)
⏟𝑖 ⏞
𝐸𝐿𝐵𝑂:ℒ(𝑞,𝜃)
Importantly, among all choices of 𝑞(𝑧), we’d be able to maximize the ELBO ℒ(𝑞, 𝜃) with respect to 𝑞 if 𝑞 is
^𝑡−1 ^𝑡−1
chosen to be the inferred posterior, i.e. at 𝑡-th iteration 𝑞 𝑡 (𝑧) = 𝑝(𝑧|𝑥𝑖 ; 𝜃ˆ𝑡−1 ) = ∫︀ 𝑝(𝑥𝑖 |𝑧;𝜃 )𝑝(𝑧;𝜃 ) . This
𝑝(𝑥𝑖 |𝑧;𝜃^𝑡−1 )𝑝(𝑧;𝜃^𝑡−1 )𝑑𝑧
is the essesce of the E-step in EM algorithm. In M-step, we then maximize over 𝜃. The particular choice
of 𝑞(𝑧) in E-step ensures that EM would monotonically increase the ELBO ℒ(𝑞, 𝜃), thus the log-liklihood
ℓ(𝜃). The chain of improvements through E-step and M-step are illustrated below.
ℓ(𝜃𝑡−1 ) = ℒ(𝑞 𝑡 , 𝜃𝑡−1 ) ≤ ℒ(𝑞 𝑡 , 𝜃𝑡 ) ≤ ℓ(𝜃𝑡 )

𝐸−𝑠𝑡𝑒𝑝 𝑀 −𝑠𝑡𝑒𝑝 𝐽𝑒𝑛𝑠𝑒𝑛
From EM to VAE
With more complex distributions of 𝑝𝜃 (𝑥|𝑧), the integration in E-step for exact inference of the posterier
𝑝𝜃 (𝑧|𝑥) is intractable. This posterier inference problem can be addressed with variational inference meth-
ods such as mean-field approximation (where we assume factorizable 𝑞(𝑧)) or sampling based methods (e.g.
collapsed Gibbs sampling for solving Latent Dirichlet allocation). Mean-field approximation put undue con-
straints on the variational family 𝑞(𝑧), and sampling based methods could have slow convergence problems.
Moreover, both methods involves arduous derivation of update functions, that would require rederivation
even for small changes in model and thus could limit the exploration of more complex models.
Auto-Encoding Variational Bayes brought about a flexible neural-network based approach. In this frame-
work, the variational inference / variational optimization task of finding the optimal 𝑞 become a matter of
finding the best parameters of a neural network via backpropagation and stochastic gradient descent. Thus
making blackbox inference possible as well as allowing scalable to trainng for deeper and larger neural
network models. We refer to this framework as Neural Variational Inference.
Here is how it works: - Select a prior for latent variable 𝑝𝜃 (𝑧), which may or may not actually involve
parameters. - Use a neural network to parameterize the distribution 𝑝𝜃 (𝑥|𝑧). Because this part of the model
maps latent varibale (code) 𝑧 to observed data 𝑥, it is viewed as a “decoder” network. - Rather than explictly
calculating the intractable 𝑝(𝑧|𝑧), use another neural network to parameterize the distribution 𝑞𝜑 (𝑧|𝑥) as
the approximate posterior. Due to the mapping from from data 𝑥 to latent variable (code) 𝑧, this part of the
model is viewed as a “encoder” network. - The objective is still to maxmize ELBO ℒ(𝜑, 𝜃). But now instead
of separately finding the optimal 𝜑 (corresponding to 𝑞 in EM) and 𝜃 like EM, we can find the parameters 𝜃
and 𝜑 jointly via standard stochastic gradient descent.
The resulted model resembles an encoder-decoder structure, thus commonly referred to as variational auto-
encoder (VAE).
In the classic example in Auto-Encoding Variational Bayes, we have the prior 𝑝(𝑧) as a standard isotropic
Gaussian 𝒩 (0, 𝐼), and the approximate posterior 𝑞𝜑 (𝑧|𝑥) also be isotropic Gaussian 𝒩 (𝜇𝜑 (𝑥), 𝜎𝜑 (𝑥)𝐼),
where 𝜇𝜑 (𝑥) and 𝜎𝜑 (𝑥) are functions implemented as neural networks and their outputs are used as the
340 Chapter 5. Part 3: Advanced Topics

parameters for the Guassian distribution 𝑞𝜑 (𝑧|𝑥). This model configuration is often referred as Gaussian
VAE.
With this setup the training loss to minimize is the negative of ELBO and can be expressed as follows:
−ℒ(𝑥𝑖 , 𝜑, 𝜃) = −E𝑧∼𝑄𝜑 (𝑧|𝑥𝑖 ) [log 𝑝𝜃 (𝑥𝑖 |𝑧) + log 𝑝𝜃 (𝑧) − log 𝑞𝜑 (𝑧|𝑥𝑖 )]
= −E𝑧∼𝑄𝜑 (𝑧|𝑥𝑖 ) [log 𝑝𝜃 (𝑥𝑖 |𝑧)] + 𝐷𝐾𝐿 [log 𝑞𝜑 (𝑧|𝑥𝑖 )‖𝑝𝜃 (𝑧)]
𝐿
1 ∑︁
≈ [− log 𝑝𝜃 (𝑥𝑖 |𝑧𝑠 )] + 𝐷𝐾𝐿 [log 𝑞𝜑 (𝑧|𝑥𝑖 )‖𝑝𝜃 (𝑧)]
𝐿 𝑠 ⏟ ⏞
⏟ ⏞ Can be calculated analytically between Gaussians
Sampling 𝑧𝑠 ∼𝑄𝜑 (𝑧|𝑥𝑖 )
= −E𝑧∼𝑄𝜑 (𝑧|𝑥𝑖 ) [log 𝑝𝜃 (𝑥𝑖 |𝑧)] + 𝐷𝐾𝐿 [log 𝑞𝜑 (𝑧|𝑥𝑖 )‖𝑝𝜃 (𝑧)]
where the ELBO above is the same as the ELBO expression in EM but with 𝑝(𝑥, 𝑧) expanded and with
𝑞(𝑥)
𝐷𝐾𝐿 denoting the KL-divergence, i.e. 𝐷𝐾𝐿 (𝑄‖𝑃 ) = E𝑥∼𝑄 [log( 𝑝(𝑥) ]. As indicated, the first term can
be approximated by drawing 𝐿 Monte Carlo samples from the distribution 𝑞𝜑 (𝑧|𝑥) (a very feasible task of
drawing from an isotropic Gaussian distribution), while the 𝐷𝐾𝐿 has convenient analytical solutions which
is preferred over Monte Carlo samples in order to have lower variance gradient.
With sampling involved, the remaining question is how do we backpropagate through a sampling node in
the computation graph. The authors of Auto-Encoding Variational Bayes proposed Reparameterize Trick
(RT). Instead of sampling 𝑧 from 𝒩 (𝜇𝜑 (𝑥), 𝜎𝜑 (𝑥)𝐼) directly, we sample 𝜖 from fixed distribution 𝒩 (0, 𝐼)
and construct 𝑧 = 𝜇(𝑥) + 𝜎(𝑥) · 𝜖. This way the random sampling is based on 𝜖, and 𝑧 deterministically
depends on 𝜇(𝑥) and 𝜎(𝑥) allowing gradient to flow through them. RT is a generally applicable technique
for distribution that allows location-scale transformation or has analytical inverse CDFs.
5.2 Implementing a Gaussian VAE

With the theoretical side of things out of the way, let’s start implemting a VAE model!
In [1]: import time
import numpy as np
import mxnet as mx
from tqdm import tqdm, tqdm_notebook
%matplotlib inline
/Users/rding/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conv
from ._conv import register_converters as _register_converters
5.2. Implementing a Gaussian VAE 341

/Users/rding/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py:46: Depreca
import OpenSSL.SSL
In [2]: def gpu_exists():
try:
mx.nd.zeros((1,), ctx=mx.gpu(0))
except:
return False
return True
data_ctx = mx.cpu()
if gpu_exists():
print('Using GPU for model_ctx')
model_ctx = mx.gpu(0)
else:
print('Using CPU for model_ctx')
Using CPU for model_ctx
In [3]: mx.random.seed(1)
output_fig = False
5.2.1 Load MNIST

In [4]: mnist = mx.test_utils.get_mnist()
#print(mnist['train_data'][0].shape)
#plt.imshow(mnist['train_data'][0][0],cmap='Greys')
n_samples = 10
idx = np.random.choice(len(mnist['train_data']), n_samples)
_, axarr = plt.subplots(1, n_samples, figsize=(16,4))
for i,j in enumerate(idx):
axarr[i].imshow(mnist['train_data'][j][0], cmap='Greys')
#axarr[i].axis('off')
axarr[i].get_xaxis().set_ticks([])
axarr[i].get_yaxis().set_ticks([])
plt.show()
/Users/rding/anaconda3/lib/python3.6/site-packages/mxnet/test_utils.py:1430: DeprecationWar
label = np.fromstring(flbl.read(), dtype=np.int8)
/Users/rding/anaconda3/lib/python3.6/site-packages/mxnet/test_utils.py:1433: DeprecationWar
image = np.fromstring(fimg.read(), dtype=np.uint8).reshape(len(label), rows, cols)
In [5]: train_data = np.reshape(mnist['train_data'],(-1,28*28))

test_data = np.reshape(mnist['test_data'],(-1,28*28))
In [6]: mnist['test_label'].shape
Out[6]: (10000,)


n_batches = train_data.shape[0]/batch_size
train_iter = mx.io.NDArrayIter(data={'data': train_data}, label={'label': mnist['tr
test_iter = mx.io.NDArrayIter(data={'data': test_data}, label={'label': mnist['test
#train_iter = mx.io.NDArrayIter(data={'data': train_data}, batch_size = batch_size)
#test_iter = mx.io.NDArrayIter(data={'data': test_data}, batch_size = batch_size)
5.2.2 Define model

In [8]: class VAE(gluon.HybridBlock):
def __init__(self, n_hidden=400, n_latent=2, n_layers=1, n_output=784, batch_si
self.soft_zero = 1e-10
self.n_latent = n_latent
self.batch_size = batch_size
self.output = None
self.mu = None
# note to self: requring batch_size in model definition is sad, not sure ho
super(VAE, self).__init__(**kwargs)
# self.use_aux_logits = use_aux_logits
self.encoder = nn.HybridSequential(prefix='encoder')
for i in range(n_layers):
self.encoder.add(nn.Dense(n_hidden, activation=act_type))
self.encoder.add(nn.Dense(n_latent*2, activation=None))
self.decoder = nn.HybridSequential(prefix='decoder')
for i in range(n_layers):
self.decoder.add(nn.Dense(n_hidden, activation=act_type))
self.decoder.add(nn.Dense(n_output, activation='sigmoid'))

h = self.encoder(x)
#print(h)
mu_lv = F.split(h, axis=1, num_outputs=2)
mu = mu_lv[0]
lv = mu_lv[1]
self.mu = mu
#eps = F.random_normal(loc=0, scale=1, shape=mu.shape, ctx=model_ctx)
# this would work fine only for nd (i.e. non-hybridized block)
eps = F.random_normal(loc=0, scale=1, shape=(self.batch_size, self.n_latent
z = mu + F.exp(0.5*lv)*eps
y = self.decoder(z)
self.output = y
KL = 0.5*F.sum(1+lv-mu*mu-F.exp(lv),axis=1)
logloss = F.sum(x*F.log(y+self.soft_zero)+ (1-x)*F.log(1-y+self.soft_zero),
loss = -logloss-KL
return loss
In [9]: n_hidden=400
n_latent=2
n_layers=2 # num of dense layers in encoder and decoder respectively
n_output=784

model_prefix = 'vae_gluon_{}d{}l{}h.params'.format(n_latent, n_layers, n_hidden)
net = VAE(n_hidden=n_hidden, n_latent=n_latent, n_layers=n_layers, n_output=n_outpu
5.2.3 Model training

In [10]: net.collect_params().initialize(mx.init.Xavier(), ctx=model_ctx)
net.hybridize()
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': .001})
In [11]: n_epoch = 50
print_period = n_epoch // 10
start = time.time()
training_loss = []
validation_loss = []
for epoch in tqdm_notebook(range(n_epoch), desc='epochs'):
epoch_loss = 0
epoch_val_loss = 0
train_iter.reset()
test_iter.reset()
n_batch_train = 0
for batch in train_iter:
n_batch_train +=1
data = batch.data[0].as_in_context(model_ctx)
loss = net(data)
loss.backward()
epoch_loss += nd.mean(loss).asscalar()
n_batch_val = 0
for batch in test_iter:
n_batch_val +=1
data = batch.data[0].as_in_context(model_ctx)
loss = net(data)
epoch_val_loss += nd.mean(loss).asscalar()
epoch_loss /= n_batch_train
epoch_val_loss /= n_batch_val
training_loss.append(epoch_loss)
validation_loss.append(epoch_val_loss)
if epoch % max(print_period,1) == 0:
tqdm.write('Epoch{}, Training loss {:.2f}, Validation loss {:.2f}'.format(
end = time.time()
print('Time elapsed: {:.2f}s'.format(end - start))
A Jupyter Widget
Epoch0, Training loss 184.74, Validation loss 171.09


Time elapsed: 1035.14s

In [12]: net.save_parameters(model_prefix)
In [13]: batch_x = np.linspace(1, n_epoch, len(training_loss))
plt.plot(batch_x, -1*np.array(training_loss))
plt.plot(batch_x, -1*np.array(validation_loss))
plt.legend(['train', 'valid'])
Out[13]: <matplotlib.legend.Legend at 0x13d1f64e0>
5.2.4 Load model

In [14]: net2 = VAE(n_hidden=n_hidden, n_latent=n_latent, n_layers=n_layers, n_output=n_out
net2.load_parameters(model_prefix, ctx=model_ctx)
5.2.5 Visualizing reconstruction quality

In [15]: test_iter.reset()
test_batch = test_iter.next()
net2(test_batch.data[0].as_in_context(model_ctx))
result = net2.output.asnumpy()
original = test_batch.data[0].asnumpy()

In [16]: n_samples = 10
idx = np.random.choice(batch_size, n_samples)
_, axarr = plt.subplots(2, n_samples, figsize=(16,4))
for i,j in enumerate(idx):
axarr[0,i].imshow(original[j].reshape((28,28)), cmap='Greys')
if i==0:
axarr[0,i].set_title('original')
#axarr[0,i].axis('off')
axarr[0,i].get_xaxis().set_ticks([])
axarr[0,i].get_yaxis().set_ticks([])
axarr[1,i].imshow(result[j].reshape((28,28)), cmap='Greys')
if i==0:
axarr[1,i].set_title('reconstruction')
#axarr[1,i].axis('off')
axarr[1,i].get_xaxis().set_ticks([])
axarr[1,i].get_yaxis().set_ticks([])
plt.show()
5.2.6 Visualizing latent space (when it is 2D)

In [17]: n_batches = 10
counter = 0
results = []
labels = []
for batch in test_iter:
net2(batch.data[0].as_in_context(model_ctx))
results.append(net2.mu.asnumpy())
labels.append(batch.label[0].asnumpy())
counter +=1
if counter >= n_batches:
break
In [18]: result= np.vstack(results)
labels = np.hstack(labels)
In [24]: if result.shape[1]==2:
from scipy.special import ndtri
from scipy.stats import norm
fig, axarr = plt.subplots(1,2, figsize=(10,4))

im=axarr[0].scatter(result[:, 0], result[:, 1], c=labels, alpha=0.6, cmap='Pai
axarr[0].set_title(r'scatter plot of $\mu$')
axarr[0].axis('equal')

fig.colorbar(im, ax=axarr[0])
im=axarr[1].scatter(norm.cdf(result[:, 0]), norm.cdf(result[:, 1]), c=labels,

axarr[1].set_title(r'scatter plot of $\mu$ on norm.cdf() transformed coordinat
axarr[1].axis('equal')
fig.colorbar(im, ax=axarr[1])
plt.tight_layout()
if output_fig:
plt.savefig('2d_latent_space_for_test_samples.png')
5.2.7 Sample latent space and generate images

Random sampling
In [20]: n_samples = 10
zsamples = nd.array(np.random.randn(n_samples*n_samples, n_latent))
In [21]: images = net2.decoder(zsamples.as_in_context(model_ctx)).asnumpy()
In [22]: canvas = np.empty((28*n_samples, 28*n_samples))
for i, img in enumerate(images):
x = i // n_samples
y = i % n_samples
canvas[(n_samples-y-1)*28:(n_samples-y)*28, x*28:(x+1)*28] = img.reshape(28, 2
plt.imshow(canvas, origin="upper", cmap="Greys")
plt.axis('off')
plt.tight_layout()
if output_fig:
plt.savefig('generated_samples_with_{}D_latent_space.png'.format(n_latent))

Grid scan 2D latent space

In [23]: if n_latent==2:
n_pts = 20
idx = np.arange(0, n_pts)
x = np.linspace(norm.cdf(-3), norm.cdf(3),n_pts)
x = ndtri(x)
x_grid = np.array(np.meshgrid(*[i for i in np.matlib.repmat(x,n_latent,1)]))

id_grid = np.array(np.meshgrid(*[i for i in np.matlib.repmat(idx,n_latent,1)])
zsamples = nd.array(x_grid.reshape((n_latent, -1)).transpose())

zsamples_id = id_grid.reshape((n_latent, -1)).transpose()
images = net2.decoder(zsamples.as_in_context(model_ctx)).asnumpy()
#plot
canvas = np.empty((28*n_pts, 28*n_pts))
for i, img in enumerate(images):
x, y = zsamples_id[i]
canvas[(n_pts-y-1)*28:(n_pts-y)*28, x*28:(x+1)*28] = img.reshape(28, 28)
plt.imshow(canvas, origin="upper", cmap="Greys")
plt.axis('off')
plt.tight_layout()
if output_fig:
plt.savefig('2d_latent_space_scan_for_generation.png')

5.3 Generative Adversarial Networks

Throughout most of this book, we’ve talked about how to make predictions. In some form or another, we
used deep neural networks learned mappings from data points to labels. This kind of learning is called
discriminative learning, as in, we’d like to be able to discriminate between photos cats and photos of dogs.
Classifiers and regressors are both examples of discriminative learning. And neural networks trained by
backpropagation have upended everything we thought we knew about discriminative learning on large com-
plicated datasets. Classification accuracies on high-res images has gone from useless to human-level (with
some caveats) in just 5-6 years. We’ll spare you another spiel about all the other discriminative tasks where
deep neural networks do astoundingly well.
But there’s more to machine learning than just solving discriminative tasks. For example, given a large
dataset, without any labels, we might want to learn a model that concisely captures the characteristics of this
data. Given such a model, we could sample synthetic data points that resemble the distribution of the training
data. For example, given a large corpus of photographs of faces, we might want to be able to generate a
5.3. Generative Adversarial Networks 349

new photorealistic image that looks like it might plausibly have come from the same dataset. This kind of
learning is called generative modeling.
Until recently, we had no method that could synthesize novel photorealistic images. But the success of deep
neural networks for discriminative learning opened up new possiblities. One big trend over the last three
years has been the application of discriminative deep nets to overcome challenges in problems that we don’t
generally think of as supervised learning problems. The recurrent neural network language models are one
example of using a discriminative network (trained to predict the next character) that once trained can act as
a generative model.
In 2014, a young researcher named Ian Goodfellow introduced Generative Adversarial Networks (GANs) a
clever new way to leverage the power of discriminative models to get good generative models. GANs made
quite a splash so it’s quite likely you’ve seen the images before. For instance, using a GAN you can create
fake images of bedrooms, as done by Radford et al. in 2015 and depicted below.
At their heart, GANs rely on the idea that a data generator is good if we cannot tell fake data apart from
real data. In statistics, this is called a two-sample test - a test to answer the question whether datasets
𝑋 = {𝑥1 , . . . 𝑥𝑛 } and 𝑋 ′ = {𝑥′1 , . . . 𝑥′𝑛 } were drawn from the same distribution. The main difference
between most statistics papers and GANs is that the latter use this idea in a constructive way. In other
words, rather than just training a model to say ‘hey, these two datasets don’t look like they came from the
same distribution’, they use the two-sample test to provide training signal to a generative model. This allows
us to improve the data generator until it generates something that resembles the real data. At the very least,
it needs to fool the classifier. And if our classifier is a state of the art deep neural network.
As you can see, there are two pieces to GANs - first off, we need a device (say, a deep network but it really
could be anything, such as a game rendering engine) that might potentially be able to generate data that looks
just like the real thing. If we are dealing with images, this needs to generate images. If we’re dealing with
speech, it needs to generate audio sequences, and so on. We call this the generator network. The second
component is the discriminator network. It attempts to distinguish fake and real data from each other. Both
networks are in competition with each other. The generator network attempts to fool the discriminator
network. At that point, the discriminator network adapts to the new fake data. This information, in turn is

used to improve the generator network, and so on.

Generator * Draw some parameter 𝑧 from a source of randomness, e.g. a normal distribution 𝑧 ∼ 𝒩 (0, 1).
* Apply a function 𝑓 such that we get 𝑥′ = 𝐺(𝑢, 𝑤) * Compute the gradient with respect to 𝑤 to minimize
log 𝑝(𝑦 = fake|𝑥′ )
Discriminator * Improve the accuracy of a binary classifier 𝑓 , i.e. maximize log 𝑝(𝑦 = fake|𝑥′ ) and
log 𝑝(𝑦 = true|𝑥) for fake and real data respectively.
In short, there are two optimization problems running simultaneously, and the optimization terminates if a
stalemate has been reached. There are lots of further tricks and details on how to modify this basic setting.
For instance, we could try solving this problem in the presence of side information. This leads to cGAN,
i.e. conditional Generative Adversarial Networks. We can change the way how we detect whether real and
fake data look the same. This leads to wGAN (Wasserstein GAN), kernel-inspired GANs and lots of other
settings, or we could change how closely we look at the objects. E.g. fake images might look real at the
texture level but not so at the larger level, or vice versa.
Many of the applications are in the context of images. Since this takes too much time to solve in a Jupyter
notebook on a laptop, we’re going to content ourselves with fitting a much simpler distribution. We will
illustrate what happens if we use GANs to build the world’s most inefficient estimator of parameters for a
Gaussian. Let’s get started.
import mxnet as mx
from mxnet import gluon, autograd, nd
import numpy as np
ctx = mx.cpu()
5.3.1 Generate some ‘real’ data

Since this is going to be the world’s lamest example, we simply generate data drawn from a Gaussian. And
let’s also set a context where we’ll do most of the computation.
In [2]: X = nd.random_normal(shape=(1000, 2))
A = nd.array([[1, 2], [-0.1, 0.5]])
b = nd.array([1, 2])
X = nd.dot(X, A) + b
Y = nd.ones(shape=(1000, 1))
# and stick them into an iterator

batch_size = 4
train_data = mx.io.NDArrayIter(X, Y, batch_size, shuffle=True)
Let’s see what we got. This should be a Gaussian shifted in some rather arbitrary way with mean 𝑏 and
covariance matrix 𝐴⊤ 𝐴.
In [3]: plt.scatter(X[:,0].asnumpy(), X[:,1].asnumpy())
plt.show()
print("The covariance matrix is")
print(nd.dot(A.T, A))


The covariance matrix is
[[ 1.00999999 1.95000005]
[ 1.95000005 4.25 ]]
5.3.2 Defining the networks

Next we need to define how to fake data. Our generator network will be the simplest network possible - a
single layer linear model. This is since we’ll be driving that linear network with a Gaussian data generator.
Hence, it literally only needs to learn the parameters to fake things perfectly. For the discriminator we will
be a bit more discriminating: we will use an MLP with 3 layers to make things a bit more interesting.
The cool thing here is that we have two different networks, each of them with their own gradients, optimizers,
losses, etc. that we can optimize as we please.
In [4]: # build the generator
netG = nn.Sequential()
with netG.name_scope():
netG.add(nn.Dense(2))
# build the discriminator (with 5 and 3 hidden units respectively)

netD = nn.Sequential()
with netD.name_scope():
netD.add(nn.Dense(5, activation='tanh'))
netD.add(nn.Dense(3, activation='tanh'))
netD.add(nn.Dense(2))
# loss
# initialize the generator and the discriminator

netG.initialize(mx.init.Normal(0.02), ctx=ctx)
netD.initialize(mx.init.Normal(0.02), ctx=ctx)
# trainer for the generator and the discriminator

trainerG = gluon.Trainer(netG.collect_params(), 'adam', {'learning_rate': 0.01})
trainerD = gluon.Trainer(netD.collect_params(), 'adam', {'learning_rate': 0.05})
5.3.3 Setting up the training loop

We are going to iterate over the data a few times. To make life simpler we need a few variables
In [5]: real_label = mx.nd.ones((batch_size,), ctx=ctx)
fake_label = mx.nd.zeros((batch_size,), ctx=ctx)
metric = mx.metric.Accuracy()
# set up logging
from datetime import datetime
import os
import time
5.3.4 Training loop

In [6]: stamp = datetime.now().strftime('%Y_%m_%d-%H_%M')
tic = time.time()
train_data.reset()
for i, batch in enumerate(train_data):
############################
# (1) Update D network: maximize log(D(x)) + log(1 - D(G(z)))
###########################
# train with real_t
data = batch.data[0].as_in_context(ctx)
noise = nd.random_normal(shape=(batch_size, 2), ctx=ctx)
real_output = netD(data)
errD_real = loss(real_output, real_label)
fake = netG(noise)
fake_output = netD(fake.detach())
errD_fake = loss(fake_output, fake_label)
errD = errD_real + errD_fake
errD.backward()
trainerD.step(batch_size)
metric.update([real_label,], [real_output,])
metric.update([fake_label,], [fake_output,])
############################
# (2) Update G network: maximize log(D(G(z)))
###########################
output = netD(fake)
errG = loss(output, real_label)

errG.backward()
trainerG.step(batch_size)
name, acc = metric.get()

metric.reset()
print('\nbinary training acc at epoch %d: %s=%f' % (epoch, name, acc))
print('time: %f' % (time.time() - tic))
noise = nd.random_normal(shape=(100, 2), ctx=ctx)
fake = netG(noise)
plt.scatter(X[:,0].asnumpy(), X[:,1].asnumpy())
plt.scatter(fake[:,0].asnumpy(), fake[:,1].asnumpy())
plt.show()
binary training acc at epoch 0: accuracy=0.764500

time: 5.838877

time: 6.052228


time: 5.773329

time: 5.613472


time: 6.069607

time: 5.800509


time: 5.982538

time: 6.017519


time: 6.143714

time: 6.123487

5.3.5 Checking the outcome

Let’s now generate some fake data and check whether it looks real.
In [7]: noise = mx.nd.random_normal(shape=(100, 2), ctx=ctx)
fake = netG(noise)
plt.scatter(X[:,0].asnumpy(), X[:,1].asnumpy())
plt.scatter(fake[:,0].asnumpy(), fake[:,1].asnumpy())
plt.show()

5.3.6 Conclusion
A word of caution here - to get this to converge properly, we needed to adjust the learning rates very carefully.
And for Gaussians, the result is rather mediocre - a simple mean and covariance estimator would have
worked much better. However, whenever we don’t have a really good idea of what the distribution should
be, this is a very good way of faking it to the best of our abilities. Note that a lot depends on the power of
the discriminating network. If it is weak, the fake can be very different from the truth. E.g. in our case it
had trouble picking up anything along the axis of reduced variance. In summary, this isn’t exactly easy to
set and forget. One nice resource for dirty practioner’s knowledge is Soumith Chintala’s handy list of tricks
for how to babysit GANs.
5.4 Deep Convolutional Generative Adversarial Networks

In our introduction to generative adversarial networks (GANs), we introduced the basic ideas behind how
GANs work. We showed that they can draw samples from some simple, easy-to-sample distribution, like
a uniform or normal distribution, and transform them into samples that appear to match the distribution of
some data set. And while our example of matching a 2D Gaussian distribution got the point across, it’s not
especially exciting.
In this notebook, we’ll demonstrate how you can use GANs to generate photorealistic images. We’ll be
basing our models on the deep convolutional GANs introduced in this paper. We’ll borrow the convolutional
architecture that have proven so successful for discriminative computer vision problems and show how via
GANs, they can be leveraged to generate photorealistic images.
In this tutorial, concentrate on the LWF Face Dataset, which contains roughly 13000 images of faces. By
the end of the tutorial, you’ll know how to generate photo-realistic images of your own, given any dataset
of images. First, we’ll the the preliminaries out of the way.
import os
import tarfile
import matplotlib.image as mpimg
import mxnet as mx
from mxnet.gluon import nn, utils
import numpy as np
5.4.1 Set training parameters

In [ ]: epochs = 2 # Set low by default for tests, set higher when you actually run this co
batch_size = 64
latent_z_size = 100
use_gpu = True
ctx = mx.gpu() if use_gpu else mx.cpu()
5.4. Deep Convolutional Generative Adversarial Networks 361

lr = 0.0002
beta1 = 0.5
5.4.2 Download and preprocess the LWF Face Dataset

In [ ]: lfw_url = 'http://vis-www.cs.umass.edu/lfw/lfw-deepfunneled.tgz'
data_path = 'lfw_dataset'
if not os.path.exists(data_path):
os.makedirs(data_path)
data_file = utils.download(lfw_url)
with tarfile.open(data_file) as tar:
tar.extractall(path=data_path)
First, we resize images to size 64 × 64. Then, we normalize all pixel values to the [−1, 1] range.
In [ ]: target_wd = 64
target_ht = 64
img_list = []
def transform(data, target_wd, target_ht):

# resize to target_wd * target_ht
data = mx.image.imresize(data, target_wd, target_ht)
# transpose from (target_wd, target_ht, 3)
# to (3, target_wd, target_ht)
data = nd.transpose(data, (2,0,1))
# normalize to [-1, 1]
data = data.astype(np.float32)/127.5 - 1
# if image is greyscale, repeat 3 times to get RGB image.
if data.shape[0] == 1:
data = nd.tile(data, (3, 1, 1))
return data.reshape((1,) + data.shape)
for path, _, fnames in os.walk(data_path):

for fname in fnames:
if not fname.endswith('.jpg'):
continue
img = os.path.join(path, fname)
img_arr = mx.image.imread(img)
img_arr = transform(img_arr, target_wd, target_ht)
img_list.append(img_arr)
train_data = mx.io.NDArrayIter(data=nd.concatenate(img_list), batch_size=batch_size
Visualize 4 images:
In [ ]: def visualize(img_arr):
plt.imshow(((img_arr.asnumpy().transpose(1, 2, 0) + 1.0) * 127.5).astype(np.uin
plt.axis('off')
for i in range(4):
plt.subplot(1,4,i+1)
visualize(img_list[i + 10][0])
plt.show()


The core to the DCGAN architecture uses a standard CNN architecture on the discriminative model. For the
generator, convolutions are replaced with upconvolutions, so the representation at each layer of the generator
is actually successively larger, as it mapes from a low-dimensional latent vector onto a high-dimensional
image.
• Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolu-
tions (generator).
• Use batch normalization in both the generator and the discriminator.
• Remove fully connected hidden layers for deeper architectures.
• Use ReLU activation in generator for all layers except for the output, which uses Tanh.
• Use LeakyReLU activation in the discriminator for all layers.
In [ ]: # build the generator

nc = 3
ngf = 64
netG = nn.Sequential()
with netG.name_scope():
# input is Z, going into a convolution
netG.add(nn.Conv2DTranspose(ngf * 8, 4, 1, 0, use_bias=False))
netG.add(nn.BatchNorm())
netG.add(nn.Activation('relu'))
# state size. (ngf*8) x 4 x 4
netG.add(nn.Conv2DTranspose(ngf, 4, 2, 1, use_bias=False))
netG.add(nn.Conv2DTranspose(nc, 4, 2, 1, use_bias=False))
netG.add(nn.Activation('tanh'))
# state size. (nc) x 64 x 64
# build the discriminator

ndf = 64
netD = nn.Sequential()
with netD.name_scope():
# input is (nc) x 64 x 64
netD.add(nn.Conv2D(ndf, 4, 2, 1, use_bias=False))
netD.add(nn.LeakyReLU(0.2))
# state size. (ndf) x 32 x 32
netD.add(nn.Conv2D(ndf * 2, 4, 2, 1, use_bias=False))
netD.add(nn.BatchNorm())
netD.add(nn.Conv2D(1, 4, 1, 0, use_bias=False))
5.4.4 Setup Loss Function and Optimizer

We use binary cross-entropy as our loss function and use the Adam optimizer. We initialize the network’s
parameters by sampling from a normal distribution.
In [ ]: # loss
loss = gluon.loss.SigmoidBinaryCrossEntropyLoss()
# initialize the generator and the discriminator

netG.initialize(mx.init.Normal(0.02), ctx=ctx)
netD.initialize(mx.init.Normal(0.02), ctx=ctx)

trainerG = gluon.Trainer(netG.collect_params(), 'adam', {'learning_rate': lr, 'beta
trainerD = gluon.Trainer(netD.collect_params(), 'adam', {'learning_rate': lr, 'beta
5.4.5 Training Loop

We recommend thst you use a GPU for training this model. After a few epochs, we can see human-face-like
images are generated.
In [ ]: from datetime import datetime
import time
import logging
real_label = nd.ones((batch_size,), ctx=ctx)

fake_label = nd.zeros((batch_size,),ctx=ctx)
def facc(label, pred):

pred = pred.ravel()
label = label.ravel()
return ((pred > 0.5) == label).mean()
metric = mx.metric.CustomMetric(facc)

stamp = datetime.now().strftime('%Y_%m_%d-%H_%M')
logging.basicConfig(level=logging.DEBUG)

tic = time.time()
btic = time.time()
train_data.reset()
iter = 0
############################
# (1) Update D network: maximize log(D(x)) + log(1 - D(G(z)))
###########################
data = batch.data[0].as_in_context(ctx)
latent_z = mx.nd.random_normal(0, 1, shape=(batch_size, latent_z_size, 1, 1
# train with real image
output = netD(data).reshape((-1, 1))
errD_real = loss(output, real_label)
metric.update([real_label,], [output,])
# train with fake image

fake = netG(latent_z)
output = netD(fake.detach()).reshape((-1, 1))
errD_fake = loss(output, fake_label)
errD = errD_real + errD_fake
errD.backward()
metric.update([fake_label,], [output,])
trainerD.step(batch.data[0].shape[0])
############################
# (2) Update G network: maximize log(D(G(z)))
###########################
fake = netG(latent_z)
output = netD(fake).reshape((-1, 1))
errG = loss(output, real_label)
errG.backward()
trainerG.step(batch.data[0].shape[0])
# Print log infomation every ten batches

if iter % 10 == 0:
logging.info('speed: {} samples/s'.format(batch_size / (time.time() - b
logging.info('discriminator loss = %f, generator loss = %f, binary trai
%(nd.mean(errD).asscalar(),
nd.mean(errG).asscalar(), acc, iter, epoch))
iter = iter + 1
btic = time.time()

metric.reset()
# logging.info('\nbinary training acc at epoch %d: %s=%f' % (epoch, name, acc))
# logging.info('time: %f' % (time.time() - tic))
# Visualize one generated image for each epoch

# fake_img = fake[0]
# visualize(fake_img)
# plt.show()
5.4.6 Results
Given a trained generator, we can generate some images of faces.
In [ ]: num_image = 8
for i in range(num_image):
latent_z = mx.nd.random_normal(0, 1, shape=(1, latent_z_size, 1, 1), ctx=ctx)
img = netG(latent_z)
visualize(img[0])
plt.show()
We can also interpolate along the manifold between images by interpolating linearly between points in the
latent space and visualizing the corresponding images. We can see that small changes in the latent space
results in smooth changes in generated images.
In [ ]: num_image = 12
latent_z = mx.nd.random_normal(0, 1, shape=(1, latent_z_size, 1, 1), ctx=ctx)
step = 0.05
img = netG(latent_z)
visualize(img[0])
latent_z += 0.05
plt.show()
5.5 Pixel to Pixel Generative Adversarial Networks

Pixel to Pixel Generative Adversarial Networks applies Conditional Generative Adversarial Networks as
a general-purpose solution to image-to-image translation problems. These networks not only learn the
mapping from input image to output image, but also learn a loss function to train this mapping.
With pixel2pixel GAN, it is possible to train different type of image translation tasks with small datasets.
In this tutorial, we will train on three image translation tasks: facades with 400 images from CMP Facades
dataset, cityscapes with 2975 images from Cityscapes training set and maps with 1096 training images
scraped from Google Maps.
For harder problems such as edges2shoes and edges2handbags, it may be important to train on far larger
datasets, which takes significantly more time. You can try them with Multiple GPUs.
import os

import tarfile
import matplotlib.image as mpimg
import mxnet as mx
from mxnet.gluon import nn, utils
from mxnet.gluon.nn import Dense, Activation, Conv2D, Conv2DTranspose, \
BatchNorm, LeakyReLU, Flatten, HybridSequential, HybridBlock, Dropout
import numpy as np
5.5.1 Set Training parameters

batch_size = 10
use_gpu = True
ctx = mx.gpu() if use_gpu else mx.cpu()
lr = 0.0002
beta1 = 0.5
lambda1 = 100
pool_size = 50
5.5.2 Download and Preprocess Dataset

We first train on facades dataset. We need to crop images to input images and output images. Notice that
pixel2pixel GAN is capable to train these tasks bidirectional. You can set is-reversed=True to switch
input and output image patterns.
In [3]: dataset = 'facades'
We first resize images to size 512 * 256. Then normalize image pixel values to be between -1 and 1.
In [4]: img_wd = 256
img_ht = 256
train_img_path = '%s/train' % (dataset)
val_img_path = '%s/val' % (dataset)
def download_data(dataset):
if not os.path.exists(dataset):
url = 'https://people.eecs.berkeley.edu/~tinghuiz/projects/pix2pix/datasets
os.mkdir(dataset)
data_file = utils.download(url)
with tarfile.open(data_file) as tar:
tar.extractall(path='.')
os.remove(data_file)
def load_data(path, batch_size, is_reversed=False):

img_in_list = []
img_out_list = []
for path, _, fnames in os.walk(path):
5.5. Pixel to Pixel Generative Adversarial Networks 367

for fname in fnames:

if not fname.endswith('.jpg'):
continue
img = os.path.join(path, fname)
img_arr = mx.image.imread(img).astype(np.float32)/127.5 - 1
img_arr = mx.image.imresize(img_arr, img_wd * 2, img_ht)
# Crop input and output images
img_arr_in, img_arr_out = [mx.image.fixed_crop(img_arr, 0, 0, img_wd, i
mx.image.fixed_crop(img_arr, img_wd, 0, img_
img_arr_in, img_arr_out = [nd.transpose(img_arr_in, (2,0,1)),
nd.transpose(img_arr_out, (2,0,1))]
img_arr_in, img_arr_out = [img_arr_in.reshape((1,) + img_arr_in.shape),
img_arr_out.reshape((1,) + img_arr_out.shape
img_in_list.append(img_arr_out if is_reversed else img_arr_in)
img_out_list.append(img_arr_in if is_reversed else img_arr_out)
return mx.io.NDArrayIter(data=[nd.concat(*img_in_list, dim=0), nd.concat(*img_o

batch_size=batch_size)
download_data(dataset)
train_data = load_data(train_img_path, batch_size, is_reversed=True)
val_data = load_data(val_img_path, batch_size, is_reversed=True)
Visualize 4 images:
In [5]: def visualize(img_arr):
plt.imshow(((img_arr.asnumpy().transpose(1, 2, 0) + 1.0) * 127.5).astype(np.uin
plt.axis('off')
def preview_train_data():
img_in_list, img_out_list = train_data.next().data
for i in range(4):
visualize(img_in_list[i])
visualize(img_out_list[i])
plt.show()
preview_train_data()


Both generator and discriminator use modules of the form convolution-BatchNorm-ReLu.
The key for generator is U-net architecture adding skip connections which shut-
tle low-level infomation shared between input and output images across net.

PatchGAN – that only penalizes structure at the scale of patches is applied as disciminator architecture.
This discriminator tries to classify if each N × N patch in an image is real or fake. We run this discriminator
convolutionally across the image, averaging all responses to provide the ultimate output of netD.
In [6]: # Define Unet generator skip block
class UnetSkipUnit(HybridBlock):
def __init__(self, inner_channels, outer_channels, inner_block=None, innermost=
use_dropout=False, use_bias=False):
super(UnetSkipUnit, self).__init__()
self.outermost = outermost
en_conv = Conv2D(channels=inner_channels, kernel_size=4, strides=2, pad
in_channels=outer_channels, use_bias=use_bias)
en_relu = LeakyReLU(alpha=0.2)
en_norm = BatchNorm(momentum=0.1, in_channels=inner_channels)
de_relu = Activation(activation='relu')
de_norm = BatchNorm(momentum=0.1, in_channels=outer_channels)
if innermost:
de_conv = Conv2DTranspose(channels=outer_channels, kernel_size=4, s
in_channels=inner_channels, use_bias=use_

encoder = [en_relu, en_conv]

decoder = [de_relu, de_conv, de_norm]
model = encoder + decoder
elif outermost:
in_channels=inner_channels * 2)
encoder = [en_conv]
decoder = [de_relu, de_conv, Activation(activation='tanh')]
model = encoder + [inner_block] + decoder
else:
in_channels=inner_channels * 2, use_bias=
encoder = [en_relu, en_conv, en_norm]
decoder = [de_relu, de_conv, de_norm]
model = encoder + [inner_block] + decoder
if use_dropout:
model += [Dropout(rate=0.5)]
self.model = HybridSequential()
with self.model.name_scope():
for block in model:
self.model.add(block)

if self.outermost:
return self.model(x)
else:
return F.concat(self.model(x), x, dim=1)
# Define Unet generator

class UnetGenerator(HybridBlock):
def __init__(self, in_channels, num_downs, ngf=64, use_dropout=True):
super(UnetGenerator, self).__init__()
#Build unet generator structure

unet = UnetSkipUnit(ngf * 8, ngf * 8, innermost=True)
for _ in range(num_downs - 5):
unet = UnetSkipUnit(ngf * 8, ngf * 8, unet, use_dropout=use_dropout)
unet = UnetSkipUnit(ngf * 8, ngf * 4, unet)
unet = UnetSkipUnit(ngf, in_channels, unet, outermost=True)
self.model = unet

return self.model(x)
# Define the PatchGAN discriminator

class Discriminator(HybridBlock):
def __init__(self, in_channels, ndf=64, n_layers=3, use_sigmoid=False, use_bias
super(Discriminator, self).__init__()

self.model = HybridSequential()
kernel_size = 4
padding = int(np.ceil((kernel_size - 1)/2))
self.model.add(Conv2D(channels=ndf, kernel_size=kernel_size, strides=2,
padding=padding, in_channels=in_channels))
self.model.add(LeakyReLU(alpha=0.2))
nf_mult = 1
for n in range(1, n_layers):
nf_mult_prev = nf_mult
nf_mult = min(2 ** n, 8)
self.model.add(Conv2D(channels=ndf * nf_mult, kernel_size=kernel_si
padding=padding, in_channels=ndf * nf_mult_pr
use_bias=use_bias))
self.model.add(BatchNorm(momentum=0.1, in_channels=ndf * nf_mult))
nf_mult_prev = nf_mult
nf_mult = min(2 ** n_layers, 8)
self.model.add(Conv2D(channels=ndf * nf_mult, kernel_size=kernel_size,
padding=padding, in_channels=ndf * nf_mult_prev,
use_bias=use_bias))
self.model.add(BatchNorm(momentum=0.1, in_channels=ndf * nf_mult))
self.model.add(Conv2D(channels=1, kernel_size=kernel_size, strides=1,
padding=padding, in_channels=ndf * nf_mult))
if use_sigmoid:
self.model.add(Activation(activation='sigmoid'))

out = self.model(x)
#print(out)
return out
5.5.4 Construct networks, Initialize parameters, Setup Loss Function and Opti-
mizer
We use binary cross entropy and L1 loss as loss functions. L1 loss can be used to capture low frequencies
in images.
In [7]: def param_init(param):
if param.name.find('conv') != -1:
if param.name.find('weight') != -1:
param.initialize(init=mx.init.Normal(0.02), ctx=ctx)
else:
param.initialize(init=mx.init.Zero(), ctx=ctx)
elif param.name.find('batchnorm') != -1:
param.initialize(init=mx.init.Zero(), ctx=ctx)
# Initialize gamma from normal distribution with mean 1 and std 0.02
if param.name.find('gamma') != -1:
param.set_data(nd.random_normal(1, 0.02, param.data().shape))
def network_init(net):

for param in net.collect_params().values():

param_init(param)
def set_network():
# Pixel2pixel networks
netG = UnetGenerator(in_channels=3, num_downs=8)
netD = Discriminator(in_channels=6)
# Initialize parameters
network_init(netG)
network_init(netD)

trainerG = gluon.Trainer(netG.collect_params(), 'adam', {'learning_rate': lr, '
trainerD = gluon.Trainer(netD.collect_params(), 'adam', {'learning_rate': lr, '
return netG, netD, trainerG, trainerD
# Loss
GAN_loss = gluon.loss.SigmoidBinaryCrossEntropyLoss()
L1_loss = gluon.loss.L1Loss()
netG, netD, trainerG, trainerD = set_network()
5.5.5 Image pool for discriminator

We use history image pool to help discriminator memorize history errors instead of just comparing current
real input and fake output.
In [8]: class ImagePool():
def __init__(self, pool_size):
self.pool_size = pool_size
if self.pool_size > 0:
self.num_imgs = 0
self.images = []
def query(self, images):

if self.pool_size == 0:
return images
ret_imgs = []
for i in range(images.shape[0]):
image = nd.expand_dims(images[i], axis=0)
if self.num_imgs < self.pool_size:
self.num_imgs = self.num_imgs + 1
self.images.append(image)
ret_imgs.append(image)
else:
p = nd.random_uniform(0, 1, shape=(1,)).asscalar()
if p > 0.5:
random_id = nd.random_uniform(0, self.pool_size - 1, shape=(1,)
tmp = self.images[random_id].copy()
self.images[random_id] = image
ret_imgs.append(tmp)
else:

ret_imgs.append(image)
ret_imgs = nd.concat(*ret_imgs, dim=0)
return ret_imgs
5.5.6 Training Loop

We recommend to use gpu to boost training. After a few epochs, we can see images silimar to building
structure are generated.
In [9]: from datetime import datetime
import time
import logging
def facc(label, pred):

pred = pred.ravel()
label = label.ravel()
return ((pred > 0.5) == label).mean()
def train():
image_pool = ImagePool(pool_size)
metric = mx.metric.CustomMetric(facc)
stamp = datetime.now().strftime('%Y_%m_%d-%H_%M')
logging.basicConfig(level=logging.DEBUG)

tic = time.time()
btic = time.time()
train_data.reset()
iter = 0
############################
# (1) Update D network: maximize log(D(x, y)) + log(1 - D(x, G(x, z)))
###########################
real_in = batch.data[0].as_in_context(ctx)
real_out = batch.data[1].as_in_context(ctx)
fake_out = netG(real_in)
fake_concat = image_pool.query(nd.concat(real_in, fake_out, dim=1))
# Train with fake image
# Use image pooling to utilize history images
output = netD(fake_concat)
fake_label = nd.zeros(output.shape, ctx=ctx)
errD_fake = GAN_loss(output, fake_label)
metric.update([fake_label,], [output,])
# Train with real image

real_concat = nd.concat(real_in, real_out, dim=1)
output = netD(real_concat)
real_label = nd.ones(output.shape, ctx=ctx)
errD_real = GAN_loss(output, real_label)
errD = (errD_real + errD_fake) * 0.5
errD.backward()

metric.update([real_label,], [output,])
trainerD.step(batch.data[0].shape[0])
############################
# (2) Update G network: maximize log(D(x, G(x, z))) - lambda1 * L1(y, G
###########################
fake_out = netG(real_in)
fake_concat = nd.concat(real_in, fake_out, dim=1)
output = netD(fake_concat)
real_label = nd.ones(output.shape, ctx=ctx)
errG = GAN_loss(output, real_label) + L1_loss(real_out, fake_out) *
errG.backward()
trainerG.step(batch.data[0].shape[0])
# Print log infomation every ten batches

if iter % 10 == 0:
logging.info('speed: {} samples/s'.format(batch_size / (time.time()
logging.info('discriminator loss = %f, generator loss = %f, binary
%(nd.mean(errD).asscalar(),
nd.mean(errG).asscalar(), acc, iter, epoch))
iter = iter + 1
btic = time.time()

metric.reset()
logging.info('\nbinary training acc at epoch %d: %s=%f' % (epoch, name, acc
logging.info('time: %f' % (time.time() - tic))
# Visualize one generated image for each epoch

fake_img = fake_out[0]
visualize(fake_img)
plt.show()
train()
5.5.7 Results
Generate images with generator.
In [10]: def print_result():
num_image = 4
img_in_list, img_out_list = val_data.next().data
img_in = nd.expand_dims(img_in_list[i], axis=0)
visualize(img_in[0])
img_out = netG(img_in.as_in_context(ctx))
visualize(img_out[0])
plt.show()

print_result()
5.5.8 Other dataset experiments

Run experiments on cityscapes and maps datasets
In [11]: datasets = ['cityscapes', 'maps']
is_reversed = False
batch_size = 64
for dataset in datasets:

train_img_path = '%s/train' % (dataset)
val_img_path = '%s/val' % (dataset)
download_data(dataset)
train_data = load_data(train_img_path, batch_size, is_reversed=is_reversed)
val_data = load_data(val_img_path, batch_size, is_reversed=is_reversed)
print("Preview %s training data:" % (dataset))

preview_train_data()
netG, netD, trainerG, trainerD = set_network()

train()
print("Training result for %s" % (dataset))

print_result()
Preview cityscapes training data:

Training result for cityscapes
Preview maps training data:

Training result for maps
5.5.9 Citation
CMP Facades dataset: @INPROCEEDINGS{ Tylecek13, author = {Radim Tyle{ č }ek, Radim { Š }{‘
a}ra}, title = {Spatial Pattern Templates for Recognition of Objects with Regular Structure}, booktitle =
{Proc. GCPR}, year = {2013}, address = {Saarbrucken, Germany}, }
Cityscapes training set: @inproceedings{Cordts2016Cityscapes, title={The Cityscapes Dataset for Seman-
tic Urban Scene Understanding}, author={Cordts, Marius and Omran, Mohamed and Ramos, Sebastian
and Rehfeld, Timo and Enzweiler, Markus and Benenson, Rodrigo and Franke, Uwe and Roth, Stefan and
Schiele, Bernt}, booktitle={Proc. of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR)}, year={2016} }

5.6 Bayes by Backprop from scratch (NN, classification)

We have already learned how to implement deep neural networks and how to use them for classification
and regression tasks. In order to fight overfitting, we further introduced a concept called dropout, which
randomly turns off a certain percentage of the weights during training.
Recall the classic architecture of a MLP (shown below, without bias terms). So far, when training a neural
network, our goal was to find an optimal point estimate for the weights.
While networks trained using this approach usually perform well in regions with lots of data, they fail to
express uncertainity in regions with little or no data, leading to overconfident decisions. This drawback
motivates the application of Bayesian learning to neural networks, introducing probability distributions over
the weights. These distributions can be of various nature in theory. To make our lifes easier and to have an
intuitive understanding of the distribution at each weight, we will use a Gaussian distribution.
Unfortunately though, exact Bayesian inference on the parameters of a neural network is intractable. One
promising way of addressing this problem is presented by the “Bayes by Backprop” algorithm (introduced
by the “Weight Uncertainity in Neural Networks” paper) which derives a variational approximation to the
true posterior. This algorithm does not only make networks more “honest” with respect to their overall
uncertainity, but also automatically leads to regularization, thereby eliminating the need of using dropout in
this model.
While we will try to explain the most important concepts of this algorithm in this notebook, we also encour-
age the reader to consult the paper for deeper insights.
5.6. Bayes by Backprop from scratch (NN, classification) 379


Let’s start implementing this idea and evaluate its performance on the MNIST classification problem. We
start off with the usual set of imports.
import collections
import mxnet as mx
import numpy as np
For easy tuning and experimentation, we define a dictionary holding the hyper-parameters of our model.
In [2]: config = {
"num_hidden_layers": 2,
"num_hidden_units": 400,
"batch_size": 128,
"epochs": 10,
"learning_rate": 0.001,
"num_samples": 1,
"pi": 0.25,
"sigma_p": 1.0,
"sigma_p1": 0.75,
"sigma_p2": 0.1,
}
Also, we specify the device context for MXNet.

In [3]: ctx = mx.cpu()

We will again train and evaluate the algorithm on the MNIST data set and therefore load the data set as
follows:
In [4]: def transform(data, label):
return data.astype(np.float32)/126.0, label.astype(np.float32)
mnist = mx.test_utils.get_mnist()
num_inputs = 784
num_outputs = 10
batch_size = config['batch_size']

num_train = sum([batch_size for i in train_data])

num_batches = num_train / batch_size
In order to reproduce and compare the results from the paper, we preprocess the pixels by dividing by 126.
5.6.2 Model definition

Activation function
As with lots of past examples, we will again use the ReLU as our activation function for the hidden units of
our neural network.
In [5]: def relu(X):
return nd.maximum(X, nd.zeros_like(X))
Neural net modeling

As our model we are using a straightforward MLP and we are wiring up our network just as we are used to.
In [6]: num_layers = config['num_hidden_layers']
# define function for evaluating MLP

def net(X, layer_params):
layer_input = X
for i in range(len(layer_params) // 2 - 2):
h_linear = nd.dot(layer_input, layer_params[2*i]) + layer_params[2*i + 1]
layer_input = relu(h_linear)
# last layer without ReLU
output = nd.dot(layer_input, layer_params[-2]) + layer_params[-1]
return output
# define network weight shapes

layer_param_shapes = []
num_hidden = config['num_hidden_units']
for i in range(num_layers + 1):
if i == 0: # input layer
W_shape = (num_inputs, num_hidden)
b_shape = (num_hidden,)
elif i == num_layers: # last layer
W_shape = (num_hidden, num_outputs)
b_shape = (num_outputs,)
else: # hidden layers
W_shape = (num_hidden, num_hidden)
b_shape = (num_hidden,)
layer_param_shapes.extend([W_shape, b_shape])
5.6.3 Build objective/loss

As we briefly mentioned at the beginning of the notebook, we will use variational inference in order to make
the prediction of the posterior tractable. While we can not model the posterior 𝑃 (w | 𝒟) directly, we try
to find the parameters 𝜃 of a distribution on the weights 𝑞(w | 𝜃) (commly referred to as the variational
posterior) that minimizes the KL divergence with the true posterior. Formally this can be expressed as:
𝜃* = arg min KL[𝑞(w | 𝜃) || 𝑃 (w | 𝒟]

𝜃
𝑞(w | 𝜃)
∫︁
= arg min 𝑞(w | 𝜃) log 𝑑w
𝜃 𝑃 (w)𝑃 (𝒟 | w)
= arg min KL[𝑞(w | 𝜃) || 𝑃 (w)] − E𝑞(w | 𝜃) [log 𝑃 (𝒟 | w)]
𝜃

The resulting loss function, commonly referred to as either variational free energy or expected lower bound
(ELBO), has to be minimized and is then given as follows:
ℱ(𝒟, 𝜃) = KL[𝑞(w | 𝜃) || 𝑃 (w)] − E𝑞(w | 𝜃) [log 𝑃 (𝒟 | w)]
As one can easily see, the cost function tries to balance the complexity of the data 𝑃 (𝒟 | w) and the
simplicity of the prior 𝑃 (w).
We can approximate this exact cost through a Monte Carlo sampling procedure as follows
𝑛
∑︁
ℱ(𝒟, 𝜃) ≈ log 𝑞(w(𝑖) | 𝜃) − log 𝑃 (w(𝑖) ) − log 𝑃 (𝒟 | w(𝑖) )
𝑖=1
where w(𝑖) corresponds to the 𝑖-th Monte Carlo sample from the variational posterior. While writing this
notebook, we noticed that even taking just one sample leads to good results and we will therefore stick to
just sampling once throughout the notebook.
Since we will be working with mini-batches, the exact loss form on mini-batch 𝑖 we will be using looks as
follows:
1
ℱ(𝒟𝑖 , 𝜃) = KL[log 𝑞(w | 𝜃) || log 𝑃 (w)] − E𝑞(w | 𝜃) [log 𝑃 (𝒟𝑖 | w)]
𝑀
1
≈ (log 𝑞(w(1) | 𝜃) − log 𝑃 (w(1) )) − log 𝑃 (𝒟𝑖 | w(1) )
𝑀
where 𝑀 corresponds to the number of batches, and ℱ(𝒟, 𝜃) = 𝑀
∑︀
𝑖=1 ℱ(𝒟𝑖 , 𝜃)
Let’s now look at each of these single terms individually.
Likelihood
As with lots of past examples, we will again use the softmax to define our likelihood 𝑃 (𝒟𝑖 | w). Revisit the
MLP from scratch notebook for a detailed motivation of this function.
In [7]: def log_softmax_likelihood(yhat_linear, y):
return nd.nansum(y * nd.log_softmax(yhat_linear), axis=0, exclude=True)
Prior
Since we are introducing a Bayesian treatment for the network, we need to define a Prior over the weights.
Gaussian prior
A popular and simple prior is the Gaussian distribution. The prior over the entire weight vector simply
corresponds to the prodcut of the individual Gaussians:
∏︁
𝑃 (w) = 𝒩 (w𝑖 | 0, 𝜎𝑝2 )
𝑖

We can define the Gaussian distribution and our Gaussian prior as seen below. Note that we are ultimately
intersted in the log-prior log 𝑃 (w) and therefore compute the sum of the log-Gaussians.
∑︁
log 𝑃 (w) = log 𝒩 (w𝑖 | 0, 𝜎𝑝2 )
𝑖
In [8]: LOG2PI = np.log(2.0 * np.pi)
def log_gaussian(x, mu, sigma):

return -0.5 * LOG2PI - nd.log(sigma) - (x - mu) ** 2 / (2 * sigma ** 2)
def gaussian_prior(x):
sigma_p = nd.array([config['sigma_p']], ctx=ctx)
return nd.sum(log_gaussian(x, 0., sigma_p))
Scale mixture prior
Instead of a single Gaussian, the paper also suggests the usage of a scale mixture prior for 𝑃 (w) as an
alternative:
∏︁ (︂ )︂
2 2
𝑃 (w) = 𝜋𝒩 (w𝑖 | 0, 𝜎1 ) + (1 − 𝜋)𝒩 (w𝑖 | 0, 𝜎2 )
𝑖
where 𝜋 ∈ [0, 1], 𝜎1 > 𝜎2 and 𝜎2 ≪ 1. Again we are intersted in the log-prior log 𝑃 (w), which can be
expressed as follows:
∑︁ (︂ )︂
2 2
log 𝑃 (w) = log 𝜋𝒩 (w𝑖 | 0, 𝜎1 ) + (1 − 𝜋)𝒩 (w𝑖 | 0, 𝜎2 )
𝑖
In [9]: def gaussian(x, mu, sigma):

scaling = 1.0 / nd.sqrt(2.0 * np.pi * (sigma ** 2))
bell = nd.exp(- (x - mu) ** 2 / (2.0 * sigma ** 2))
return scaling * bell
def scale_mixture_prior(x):
sigma_p1 = nd.array([config['sigma_p1']], ctx=ctx)
sigma_p2 = nd.array([config['sigma_p2']], ctx=ctx)
pi = config['pi']
first_gaussian = pi * gaussian(x, 0., sigma_p1)

second_gaussian = (1 - pi) * gaussian(x, 0., sigma_p2)
return nd.log(first_gaussian + second_gaussian)

Variational Posterior
The last missing piece in the equation is the variational posterior. Again, we choose a Gaussian disribution
for this purpose. The variational posterior on the weights is centered on the mean vector 𝜇 and has variance
𝜎2: ∏︁
𝑞(w | 𝜃) = 𝒩 (w𝑖 | 𝜇, 𝜎 2 )
𝑖
The log-posterior log 𝑞(w | 𝜃) is given by:

∑︁
log 𝑞(w | 𝜃) = log 𝒩 (w𝑖 | 𝜇, 𝜎 2 )
𝑖
Combined Loss
After introducing the data likelihood, the prior, and the variational posterior, we are now able to build our
1
combined loss function: ℱ(𝒟𝑖 , 𝜃) = 𝑀 (log 𝑞(w | 𝜃) − log 𝑃 (w)) − log 𝑃 (𝒟𝑖 | w)
In [10]: def combined_loss(output, label_one_hot, params, mus, sigmas, log_prior, log_likel
# Calculate data likelihood

log_likelihood_sum = nd.sum(log_likelihood(output, label_one_hot))
# Calculate prior
log_prior_sum = sum([nd.sum(log_prior(param)) for param in params])
# Calculate variational posterior

log_var_posterior_sum = sum([nd.sum(log_gaussian(params[i], mus[i], sigmas[i])
# Calculate total loss

return 1.0 / num_batches * (log_var_posterior_sum - log_prior_sum) - log_likel
5.6.4 Optimizer
We use vanilla stochastic gradient descent to optimize the variational parameters. Note that this implements
the updates described in the paper, as the gradient contribution due to the reparametrization trick is au-
tomatically included by taking the gradients of the combined loss function with respect to the variational
parameters.

In order to being able to assess our model performance we define a helper function which evaluates our
accuracy on an ongoing basis.

In [12]: def evaluate_accuracy(data_iterator, net, layer_params):

numerator = 0.
denominator = 0.
output = net(data, layer_params)

We are using a Gaussian distribution for each individual weight as our variational posterior, which means
that we need to store two parameters, mean and variance, for each weight. For the variance we need to
ensure that it is non-negative, which we will do by using the softplus function to express 𝜎 in terms of an
unconstrained parameter 𝜌. While gradient descent will be performed on 𝜃 = (𝜇, 𝜌), the distribution for
each individual weight is represented as 𝑤𝑖 ∼ 𝒩 (𝑤𝑖 | 𝜇𝑖 , 𝜎𝑖 ) with 𝜎𝑖 = softplus(𝜌𝑖 ).
We initialize 𝜇 with a Gaussian around 0 (just as we would initialize standard weights of a neural network).
It is important to initialize 𝜌 (and hence 𝜎) to a small value, otherwise learning might not work properly.
In [13]: weight_scale = .1
rho_offset = -3
# initialize variational parameters; mean and variance for each weight

mus = []
rhos = []
for shape in layer_param_shapes:

mu = nd.random_normal(shape=shape, ctx=ctx, scale=weight_scale)
rho = rho_offset + nd.zeros(shape=shape, ctx=ctx)
mus.append(mu)
rhos.append(rho)
variational_params = mus + rhos
Since these are the parameters we wish to do gradient descent on, we need to allocate space for storing the
gradients.
In [14]: for param in variational_params:
param.attach_grad()
5.6.7 Main training loop

The main training loop is pretty similar to the one we used in the MLP example. The only adaptation we
need to make is to add the weight sampling which is performed during each optimization step. Generating
a set of weights, which will subsequently be used in the neural network and the loss function, is a 3-step
process:
1. Sample 𝜖 ∼ 𝒩 (0, I𝑑 )
In [15]: def sample_epsilons(param_shapes):
epsilons = [nd.random_normal(shape=shape, loc=0., scale=1.0, ctx=ctx) for shap

return epsilons
2. Transform 𝜌 to a postive vector via the softplus function: 𝜎 = softplus(𝜌) = log(1 + exp(𝜌))
In [16]: def softplus(x):
return nd.log(1. + nd.exp(x))
def transform_rhos(rhos):
return [softplus(rho) for rho in rhos]
3. Compute w: w = 𝜇 + 𝜎 ∘ 𝜖, where the ∘ operator represents the element-wise multiplication. This is

the “reparametrization trick” for separating the randomness from the parameters of 𝑞.
In [17]: def transform_gaussian_samples(mus, sigmas, epsilons):
samples = []
for j in range(len(mus)):
samples.append(mus[j] + sigmas[j] * epsilons[j])
return samples
Complete loop
The complete training loop is given below.
In [18]: epochs = config['epochs']
learning_rate = config['learning_rate']
train_acc = []
test_acc = []
# sample epsilons from standard normal
epsilons = sample_epsilons(layer_param_shapes)
# compute softplus for variance

sigmas = transform_rhos(rhos)
# obtain a sample from q(w|theta) by transforming the epsilons

layer_params = transform_gaussian_samples(mus, sigmas, epsilons)
# forward-propagate the batch

output = net(data, layer_params)
# calculate the loss

loss = combined_loss(output, label_one_hot, layer_params, mus, sigmas,
# backpropagate for gradient calculation

loss.backward()
# apply stochastic gradient descent to variational parameters

SGD(variational_params, learning_rate)
# calculate moving loss for monitoring convergence

else (1 - smoothing_constant) * moving_loss + (smoothing_co
test_accuracy = evaluate_accuracy(test_data, net, mus)

train_accuracy = evaluate_accuracy(train_data, net, mus)
train_acc.append(np.asscalar(train_accuracy))
test_acc.append(np.asscalar(test_accuracy))
plt.plot(train_acc)
plt.plot(test_acc)
plt.show()
For demonstration purposes, we can now take a look at one particular weight by plotting its distribution.
In [19]: def show_weight_dist(mean, variance):

sigma = nd.sqrt(variance)
x = np.linspace(mean.asscalar() - 4*sigma.asscalar(), mean.asscalar() + 4*sigm
plt.plot(x, gaussian(nd.array(x, ctx=ctx), mean, sigma).asnumpy())
plt.show()
mu = mus[0][0][0]
var = softplus(rhos[0][0][0]) ** 2
show_weight_dist(mu, var)
Great! We have obtained a fully functional Bayesian neural network. However, the number of weights now
is twice as high as for traditional neural networks. As we will see in the final section of this notebook, we
are able to drastically reduce the number of weights our network uses for prediction with weight pruning.
5.6.8 Weight pruning

To measure the degree of redundancy present in the trained network and to reduce the model’s parameter
count, we now want to examine the effect of setting some of the weights to 0 and evaluate the test accuracy
afterwards. We can achieve this by ordering the weights according to their signal-to-noise-ratio, |𝜇𝜎𝑖𝑖 | , and
setting a certain percentage of the weights with the lowest ratios to 0.
We can calculate the signal-to-noise-ratio as follows:
In [20]: def signal_to_noise_ratio(mus, sigmas):
sign_to_noise = []
sign_to_noise.extend([nd.abs(mus[j]) / sigmas[j]])
return sign_to_noise
We further introduce a few helper methods which turn our list of weights into a single vector containing all
weights. This will make our subsequent actions easier.
In [21]: def vectorize_matrices_in_vector(vec):

for i in range(0, (num_layers + 1) * 2, 2):

if i == 0:
vec[i] = nd.reshape(vec[i], num_inputs * num_hidden)
elif i == num_layers * 2:
vec[i] = nd.reshape(vec[i], num_hidden * num_outputs)
else:
vec[i] = nd.reshape(vec[i], num_hidden * num_hidden)
return vec
def concact_vectors_in_vector(vec):
concat_vec = vec[0]
for i in range(1, len(vec)):
concat_vec = nd.concat(concat_vec, vec[i], dim=0)
return concat_vec
def transform_vector_structure(vec):
vec = vectorize_matrices_in_vector(vec)
vec = concact_vectors_in_vector(vec)
return vec
In addition, we also have a helper method which transforms the pruned weight vector back to the original
layered structure.
import operator
def prod(iterable):
return reduce(operator.mul, iterable, 1)
def restore_weight_structure(vec):
pruned_weights = []
index = 0
for shape in layer_param_shapes:

incr = prod(shape)
pruned_weights.extend([nd.reshape(vec[index : index + incr], shape)])
index += incr
return pruned_weights
The actual pruning of the vector happens in the following function. Note that this function accepts an
ordered list of percentages to evaluate the performance at different pruning rates. In this setting, pruning at
each iteration means extracting the index of the lowest signal-to-noise-ratio weight and setting the weight at
this index to 0.
In [23]: def prune_weights(sign_to_noise_vec, prediction_vector, percentages):
pruning_indices = nd.argsort(sign_to_noise_vec, axis=0)
for percentage in percentages:

prediction_vector = mus_copy_vec.copy()
pruning_indices_percent = pruning_indices[0:int(len(pruning_indices)*perce

for pr_ind in pruning_indices_percent:

prediction_vector[int(pr_ind.asscalar())] = 0
pruned_weights = restore_weight_structure(prediction_vector)
test_accuracy = evaluate_accuracy(test_data, net, pruned_weights)
print("%s --> %s" % (percentage, test_accuracy))
Putting the above functions together:

In [24]: sign_to_noise = signal_to_noise_ratio(mus, sigmas)
sign_to_noise_vec = transform_vector_structure(sign_to_noise)
mus_copy = mus.copy()
mus_copy_vec = transform_vector_structure(mus_copy)
prune_weights(sign_to_noise_vec, mus_copy_vec, [0.1, 0.25, 0.5, 0.75, 0.95, 0.99,

0.1 --> 0.9777
0.25 --> 0.9779
0.5 --> 0.9756
0.75 --> 0.9602
0.95 --> 0.7259
0.99 --> 0.3753
1.0 --> 0.098
Depending on the number of units used in the original network and the number of training epochs, the highest
achievable pruning percentages (without significantly reducing the predictive performance) can vary. The
paper, for example, reports almost no change in the test accuracy when pruning 95% of the weights in
a 2x1200 unit Bayesian neural network, which creates a significantly sparser network, leading to faster
predictions and reduced memory requirements.
5.6.9 Conclusion
We have taken a look at an efficient Bayesian treatment for neural networks using variational inference via
the “Bayes by Backprop” algorithm (introduced by the “Weight Uncertainity in Neural Networks” paper).
We have implemented a stochastic version of the variational lower bound and optimized it in order to find
an approximation to the posterior distribution over the weights of a MLP network on the MNIST data set.
As a result, we achieve regularization on the network’s parameters and can quantify our uncertainty about
the weights accurately. Finally, we saw that it is possible to significantly reduce the number of weights in
the neural network after training while still keeping a high accuracy on the test set.
We also note that, given this model implementation, we were able to reproduce the paper’s results on the
MNIST data set, achieving a comparable test accuracy for all documented instances of the MNIST classifi-
cation problem.
5.7 Bayes by Backprop with gluon (NN, classification)

After discussing Bayes by Backprop from scratch in a previous notebook, we can now look at the corre-
sponding implementation as gluon components.
We start off with the usual set of imports.
5.7. Bayes by Backprop with gluon (NN, classification) 391


import collections
import mxnet as mx
import numpy as np
For easy tuning and experimentation, we define a dictionary holding the hyper-parameters of our model.
In [ ]: config = {
"num_hidden_layers": 2,
"num_hidden_units": 400,
"batch_size": 128,
"epochs": 10,
"learning_rate": 0.001,
"num_samples": 1,
"pi": 0.25,
"sigma_p": 1.0,
"sigma_p1": 0.75,
"sigma_p2": 0.01,
}
Also, we specify the device context for MXNet.

In [ ]: ctx = mx.cpu()

We will again train and evaluate the algorithm on the MNIST data set and therefore load the data set as
follows:
In [ ]: def transform(data, label):
return data.astype(np.float32)/126.0, label.astype(np.float32)
mnist = mx.test_utils.get_mnist()
num_inputs = 784
num_outputs = 10
batch_size = config['batch_size']

num_train = sum([batch_size for i in train_data])

num_batches = num_train / batch_size
In order to reproduce and compare the results from the paper, we preprocess the pixels by dividing by 126.
5.7.2 Model definition

Neural net modeling
As our model we are using a straightforward MLP and we are wiring up our network just as we are used to
in gluon. Note that we are not using any special layers during the definition of our network, as we believe

that Bayes by Backprop should be thought of as a training method, rather than a special architecture.
In [ ]: num_layers = config['num_hidden_layers']
num_hidden = config['num_hidden_units']
for i in range(num_layers):
5.7.3 Build objective/loss

Again, we define our loss function as described in Bayes by Backprop from scratch. Note that we are
bundling all of this functionality as part of a gluon.loss.Loss subclass, where the loss computation is
performed in the hybrid_forward function.
In [ ]: class BBBLoss(gluon.loss.Loss):
def __init__(self, log_prior="gaussian", log_likelihood="softmax_cross_entropy"
sigma_p1=1.0, sigma_p2=0.1, pi=0.5, weight=None, batch_axis=0, **k
super(BBBLoss, self).__init__(weight, batch_axis, **kwargs)
self.log_prior = log_prior
self.log_likelihood = log_likelihood
self.sigma_p1 = sigma_p1
self.sigma_p2 = sigma_p2
self.pi = pi
def log_softmax_likelihood(self, yhat_linear, y):

return nd.nansum(y * nd.log_softmax(yhat_linear), axis=0, exclude=True)
def log_gaussian(self, x, mu, sigma):

return -0.5 * np.log(2.0 * np.pi) - nd.log(sigma) - (x - mu) ** 2 / (2 * si
def gaussian_prior(self, x):

sigma_p = nd.array([self.sigma_p1], ctx=ctx)
return nd.sum(self.log_gaussian(x, 0., sigma_p))
def gaussian(self, x, mu, sigma):

def scale_mixture_prior(self, x):

sigma_p1 = nd.array([self.sigma_p1], ctx=ctx)
sigma_p2 = nd.array([self.sigma_p2], ctx=ctx)
pi = self.pi
first_gaussian = pi * self.gaussian(x, 0., sigma_p1)

second_gaussian = (1 - pi) * self.gaussian(x, 0., sigma_p2)
return nd.log(first_gaussian + second_gaussian)
def hybrid_forward(self, F, output, label, params, mus, sigmas, sample_weight=N

log_likelihood_sum = nd.sum(self.log_softmax_likelihood(output, label))

prior = None
if self.log_prior == "gaussian":
prior = self.gaussian_prior
elif self.log_prior == "scale_mixture":
prior = self.scale_mixture_prior
log_prior_sum = sum([nd.sum(prior(param)) for param in params])
log_var_posterior_sum = sum([nd.sum(self.log_gaussian(params[i], mus[i], si
return 1.0 / num_batches * (log_var_posterior_sum - log_prior_sum) - log_li
bbb_loss = BBBLoss(log_prior="scale_mixture", sigma_p1=config['sigma_p1'], sigma_p2

First, we need to initialize all the network’s parameters, which are only point estimates of the weights at this
point. We will soon see, how we can still train the network in a Bayesian fashion, without interfering with
the network’s architecture.
Then we have to forward-propagate a single data set entry once to set up all network parameters (weights
and biases) with the desired initliaizer specified above.
In [ ]: for i, (data, label) in enumerate(train_data):
net(data)
break
In [ ]: weight_scale = .1
rho_offset = -3
# initialize variational parameters; mean and variance for each weight

mus = []
rhos = []
shapes = list(map(lambda x: x.shape, net.collect_params().values()))
for shape in shapes:

mu = gluon.Parameter('mu', shape=shape, init=mx.init.Normal(weight_scale))
rho = gluon.Parameter('rho',shape=shape, init=mx.init.Constant(rho_offset))
mu.initialize(ctx=ctx)
rho.initialize(ctx=ctx)
mus.append(mu)
rhos.append(rho)
variational_params = mus + rhos
raw_mus = list(map(lambda x: x.data(ctx), mus))

raw_rhos = list(map(lambda x: x.data(ctx), rhos))
5.7.5 Optimizer
Now, we still have to choose the optimizer we wish to use for training. This time, we are using the adam
optimizer.

In [ ]: trainer = gluon.Trainer(variational_params, 'adam', {'learning_rate': config['learn
5.7.6 Main training loop

Sampling
Recall the 3-step process for the variational parameters:
1. Sample 𝜖 ∼ 𝒩 (0, I𝑑 )
In [ ]: def sample_epsilons(param_shapes):
epsilons = [nd.random_normal(shape=shape, loc=0., scale=1.0, ctx=ctx) for shape
return epsilons
2. Transform 𝜌 to a positive vector via the softplus function: 𝜎 = softplus(𝜌) = log(1 + exp(𝜌))
In [ ]: def softplus(x):
return nd.log(1. + nd.exp(x))
def transform_rhos(rhos):
return [softplus(rho) for rho in rhos]
3. Compute w: w = 𝜇 + 𝜎 ∘ 𝜖, where the ∘ operator represents the element-wise multiplication. This is

the “reparametrization trick” for separating the randomness from the parameters of 𝑞.
In [ ]: def transform_gaussian_samples(mus, sigmas, epsilons):
samples = []
samples.append(mus[j] + sigmas[j] * epsilons[j])
return samples
Putting these three steps together we get:

In [ ]: def generate_weight_sample(layer_param_shapes, mus, rhos):
# sample epsilons from standard normal
epsilons = sample_epsilons(layer_param_shapes)
# compute softplus for variance

sigmas = transform_rhos(rhos)
# obtain a sample from q(w|theta) by transforming the epsilons

layer_params = transform_gaussian_samples(mus, sigmas, epsilons)
return layer_params, sigmas
Evaluation metric
In order to being able to assess our model performance we define a helper function which evaluates our
accuracy on an ongoing basis.
In [ ]: def evaluate_accuracy(data_iterator, net, layer_params):
numerator = 0.
denominator = 0.

for l_param, param in zip(layer_params, net.collect_params().values()):

param._data[0] = l_param
output = net(data)
Complete loop
The complete training loop is given below.
In [ ]: epochs = config['epochs']
learning_rate = config['learning_rate']
train_acc = []
test_acc = []
# generate sample
layer_params, sigmas = generate_weight_sample(shapes, raw_mus, raw_rhos
# overwrite network parameters with sampled parameters

for sample, param in zip(layer_params, net.collect_params().values()):
param._data[0] = sample
# forward-propagate the batch

output = net(data)
# calculate the loss

loss = bbb_loss(output, label_one_hot, layer_params, raw_mus, sigmas)
# backpropagate for gradient calculation

loss.backward()
# calculate moving loss for monitoring convergence

test_accuracy = evaluate_accuracy(test_data, net, raw_mus)

train_accuracy = evaluate_accuracy(train_data, net, raw_mus)
train_acc.append(np.asscalar(train_accuracy))
test_acc.append(np.asscalar(test_accuracy))

plt.plot(train_acc)
plt.plot(test_acc)
plt.show()
For demonstration purposes, we can now take a look at one particular weight by plotting its distribution.
In [ ]: def gaussian(x, mu, sigma):
def show_weight_dist(mean, variance):

sigma = nd.sqrt(variance)
x = np.linspace(mean.asscalar() - 4*sigma.asscalar(), mean.asscalar() + 4*sigma
plt.plot(x, gaussian(nd.array(x, ctx=ctx), mean, sigma).asnumpy())
plt.show()
mu = raw_mus[0][0][0]
var = softplus(raw_rhos[0][0][0]) ** 2
show_weight_dist(mu, var)
5.7.7 Weight pruning

To measure the degree of redundancy present in the trained network and to reduce the model’s parameter
count, we now want to examine the effect of setting some of the weights to 0 and evaluate the test accuracy
afterwards. We can achieve this by ordering the weights according to their signal-to-noise-ratio, |𝜇𝜎𝑖𝑖 | , and
setting a certain percentage of the weights with the lowest ratios to 0.
We can calculate the signal-to-noise-ratio as follows:
In [ ]: def signal_to_noise_ratio(mus, sigmas):
sign_to_noise = []
sign_to_noise.extend([nd.abs(mus[j]) / sigmas[j]])
return sign_to_noise
We further introduce a few helper methods which turn our list of weights into a single vector containing all
weights. This will make our subsequent actions easier.
In [ ]: def vectorize_matrices_in_vector(vec):
for i in range(0, (num_layers + 1) * 2, 2):
if i == 0:
vec[i] = nd.reshape(vec[i], num_inputs * num_hidden)
elif i == num_layers * 2:
vec[i] = nd.reshape(vec[i], num_hidden * num_outputs)
else:
vec[i] = nd.reshape(vec[i], num_hidden * num_hidden)
return vec
def concact_vectors_in_vector(vec):

concat_vec = vec[0]
for i in range(1, len(vec)):
concat_vec = nd.concat(concat_vec, vec[i], dim=0)
return concat_vec
def transform_vector_structure(vec):
vec = vectorize_matrices_in_vector(vec)
vec = concact_vectors_in_vector(vec)
return vec
In addition, we also have a helper method which transforms the pruned weight vector back to the original
layered structure.
In [ ]: from functools import reduce
import operator
def prod(iterable):
return reduce(operator.mul, iterable, 1)
def restore_weight_structure(vec):
pruned_weights = []
index = 0
for shape in shapes:

incr = prod(shape)
pruned_weights.extend([nd.reshape(vec[index : index + incr], shape)])
index += incr
return pruned_weights
The actual pruning of the vector happens in the following function. Note that this function accepts an
ordered list of percentages to evaluate the performance at different pruning rates. In this setting, pruning at
each iteration means extracting the index of the lowest signal-to-noise-ratio weight and setting the weight at
this index to 0.
In [ ]: def prune_weights(sign_to_noise_vec, prediction_vector, percentages):
pruning_indices = nd.argsort(sign_to_noise_vec, axis=0)
for percentage in percentages:

prediction_vector = mus_copy_vec.copy()
pruning_indices_percent = pruning_indices[0:int(len(pruning_indices)*percen
for pr_ind in pruning_indices_percent:
prediction_vector[int(pr_ind.asscalar())] = 0
pruned_weights = restore_weight_structure(prediction_vector)
test_accuracy = evaluate_accuracy(test_data, net, pruned_weights)
print("%s --> %s" % (percentage, test_accuracy))
Putting the above function together:

In [ ]: sign_to_noise = signal_to_noise_ratio(raw_mus, sigmas)
sign_to_noise_vec = transform_vector_structure(sign_to_noise)

mus_copy = raw_mus.copy()
mus_copy_vec = transform_vector_structure(mus_copy)
prune_weights(sign_to_noise_vec, mus_copy_vec, [0.1, 0.25, 0.5, 0.75, 0.95, 0.98, 1
Depending on the number of units used in the original network, the highest achievable pruning percent-
ages (without significantly reducing the predictive performance) can vary. The paper, for example, reports
almost no change in the test accuracy when pruning 95% of the weights in a 1200 unit Bayesian neural
network, which creates a significantly sparser network, leading to faster predictions and reduced memory
requirements.
5.7.8 Conclusion
We have taken a look at an efficient Bayesian treatment for neural networks using variational inference via
the “Bayes by Backprop” algorithm (introduced by the “Weight Uncertainity in Neural Networks” paper).
We have implemented a stochastic version of the variational lower bound and optimized it in order to find
an approximation to the posterior distribution over the weights of a MLP network on the MNIST data set.
As a result, we achieve regularization on the network’s parameters and can quantify our uncertainty about
the weights accurately. Finally, we saw that it is possible to significantly reduce the number of weights in
the neural network after training while still keeping a high accuracy on the test set.
We also note that, given this model implementation, we were able to reproduce the paper’s results on the
MNIST data set, achieving a comparable test accuracy for all documented instances of the MNIST classifi-
cation problem.

Gluon Tutorials: Deep Learning - The Straight Dope

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gluon Tutorials: Deep Learning - The Straight Dope

Uploaded by

Copyright:

Available Formats

Deep Learning - The Straight Dope

Oct 12, 2018

To clone or contribute, visit Deep Learning - The Straight Dope on Github.

4 Chapter 1. How to contribute

$ pip install mxnet --pre --user

More detailed instructions are available here

PART 1: DEEP LEARNING FUNDAMENTALS

3.1.2 Learning by doing

3.1.3 Next steps

8 Chapter 3. Part 1: Deep Learning Fundamentals

3.2.1 A motivating example

10 Chapter 3. Part 1: Deep Learning Fundamentals

cat cat dog dog

3.2.2 The dizzying versatility of machine learning

3.2.3 Basics of machine learning

12 Chapter 3. Part 1: Deep Learning Fundamentals

3.2.4 Supervised learning

14 Chapter 3. Part 1: Deep Learning Fundamentals

Death cap - do not eat!

𝐿(action | 𝑥) = E𝑦∼𝑝(𝑦|𝑥) [loss(action, 𝑦)]

16 Chapter 3. Part 1: Deep Learning Fundamentals

18 Chapter 3. Part 1: Deep Learning Fundamentals

Search and ranking

20 Chapter 3. Part 1: Deep Learning Fundamentals

Tagging and Parsing

Automatic Speech Recognition

German Haben Sie sich schon dieses grossartige Lehrwerk angeschaut?

3.2.5 Unsupervised learning

3.2.6 Interacting with an environment

22 Chapter 3. Part 1: Deep Learning Fundamentals

• remember what we did previously?

24 Chapter 3. Part 1: Deep Learning Fundamentals

MDPs, bandits, and friends

3.2.7 When not to use machine learning

1 2 fizz 4 buzz fizz 7 8 fizz buzz 11 fizz 13 14 fizzbuzz 16 ...

The conventional way to solve such a task is quite simple.

3.3 Manipulate data the MXNet way with ndarray

26 Chapter 3. Part 1: Deep Learning Fundamentals

3.3.1 Getting started

[[ 0.00000000e+00 0.00000000e+00 2.26995938e-20 4.57734143e-41]

Similarly, ndarray has a function to create a matrix of all ones.

3.3. Manipulate data the MXNet way with ndarray 27

In [5]: y = nd.random_normal(0, 1, shape=(3, 4))

We can also grab a matrix’s transpose to compute a proper matrix-matrix product.

28 Chapter 3. Part 1: Deep Learning Fundamentals

[-3.93169522 -1.43098474 2.20789099]]

3.3.3 In-place operations

3.3. Manipulate data the MXNet way with ndarray 29

Now let’s try writing to a specific element.

Multi-dimensional slicing is also supported.

30 Chapter 3. Part 1: Deep Learning Fundamentals

3.3.6 Converting from MXNet NDArray to NumPy

3.3.7 Managing context

3.3. Manipulate data the MXNet way with ndarray 31

32 Chapter 3. Part 1: Deep Learning Fundamentals

3.3.8 Watch out!

3.4 Linear algebra

3.4. Linear algebra 33

34 Chapter 3. Part 1: Deep Learning Fundamentals

3.4.3 Length, dimensionality, and, shape

3.4. Linear algebra 35

36 Chapter 3. Part 1: Deep Learning Fundamentals

3.4. Linear algebra 37

[[ 12. 13. 14. 15.]