Gluon Tutorials: Deep Learning - The Straight Dope
Gluon Tutorials: Deep Learning - The Straight Dope
Release 0.1
MXNet Community
i
ii
Deep Learning - The Straight Dope, Release 0.1
This repo contains an incremental sequence of notebooks designed to teach deep learning, Apache MXNet
(incubating), and the gluon interface. Our goal is to leverage the strengths of Jupyter notebooks to present
prose, graphics, equations, and code together in one place. If we’re successful, the result will be a resource
that could be simultaneously a book, course material, a prop for live tutorials, and a resource for plagiarising
(with our blessing) useful code. To our knowledge there’s no source out there that teaches either (1) the full
breadth of concepts in modern deep learning or (2) interleaves an engaging textbook with runnable code.
We’ll find out by the end of this venture whether or not that void exists for a good reason.
Another unique aspect of this book is its authorship process. We are developing this resource fully in the
public view and are making it available for free in its entirety. While the book has a few primary authors
to set the tone and shape the content, we welcome contributions from the community and hope to coauthor
chapters and entire sections with experts and community members. Already we’ve received contributions
spanning typo corrections through full working examples.
CRASH COURSE 1
Deep Learning - The Straight Dope, Release 0.1
2 CRASH COURSE
CHAPTER
ONE
HOW TO CONTRIBUTE
3
Deep Learning - The Straight Dope, Release 0.1
TWO
DEPENDENCIES
To run these notebooks, a recent version of MXNet is required. The easiest way is to install the nightly build
MXNet through pip. E.g.:
5
Deep Learning - The Straight Dope, Release 0.1
6 Chapter 2. Dependencies
CHAPTER
THREE
3.1 Preface
If you’re a reasonable person, you might ask, “what is mxnet-the-straight-dope?” You might also ask, “why
does it have such an ostentatious name?” Speaking to the former question, mxnet-the-straight-dope is an
attempt to create a new kind of educational resource for deep learning. Our goal is to leverage the strengths
of Jupyter notebooks to present prose, graphics, equations, and (importantly) code together in one place. If
we’re successful, the result will be a resource that could be simultaneously a book, course material, a prop
for live tutorials, and a resource for plagiarising (with our blessing) useful code. To our knowledge, few
available resources aim to teach either (1) the full breadth of concepts in modern machine learning or (2)
interleave an engaging textbook with runnable code. We’ll find out by the end of this venture whether or not
that void exists for a good reason.
Regarding the name, we are cognizant that the machine learning community and the ecosystem in which we
operate have lurched into an absurd place. In the early 2000s, comparatively few tasks in machine learning
had been conquered, but we felt that we understood how and why those models worked (with some caveats).
By contrast, today’s machine learning systems are extremely powerful and actually work for a growing list
of tasks, but huge open questions remain as to precisely why they are so effective.
This new world offers enormous opportunity, but has also given rise to considerable buffoonery. Research
preprints like the arXiv are flooded by clickbait, AI startups have sometimes received overly optimistic
valuations, and the blogosphere is flooded with thought leadership pieces written by marketers bereft of any
technical knowledge. Amid the chaos, easy money, and lax standards, we believe it’s important not to take
our models or the environment in which they are worshipped too seriously. Also, in order to both explain,
visualize, and code the full breadth of models that we aim to address, it’s important that the authors do not
get bored while writing.
3.1.1 Organization
At present, we’re aiming for the following format: aside from a few (optional) notebooks providing a crash
course in the basic mathematical background, each subsequent notebook will both:
1. Introduce a reasonable number (perhaps one) of new concepts
2. Provide a single self-contained working example, using a real dataset
This presents an organizational challenge. Some models might logically be grouped together in a single
notebook. And some ideas might be best taught by executing several models in succession. On the other
hand, there’s a big advantage to adhering to a policy of 1 working example, 1 notebook: This makes it as
7
Deep Learning - The Straight Dope, Release 0.1
easy as possible for you to start your own research projects by plagiarising our code. Just copy a single
notebook and start modifying it.
We will interleave the runnable code with background material as needed. In general, we will often err on
the side of making tools available before explaining them fully (and we will follow up by explaining the
background later). For instance, we might use stochastic gradient descent before fully explaining why it is
useful or why it works. This helps to give practitioners the necessary ammunition to solve problems quickly,
at the expense of requiring the reader to trust us with some decisions, at least in the short term. Throughout,
we’ll be working with the MXNet library, which has the rare property of being flexible enough for research
while being fast enough for production. Our more advanced chapters will mostly rely on MXNet’s new high-
level imperative interface gluon. Note that this is not the same as mxnet.module, an older, symbolic
interface supported by MXNet.
This book will teach deep learning concepts from scratch. Sometimes, we’ll want to delve into fine details
about the models that are hidden from the user by gluon’s advanced features. This comes up especially
in the basic tutorials, where we’ll want you to understand everything that happens in a given layer. In
these cases, we’ll generally present two versions of the example: one where we implement everything from
scratch, relying only on NDArray and automatic differentiation, and another where we show how to do
things succinctly with gluon. Once we’ve taught you how a layer works, we can just use the gluon
version in subsequent tutorials.
3.2 Introduction
Before we could begin writing, the authors of this book, like much of the work force, had to become
caffeinated. We hopped in the car and started driving. Having an Android, Alex called out “Okay Google”,
awakening the phone’s voice recognition system. Then Mu commanded “directions to Blue Bottle coffee
shop”. The phone quickly displayed the transcription of his command. It also recognized that we were
asking for directions and launched the Maps application to fulfill our request. Once launched, the Maps app
identified a number of routes. Next to each route, the phone displayed a predicted transit time. While we
fabricated this story for pedagogical convenience, it demonstrates that in the span of just a few seconds, our
everyday interactions with a smartphone can engage several machine learning models.
If you’ve never worked with machine learning before, you might be wondering what the hell we’re talking
about. You might ask, “isn’t that just programming?” or “what does machine learning even mean?” First, to
be clear, we implement all machine learning algorithms by writing computer programs. Indeed, we use the
same languages and hardware as other fields of computer science, but not all computer programs involve
machine learning. In response to the second question, precisely defining a field of study as vast as machine
learning is hard. It’s a bit like answering, “what is math?”. But we’ll try to give you enough intuition to get
started.
Here’s the trick. Often, even when we don’t know how to tell a computer explicitly how to map from inputs
to outputs, we are nonetheless capable of performing the cognitive feat ourselves. In other words, even
if you don’t know how to program a computer to recognize the word “Alexa”, you yourself are able to
recognize the word “Alexa”. Armed with this ability, we can collect a huge data set containing examples of
audio and label those that do and that do not contain the wake word. In the machine learning approach, we
do not design a system explicitly to recognize wake words right away. Instead, we define a flexible program
with a number of parameters. These are knobs that we can tune to change the behavior of the program. We
call this program a model. Generally, our model is just a machine that transforms its input into some output.
In this case, the model receives as input a snippet of audio, and it generates as output an answer {yes,
3.2. Introduction 9
Deep Learning - The Straight Dope, Release 0.1
no}, which we hope reflects whether (or not) the snippet contains the wake word.
If we choose the right kind of model, then there should exist one setting of the knobs such that the model
fires yes every time it hears the word “Alexa”. There should also be another setting of the knobs that might
fire yes on the word “Apricot”. We expect that the same model should apply to “Alexa” recognition and
“Apricot” recognition because these are similar tasks. However, we might need a different model to deal
with fundamentally different inputs or outputs. For example, we might choose a different sort of machine to
map from images to captions, or from English sentences to Chinese sentences.
As you might guess, if we just set the knobs randomly, the model will probably recognize neither “Alexa”,
“Apricot”, nor any other English word. Generally, in deep learning, the learning refers precisely to updating
the model’s behavior (by twisting the knobs) over the course of a training period.
The training process usually looks like this:
1. Start off with a randomly initialized model that can’t do anything useful.
2. Grab some of your labeled data (e.g. audio snippets and corresponding {yes,no} labels)
3. Tweak the knobs so the model sucks less with respect to those examples
4. Repeat until the model is awesome.
To summarize, rather than code up a wake word recognizer, we code up a program that can learn to recognize
wake words, if we present it with a large labeled dataset. You can think of this act of determining a program’s
behavior by presenting it with a dataset as programming with data.
We can ‘program’ a cat detector by providing our machine learning system with many examples of cats and
dogs, such as the images below:
This way the detector will eventually learn to emit a very large positive number if it’s a cat, a very large
negative number if it’s a dog, and something closer to zero if it isn’t sure, but this is just barely scratching
the surface of what machine learning can do.
3.2. Introduction 11
Deep Learning - The Straight Dope, Release 0.1
Data
Generally, the more data we have, the easier our job becomes. When we have more data, we can train more
powerful models. Data is at the heart of the resurgence of deep learning and many of most exciting models
in deep learning don’t work without large data sets. Here are some examples of the kinds of data machine
learning practitioners often engage with:
• Images: Pictures taken by smartphones or harvested from the web, satellite images, photographs of
medical conditions, ultrasounds, and radiologic images like CT scans and MRIs, etc.
• Text: Emails, high school essays, tweets, news articles, doctor’s notes, books, and corpora of trans-
lated sentences, etc.
• Audio: Voice commands sent to smart devices like Amazon Echo, or iPhone or Android phones,
audio books, phone calls, music recordings, etc.
• Video: Television programs and movies, YouTube videos, cell phone footage, home surveillance,
multi-camera tracking, etc.
• Structured data: Webpages, electronic medical records, car rental records, electricity bills, etc.
Models
Usually the data looks quite different from what we want to accomplish with it. For example, we might have
photos of people and want to know whether they appear to be happy. We might desire a model capable of
ingesting a high-resolution image and outputting a happiness score. While some simple problems might be
addressable with simple models, we’re asking a lot in this case. To do its job, our happiness detector needs
to transform hundreds of thousands of low-level features (pixel values) into something quite abstract on the
other end (happiness scores). Choosing the right model is hard, and different models are better suited to
different datasets. In this book, we’ll be focusing mostly on deep neural networks. These models consist
of many successive transformations of the data that are chained together top to bottom, thus the name deep
learning. On our way to discussing deep nets, we’ll also discuss some simpler, shallower models.
Loss functions
To assess how well we’re doing we need to compare the output from the model with the truth. Loss functions
give us a way of measuring how bad our output is. For example, say we trained a model to infer a patient’s
heart rate from images. If the model predicted that a patient’s heart rate was 100bpm, when the ground truth
was actually 60bpm, we need a way to communicate to the model that it’s doing a lousy job.
Similarly if the model was assigning scores to emails indicating the probability that they are spam, we’d
need a way of telling the model when its predictions are bad. Typically the learning part of machine learning
consists of minimizing this loss function. Usually, models have many parameters. The best values of these
parameters is what we need to ‘learn’, typically by minimizing the loss incurred on a training data of
observed data. Unfortunately, doing well on the training data doesn’t guarantee that we will do well on
(unseen) test data, so we’ll want to keep track of two quantities.
• Training Error: This is the error on the dataset used to train our model by minimizing the loss on
the training set. This is equivalent to doing well on all the practice exams that a student might use
to prepare for the real exam. The results are encouraging, but by no means guarantee success on the
final exam.
• Test Error: This is the error incurred on an unseen test set. This can deviate quite a bit from the
training error. This condition, when a model fails to generalize to unseen data, is called overfitting. In
real-life terms, this is the equivalent of screwing up the real exam despite doing well on the practice
exams.
Optimization algorithms
Finally, to minimize the loss, we’ll need some way of taking the model and its loss functions, and searching
for a set of parameters that minimizes the loss. The most popular optimization algorithms for work on neural
networks follow an approach called gradient descent. In short, they look to see, for each parameter which
way the training set loss would move if you jiggled the parameter a little bit. They then update the parameter
in the direction that reduces the loss.
In the following sections, we will discuss a few types of machine learning in some more detail. We begin
with a list of objectives, i.e. a list of things that machine learning can do. Note that the objectives are
complemented with a set of techniques of how to accomplish them, i.e. training, types of data, etc. The list
below is really only sufficient to whet the readers’ appetite and to give us a common language when we talk
about problems. We will introduce a larger number of such problems as we go along.
3.2. Introduction 13
Deep Learning - The Straight Dope, Release 0.1
desired outputs) comprise the training set. We feed the training dataset into a supervised learning algorithm.
So here the supervised learning algorithm is a function that takes as input a dataset, and outputs another
function, the learned model. Then, given a learned model, we can take a new previously unseen input, and
predict the corresponding label.
Regression
Perhaps the simplest supervised learning task to wrap your head around is Regression. Consider, for ex-
ample a set of data harvested from a database of home sales. We might construct a table, where each row
corresponds to a different house, and each column corresponds to some relevant attribute, such as the square
footage of a house, the number of bedrooms, the number of bathrooms, and the number of minutes (walking)
to the center of town. Formally, we call one row in this dataset a feature vector, and the object (e.g. a house)
it’s associated with an example.
If you live in New York or San Francisco, and you are not the CEO of Amazon, Google, Microsoft, or
Facebook, the (sq. footage, no. of bedrooms, no. of bathrooms, walking distance) feature vector for your
home might look something like: [100, 0, .5, 60]. However, if you live in Pittsburgh, it might look more like
[3000, 4, 3, 10]. Feature vectors like this are essential for all the classic machine learning problems. We’ll
typically denote the feature vector for any one example xi and the set of feature vectors for all our examples
𝑋.
What makes a problem regression is actually the outputs. Say that you’re in the market for a new home,
you might want to estimate the fair market value of a house, given some features like these. The target
value, the price of sale, is a real number. We denote any individual target 𝑦𝑖 (corresponding to example xi )
and the set of all targets y (corresponding to all examples X). When our targets take on arbitrary real values
in some range, we call this a regression problem. The goal of our model is to produce predictions (guesses
of the price, in our example) that closely approximate the actual target values.
We denote these predictions 𝑦ˆ𝑖 and if the notation seems unfamiliar, then just ignore it for now. We’ll
unpack it more thoroughly in the subsequent chapters.
Lots of practical problems are well-described regression problems. Predicting the rating that a user will
assign to a movie is a regression problem, and if you designed a great algorithm to accomplish this feat
in 2009, you might have won the $1 million Netflix prize. Predicting the length of stay for patients in the
hospital is also a regression problem. A good rule of thumb is that any How much? or How many? problem
should suggest regression. * “How many hours will this surgery take?”. . . regression * “How many dogs
are in this photo?” . . . regression. However, if you can easily pose your problem as “Is this a ___?”, then
it’s likely, classification, a different fundamental problem type that we’ll cover next.
Even if you’ve never worked with machine learning before, you’ve probably worked through a regression
problem informally. Imagine, for example, that you had your drains repaired and that your contractor, spent
𝑥1 = 3 hours removing gunk from your sewage pipes. Then she sent you a bill of 𝑦1 = $350. Now imagine
that your friend hired the same contractor for 𝑥2 = 2 hours and that she received a bill of 𝑦2 = $250. If
someone then asked you how much to expect on their upcoming gunk-removal invoice you might make
some reasonable assumptions, such as more hours worked costs more dollars. You might also assume that
there’s some base charge and that the contractor then charges per hour. If these assumptions held, then given
these two data points, you could already identify the contractor’s pricing structure: $100 per hour plus $50
to show up at your house. If you followed that much then you already understand the high-level idea behind
linear regression.
In this case, we could produce the parameters that exactly matched the contractor’s prices. Sometimes that’s
not possible, e.g., if some of the variance owes to some factors besides your two features. In these cases,
we’ll try to learn models that minimize the distance between our predictions and the observed∑︀ values. In most
of our chapters, we’ll focus on∑︀one of two very common losses, the L1 loss where 𝑙(𝑦, 𝑦 ′ ) = 𝑖 |𝑦𝑖 −𝑦𝑖′ | and
the L2 loss where 𝑙(𝑦, 𝑦 ′ ) = 𝑖 (𝑦𝑖 − 𝑦𝑖′ )2 . As we will see later, the 𝐿2 loss corresponds to the assumption
that our data was corrupted by Gaussian noise, whereas the 𝐿1 loss corresponds to an assumption of noise
from a Laplace distribution.
Classification
While regression models are great for addressing how many? questions, lots of problems don’t bend com-
fortably to this template. For example, a bank wants to add check scanning to their mobile app. This would
involve the customer snapping a photo of a check with their smartphone’s camera and the machine learning
model would need to be able to automatically understand text seen in the image. It would also need to
understand hand-written text to be even more robust. This kind of system is referred to as optical character
recognition (OCR), and the kind of problem it solves is called a classification. It’s treated with a distinct set
of algorithms than those that are used for regression.
In classification, we want to look at a feature vector, like the pixel values in an image, and then predict which
category (formally called classes), among some set of options, an example belongs. For hand-written digits,
we might have 10 classes, corresponding to the digits 0 through 9. The simplest form of classification is
when there are only two classes, a problem which we call binary classification. For example, our dataset 𝑋
could consist of images of animals and our labels 𝑌 might be the classes {cat, dog}. While in regression,
we sought a regressor to output a real value 𝑦ˆ, in classification, we seek a classifier, whose output 𝑦ˆ is the
predicted class assignment.
For reasons that we’ll get into as the book gets more technical, it’s pretty hard to optimize a model that can
3.2. Introduction 15
Deep Learning - The Straight Dope, Release 0.1
only output a hard categorical assignment, e.g. either cat or dog. It’s a lot easier instead to express the model
in the language of probabilities. Given an example 𝑥, the model assigns a probability 𝑦ˆ𝑘 to each label 𝑘.
Because these are probabilities, they need to be positive numbers and add up to 1. This means that we only
need 𝐾 − 1 numbers to give the probabilities of 𝐾 categories. This is easy to see for binary classification. If
there’s a 0.6 (60%) probability that an unfair coin comes up heads, then there’s a 0.4 (40%) probability that
it comes up tails. Returning to our animal classification example, a classifier might see an image and output
the probability that the image is a cat Pr(𝑦 = cat | 𝑥) = 0.9. We can interpret this number by saying that
the classifier is 90% sure that the image depicts a cat. The magnitude of the probability for the predicted
class is one notion of confidence. It’s not the only notion of confidence and we’ll discuss different notions
of uncertainty in more advanced chapters.
When we have more than two possible classes, we call the problem multiclass classification. Common ex-
amples include hand-written character recognition [0, 1, 2, 3 ... 9, a, b, c, ...]. While
we attacked regression problems by trying to minimize the L1 or L2 loss functions, the common loss func-
tion for classification problems is called cross-entropy. In MXNet Gluon, the corresponding loss function
can be found here.
Note that the most likely class is not necessarily the one that you’re going to use for your decision. Assume
that you find this beautiful mushroom in your backyard:
Now, assume that you built a classifier and trained it to predict if a mushroom is poisonous based on a
photograph. Say our poison-detection classifier outputs Pr(𝑦 = deathcap | image) = 0.2. In other words,
the classifier is 80% confident that our mushroom is not a death cap. Still, you’d have to be a fool to eat it.
That’s because the certain benefit of a delicious dinner isn’t worth a 20% chance of dying from it. In other
words, the effect of the uncertain risk by far outweighs the benefit. Let’s look at this in math. Basically,
we need to compute the expected risk that we incur, i.e. we need to multiply the probability of the outcome
with the benefit (or harm) associated with it:
Hence, the loss 𝐿 incurred by eating the mushroom is 𝐿(𝑎 = eat | 𝑥) = 0.2 * ∞ + 0.8 * 0 = ∞, whereas
the cost of discarding it is 𝐿(𝑎 = discard | 𝑥) = 0.2 * 0 + 0.8 * 1 = 0.8.
We got lucky: as any mycologist would tell us, the above actually is a death cap. Classification can get much
more complicated than just binary, multiclass, of even multi-label classification. For instance, there are some
variants of classification for addressing hierarchies. Hierarchies assume that there exist some relationships
among the many classes. So not all errors are equal - we prefer to misclassify to a related class than to a
distant class. Usually, this is referred to as hierarchical classification. One early example is due to Linnaeus,
who organized the animals in a hierarchy..
In the case of animal classification, it might not be so bad to mistake a poodle for a schnauzer, but our
model would pay a huge penalty if it confused a poodle for a dinosaur. What hierarchy is relevant might
depend on how you plan to use the model. For example, rattle snakes and garter snakes might be close on
the phylogenetic tree, but mistaking a rattler for a garter could be deadly.
Tagging
Some classification problems don’t fit neatly into the binary or multiclass classification setups. For example,
we could train a normal binary classifier to distinguish cats from dogs. Given the current state of computer
vision, we can do this easily, with off-the-shelf tools. Nonetheless, no matter how accurate our model gets,
we might find ourselves in trouble when the classifier encounters an image like this:
As you can see, there’s a cat in the picture. There is also a dog, a tire, some grass, a door, concrete, rust,
individual grass leaves, etc. Depending on what we want to do with our model ultimately, treating this as a
binary classification problem might not make a lot of sense. Instead, we might want to give the model the
option of saying the image depicts a cat and a dog, or neither a cat nor a dog.
The problem of learning to predict classes that are not mutually exclusive is called multi-label classifica-
tion. Auto-tagging problems are typically best described as multi-label classification problems. Think of
3.2. Introduction 17
Deep Learning - The Straight Dope, Release 0.1
the tags people might apply to posts on a tech blog, e.g., “machine learning”, “technology”, “gadgets”, “pro-
gramming languages”, “linux”, “cloud computing”, “AWS”. A typical article might have 5-10 tags applied
because these concepts are correlated. Posts about “cloud computing” are likely to mention “AWS” and
posts about “machine learning” could also deal with “programming languages”.
We also have to deal with this kind of problem when dealing with the biomedical literature, where correctly
tagging articles is important because it allows researchers to do exhaustive reviews of the literature. At the
National Library of Medicine, a number of professional annotators go over each article that gets indexed
in PubMed to associate each with the relevant terms from MeSH, a collection of roughly 28k tags. This is
a time-consuming process and the annotators typically have a one year lag between archiving and tagging.
Machine learning can be used here to provide provisional tags until each article can have a proper manual
review. Indeed, for several years, the BioASQ organization has hosted a competition to do precisely this.
Recommender systems
Recommender systems are another problem setting that is related to search and ranking. The problems
are similar insofar as the goal is to display a set of relevant items to the user. The main difference is the
emphasis on personalization to specific users in the context of recommender systems. For instance, for
movie recommendations, the results page for a SciFi fan and the results page for a connoisseur of Woody
Allen comedies might differ significantly.
Such problems occur, e.g. for movie, product or music recommendation. In some cases, customers will
provide explicit details about how much they liked the product (e.g. Amazon product reviews). In some
other cases, they might simply provide feedback if they are dissatisfied with the result (skipping titles on a
playlist). Generally, such systems strive to estimate some score 𝑦𝑖𝑗 , such as an estimated rating or probability
of purchase, given a user 𝑢𝑖 and product 𝑝𝑗 .
Given such a model, then for any given user, we could retrieve the set of objects with the largest scores 𝑦𝑖𝑗
are then used as a recommendation. Production systems are considerably more advanced and take detailed
user activity and item characteristics into account when computing such scores. The following image is an
example of deep learning books recommended by Amazon based on personalization algorithms tuned to the
author’s preferences.
3.2. Introduction 19
Deep Learning - The Straight Dope, Release 0.1
Sequence Learning
So far we’ve looked at problems where we have some fixed number of inputs and produce a fixed number of
outputs. Before we considered predicting home prices from a fixed set of features: square footage, number
of bedrooms, number of bathrooms, walking time to downtown. We also discussed mapping from an image
(of fixed dimension), to the predicted probabilities that it belongs to each of a fixed number of classes, or
taking a user ID and a product ID, and predicting a star rating. In these cases, once we feed our fixed-length
input into the model to generate an output, the model immediately forgets what it just saw.
This might be fine if our inputs truly all have the same dimensions and if successive inputs truly have nothing
to do with each other. But how would we deal with video snippets? In this case, each snippet might consist
of a different number of frames. And our guess of what’s going on in each frame might be much stronger if
we take into account the previous or succeeding frames. Same goes for language. One popular deep learning
problem is machine translation: the task of ingesting sentences in some source language and predicting their
translation in another language.
These problems also occur in medicine. We might want a model to monitor patients in the intensive care
unit and to fire off alerts if their risk of death in the next 24 hours exceeds some threshold. We definitely
wouldn’t want this model to throw away everything it knows about the patient history each hour, and just
make its predictions based on the most recent measurements.
These problems are among the more exciting applications of machine learning and they are instances of
sequence learning. They require a model to either ingest sequences of inputs or to emit sequences of
outputs (or both!). These latter problems are sometimes referred to as seq2seq problems. Language
translation is a seq2seq problem. Transcribing text from spoken speech is also a seq2seq problem.
While it is impossible to consider all types of sequence transformations, a number of special cases are worth
mentioning:
This involves annotating a text sequence with attributes. In other words, the number of inputs and outputs is
essentially the same. For instance, we might want to know where the verbs and subjects are. Alternatively,
we might want to know which words are the named entities. In general, the goal is to decompose and
annotate text based on structural and grammatical assumptions to get some annotation. This sounds more
complex than it actually is. Below is a very simple example of annotating a sentence with tags indicating
which words refer to named entities.
Tom
Ent
With speech recognition, the input sequence 𝑥 is the sound of a speaker, and the output 𝑦 is the textual
transcript of what the speaker said. The challenge is that there are many more audio frames (sound is
typically sampled at 8kHz or 16kHz) than text, i.e. there is no 1:1 correspondence between audio and text,
since thousands of samples correspond to a single spoken word. These are seq2seq problems where the
output is much shorter than the input.
----D----e----e-----p------- L----ea------r------ni-----ng---
Text to Speech
Text to Speech (TTS) is the inverse of speech recognition. In other words, the input 𝑥 is text and the output
𝑦 is an audio file. In this case, the output is much longer than the input. While it is easy for humans to
recognize a bad audio file, this isn’t quite so trivial for computers.
Machine Translation
Unlike the case of speech of recognition, where corresponding inputs and outputs occur in the same order
(after alignment), in machine translation, order inversion can be vital. In other words, while we are still
converting one sequence into another, neither the number of inputs and outputs nor the order of correspond-
ing data points are assumed to be the same. Consider the following illustrative example of the obnoxious
tendency of Germans (Alex writing here) to place the verbs at the end of sentences.
3.2. Introduction 21
Deep Learning - The Straight Dope, Release 0.1
A number of related problems exist. For instance, determining the order in which a user reads a webpage
is a two-dimensional layout analysis problem. Likewise, for dialogue problems, we need to take world-
knowledge and prior state into account. This is an active area of research.
takes place after the algorithm is disconnected from the environment, this is called offline learning. For
supervised learning, the process looks like this:
This simplicity of offline learning has its charms. The upside is we can worry about pattern recognition
in isolation without these other problems to deal with, but the downside is that the problem formulation
is quite limiting. If you are more ambitious, or if you grew up reading Asimov’s Robot Series, then you
might imagine artificially intelligent bots capable not only of making predictions, but of taking actions in
the world. We want to think about intelligent agents, not just predictive models. That means we need to
think about choosing actions, not just making predictions. Moreover, unlike predictions, actions actually
impact the environment. If we want to train an intelligent agent, we must account for the way its actions
might impact the future observations of the agent.
Considering the interaction with an environment that opens a whole set of new modeling questions. Does
the environment:
3.2. Introduction 23
Deep Learning - The Straight Dope, Release 0.1
Reinforcement learning
If you’re interested in using machine learning to develop an agent that interacts with an environment and
takes actions, then you’re probably going to wind up focusing on reinforcement learning (RL). This might
include applications to robotics, to dialogue systems, and even to developing AI for video games. Deep re-
inforcement learning (DRL), which applies deep neural networks to RL problems, has surged in popularity.
The breakthrough deep Q-network that beat humans at Atari games using only the visual input , and the
AlphaGo program that dethroned the world champion at the board game Go are two prominent examples.
Reinforcement learning gives a very general statement of a problem, in which an agent interacts with an
environment over a series of time steps. At each time step 𝑡, the agent receives some observation 𝑜𝑡 from
the environment, and must choose an action 𝑎𝑡 which is then transmitted back to the environment. Finally,
the agent receives a reward 𝑟𝑡 from the environment. The agent then receives a subseqeunt observation,
and chooses a subsequent action, and so on. The behavior of an RL agent is governed by a policy. In
short, a policy is just a function that maps from observations (of the environment) to actions. The goal of
reinforcement learning is to produce a good policy.
It’s hard to overstate the generality of the RL framework. For example, we can cast any supervised learning
problem as an RL problem. Say we had a classification problem. We could create an RL agent with one
action corresponding to each class. We could then create an environment which gave a reward that was
exactly equal to the loss function from the original supervised problem.
That being said, RL can also address many problems that supervised learning cannot. For example, in
supervised learning we always expect that the training input comes associated with the correct label. But in
RL, we don’t assume that for each observation, the environment tells us the optimal action. In general, we
just get some reward. Moreover, the environment may not even tell us which actions led to the reward.
Consider for example the game of chess. The only real reward signal comes at the end of the game when we
either win, which we might assign a reward of 1, or when we lose, which we could assign a reward of -1. So
reinforcement learners must deal with the credit assignment problem. The same goes for an employee who
gets a promotion on October 11. That promotion likely reflects a large number of well-chosen actions over
the previous year. Getting more promotions in the future requires figuring out what actions along the way
led to the promotion.
Reinforcement learners may also have to deal with the problem of partial observability. That is, the current
observation might not tell you everything about your current state. Say a cleaning robot found itself trapped
in one of many identical closets in a house. Inferring the precise location (and thus state) of the robot might
require considering its previous observerations before entering the closet.
Finally, at any given point, reinforcement learners might know of one good policy, but there might be many
other better policies that the agent has never tried. The reinforcement learner must constantly choose whether
to exploit the best currently-known strategy as a policy, or to explore the space of strategies, potentially
giving up some short-run reward in exchange for knowledge.
3.2. Introduction 25
Deep Learning - The Straight Dope, Release 0.1
if i % 15 == 0:
res.append('fizzbuzz')
elif i % 3 == 0:
res.append('fizz')
elif i % 5 == 0:
res.append('buzz')
else:
res.append(str(i))
print(' '.join(res))
1 2 fizz 4 buzz fizz 7 8 fizz buzz 11 fizz 13 14 fizzbuzz 16 17 fizz 19 buzz fizz 22 23 fiz
This isn’t very exciting if you’re a good programmer. Joel proceeded to ‘implement’ this problem in Machine
Learning instead. For that to succeed, he needed a number of pieces:
• Data X [1, 2, 3, 4, ...] and labels Y ['fizz', 'buzz', 'fizzbuzz',
identity]
• Training data, i.e. examples of what the system is supposed to do. Such as [(2, 2), (6,
fizz), (15, fizzbuzz), (23, 23), (40, buzz)]
• Features that map the data into something that the computer can handle more easily, e.g. x -> [(x
% 3), (x % 5), (x % 15)]. This is optional but helps a lot if you have it.
Armed with this, Joel wrote a classifier in TensorFlow (code). The interviewer was nonplussed . . . and the
classifier didn’t have perfect accuracy.
Quite obviously, this is silly. Why would you go through the trouble of replacing a few lines of Python
with something much more complicated and error prone? However, there are many cases where a simple
Python script simply does not exist, yet a 3-year-old child will solve the problem perfectly. Fortunately, this
is precisely where machine learning comes to the rescue.
3.2.8 Conclusion
Machine Learning is vast. We cannot possibly cover it all. On the other hand, neural networks are simple
and only require elementary mathematics. So let’s get started.
3.2.9 Next
Manipulate data the MXNet way with NDArray
For whinges or inquiries, open an issue on GitHub.
computation on CPU, GPU, and distributed cloud architectures. Second, they provide support for automatic
differentiation. These properties make NDArray an ideal library for machine learning, both for researchers
and engineers launching production systems.
Next, let’s see how to create an NDArray, without any values initialized. Specifically, we’ll create a 2D array
(also called a matrix) with 3 rows and 4 columns.
In [2]: x = nd.empty((3, 4))
print(x)
The empty method just grabs some memory and hands us back a matrix without setting the values of any of
its entries. This means that the entries can have any form of values, including very big ones! But typically,
we’ll want our matrices initialized. Commonly, we want a matrix of all zeros.
In [3]: x = nd.zeros((3, 5))
x
Out[3]:
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]
<NDArray 3x5 @cpu(0)>
Often, we’ll want to create arrays whose values are sampled randomly. This is especially common when we
intend to use the array as a parameter in a neural network. In this snippet, we initialize with values drawn
from a standard normal distribution with zero mean and unit variance.
As in NumPy, the dimensions of each NDArray are accessible via the .shape attribute.
In [6]: y.shape
Out[6]: (3, 4)
We can also query its size, which is equal to the product of the components of the shape. Together with the
precision of the stored values, this tells us how much memory the array occupies.
In [7]: y.size
Out[7]: 12
3.3.2 Operations
NDArray supports a large number of standard mathematical operations. Such as element-wise addition:
In [8]: x + y
Out[8]:
[[ 1.11287737 -0.30644417 0.89286423 -1.63099265]
[ 0.9426415 1.31348419 0.42348909 -0.11059952]
[ 1.57960725 0.77100402 2.04484272 1.81243682]]
<NDArray 3x4 @cpu(0)>
Multiplication:
In [9]: x * y
Out[9]:
[[ 0.11287736 -1.30644417 -0.10713575 -2.63099265]
[-0.05735848 0.31348416 -0.57651091 -1.11059952]
[ 0.57960719 -0.22899596 1.04484284 0.81243682]]
<NDArray 3x4 @cpu(0)>
And exponentiation:
In [10]: nd.exp(y)
Out[10]:
[[ 1.11949468 0.27078119 0.8984037 0.07200695]
[ 0.94425553 1.36818385 0.56185532 0.32936144]
[ 1.78533697 0.79533172 2.84295177 2.25339246]]
<NDArray 3x4 @cpu(0)>
We’ll explain these operations and present even more operators in the linear algebra chapter. But for now,
we’ll stick with the mechanics of working with NDArrays.
This might be undesirable for two reasons. First, we don’t want to run around allocating memory unnec-
essarily all the time. In machine learning, we might have hundreds of megabytes of paramaters and update
all of them multiple times per second. Typically, we’ll want to perform these updates in place. Second, we
might point at the same parameters from multiple variables. If we don’t update in place, this could cause a
memory leak, and could cause us to inadvertently reference stale parameters.
Fortunately, performing in-place operations in MXNet is easy. We can assign the result of an operation to a
previously allocated array with slice notation, e.g., y[:] = <expression>.
In [13]: print('id(y):', id(y))
y[:] = x + y
print('id(y):', id(y))
id(y): 140295515324600
id(y): 140295515324600
While this syntacically nice, x+y here will still allocate a temporary buffer to store the result before copying
it to y[:]. To make even better use of memory, we can directly invoke the underlying ndarray operation,
in this case elemwise_add, avoiding temporary buffers. We do this by specifying the out keyword
argument, which every ndarray operator supports:
In [15]: nd.elemwise_add(x, y, out=y)
Out[15]:
[[ 3.11287737 1.69355583 2.89286423 0.36900735]
[ 2.9426415 3.31348419 2.42348909 1.88940048]
[ 3.57960725 2.77100396 4.04484272 3.81243682]]
<NDArray 3x4 @cpu(0)>
If we’re not planning to re-use x, then we can assign the result to x itself. There are two ways to do this in
MXNet. 1. By using slice notation x[:] = x op y 2. By using the op-equals operators like +=
In [16]: print('id(x):', id(x))
x += y
x
print('id(x):', id(x))
id(x): 140291459564992
id(x): 140291459564992
3.3.4 Slicing
MXNet NDArrays support slicing in all the ridiculous ways you might imagine accessing your data. Here’s
an example of reading the second and third rows from x.
In [17]: x[1:3]
Out[17]:
[[ 3.9426415 4.31348419 3.42348909 2.88940048]
[ 4.57960701 3.77100396 5.04484272 4.81243706]]
<NDArray 2x4 @cpu(0)>
3.3.5 Broadcasting
You might wonder, what happens if you add a vector y to a matrix X? These operations, where we compose
a low dimensional array y with a high-dimensional array X invoke a functionality called broadcasting. Here,
the low-dimensional array is duplicated along any axis with dimension 1 to match the shape of the high
dimensional array. Consider the following example.
In [21]: x = nd.ones(shape=(3,3))
print('x = ', x)
y = nd.arange(3)
print('y = ', y)
print('x + y = ', x + y)
x =
[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
<NDArray 3x3 @cpu(0)>
y =
[ 0. 1. 2.]
<NDArray 3 @cpu(0)>
x + y =
[[ 1. 2. 3.]
[ 1. 2. 3.]
[ 1. 2. 3.]]
<NDArray 3x3 @cpu(0)>
While y is initially of shape (3), MXNet infers its shape to be (1,3), and then broadcasts along the rows to
form a (3,3) matrix). You might wonder, why did MXNet choose to interpret y as a (1,3) matrix and not
(3,1). That’s because broadcasting prefers to duplicate along the left most axis. We can alter this behavior
by explicitly giving y a 2D shape.
In [22]: y = y.reshape((3,1))
print('y = ', y)
print('x + y = ', x+y)
y =
[[ 0.]
[ 1.]
[ 2.]]
<NDArray 3x1 @cpu(0)>
x + y =
[[ 1. 1. 1.]
[ 2. 2. 2.]
[ 3. 3. 3.]]
<NDArray 3x3 @cpu(0)>
hardware devices.
In MXNet, every array has a context. One context could be the CPU. Other contexts might be various GPUs.
Things can get even hairier when we deploy jobs across multiple servers. By assigning arrays to contexts
intelligently, we can minimize the time spent transferring data between devices. For example, when training
neural networks on a server with a GPU, we typically prefer for the model’s parameters to live on the GPU.
To start, let’s try initializing an array on the first GPU.
In [25]: z = nd.ones(shape=(3,3), ctx=mx.gpu(0))
z
Out[25]:
[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
<NDArray 3x3 @gpu(0)>
Given an NDArray on a given context, we can copy it to another context by using the copyto() method.
In [26]: x_gpu = x.copyto(mx.gpu(0))
print(x_gpu)
[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
<NDArray 3x3 @gpu(0)>
The result of an operator will have the same context as the inputs.
In [27]: x_gpu + z
Out[27]:
[[ 2. 2. 2.]
[ 2. 2. 2.]
[ 2. 2. 2.]]
<NDArray 3x3 @gpu(0)>
If we ever want to check the context of an NDArray programmaticaly, we can just call its .context
attribute.
In [28]: print(x_gpu.context)
print(z.context)
gpu(0)
gpu(0)
In order to perform an operation on two ndarrays x1 and x2, we need them both to live on the same context.
And if they don’t already, we may need to explicitly copy data from one context to another. You might think
that’s annoying. After all, we just demonstrated that MXNet knows where each NDArray lives. So why
can’t MXNet just automatically copy x1 to x2.context and then add them?
In short, people use MXNet to do machine learning because they expect it to be fast. But transferring
variables between different contexts is slow. So we want you to be 100% certain that you want to do
something slow before we let you do it. If MXNet just did the copy automatically without crashing then
you might not realize that you had written some slow code. We don’t want you to spend your entire life on
StackOverflow, so we make some mistakes impossible.
[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
<NDArray 3x3 @gpu(0)>
3.3.9 Next
Linear algebra
For whinges or inquiries, open an issue on GitHub.
mathematical notation, and their realization in code all in one place. If you’re already confident in your
basic linear algebra, feel free to skim or skip this chapter.
In [2]: from mxnet import nd
3.4.1 Scalars
If you never studied linear algebra or machine learning, you’re probably used to working with one number
at a time. And know how to do basic things like add them together or multiply them. For example, in Palo
Alto, the temperature is 52 degrees Fahrenheit. Formally, we call these values 𝑠𝑐𝑎𝑙𝑎𝑟𝑠. If you wanted to
convert this value to Celsius (using metric system’s more sensible unit of temperature measurement), you’d
evaluate the expression 𝑐 = (𝑓 − 32) * 5/9 setting 𝑓 to 52. In this equation, each of the terms 32, 5, and 9
is a scalar value. The placeholders 𝑐 and 𝑓 that we use are called variables and they stand in for unknown
scalar values.
In mathematical notation, we represent scalars with ordinary lower cased letters (𝑥, 𝑦, 𝑧). We also denote
the space of all scalars as ℛ. For expedience, we’re going to punt a bit on what precisely a space is, but for
now, remember that if you want to say that 𝑥 is a scalar, you can simply say 𝑥 ∈ ℛ. The symbol ∈ can be
pronounced “in” and just denotes membership in a set.
In MXNet, we work with scalars by creating NDArrays with just one element. In this snippet, we instantiate
two scalars and perform some familiar arithmetic operations with them.
In [3]: ##########################
# Instantiate two scalars
##########################
x = nd.array([3.0])
y = nd.array([2.0])
##########################
# Add them
##########################
print('x + y = ', x + y)
##########################
# Multiply them
##########################
print('x * y = ', x * y)
##########################
# Divide x by y
##########################
print('x / y = ', x / y)
##########################
# Raise x to the power y.
##########################
print('x ** y = ', nd.power(x,y))
x + y =
[ 5.]
<NDArray 1 @cpu(0)>
x * y =
[ 6.]
<NDArray 1 @cpu(0)>
x / y =
[ 1.5]
<NDArray 1 @cpu(0)>
x ** y =
[ 9.]
<NDArray 1 @cpu(0)>
We can convert any NDArray to a Python float by calling its asscalar method
In [4]: x.asscalar()
Out[4]: 3.0
3.4.2 Vectors
You can think of a vector as simply a list of numbers, for example [1.0,3.0,4.0,2.0]. Each of the
numbers in the vector consists of a single scalar value. We call these values the entries or components of the
vector. Often, we’re interested in vectors whose values hold some real-world significance. For example, if
we’re studying the risk that loans default, we might associate each applicant with a vector whose components
correspond to their income, length of employment, number of previous defaults, etc. If we were studying
the risk of heart attack in hospital patients, we might represent each patient with a vector whose components
capture their most recent vital signs, cholesterol levels, minutes of exercise per day, etc. In math notation,
we’ll usually denote vectors as bold-faced, lower-cased letters (u, v, w). In MXNet, we work with vectors
via 1D NDArrays with an arbitrary number of components.
In [5]: u = nd.arange(4)
print('u = ', u)
u =
[ 0. 1. 2. 3.]
<NDArray 4 @cpu(0)>
We can refer to any element of a vector by using a subscript. For example, we can refer to the 4th element
of u by 𝑢4 . Note that the element 𝑢4 is a scalar, so we don’t bold-face the font when referring to it. In code,
we access any element 𝑖 by indexing into the NDArray.
In [6]: u[3]
Out[6]:
[ 3.]
<NDArray 1 @cpu(0)>
We can also access a vector’s length via its .shape attribute. The shape is a tuple that lists the dimension-
ality of the NDArray along each of its axes. Because a vector can only be indexed along one axis, its shape
has just one element.
In [8]: u.shape
Out[8]: (4,)
Note that the word dimension is overloaded and this tends to confuse people. Some use the dimensionality
of a vector to refer to its length (the number of components). However some use the word dimensionality to
refer to the number of axes that an array has. In this sense, a scalar would have 0 dimensions and a vector
would have 1 dimension. To avoid confusion, when we say *2D* array or *3D* array, we mean an
array with 2 or 3 axes repespectively. But if we say *:math:‘n‘-dimensional* vector, we mean a vector
of length :math:‘n‘.
In [ ]: a = 2
x = nd.array([1,2,3])
y = nd.array([10,20,30])
print(a * x)
print(a * x + y)
3.4.4 Matrices
Just as vectors generalize scalars from order 0 to order 1, matrices generalize vectors from 1𝐷 to 2𝐷.
Matrices, which we’ll denote with capital letters (𝐴, 𝐵, 𝐶), are represented in code as arrays with 2 axes.
Visually, we can draw a matrix as a table, where each entry 𝑎𝑖𝑗 belongs to the 𝑖-th row and 𝑗-th column.
⎛ ⎞
𝑎11 𝑎12 · · · 𝑎1𝑚
⎜ 𝑎21 𝑎22 · · · 𝑎2𝑚 ⎟
𝐴=⎜ .
⎜ ⎟
.. .. .. ⎟
⎝ .. . . . ⎠
𝑎𝑛1 𝑎𝑛2 · · · 𝑎𝑛𝑚
We can create a matrix with 𝑛 rows and 𝑚 columns in MXNet by specifying a shape with two components
(n,m) when calling any of our favorite functions for instantiating an ndarray such as ones, or zeros.
In [10]: A = nd.zeros((5,4))
A
Out[10]:
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
<NDArray 5x4 @cpu(0)>
We can also reshape any 1D array into a 2D ndarray by calling ndarray’s reshape method and passing in
the desired shape. Note that the product of shape components n * m must be equal to the length of the
original vector.
In [12]: x = nd.arange(20)
A = x.reshape((5, 4))
A
Out[12]:
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[ 12. 13. 14. 15.]
[ 16. 17. 18. 19.]]
<NDArray 5x4 @cpu(0)>
Matrices are useful data structures: they allow us to organize data that has different modalities of variation.
For example, returning to the example of medical data, rows in our matrix might correspond to different
patients, while columns might correspond to different attributes.
We can access the scalar elements 𝑎𝑖𝑗 of a matrix 𝐴 by specifying the indices for the row (𝑖) and column (𝑗)
respectively. Let’s grab the element 𝑎2,3 from the random matrix we initialized above.
In [13]: print('A[2, 3] = ', A[2, 3])
A[2, 3] =
[ 11.]
<NDArray 1 @cpu(0)>
We can also grab the vectors corresponding to an entire row a𝑖,: or a column a:,𝑗 .
In [14]: print('row 2', A[2, :])
print('column 3', A[:, 3])
row 2
[ 8. 9. 10. 11.]
<NDArray 4 @cpu(0)>
column 3
[ 3. 7. 11. 15. 19.]
<NDArray 5 @cpu(0)>
We can transpose the matrix through T. That is, if 𝐵 = 𝐴𝑇 , then 𝑏𝑖𝑗 = 𝑎𝑗𝑖 for any 𝑖 and 𝑗.
In [15]: A.T
Out[15]:
[[ 0. 4. 8. 12. 16.]
[ 1. 5. 9. 13. 17.]
[ 2. 6. 10. 14. 18.]
[ 3. 7. 11. 15. 19.]]
<NDArray 4x5 @cpu(0)>
3.4.5 Tensors
Just as vectors generalize scalars, and matrices generalize vectors, we can actually build data structures
with even more axes. Tensors give us a generic way of discussing arrays with an arbitrary number of axes.
Vectors, for example, are first-order tensors, and matrices are second-order tensors.
Using tensors will become more important when we start working with images, which arrive as 3D data
structures, with axes corresponding to the height, width, and the three (RGB) color channels. But in this
chapter, we’re going to skip past and make sure you know the basics.
In [16]: X = nd.arange(24).reshape((2, 3, 4))
print('X.shape =', X.shape)
print('X =', X)
X.shape = (2, 3, 4)
X =
[[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]]
We can call element-wise operations on any two tensors of the same shape, including matrices.
In [18]: B = nd.ones_like(A) * 3
print('B =', B)
print('A + B =', A + B)
print('A * B =', A * B)
B =
[[ 3. 3. 3. 3.]
[ 3. 3. 3. 3.]
[ 3. 3. 3. 3.]
[ 3. 3. 3. 3.]
[ 3. 3. 3. 3.]]
<NDArray 5x4 @cpu(0)>
A + B =
[[ 3. 4. 5. 6.]
[ 7. 8. 9. 10.]
[ 11. 12. 13. 14.]
[ 15. 16. 17. 18.]
[ 19. 20. 21. 22.]]
<NDArray 5x4 @cpu(0)>
A * B =
[[ 0. 3. 6. 9.]
[ 12. 15. 18. 21.]
[ 24. 27. 30. 33.]
[ 36. 39. 42. 45.]
[ 48. 51. 54. 57.]]
<NDArray 5x4 @cpu(0)>
Shape is not the the only property preserved under addition and multiplication by a scalar. These operations
also preserve membership in a vector space. But we’ll postpone this discussion for the second half of this
chapter because it’s not critical to getting your first models up and running.
A related quantity is the mean, which is also called the average. We calculate the mean by dividing the sum
by∑︀
the total number of elements. With mathematical notation, we could write the average over a vector u as
1 𝑑 1 ∑︀𝑚 ∑︀𝑛
𝑑 𝑖=1 𝑢𝑖 and the average over a matrix 𝐴 as 𝑛·𝑚 𝑖=1 𝑗=1 𝑎𝑖𝑗 . In code, we could just call nd.mean()
on tensors of arbitrary shape:
In [ ]: print(nd.mean(A))
print(nd.sum(A) / A.size)
In [ ]: nd.dot(u, v)
Note that we can express the dot product of two vectors nd.dot(u, v) equivalently by performing an
element-wise multiplication and then a sum:
In [ ]: nd.sum(u * v)
Dot products are useful in a wide range of contexts. For example, given a set of weights w, the weighted
sum of some ∑︀values 𝑢 could be expressed as the dot product u𝑇 w. When the weights are non-negative and
sum to one ( 𝑑𝑖=1 𝑤𝑖 = 1), the dot product expresses a weighted average. When two vectors each have
length one (we’ll discuss what length means below in the section on norms), dot products can also capture
the cosine of the angle between them.
where each a𝑇𝑖 ∈ R𝑚 is a row vector representing the 𝑖-th row of the matrix 𝐴.
Then the matrix vector product y = 𝐴x is simply a column vector y ∈ R𝑛 where each entry 𝑦𝑖 is the dot
product a𝑇𝑖 x.
· · · a𝑇1
⎛ ⎞⎛ ⎞ ⎛ 𝑇 ⎞
... 𝑥1 a1 x
⎜· · · a𝑇 · · ·⎟ ⎜ 𝑥2 ⎟ ⎜a𝑇2 x⎟
⎟ ⎜ ⎟ ⎜
2
𝐴x = ⎜ ⎟ ⎜ .. ⎟ = ⎜ .. ⎟
⎜ ⎟
..
⎝ . ⎠⎝ . ⎠ ⎝ . ⎠
· · · a𝑇𝑛 ··· 𝑥𝑚 a𝑇𝑛 x
So you can think of multiplication by a matrix 𝐴 ∈ R𝑚×𝑛 as a transformation that projects vectors from
R𝑚 to R𝑛 .
These transformations turn out to be quite useful. For example, we can represent rotations as multiplications
by a square matrix. As we’ll see in subsequent chapters, we can also use matrix-vector products to describe
the calculations of each layer in a neural network.
Expressing matrix-vector products in code with ndarray, we use the same nd.dot() function as for
dot products. When we call nd.dot(A, x) with a matrix A and a vector x, MXNet knows to perform a
matrix-vector product. Note that the column dimension of A must be the same as the dimension of x.
In [ ]: nd.dot(A, u)
To produce the matrix product 𝐶 = 𝐴𝐵, it’s easiest to think of 𝐴 in terms of its row vectors and 𝐵 in terms
of its column vectors:
· · · a𝑇1 ...
⎛ ⎞
.. .. ..
⎛ ⎞
⎜· · · a𝑇 · · ·⎟
2 ⎜ . . . ⎟
𝐴=⎜ , 𝐵 = b b · · · b 𝑚⎠ .
⎜ ⎟
.. ⎟ ⎝ 1
⎜ 2
⎟
⎝ . ⎠ .. .. ..
· · · a𝑇𝑛 · · · . . .
Note here that each row vector a𝑇𝑖 lies in R𝑘 and that each column vector b𝑗 also lies in R𝑘 .
Then to produce the matrix product 𝐶 ∈ R𝑛×𝑚 we simply compute each entry 𝑐𝑖𝑗 as the dot product a𝑇𝑖 b𝑗 .
You can think of the matrix-matrix multiplication 𝐴𝐵 as simply performing 𝑚 matrix-vector products and
stitching the results together to form an 𝑛 × 𝑚 matrix. Just as with ordinary dot products and matrix-vector
products, we can compute matrix-matrix products in MXNet by using nd.dot().
3.4.12 Norms
Before we can start implementing models, there’s one last concept we’re going to introduce. Some of the
most useful operators in linear algebra are norms. Informally, they tell us how big a vector or matrix is. We
represent norms with the notation ‖ · ‖. The · in this expression is just a placeholder. For example, we would
represent the norm of a vector x or matrix 𝐴 as ‖x‖ or ‖𝐴‖, respectively.
All norms must satisfy a handful of properties: 1. ‖𝛼𝐴‖ = |𝛼|‖𝐴‖ 2. ‖𝐴 + 𝐵‖ ≤ ‖𝐴‖ + ‖𝐵‖ 3. ‖𝐴‖ ≥ 0
4. If ∀𝑖, 𝑗, 𝑎𝑖𝑗 = 0, then ‖𝐴‖ = 0
To put it in words, the first rule says that if we scale all the components of a matrix or vector by a constant
factor 𝛼, its norm also scales by the absolute value of the same constant factor. The second rule is the
familiar triangle inequality. The third rule simple says that the norm must be non-negative. That makes
sense, in most contexts the smallest size for anything is 0. The final rule basically says that the smallest
norm is achieved by a matrix or vector consisting of all zeros. It’s possible to define a norm that gives zero
norm to nonzero matrices, but you can’t give nonzero norm to zero matrices. That’s a mouthful, but if you
digest it then you probably have grepped the important concepts here.
If you remember Euclidean distances (think Pythagoras’ theorem) from grade school, then non-negativity
and the triangle inequality might ring a bell. You might notice that norms sound a lot like measures of
distance.
√︀
In fact, the Euclidean distance 𝑥21 + · · · + 𝑥2𝑛 is a norm. Specifically it’s the ℓ2 -norm. An analogous
√︁∑︀
2
computation, performed over the entries of a matrix, e.g. 𝑖,𝑗 𝑎𝑖𝑗 , is called the Frobenius norm. More
often, in machine learning we work with the squared ℓ2 norm (notated ℓ22 ). We also commonly work with
the ℓ1 norm. The ℓ1 norm is simply the sum of the absolute values. It has the convenient property of placing
less emphasis on outliers.
To calculate the ℓ2 norm, we can just call nd.norm().
In [ ]: nd.norm(u)
To calculate the L1-norm we can simply perform the absolute value and then sum over the elements.
In [ ]: nd.sum(nd.abs(u))
objectives, perhaps the most important component of a machine learning algorithm (besides the data itself),
are expressed as norms.
• Positive Definite Matrix These are matrices that have the nice property where 𝑥⊤ 𝑀 𝑥 > 0 whenever
𝑥 ̸= 0. Intuitively, they are a generalization of the squared norm of a vector ‖𝑥‖2 = 𝑥⊤ 𝑥. It is easy
to check that whenever 𝑀 = 𝐴⊤ 𝐴, this holds since there 𝑥⊤ 𝑀 𝑥 = 𝑥⊤ 𝐴⊤ 𝐴𝑥 = ‖𝐴𝑥‖2 . There is
a somewhat more profound theorem which states that all positive definite matrices can be written in
this form.
3.5.3 Conclusions
In just a few pages (or one Jupyter notebook) we’ve taught you all the linear algebra you’ll need to un-
derstand a good chunk of neural networks. Of course there’s a lot more to linear algebra. And a lot of
that math is useful for machine learning. For example, matrices can be decomposed into factors, and these
decompositions can reveal low-dimensional structure in real-world datasets. There are entire subfields of
machine learning that focus on using matrix decompositions and their generalizations to high-order tensors
to discover structure in datasets and solve prediction problems. But this book focuses on deep learning. And
we believe you’ll be much more inclined to learn more mathematics once you’ve gotten your hands dirty
deploying useful machine learning models on real datasets. So while we reserve the right to introduce more
math much later on, we’ll wrap up this chapter here.
If you’re eager to learn more about linear algebra, here are some of our favorite resources on the topic *
For a solid primer on basics, check out Gilbert Strang’s book Introduction to Linear Algebra * Zico Kolter’s
Linear Algebra Reivew and Reference
3.5.4 Next
Probability and statistics
For whinges or inquiries, open an issue on GitHub.
While it’s easy for humans to recognize cats and dogs at 320 pixel resolution, it becomes challenging at 40
pixels and next to impossible at 20 pixels. In other words, our ability to tell cats and dogs apart at a large
distance (and thus low resolution) might approach uninformed guessing. Probability gives us a formal way
of reasoning about our level of certainty. If we are completely sure that the image depicts a cat, we say that
the probability that the corresponding label 𝑙 is cat, denoted 𝑃 (𝑙 = cat) equals 1.0. If we had no evidence
to suggest that 𝑙 = cat or that 𝑙 = dog, then we might say that the two possibilities were equally 𝑙𝑖𝑘𝑒𝑙𝑦
expressing this as 𝑃 (𝑙 = cat) = 0.5. If we were reasonably confident, but not sure that the image depicted
a cat, we might assign a probability .5 < 𝑃 (𝑙 = cat) < 1.0.
Now consider a second case: given some weather monitoring data, we want to predict the probability that
it will rain in Taipei tomorrow. If it’s summertime, the rain might come with probability .5 In both cases,
we have some value of interest. And in both cases we are uncertain about the outcome. But there’s a key
difference between the two cases. In this first case, the image is in fact either a dog or a cat, we just don’t
know which. In the second case, the outcome may actually be a random event, if you believe in such things
(and most physicists do). So probability is a flexible language for reasoning about our level of certainty, and
it can be applied effectively in a broad set of contexts.
Next, we’ll want to be able to cast the die. In statistics we call this process of drawing examples from
probability distributions sampling. The distribution which assigns probabilities to a number of discrete
choices is called the multinomial distribution. We’ll give a more formal definition of distribution later, but
at a high level, think of it as just an assignment of probabilities to events. In MXNet, we can sample from
the multinomial distribution via the aptly named nd.sample_multinomial function. The function can
be called in many ways, but we’ll focus on the simplest. To draw a single sample, we simply give pass in a
vector of probabilities.
In [2]: probabilities = nd.ones(6) / 6
nd.sample_multinomial(probabilities)
Out[2]:
[3]
<NDArray 1 @cpu(0)>
If you run this line (nd.sample_multinomial(probabilities)) a bunch of times, you’ll find that
you get out random values each time. As with estimating the fairness of a die, we often want to generate
many samples from the same distribution. It would be really slow to do this with a Python for loop, so
sample_multinomial supports drawing multiple samples at once, returning an array of independent
samples in any shape we might desire.
In [3]: print(nd.sample_multinomial(probabilities, shape=(10)))
print(nd.sample_multinomial(probabilities, shape=(5,10)))
[3 4 5 3 5 3 5 2 3 3]
<NDArray 10 @cpu(0)>
[[2 2 1 5 0 5 1 2 2 4]
[4 3 2 3 2 5 5 0 2 0]
[3 0 2 4 5 4 0 5 5 5]
[2 4 4 2 3 4 4 0 4 3]
[3 0 3 5 4 3 0 2 2 1]]
<NDArray 5x10 @cpu(0)>
Now that we know how to sample rolls of a die, we can simulate 1000 rolls.
In [4]: rolls = nd.sample_multinomial(probabilities, shape=(1000))
We can then go through and count, after each of the 1000 rolls, how many times each number was rolled.
In [5]: counts = nd.zeros((6,1000))
totals = nd.zeros(6)
for i, roll in enumerate(rolls):
totals[int(roll.asscalar())] += 1
counts[:, i] = totals
To start, we can inspect the final tally at the end of 1000 rolls.
In [6]: totals / 1000
Out[6]:
[ 0.167 0.168 0.175 0.15899999 0.15800001 0.17299999]
<NDArray 6 @cpu(0)>
As you can see, the lowest estimated probability for any of the numbers is about .15 and the highest estimated
probability is 0.188. Because we generated the data from a fair die, we know that each number actually has
probability of 1/6, roughly .167, so these estimates are pretty good. We can also visualize how these
probabilities converge over time towards reasonable estimates.
To start let’s take a look at the counts array which has shape (6, 1000). For each time step (out of
1000), counts, says how many times each of the numbers has shown up. So we can normalize each 𝑗-th
column of the counts vector by the number of tosses to give the current estimated probabilities at that
time. The counts object looks like this:
In [7]: counts
Out[7]:
[[ 0. 0. 0. ..., 165. 166. 167.]
[ 1. 1. 1. ..., 168. 168. 168.]
[ 0. 0. 0. ..., 175. 175. 175.]
[ 0. 0. 0. ..., 159. 159. 159.]
[ 0. 1. 2. ..., 158. 158. 158.]
[ 0. 0. 0. ..., 173. 173. 173.]]
<NDArray 6x1000 @cpu(0)>
[ 0. 1. 0. 0. 0. 0.]
<NDArray 6 @cpu(0)>
[ 0. 0.5 0. 0. 0.5 0. ]
<NDArray 6 @cpu(0)>
As you can see, after the first toss of the die, we get the extreme estimate that one of the numbers will be
rolled with probability 1.0 and that the others have probability 0. After 100 rolls, things already look a bit
more reasonable. We can visualize this convergence by using the plotting package matplotlib. If you
don’t have it installed, now would be a good time to install it.
In [9]: %matplotlib inline
Each solid curve corresponds to one of the six values of the die and gives our estimated probability that
the die turns up that value as assessed after each of the 1000 turns. The dashed black line gives the true
underlying probability. As we get more data, the solid curves converge towards the true answer.
In our example of casting a die, we introduced the notion of a random variable. A random variable,
which we denote here as 𝑋 can be pretty much any quantity and is not determistic. Random variables
could take one value among a set of possibilites. We denote sets with brackets, e.g., {cat, dog, rabbit}.
The items contained in the set are called elements, and we can say that an element 𝑥 is in the set S, by
writing 𝑥 ∈ 𝑆. The symbol ∈ is read as “in” and denotes membership. For instance, we could truthfully
say dog ∈ {cat, dog, rabbit}. When dealing with the rolls of die, we are concerned with a variable 𝑋 ∈
{1, 2, 3, 4, 5, 6}.
Note that there is a subtle difference between discrete random variables, like the sides of a dice, and con-
tinuous ones, like the weight and the height of a person. There’s little point in asking whether two people
have exactly the same height. If we take precise enough measurements you’ll find that no two people on
the planet have the exact same height. In fact, if we take a fine enough measurement, you will not have
the same height when you wake up and when you go to sleep. So there’s no purpose in asking about the
probability that some one is 2.00139278291028719210196740527486202 meters tall. The probability is 0.
It makes more sense in this case to ask whether someone’s height falls into a given interval, say between
1.99 and 2.01 meters. In these cases we quantify the likelihood that we see a value as a density. The height
of exactly 2.0 meters has no probability, but nonzero density. Between any two different heights we have
nonzero probability.
There are a few important axioms of probability that you’ll want to remember:
• For any event 𝑧, the probability is never negative, i.e. Pr(𝑍 = 𝑧) ≥ 0.
• For any two events 𝑍 = 𝑧 and 𝑋 = 𝑥 the union is no more likely than the sum of the individual
events, i.e. Pr(𝑍 = 𝑧 ∪ 𝑋 = 𝑥) ≤ Pr(𝑍 = 𝑧) + Pr(𝑋 = 𝑥).
• For any random variable, the probabilities of all the values it can take must sum to 1 𝑛𝑖=1 𝑃 (𝑍 =
∑︀
𝑧𝑖 ) = 1.
• For any two mutually exclusive events 𝑍 = 𝑧 and 𝑋 = 𝑥, the probability that either happens is equal
to the sum of their individual probabilities that Pr(𝑍 = 𝑧 ∪ 𝑋 = 𝑥) = Pr(𝑍 = 𝑧) + Pr(𝑋 = 𝑧).
seeing 𝐴 amounts to accounting for all possible choices of 𝐵 and aggregating the joint probabilities over all
of them, i.e.
∑︁ ∑︁
Pr(𝐴) = Pr(𝐴, 𝐵 ′ ) and Pr(𝐵) = Pr(𝐴′ , 𝐵)
𝐵′ 𝐴′
A really useful property to check is for dependence and independence. Independence is when the oc-
currence of one event does not influence the occurrence of the other. In this case Pr(𝐵|𝐴) = Pr(𝐵).
Statisticians typically use 𝐴 ⊥
⊥ 𝐵 to express this. From Bayes Theorem it follows immediately that also
Pr(𝐴|𝐵) = Pr(𝐴). In all other cases we call 𝐴 and 𝐵 dependent. For instance, two successive rolls of a
dice are independent. On the other hand, the position of a light switch and the brightness in the room are not
(they are not perfectly deterministic, though, since we could always have a broken lightbulb, power failure,
or a broken switch).
Let’s put our skills to the test. Assume that a doctor administers an AIDS test to a patient. This test is fairly
accurate and fails only with 1% probability if the patient is healthy by reporting him as diseased, and that it
never fails to detect HIV if the patient actually has it. We use 𝐷 to indicate the diagnosis and 𝐻 to denote
the HIV status. Written as a table the outcome Pr(𝐷|𝐻) looks as follows:
Note that the column sums are all one (but the row sums aren’t), since the conditional probability needs to
sum up to 1, just like the probability. Let us work out the probability of the patient having AIDS if the test
comes back positive. Obviously this is going to depend on how common the disease is, since it affects the
number of false alarms. Assume that the population is quite healthy, e.g. Pr(HIV positive) = 0.0015. To
apply Bayes Theorem we need to determine
Pr(Test positive) = Pr(𝐷 = 1|𝐻 = 0) Pr(𝐻 = 0) + Pr(𝐷 = 1|𝐻 = 1) Pr(𝐻 = 1) = 0.01 · 0.9985 + 1 · 0.0015 = 0.011
Unfortunately, the second test comes back positive, too. Let us work out the requisite probabilities to invoke
Bayes’ Theorem.
• Pr(𝐷1 = 1 and 𝐷2 = 1|𝐻 = 0) = 0.01 · 0.03 = 0.0003
y = int(label)
ycount[y] += 1
xcount[:, y] += x
Now that we computed per-pixel counts of occurrence for all pixels, it’s time to see how our model behaves.
Time to plot it. We show the estimated probabilities of observing a switched-on pixel. These are some mean
looking digits.
In [11]: import matplotlib.pyplot as plt
fig, figarr = plt.subplots(1, 10, figsize=(15, 15))
for i in range(10):
figarr[i].imshow(xcount[:, i].reshape((28, 28)).asnumpy(), cmap='hot')
figarr[i].axes.get_xaxis().set_visible(False)
figarr[i].axes.get_yaxis().set_visible(False)
plt.show()
print(py)
Now we can compute the likelihoods of an image, given the model. This is statistican speak for 𝑝(𝑥|𝑦),
i.e. how likely it is to see a particular image under certain conditions (such as the label). Since this is
computationally awkward (we might have to multiply many small numbers if many pixels have a small
probability
∏︀ ∑︀off computing its logarithm instead. That is, instead of 𝑝(𝑥|𝑦) =
of occurring), we are better
𝑖 𝑝(𝑥 𝑖 |𝑦) we compute log 𝑝(𝑥|𝑦) = 𝑖 log 𝑝(𝑥𝑖 |𝑦).
∑︁ ∑︁
𝑙𝑦 := log 𝑝(𝑥𝑖 |𝑦) = 𝑥𝑖 log 𝑝(𝑥𝑖 = 1|𝑦) + (1 − 𝑥𝑖 ) log (1 − 𝑝(𝑥𝑖 = 1|𝑦))
𝑖 𝑖
To avoid recomputing logarithms all the time, we precompute them for all pixels.
In [12]: logxcount = nd.log(xcount)
logxcountneg = nd.log(1-xcount)
logpy = nd.log(py)
# show 10 images
ctr = 0
plt.show()
As we can see, this classifier is both incompetent and overly confident of its incorrect estimates. That is,
even if it is horribly wrong, it generates probabilities close to 1 or 0. Not a classifier we should use very
much nowadays any longer. While Naive Bayes classifiers used to be popular in the 80s and 90s, e.g. for
spam filtering, their heydays are over. The poor performance is due to the incorrect statistical assumptions
that we made in our model: we assumed that each and every pixel are independently generated, depending
only on the label. This is clearly not how humans write digits, and this wrong assumption led to the downfall
of our overly naive (Bayes) classifier.
3.6.5 Sampling
Random numbers are just one form of random variables, and since computers are particularly good with
numbers, pretty much everything else in code ultimately gets converted to numbers anyway. One of the
basic tools needed to generate random numbers is to sample from a distribution. Let’s start with what
happens when we use a random number generator.
Uniform Distribution
These are some pretty random numbers. As we can see, their range is between 0 and 1, and they are evenly
distributed. That is, there is (actually, should be, since this is not a real random number generator) no
interval in which numbers are more likely than in any other. In other words, the chances of any of these
numbers to fall into the interval, say [0.2, 0.3) are as high as in the interval [.593264, .693264). The way
they are generated internally is to produce a random integer first, and then divide it by its maximum range.
If we want to have integers directly, try the following instead. It generates random numbers between 0 and
100.
In [14]: for i in range(10):
print(random.randint(1, 100))
75
23
34
85
99
66
13
42
19
14
What if we wanted to check that randint is actually really uniform. Intuitively the best strategy would be
to run it, say 1 million times, count how many times it generates each one of the values and to ensure that
the result is uniform.
In [15]: import math
counts = np.zeros(100)
fig, axes = plt.subplots(2, 3, figsize=(15, 8), sharex=True)
axes = axes.reshape(6)
# mangle subplots such that we can index them in a linear fashion rather than
# a 2d grid
What we can see from the above figures is that the initial number of counts looks very uneven. If we sample
fewer than 100 draws from a distribution over 100 outcomes this is pretty much expected. But even for 1000
samples there is a significant variability between the draws. What we are really aiming for is a situation
where the probability of drawing a number 𝑥 is given by 𝑝(𝑥).
plt.figure(figsize=(15, 8))
plt.semilogx(x, p0)
plt.semilogx(x, p1)
plt.show()
As we can see, on average this sampler will generate 35% zeros and 65% ones. Now what if we have
more than two possible outcomes? We can simply generalize this idea as follows. Given any probability
distribution, e.g. 𝑝 = [0.1, 0.2, 0.05, 0.3, 0.25, 0.1] we can compute its cumulative distribution (python’s
cumsum will do this for you) 𝐹 = [0.1, 0.3, 0.35, 0.65, 0.9, 1]. Once we have this we draw a random
variable 𝑥 from the uniform distribution 𝑈 [0, 1] and then find the interval where 𝐹 [𝑖 − 1] ≤ 𝑥 < 𝐹 [𝑖]. We
then return 𝑖 as the sample. By construction, the chances of hitting interval [𝐹 [𝑖 − 1], 𝐹 [𝑖]) has probability
𝑝(𝑖).
Note that there are many more efficient algorithms for sampling than the one above. For instance, binary
search over 𝐹 will run in 𝑂(log 𝑛) time for 𝑛 random variables. There are even more clever algorithms,
such as the Alias Method to sample in constant time, after 𝑂(𝑛) preprocessing.
Sampling from this distribution is a lot less trivial. First off, the support is infinite, that is, for any 𝑥 the
density 𝑝(𝑥) is positive. Secondly, the density is nonuniform. There are many tricks for sampling from it -
the key idea in all algorithms is to stratify 𝑝(𝑥) in such a way as to map it to the uniform distribution 𝑈 [0, 1].
One way to do this is with the probability integral transform.
∫︀ 𝑥
Denote by 𝐹 (𝑥) = −∞ 𝑝(𝑧)𝑑𝑧 the cumulative distribution function (CDF) of 𝑝. This is in a way the
continuous version of the cumulative sum that we used previously. In the same way we can now define the
inverse map 𝐹 −1 (𝜉), where 𝜉 is drawn uniformly. Unlike previously where we needed to find the correct
interval for the vector 𝐹 (i.e. for the piecewise constant function), we now invert the function 𝐹 (𝑥).
In practice, this is slightly more tricky since inverting the CDF is hard in the case of a Gaussian. It turns
out that the twodimensional integral is much easier to deal with, thus yielding two normal random variables
than one, albeit at the price of two uniformly distributed ones. For now, suffice it to say that there are built-in
algorithms to address this.
The normal distribution has yet another desirable property. In a way all distributions converge to it, if we
only average over a sufficiently large number of draws from any other distribution. To understand this in a
bit more detail, we need to introduce three important things: expected values, means and variances.
∫︀ expected value E𝑥∼𝑝(𝑥) [𝑓 (𝑥)] of a function 𝑓 under a distribution 𝑝 is given by the integral
• The
𝑥 𝑝(𝑥)𝑓 (𝑥)𝑑𝑥. That is, we average over all possible outcomes, as given by 𝑝.
• A particularly important expected value is that for the function 𝑓 (𝑥) = 𝑥, i.e. 𝜇 := E𝑥∼𝑝(𝑥) [𝑥]. It
provides us with some idea about the typical values of 𝑥.
• Another important quantity is the variance, i.e. the typical deviation from the mean 𝜎 2 :=
E𝑥∼𝑝(𝑥) [(𝑥 − 𝜇)2 ]. Simple math shows (check it as an exercise) that 𝜎 2 = E𝑥∼𝑝(𝑥) [𝑥2 ] − E2𝑥∼𝑝(𝑥) [𝑥].
The above allows us to change both mean and variance of random variables. Quite obviously for some
random variable 𝑥 with mean 𝜇, the random variable 𝑥 + 𝑐 has mean 𝜇 + 𝑐. Moreover, 𝛾𝑥 has the variance
𝛾 2 𝜎 2 . Applying this
(︀ to 1the normal2 )︀distribution we see that one with mean 𝜇 and variance 𝜎 2 has the form
1 1
𝑝(𝑥) = √ 2 exp − 2𝜎2 (𝑥 − 𝜇) . Note the scaling factor 𝜎 - it arises from the fact that if we stretch the
2𝜎 𝜋
distribution by 𝜎, we need to lower it by 𝜎1 to retain the same probability mass (i.e. the weight under the
distribution always needs to integrate out to 1).
Now we are ready to state one of the most fundamental theorems in statistics, the Central Limit Theorem. It
states that for sufficiently well-behaved random variables, in particular random variables with well-defined
mean and variance, the sum tends toward a normal distribution. To get some idea, let’s repeat the experiment
described in the beginning, but now using random variables with integer values of {0, 1, 2}.
In [18]: # generate 10 random sequences of 10,000 random normal variables N(0,1)
tmp = np.random.uniform(size=(10000,10))
x = 1.0 * (tmp > 0.3) + 1.0 * (tmp > 0.8)
mean = 1 * 0.5 + 2 * 0.2
variance = 1 * 0.5 + 4 * 0.2 - mean**2
print('mean {}, variance {}'.format(mean, variance))
# cumulative sum and normalization
y = np.arange(1,10001).reshape(10000,1)
z = np.cumsum(x,axis=0) / y
plt.figure(figsize=(10,5))
for i in range(10):
plt.semilogx(y,z[:,i])
This looks very similar to the initial example, at least in the limit of averages of large numbers of variables.
This is confirmed by theory. Denote by mean and variance of a random variable the quantities
√1
∑︀𝑛 𝑥𝑖 −𝜇
Then we have that lim𝑛→∞ 𝑛 𝑖=1 𝜎 → 𝒩 (0, 1). In other words, regardless of what we started out
with, we will always converge to a Gaussian. This is one of the reasons why Gaussians are so popular in
statistics.
More distributions
Many more useful distributions exist. We recommend consulting a statistics book or looking some of them
up on Wikipedia for further detail.
• Binomial Distribution It is used to describe the distribution over multiple draws from the same
distribution, e.g. the number of heads when tossing a biased (︀coin
)︀ (i.e. a coin with probability 𝜋 of
𝑛 𝑥 𝑛−𝑥
returning heads) 10 times. The probability is given by 𝑝(𝑥) = 𝑥 𝜋 (1 − 𝜋) .
• Multinomial Distribution Obviously we can have more than two outcomes, ∏︀𝑘 e.g. when rolling a dice
𝑥𝑖
multiple times. In this case the distribution is given by 𝑝(𝑥) = ∏︀𝑘 𝑛! 𝜋
𝑖=1 𝑖 .
𝑖=1 𝑥𝑖 !
• Poisson Distribution It is used to model the occurrence of point events that happen with a given rate,
e.g. the number of raindrops arriving within a given amount of time in an area (weird fact - the number
of Prussian soldiers being killed by horses kicking them followed that distribution). Given a rate 𝜆,
1 𝑥 −𝜆
the number of occurrences is given by 𝑝(𝑥) = 𝑥! 𝜆 𝑒 .
• Beta, Dirichlet, Gamma, and Wishart Distributions They are what statisticians call conjugate to
the Binomial, Multinomial, Poisson and Gaussian respectively. Without going into detail, these distri-
butions are often used as priors for coefficients of the latter set of distributions, e.g. a Beta distribution
as a prior for modeling the probability for binomial outcomes.
3.6.6 Next
Autograd
For whinges or inquiries, open an issue on GitHub.
Once we compute the gradient of f with respect to x, we’ll need a place to store it. In MXNet, we can tell
an NDArray that we plan to store a gradient by invoking its attach_grad() method.
In [3]: x.attach_grad()
Now we’re going to define the function f and MXNet will generate a computation graph on the fly. It’s as if
MXNet turned on a recording device and captured the exact path by which each variable was generated.
Note that building the computation graph requires a nontrivial amount of computation. So MXNet will only
build the graph when explicitly told to do so. We can instruct MXNet to start recording by placing code
inside a with autograd.record(): block.
In [4]: with autograd.record():
y = x * 2
z = y * x
Let’s backprop by calling z.backward(). When z has more than one entry, z.backward() is equiv-
alent to mx.nd.sum(z).backward().
In [5]: z.backward()
Now, let’s see if this is the expected output. Remember that y = x * 2, and z = x * y, so z should be
equal to 2 * x * x. After, doing backprop with z.backward(), we expect to get back gradient dz/dx
as follows: dy/dx = 2, dz/dx = 4 * x. So, if everything went according to plan, x.grad should consist of
an NDArray with the values [[4, 8],[12, 16]].
In [6]: print(x.grad)
[[ 4. 8.]
[ 12. 16.]]
<NDArray 2x2 @cpu(0)>
[[ 40. 8. ]
[ 1.20000005 0.16 ]]
<NDArray 2x2 @cpu(0)>
Now that we know the basics, we can do some wild things with autograd, including building differentiable
functions using Pythonic control flow.
In [8]: a = nd.random_normal(shape=3)
a.attach_grad()
with autograd.record():
b = a * 2
while (nd.norm(b) < 1000).asscalar():
b = b * 2
3.7.3 Next
Chapter 1 Problem Set
For whinges or inquiries, open an issue on GitHub.
In [ ]:
𝑦ˆ = 𝑤1 · 𝑥1 + ... + 𝑤𝑑 · 𝑥𝑑 + 𝑏
Given a collection of data points 𝑋, and corresponding target values 𝑦, we’ll try to find the weight vector
𝑤 and bias term 𝑏 (also called an offset or intercept) that approximately associate data points 𝑥𝑖 with their
corresponding labels y_i. Using slightly more advanced math notation, we can express the predictions 𝑦 ^
corresponding to a collection of datapoints 𝑋 via the matrix-vector product:
𝑦
^ = 𝑋𝑤 + 𝑏
Square loss
In order to say whether we’ve done a good job, we need some way to measure the quality of a model.
Generally, we will define a loss function that says how far are our predictions from the correct answers. For
the classical case of linear regression, we usually focus on the squared error. Specifically, our loss will be
the sum, over all examples, of the squared error (𝑦𝑖 − 𝑦ˆ)2 ) on each:
𝑛
∑︁
ℓ(𝑦, 𝑦ˆ) = 𝑦𝑖 − 𝑦𝑖 )2 .
(ˆ
𝑖=1
For one-dimensional data, we can easily visualize the relationship between our single feature and the target
variable. It’s also easy to visualize a linear predictor and it’s error on each example. Note that squared loss
heavily penalizes outliers. For the visualized predictor below, the lone outlier would contribute most of the
loss.
Historical note
You might reasonably point out that linear regression is a classical statistical model. According to Wikipedia,
Legendre first developed the method of least squares regression in 1805, which was shortly thereafter re-
discovered by Gauss in 1809. Presumably, Legendre, who had Tweeted about the paper several times, was
peeved that Gauss failed to cite his arXiv preprint.
Matters of provenance aside, you might wonder - if Legendre and Gauss worked on linear regression, does
that mean there were the original deep learning researchers? And if linear regression doesn’t wholly belong
to deep learning, then why are we presenting a linear model as the first example in a tutorial series on neural
networks? Well it turns out that we can express linear regression as the simplest possible (useful) neural
network. A neural network is just a collection of nodes (aka neurons) connected by directed edges. In most
networks, we arrange the nodes into layers with each feeding its output into the layer above. To calculate the
value of any node, we first perform a weighted sum of the inputs (according to weights w) and then apply an
activation function. For linear regression, we only have two layers, one corresponding to the input (depicted
in orange) and a one-node layer (depicted in green) corresponding to the ouput. For the output node the
activation function is just the identity function.
While you certainly don’t have to view linear regression through the lens of deep learning, you can (and we
will!). To ground the concepts that we just discussed in code, let’s actually code up a neural network for
linear regression from scratch.
To get going, we will generate a simple synthetic dataset by sampling random data points X[i] and cor-
responding labels y[i] in the following manner. Our inputs will each be sampled from a random normal
distribution with mean 0 and variance 1. Our features will be independent. Another way of saying this is
that they will have diagonal covariance. The labels will be generated accoding to the true labeling func-
tion y[i] = 2 * X[i][0]- 3.4 * X[i][1] + 4.2 + noise where the noise is drawn from a
random gaussian with mean 0 and variance .01. We could express the labeling function in mathematical
notation as:
𝑦 = 𝑋 · 𝑤 + 𝑏 + 𝜂, for 𝜂 ∼ 𝒩 (0, 𝜎 2 )
In [25]: num_inputs = 2
num_outputs = 1
num_examples = 10000
def real_fn(X):
return 2 * X[:, 0] - 3.4 * X[:, 1] + 4.2
Notice that each row in X consists of a 2-dimensional data point and that each row in Y consists of a 1-
dimensional target value.
In [27]: print(X[0])
print(y[0])
[-1.22338355 2.39233518]
<NDArray 2 @cpu(0)>
[-6.09602737]
<NDArray 1 @cpu(0)>
Note that because our synthetic features X live on data_ctx and because our noise also lives on
data_ctx, the labels y, produced by combining X and noise in real_fn also live on data_ctx.
We can confirm that for any randomly chosen point, a linear combination with the (known) optimal param-
eters produces a prediction that is indeed close to the target value
In [28]: print(2 * X[0, 0] - 3.4 * X[0, 1] + 4.2)
[-6.38070679]
<NDArray 1 @cpu(0)>
We can visualize the correspondence between our second feature (X[:, 1]) and the target values Y by
generating a scatter plot with the Python plotting package matplotlib. Make sure that matplotlib
is installed. Otherwise, you may install it by running pip2 install matplotlib (for Python 2) or
pip3 install matplotlib (for Python 3) on your command line.
In order to plot with matplotlib we’ll just need to convert X and y into NumPy arrays by using the
.asnumpy() function.
In [29]: import matplotlib.pyplot as plt
plt.scatter(X[:, 1].asnumpy(),y.asnumpy())
plt.show()
Once we’ve initialized our DataLoader (train_data), we can easily fetch batches by iterating over
train_data just as if it were a Python list. You can use your favorite iterating techniques like fore-
ach loops: for data, label in train_data or enumerations: for i, (data, label) in
enumerate(train_data). First, let’s just grab one batch and break out of the loop.
In [31]: for i, (data, label) in enumerate(train_data):
print(data, label)
break
[[-0.14732301 -1.32803488]
[-0.56128627 0.48301753]
[ 0.75564283 -0.12659997]
[-0.96057719 -0.96254188]]
<NDArray 4x2 @cpu(0)>
[ 8.25711536 1.30587864 6.15542459 5.48825312]
<NDArray 4 @cpu(0)>
If we run that same code again you’ll notice that we get a different batch. That’s because we instructed the
DataLoader that shuffle=True.
In [32]: for i, (data, label) in enumerate(train_data):
print(data, label)
break
[[-0.59027743 -1.52694809]
[-0.00750104 2.68466949]
[ 1.50308061 0.54902577]
[ 1.69129586 0.32308948]]
<NDArray 4x2 @cpu(0)>
[ 8.28844357 -5.07566643 5.3666563 6.52408457]
<NDArray 4 @cpu(0)>
Finally, if we actually pass over the entire dataset, and count the number of batches, we’ll find that there are
2500 batches. We expect this because our dataset has 10,000 examples and we configured the DataLoader
with a batch size of 4.
In [33]: counter = 0
for i, (data, label) in enumerate(train_data):
pass
print(i+1)
2500
In the succeeding cells, we’re going to update these parameters to better fit our data. This will involve
taking the gradient (a multi-dimensional derivative) of some loss function with respect to the parameters.
We’ll update each parameter in the direction that reduces the loss. But first, let’s just allocate some memory
for each gradient.
In [35]: for param in params:
param.attach_grad()
3.8.7 Optimizer
It turns out that linear regression actually has a closed-form solution. However, most interesting models that
we’ll care about cannot be solved analytically. So we’ll solve this problem by stochastic gradient descent.
At each step, we’ll estimate the gradient of the loss with respect to our weights, using one batch randomly
drawn from our dataset. Then, we’ll update our parameters a small amount in the direction that reduces the
loss. The size of the step is determined by the learning rate lr.
In [38]: def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
for e in range(epochs):
cumulative_loss = 0
# inner loop
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx)
label = label.as_in_context(model_ctx).reshape((-1, 1))
with autograd.record():
output = net(data)
loss = square_loss(output, label)
loss.backward()
SGD(params, learning_rate)
cumulative_loss += loss.asscalar()
print(cumulative_loss / num_batches)
24.6606138554
9.09776815639
3.36058844271
1.24549788469
0.465710770596
0.178157229481
0.0721970594548
0.0331197250206
0.0186954441286
0.0133724625537
############################################
# Script to plot the losses over time
############################################
def plot(losses, X, sample_size=100):
xs = list(range(len(losses)))
f, (fg1, fg2) = plt.subplots(1, 2)
fg1.set_title('Loss during training')
fg1.plot(xs, losses, '-r')
fg2.set_title('Estimated vs real function')
fg2.plot(X[:sample_size, 1].asnumpy(),
net(X[:sample_size, :]).asnumpy(), 'or', label='Estimated')
fg2.plot(X[:sample_size, 1].asnumpy(),
real_fn(X[:sample_size, :]).asnumpy(), '*g', label='Real')
fg2.legend()
plt.show()
learning_rate = .0001
losses = []
plot(losses, X)
for e in range(epochs):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx)
plot(losses, X)
3.8.10 Conclusion
You’ve seen that using just mxnet.ndarray and mxnet.autograd, we can build statistical models from scratch.
In the following tutorials, we’ll build on this foundation, introducing the basic ideas behind modern neural
networks and demonstrating the powerful abstractions in MXNet’s gluon package for building complex
models with little code.
3.8.11 Next
Linear regression with gluon
For whinges or inquiries, open an issue on GitHub.
def real_fn(X):
return 2 * X[:, 0] - 3.4 * X[:, 1] + 4.2
X = nd.random_normal(shape=(num_examples, num_inputs))
noise = 0.01 * nd.random_normal(shape=(num_examples,))
y = real_fn(X) + noise
That’s it! We’ve already got a neural network. Like our hand-crafted model in the previous notebook, this
model has a weight matrix and bias vector.
In [37]: print(net.weight)
print(net.bias)
Out[37]: Parameter dense4_weight (shape=(1, 2), dtype=None)
Parameter dense4_bias (shape=(1,), dtype=None)
Here, net.weight and net.bias are not actually NDArrays. They are instances of the Parameter
class. We use Parameter instead of directly accessing NDAarrays for several reasons. For example, they
provide convenient abstractions for initializing values. Unlike NDArrays, Parameters can be associated with
multiple contexts simultaneously. This will come in handy in future chapters when we start thinking about
distributed learning across multiple GPUs.
In gluon, all neural networks are made out of Blocks (gluon.Block). Blocks are just units that take
inputs and generate outputs. Blocks also contain parameters that we can update. Here, our network consists
of only one layer, so it’s convenient to access our parameters directly. When our networks consist of 10s of
layers, this won’t be so fun. No matter how complex our network, we can grab all its parameters by calling
collect_params() as follows:
In [38]: net.collect_params()
Out[38]: dense4_ (
Parameter dense4_weight (shape=(1, 2), dtype=None)
Parameter dense4_bias (shape=(1,), dtype=None)
)
Passing data through a gluon model is easy. We just sample a batch of the appropriate shape and call net
just as if it were a function. This will invoke net’s forward() method.
In [41]: example_data = nd.array([[4,7]])
net(example_data)
Out[41]:
[[-1.33219385]]
<NDArray 1x1 @cpu(0)>
[[-0.25217363 -0.04621419]]
<NDArray 1x2 @cpu(0)>
[ 0.]
<NDArray 1 @cpu(0)>
We’ll elaborate on this and more of gluon’s internal workings in subsequent chapters.
3.9.9 Optimizer
Instead of writing stochastic gradient descent from scratch every time, we can instantiate a gluon.
Trainer, passing it a dictionary of parameters. Note that the SGD optimizer in gluon also has a few
bells and whistles that you can turn on at will, including momentum and clipping (both are switched off by
default). These modifications can help to converge faster and we’ll discuss them later when we go over a
variety of optimization algorithms in detail.
In [45]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.0001})
descent. The benefits of relying on gluon’s abstractions will grow substantially once we start working with
much more complex models. But once we have all the basic pieces in place, the training loop itself is quite
similar to what we would do if implementing everything from scratch.
To refresh your memory. For some number of epochs, we’ll make a complete pass over the dataset
(train_data), grabbing one mini-batch of inputs and the corresponding ground-truth labels at a time.
Then, for each batch, we’ll go through the following ritual. So that this process becomes maximally ritual-
istic, we’ll repeat it verbatim:
• Generate predictions (yhat) and the loss (loss) by executing a forward pass through the network.
• Calculate gradients by making a backwards pass through the network via loss.backward().
• Update the model parameters by invoking our SGD optimizer (note that we need not tell trainer.
step about which parameters but rather just the amount of data, since we already performed that in
the initialization of trainer).
In [46]: epochs = 10
loss_sequence = []
num_batches = num_examples / batch_size
for e in range(epochs):
cumulative_loss = 0
# inner loop
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx)
label = label.as_in_context(model_ctx)
with autograd.record():
output = net(data)
loss = square_loss(output, label)
loss.backward()
trainer.step(batch_size)
cumulative_loss += nd.mean(loss).asscalar()
print("Epoch %s, loss: %s" % (e, cumulative_loss / num_examples))
loss_sequence.append(cumulative_loss)
import matplotlib
import matplotlib.pyplot as plt
plt.figure(num=None,figsize=(8, 6))
plt.plot(loss_sequence)
As we can see, the loss function converges quickly to the optimal solution.
3.9.13 Conclusion
As you can see, even for a simple example like linear regression, gluon can help you to write quick and
clean code. Next, we’ll repeat this exercise for multi-layer perceptrons, extending these lessons to deep
neural networks and (comparatively) real datasets.
3.9.14 Next
Binary classification with logistic regression
For whinges or inquiries, open an issue on GitHub.
chapter02_supervised-learning/../img/linear-separato
With neural networks, we usually approach the problem differently. Instead of just trying to separate the
points, we train a probabilistic classifier which estimates, for each data point, the conditional probability
that it belongs to the positive class.
Recall that in linear regression, we made predictions of the form
𝑦ˆ = 𝑤𝑇 𝑥 + 𝑏.
We are interested in asking the question “what is the probability that example :math:‘x‘ belongs to the
positive class?” A regular linear model is a poor choice here because it can output values greater than 1 or
less than 0. To coerce reasonable answers from our model, we’re going to modify it slightly, by running the
linear function through a sigmoid activation function 𝜎:
𝑦ˆ = 𝜎(𝑤𝑇 𝑥 + 𝑏).
The sigmoid function 𝜎, sometimes called a squashing function or a logistic function - thus the name logistic
regression - maps a real-valued input to the range 0 to 1. Specifically, it has the functional form:
1
𝜎(𝑧) =
1 + 𝑒−𝑧
Let’s get our imports out of the way and visualize the logistic function using mxnet and matplotlib.
In [ ]: import mxnet as mx
from mxnet import nd, autograd, gluon
import matplotlib.pyplot as plt
def logistic(z):
return 1. / (1. + nd.exp(-z))
x = nd.arange(-5, 5, .1)
y = logistic(x)
plt.plot(x.asnumpy(),y.asnumpy())
plt.show()
Because the sigmoid outputs a value between 0 and 1, it’s more reasonable to think of it as a probability.
Note that an input of 0 gives a value of .5. So in the common case, where we want to predict positive
whenever the probability is greater than .5 and negative whenever the probability is less than .5, we can just
look at the sign of 𝑤𝑇 𝑥 + 𝑏.
Since now we’re thinking about outputing probabilities, one natural objective is to say that we should choose
the weights that give the actual labels in the training data the highest probability.
Because each example is independent of the others, and each label depends only on the features of the
corresponding examples, we can rewrite the above as
This function is a product over the examples, but in general, because we want to train by stochastic gradient
descent, it’s a lot easier to work with a loss function that breaks down as a sum over the training examples.
Because we typically express our objective as a loss we can just flip the sign, giving us the negative log
probability:
(︃ 𝑛 )︃
∑︁
min − log 𝑃𝜃 (𝑦𝑖 |𝑥𝑖 )
𝜃
𝑖=1
If we interpret 𝑦ˆ𝑖 as the probability that the 𝑖-th example belongs to the positive class (i.e 𝑦𝑖 = 1), then 1 − 𝑦ˆ𝑖
is the probability that the 𝑖-th example belongs to the negative class (i.e 𝑦𝑖 = 0). This is equivalent to saying
{︃
𝑦ˆ𝑖 , if 𝑦𝑖 = 1
𝑃𝜃 (𝑦𝑖 |𝑥𝑖 ) =
1 − 𝑦ˆ𝑖 , if 𝑦𝑖 = 0
If you’re learning machine learning for the first time, that might have been too much information too quickly.
Let’s take a look at this loss function and break down what’s going on more slowly. The loss function consists
of two terms, 𝑦𝑖 log 𝑦ˆ𝑖 and (1 − 𝑦𝑖 ) log(1 − 𝑦ˆ𝑖 ). Because 𝑦𝑖 only takes values 0 or 1, for a given data point,
one of these terms disappears. When 𝑦𝑖 is 1, this loss says that we should maximize log 𝑦ˆ𝑖 , giving higher
probability to the correct answer. When 𝑦𝑖 is 0, this loss function takes value log(1 − 𝑦ˆ𝑖 ). That says that we
should maximize the value 1 − 𝑦ˆ which we already know is the probability assigned to 𝑥𝑖 belonging to the
negative class.
Note that this loss function is commonly called log loss and is also commonly referred to as binary cross
entropy. It is a special case of negative log likelihood. And it is a special case of cross-entropy, which can
apply to the multi-class (> 2) setting.
While for linear regression, we demonstrated a completely different implementation from scratch and with
‘‘gluon‘‘, here we’re going to demonstrate how we can mix and match the two. We’ll use gluon for our
modeling, but we’ll write our loss function from scratch.
3.10.2 Data
As usual, we’ll want to work out these concepts using a real dataset. This time around, we’ll use the
Adult dataset taken from the UCI repository. The dataset was constructed by Barry Becker from 1994
census data. In its original form, the dataset contained 14 features, including age, education, occupation,
sex, native-country, among others. In this version, hosted by National Taiwan University, the data have
been re-processed to 123 binary features each representing quantiles among the original features. The label
is a binary indicator indicating whether the person corresponding to each row made more (𝑦𝑖 = 1) or less
(𝑦𝑖 = 0) than $50,000 of income in 1994. The dataset we’re working with contains 30,956 training examples
and 1,605 examples set aside for testing. We can read the datasets into main memory like so:
In [ ]: data_ctx = mx.cpu()
# Change this to `mx.gpu(0) if you would like to train on an NVIDIA GPU
model_ctx = mx.cpu()
with open("../data/adult/a1a.train") as f:
train_raw = f.read()
with open("../data/adult/a1a.test") as f:
test_raw = f.read()
We can now verify that our data arrays have the right shapes.
In [ ]: print(Xtrain.shape)
print(Ytrain.shape)
print(Xtest.shape)
print(Ytest.shape)
We can also check the fraction of positive examples in our training and test sets. This will give us one
nice (necessary but insufficient) sanity check that our training and test data really are drawn from the same
distribution.
In [ ]: print(nd.sum(Ytrain)/len(Ytrain))
print(nd.sum(Ytest)/len(Ytest))
for e in range(epochs):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx)
label = label.as_in_context(model_ctx)
with autograd.record():
output = net(data)
loss = log_loss(output, label)
loss.backward()
trainer.step(batch_size)
cumulative_loss += nd.sum(loss).asscalar()
print("Epoch %s, loss: %s" % (e, cumulative_loss ))
loss_sequence.append(cumulative_loss)
import matplotlib
import matplotlib.pyplot as plt
plt.figure(num=None,figsize=(8, 6))
plt.plot(loss_sequence)
This isn’t too bad! A naive classifier would predict that nobody had an income greater than $50k (the
majority class). This classifier would achieve an accuracy of roughly 75%. By contrast, our classifier gets
an accuracy of .84 (results may vary a small amount on each run owing to random initializations and random
sampling of the batches).
By now you should have some feeling for the two most fundamental tasks in supervised learning: regression
and classification. In the following chapters we’ll go deeper into these problems, exploring more complex
models, loss functions, optimizers, and training schemes. We’ll also look at more interesting datasets. And
finally, in the following chapters we’ll also look more advanced problems where we want, for example, to
predict more structured objects.
3.10.9 Next:
Softmax regression from scratch
For whinges or inquiries, open an issue on GitHub.
You also know how to define a loss function, construct a model, and write your own optimizer. Nearly
all neural networks that we’ll build in the real world consist of these same fundamental parts. The main
differences will be the type and scale of the data and the complexity of the models. And every year or two,
a new hipster optimizer comes around, but at their core they’re all subtle variations of stochastic gradient
descent.
In the previous chapter, we introduced logistic regression, a classic algorithm for performing binary classi-
fication. We implemented a model
𝑦ˆ = 𝜎(𝑥𝑤𝑇 + 𝑏)
𝑤ℎ𝑒𝑟𝑒 : 𝑚𝑎𝑡ℎ : ‘𝜎‘𝑖𝑠𝑡ℎ𝑒𝑠𝑖𝑔𝑚𝑜𝑖𝑑𝑠𝑞𝑢𝑎𝑠ℎ𝑖𝑛𝑔𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛.
This activation function on the final layer was crucial because it forced our outputs to take values in the
range [0,1]. That allowed us to interpret these outputs as probabilties. We then updated our parameters to
give the true labels (which take values either 1 or 0) the highest probability. In that tutorial, we looked at
predicting whether or not an individual’s income exceeded $50k based on features available in 1994 census
data.
Binary classification is quite useful. We can use it to predict spam vs. not spam or cancer vs not cancer.
But not every problem fits the mold of binary classification. Sometimes we encounter a problem where each
example could belong to one of 𝑘 classes. For example, a photograph might depict a cat or a dog or a zebra
or . . . (you get the point). Given 𝑘 classes, the most naive way to solve a multiclass classification problem
is to train 𝑘 different binary classifiers 𝑓𝑖 (𝑥). We could then predict that an example 𝑥 belongs to the class
𝑖 for which the probability that the label applies is highest:
max 𝑓𝑖 (𝑥)
𝑖
There’s a smarter way to go about this. We could force the output layer to be a discrete probability distri-
bution over the 𝑘 classes. To be a valid probability distribution, we’ll want the output 𝑦ˆ to (i) contain only
non-negative values, and (ii) sum to 1. We accomplish this by using the softmax function. Given an input
vector 𝑧, softmax does two things. First, it exponentiates (elementwise) 𝑒𝑧 , forcing all values to be strictly
positive. Then it normalizes so that all values sum to 1. Following the softmax operation computes the
following
𝑒𝑧
softmax(𝑧) = ∑︀𝑘
𝑧𝑖
𝑖=1 𝑒
Because now we have 𝑘 outputs and not 1 we’ll need weights connecting each of our inputs to each of our
outputs. Graphically, the network looks something like this:
We can represent these weights one for each input node, output node pair in a matrix 𝑊 . We generate the
linear mapping from inputs to outputs via a matrix-vector product 𝑥𝑊 + 𝑏. Note that the bias term is now
a vector, with one component for each output node. The whole model, including the activation function can
be written:
𝑦ˆ = softmax(𝑥𝑊 + 𝑏)
This model is sometimes called multiclass logistic regression. Other common names for it include softmax
regression and multinomial regression. For these concepts to sink in, let’s actually implement softmax re-
gression, and pick a slightly more interesting dataset this time. We’re going to classify images of handwritten
𝑧 = 𝑥 𝑊 + 𝑏
1×𝑘 1×𝑑 𝑑×𝑘 1×𝑘
Often we would one-hot encode the output label, for example 𝑦ˆ = 5 would be 𝑦 ^𝑜𝑛𝑒−ℎ𝑜𝑡 =
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0] when one-hot encoded for a 10-class classfication problem. So 𝑦ˆ = softmax(𝑧)
becomes
^𝑜𝑛𝑒−ℎ𝑜𝑡 = softmax𝑜𝑛𝑒−ℎ𝑜𝑡 ( 𝑧 )
𝑦
1×𝑘 1×𝑘
When we input a batch of 𝑚 training examples, we would have matrix 𝑋 that is the vertical stacking of
𝑚×𝑑
individual training examples 𝑥𝑖 , due to the choice of using row vectors.
⎡ ⎤ ⎡ ⎤
𝑥1 𝑥11 𝑥12 𝑥13 . . . 𝑥1𝑑
⎢ 𝑥2 ⎥ ⎢ 𝑥21 𝑥22 𝑥23 . . . 𝑥2𝑑 ⎥
𝑋=⎢ . ⎥=⎢ .
⎢ ⎥ ⎢ ⎥
.. .. .. .. ⎥
⎣ .. ⎦ ⎣ .. . . . . ⎦
𝑥𝑚 𝑥𝑚1 𝑥𝑚2 𝑥𝑚3 . . . 𝑥𝑚𝑑
𝑌 = softmax(𝑍) = softmax(𝑋𝑊 + 𝐵)
In actual implementation we can often get away with using 𝑏 directly instead of 𝐵 in the equation for 𝑍
above, due to broadcasting.
Each row of matrix 𝑍 corresponds to one training example. The softmax function operates on each row
𝑚×𝑘
of matrix 𝑍 and returns a matrix 𝑌 , each row of which corresponds to the one-hot encoded prediction of
𝑚×𝑘
one training example.
3.11.2 Imports
To start, let’s import the usual libraries.
In [ ]: from __future__ import print_function
import numpy as np
import mxnet as mx
from mxnet import nd, autograd, gluon
mx.random.seed(1)
There are two parts of the dataset for training and testing. Each part has N items and each item is a tuple of
an image and a label:
Note that each image has been formatted as a 3-tuple (height, width, channel). For color images, the channel
would have 3 dimensions (red, green and blue).
Machine learning libraries generally expect to find images in (batch, channel, height, width) format. How-
ever, most libraries for visualization prefer (height, width, channel). Let’s transpose our image into the
expected shape. In this case, matplotlib expects either (height, width) or (height, width, channel) with RGB
channels, so let’s broadcast our single channel to 3.
In [ ]: im = mx.nd.tile(image, (1,1,3))
print(im.shape)
Now we can visualize our image and make sure that our data and labels line up.
In [ ]: import matplotlib.pyplot as plt
plt.imshow(im.asnumpy())
plt.show()
We’re also going to want to load up an iterator with test data. After we train on the training dataset we’re
going to want to test our model on the test data. Otherwise, for all we know, our model could be doing
something stupid (or treacherous?) like memorizing the training examples and regurgitating the labels on
command.
In [ ]: test_data = mx.gluon.data.DataLoader(mnist_test, batch_size, shuffle=False)
We’ll also want to allocate one offset for each of the outputs. We call these offsets the bias term and collect
them in the 10-dimensional array b.
In [ ]: W = nd.random_normal(shape=(num_inputs, num_outputs),ctx=model_ctx)
b = nd.random_normal(shape=num_outputs,ctx=model_ctx)
params = [W, b]
As before, we need to let MXNet know that we’ll be expecting gradients corresponding to each of these
parameters during training.
In [ ]: for param in params:
param.attach_grad()
The relevant loss function here is called cross-entropy and it may be the most common loss function you’ll
find in all of deep learning. That’s because at the moment, classification problems tend to be far more
abundant than regression problems.
The basic idea is that we’re going to take a target Y that has been formatted as a one-hot vector, meaning
one value corresponding to the correct label is set to 1 and the others are set to 0, e.g. [0, 1, 0, 0, 0,
0, 0, 0, 0, 0].
The basic idea of cross-entropy loss is that we only care about how much probability the prediction assigned
to the correct label. In other words, for true label 2, we only care about the component of yhat corresponding
to 2. Cross-entropy attempts to maximize the log-likelihood given to the correct labels.
In [ ]: def cross_entropy(yhat, y):
return - nd.sum(y * nd.log(yhat+1e-6))
3.11.11 Optimizer
For this example we’ll be using the same stochastic gradient descent (SGD) optimizer as last time.
In [ ]: def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
Because we initialized our model randomly, and because roughly one tenth of all examples belong to each
of the ten classes, we should have an accuracy in the ball park of .10.
In [ ]: evaluate_accuracy(test_data, net)
for e in range(epochs):
cumulative_loss = 0
plt.imshow(imtiles.asnumpy())
plt.show()
pred=model_predict(net,data.reshape((-1,784)))
print('model predictions are:', pred)
break
3.11.15 Conclusion
Jeepers. We can get nearly 90% accuracy at this task just by training a linear model for a few seconds! You
might reasonably conclude that this problem is too easy to be taken seriously by experts.
But until recently, many papers (Google Scholar says 13,800) were published using results obtained on this
data. Even this year, I reviewed a paper whose primary achievement was an (imagined) improvement in
performance. While MNIST can be a nice toy dataset for testing new ideas, we don’t recommend writing
papers with it.
3.11.16 Next
Softmax regression with gluon
We’re also going to want to load up an iterator with test data. After we train on the training dataset we’re
going to want to test our model on the test data. Otherwise, for all we know, our model could be doing
something stupid (or treacherous?) like memorizing the training examples and regurgitating the labels on
command.
3.12.6 Optimizer
And let’s instantiate an optimizer to make our updates
In [58]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1})
Because we initialized our model randomly, and because roughly one tenth of all examples belong to each
of the ten classes, we should have an accuracy in the ball park of .10.
In [60]: evaluate_accuracy(test_data, net)
Out[60]: 0.1154
for e in range(epochs):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx).reshape((-1,784))
label = label.as_in_context(model_ctx)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(batch_size)
cumulative_loss += nd.sum(loss).asscalar()
def model_predict(net,data):
output = net(data.as_in_context(model_ctx))
return nd.argmax(output, axis=1)
plt.imshow(imtiles.asnumpy())
plt.show()
pred=model_predict(net,data.reshape((-1,784)))
print('model predictions are:', pred)
break
3.12.10 Next
Overfitting and regularization from scratch
For whinges or inquiries, open an issue on GitHub.
facts is what statisticians view as complex, whereas one that has only a limited expressive power but still
manages to explain the data well is probably closer to the truth. In philosophy this is closely related to
Popper’s criterion of falsifiability of a scientific theory: a theory is good if it fits data and if there are specific
tests which can be used to disprove it. This is important since all statistical estimation is post hoc, i.e. we
estimate after we observe the facts, hence vulnerable to the associated fallacy. Ok, enough of philosophy,
let’s get to more tangible issues.
To give you some intuition in this chapter, we’ll focus on a few factors that tend to influence the generaliz-
ability of a model class:
1. The number of tunable parameters. When the number of tunable parameters, sometimes denoted
as the number of degrees of freedom, is large, models tend to be more susceptible to overfitting.
2. The values taken by the parameters. When weights can take a wider range of values, models can
be more susceptible to over fitting.
3. The number of training examples. It’s trivially easy to overfit a dataset containing only one or two
examples even if your model is simple. But overfitting a dataset with millions of examples requires
an extremely flexible model.
When classifying handwritten digits before, we didn’t overfit because our 60,000 training examples far out
numbered the 784 × 10 = 7, 840 weights plus 10 bias terms, which gave us far fewer parameters than
training examples. Let’s see how things can go wrong. We begin with our import ritual.
In [ ]: from __future__ import print_function
import mxnet as mx
import mxnet.ndarray as nd
from mxnet import autograd
import numpy as np
ctx = mx.cpu()
mx.random.seed(1)
params = [W, b]
def net(X):
y_linear = nd.dot(X, W) + b
yhat = nd.softmax(y_linear, axis=1)
return yhat
f = plt.figure(figsize=(12,6))
fg1 = f.add_subplot(121)
fg2 = f.add_subplot(122)
fg1.set_xlabel('epoch',fontsize=14)
fg1.set_title('Comparing loss functions')
fg1.semilogy(xs, loss_tr)
fg1.semilogy(xs, loss_ts)
fg1.grid(True,which="both")
fg2.set_title('Comparing accuracy')
fg1.set_xlabel('epoch',fontsize=14)
fg2.plot(xs, acc_tr)
fg2.plot(xs, acc_ts)
fg2.grid(True,which="both")
fg2.legend(['training accuracy', 'testing accuracy'],fontsize=14)
loss_seq_train = []
loss_seq_test = []
acc_seq_train = []
acc_seq_test = []
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, 10)
with autograd.record():
output = net(data)
loss = cross_entropy(output, label_one_hot)
loss.backward()
SGD(params, .001)
##########################
# Keep a moving average of the losses
##########################
niter +=1
moving_loss = .99 * moving_loss + .01 * nd.mean(loss).asscalar()
est_loss = moving_loss/(1-0.99**niter)
if e % 100 == 99:
print("Completed epoch %s. Train Loss: %s, Test Loss %s, Train_acc %s, Test
(e+1, train_loss, test_loss, train_accuracy, test_accuracy))
3.13.8 Regularization
Now that we’ve characterized the problem of overfitting, we can begin talking about some solutions. Broadly
speaking the family of techniques geared towards mitigating overfitting are referred to as regularization. The
core idea is this: when a model is overfitting, its training error is substantially lower than its test error. We’re
already doing as well as we possibly can on the training data, but our test data performance leaves something
to be desired. Typically, regularization techniques attempt to trade off our training performance in exchange
for lowering our test error.
There are several straightforward techniques we might employ. Given the intuition from the previous chart,
we might attempt to make our model less complex. One way to do this would be to lower the number
of free parameters. For example, we could throw away some subset of our input features (and thus the
corresponding parameters) that we thought were least informative.
Another approach is to limit the values that our weights might take. One common approach is to force
the weights to take small values. [give more intuition with example of polynomial curve fitting] We can
accomplish this by changing our optimization objective to penalize the value of our weights. The most
popular regularizer is the ℓ22 norm. For linear models, ℓ22 regularization has the additional benefit that it
makes the solution unique, even when our model is overparameterized.
∑︁
𝑦 − 𝑦)2 + 𝜆‖w‖22
(ˆ
𝑖
Here, ‖w‖ is the ℓ22 norm and 𝜆 is a hyper-parameter that determines how aggressively we want to push the
weights towards 0. In code, we can express the ℓ22 penalty succinctly:
In [ ]: def l2_penalty(params):
penalty = nd.zeros(shape=1)
for param in params:
penalty = penalty + nd.sum(param ** 2)
return penalty
l2_strength = .1
niter=0
loss_seq_train = []
loss_seq_test = []
acc_seq_train = []
acc_seq_test = []
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, 10)
with autograd.record():
output = net(data)
loss = nd.sum(cross_entropy(output, label_one_hot)) + l2_strength * l2_
loss.backward()
SGD(params, .001)
##########################
# Keep a moving average of the losses
##########################
niter +=1
moving_loss = .99 * moving_loss + .01 * nd.mean(loss).asscalar()
est_loss = moving_loss/(1-0.99**niter)
if e % 100 == 99:
print("Completed epoch %s. Train Loss: %s, Test Loss %s, Train_acc %s, Test
(e+1, train_loss, test_loss, train_accuracy, test_accuracy))
3.13.11 Analysis
By adding 𝐿2 regularization we were able to increase the performance on test data from 75% accuracy to
83% accuracy. That’s a 32% reduction in error. In a lot of applications, this big an improvement can make
the difference between a viable product and useless system. Note that L2 regularization is just one of many
ways of controlling capacity. Basically we assumed that small weight values are good. But there are many
more ways to constrain the values of the weights:
• We could require
∑︀ that the total sum of the weights is small. That is what 𝐿1 regularization does via
the penalty 𝑖 |𝑤𝑖 |.
• We could require that the largest weight is not too large. This is what 𝐿∞ regularization does via the
penalty max𝑖 |𝑤𝑖 |.
• We could require that the number of nonzero
∑︀ weights is small, i.e. that the weight vectors are sparse.
This is what the 𝐿0 penalty does, i.e. 𝑖 𝐼{𝑤𝑖 ̸= 0}. This penalty is quite difficult to deal with
explicitly since it is nonsmooth. There is a lot of research that shows how to solve this problem
approximately using an 𝐿1 penalty.
From left to right: 𝐿2 regularization, which constrains the parameters to a ball, 𝐿1 regularization, which
constrains the parameters to a diamond (for lack of a better name, this is often referred to as an 𝐿1 -ball), and
𝐿∞ regularization, which constrains the parameters to a hypercube.
All of this raises the question of why regularization is any good. After all, choice is good and giving
our model more flexibility ought to be better (e.g. there are plenty of papers which show improvements
on ImageNet using deeper networks). What is happening is somewhat more subtle. Allowing for many
different parameter values allows our model to cherry pick a combination that is just right for all the training
data it sees, without really learning the underlying mechanism. Since our observations are likely noisy,
this means that we are trying to approximate the errors at least as much as we’re learning what the relation
between data and labels actually is. There is an entire field of statistics devoted to this issue - Statistical
Learning Theory. For now, a few simple rules of thumb suffice:
• Fewer parameters tend to be better than more parameters.
• Better engineering for a specific problem that takes the actual problem into account will lead to better
models, due to the prior knowledge that data scientists have about the problem at hand.
• 𝐿2 is easier to optimize for than 𝐿1 . In particular, many optimizers will not work well out of the box
for 𝐿1 . Using the latter requires something called proximal operators.
• Dropout and other methods to make the model robust to perturbations in the data often work better
than off-the-shelf 𝐿2 regularization.
We conclude with an XKCD Cartoon which captures the entire situation more succinctly than the proceeding
paragraph.
3.13.12 Next
Overfitting and regularization with gluon
3.14.5 Optimizer
By default gluon tries to keep the coefficients from diverging by using a weight decay penalty. So, to get
the real overfitting experience we need to switch it off. We do this by passing 'wd': 0.0' when we
instantiate the trainer.
In [ ]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01, 'wd':
f = plt.figure(figsize=(12,6))
fg1 = f.add_subplot(121)
fg2 = f.add_subplot(122)
fg1.set_xlabel('epoch',fontsize=14)
fg1.set_title('Comparing loss functions')
fg1.semilogy(xs, loss_tr)
fg1.semilogy(xs, loss_ts)
fg1.grid(True,which="both")
fg2.set_title('Comparing accuracy')
fg1.set_xlabel('epoch',fontsize=14)
fg2.plot(xs, acc_tr)
fg2.plot(xs, acc_ts)
fg2.grid(True,which="both")
fg2.legend(['training accuracy', 'testing accuracy'],fontsize=14)
plt.show()
loss_seq_train = []
loss_seq_test = []
acc_seq_train = []
acc_seq_test = []
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
with autograd.record():
output = net(data)
cross_entropy = loss(output, label)
cross_entropy.backward()
trainer.step(data.shape[0])
##########################
# Keep a moving average of the losses
##########################
niter +=1
moving_loss = .99 * moving_loss + .01 * nd.mean(cross_entropy).asscalar()
est_loss = moving_loss/(1-0.99**niter)
if e % 20 == 0:
print("Completed epoch %s. Train Loss: %s, Test Loss %s, Train_acc %s, Test
(e+1, train_loss, test_loss, train_accuracy, test_accuracy))
3.14.8 Regularization
Now let’s see what this mysterious weight decay is all about. We begin with a bit of math. When we add an
L2 penalty to the weights we are effectively adding 𝜆2 ‖𝑤‖2 to the loss. Hence, every time we compute the
gradient it gets an additional 𝜆𝑤 term that is added to 𝑔𝑡 , since this is the very derivative of the L2 penalty.
As a result we end up taking a descent step not in the direction −𝜂𝑔𝑡 but rather in the direction −𝜂(𝑔𝑡 + 𝜆𝑤).
This effectively shrinks 𝑤 at each step by 𝜂𝜆𝑤, thus the name weight decay. To make this work in practice
we just need to set the weight decay to something nonzero.
In [ ]: net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx, force_rein
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01, 'wd':
moving_loss = 0.
niter=0
loss_seq_train = []
loss_seq_test = []
acc_seq_train = []
acc_seq_test = []
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
with autograd.record():
output = net(data)
cross_entropy = loss(output, label)
cross_entropy.backward()
trainer.step(data.shape[0])
##########################
# Keep a moving average of the losses
##########################
niter +=1
moving_loss = .99 * moving_loss + .01 * nd.mean(cross_entropy).asscalar()
est_loss = moving_loss/(1-0.99**niter)
if e % 20 == 0:
print("Completed epoch %s. Train Loss: %s, Test Loss %s, Train_acc %s, Test
(e+1, train_loss, test_loss, train_accuracy, test_accuracy))
As we can see, the test accuracy improves a bit. Note that the amount by which it improves actually depends
on the amount of weight decay. We recommend that you try and experiment with different extents of weight
decay. For instance, a larger weight decay (e.g. 0.01) will lead to inferior performance, one that’s larger still
(0.1) will lead to terrible results. This is one of the reasons why tuning parameters is quite so important in
getting good experimental results in practice.
3.14.9 Next
Learning environments
For whinges or inquiries, open an issue on GitHub.
algorithms - the Perceptron. After that, we’ll give a simple convergence proof for SGD. This chapter is not
really needed for practitioners but will help to understand why the algorithms that we use are working at all.
In [1]: import mxnet as mx
from mxnet import nd, autograd
import matplotlib.pyplot as plt
import numpy as np
mx.random.seed(1)
# making some linearly separable data, simply by chosing the labels accordingly
X = nd.zeros(shape=(samples, dimensions))
Y = nd.zeros(shape=(samples))
i = 0
while (i < samples):
tmp = nd.random_normal(shape=(1,dimensions))
margin = nd.dot(tmp, wfake) + bfake
if (nd.norm(tmp).asscalar() < 3) & (abs(margin.asscalar()) > epsilon):
X[i,:] = tmp[0]
Y[i] = 1 if margin.asscalar() > 0 else -1
i += 1
return X, Y
vv = nd.dot(zz,w) + d
CS = plt.contour(xgrid,ygrid,vv.asnumpy())
plt.clabel(CS, inline=1, fontsize=10)
X, Y = getfake(50, 2, 0.3)
plotdata(X,Y)
plt.show()
Now we are going to use the simplest possible algorithm to learn parameters. It’s inspired by the Hebbian
Learning Rule which suggests that positive events should be reinforced and negative ones diminished. The
analysis of the algorithm is due to Rosenblatt and we will give a detailed proof of it after illustrating how it
works. In a nutshell, after initializing parameters 𝑤 = 0 and 𝑏 = 0 it updates them by 𝑦𝑥 and 𝑦 respectively
to ensure that they are properly aligned with the data. Let’s see how well it works:
In [3]: def perceptron(w,b,x,y):
if (y * (nd.dot(w,x) + b)).asscalar() <= 0:
w += y * x
b += y
return 1
else:
return 0
w = nd.zeros(shape=(2))
b = nd.zeros(shape=(1))
for (x,y) in zip(X,Y):
res = perceptron(w,b,x,y)
if (res == 1):
print('Encountered an error and updated parameters')
print('data {}, label {}'.format(x.asnumpy(),y.asscalar()))
print('weight {}, bias {}'.format(w.asnumpy(),b.asscalar()))
plotscore(w,b)
plotdata(X,Y)
As we can see, the model has learned something - all the red dots are positive and all the blue dots correspond
to a negative value. Moreover, we saw that the values for 𝑤⊤ 𝑥 + 𝑏 became more extreme as values over the
grid. Did we just get lucky in terms of classification or is there any math behind it? Obviously there is, and
there’s actually a nice theorem to go with this. It’s the perceptron convergence theorem.
for j in range(10):
for (i,epsilon) in enumerate(Eps):
X, Y = getfake(1000, 2, epsilon)
As we can see, the number of errors (and with it, updates) decreases inversely with the width of the margin.
Let’s see whether we can put this into equations. The first thing to consider is the size of the inner product
between (𝑤, 𝑏) and (𝑤* , 𝑏* ), the parameter that solves the classification problem with margin 𝜖. Note that
we do not need explicit knowledge of (𝑤* , 𝑏* ) for this, just know about its existence. For convenience, we
will index 𝑤 and 𝑏 by 𝑡, the number of updates on the parameters. Moreover, whenever convenient we will
treat (𝑤, 𝑏) as a new vector with an extra dimension and with the appropriate terms such as norms ‖(𝑤, 𝑏)‖
and inner products.
First off, 𝑤0⊤ 𝑤* + 𝑏0 𝑏* = 0 by construction. Second, by the update rule we have that
𝑡𝑜
(︁ )︁
(𝑤𝑡+1 , 𝑏𝑡+1 )⊤ (𝑤* , 𝑏* ) =(𝑤𝑡 , 𝑏𝑡 )⊤ (𝑤* , 𝑏* ) + 𝑦𝑡 𝑥⊤ *
𝑡 𝑤 +𝑏
*
≥(𝑤𝑡 , 𝑏𝑡 )⊤ (𝑤* , 𝑏* ) + 𝜖
≥(𝑡 + 1)𝜖(3.1)
(𝑤𝑡 , 𝑏𝑡 )⊤ (𝑤* , 𝑏* ) + 𝜖 ≥
Here the first equality follows from the definition of the weight updates. The next inequality follows from the
fact that (𝑤* , 𝑏* ) separate the problem with margin at least 𝜖, and the last inequality is simply a consequence
of iterating this inequality 𝑡 + 1 times. Growing alignment between the ‘ideal’ and the actual weight vectors
is great, but only if the actual weight vectors don’t grow too rapidly. So we need a bound on their length:
𝑡𝑜
≤‖(𝑤𝑡 , 𝑏𝑡 )‖2 + 𝑅2 + 1
≤(𝑡 + 1)(𝑅2 + 1)(3.1)
(︁ )︁
‖(𝑤𝑡 , 𝑏𝑡 )‖2 + 2𝑦𝑡 𝑥⊤𝑡 𝑤 𝑡 + 𝑏𝑡 + ‖(𝑥𝑡 , 1)‖2 ≤
(𝑡 + 1)(𝑅2 + 1)
This is a strange equation - we have a linear term on the left and a sublinear term on the right. So this
inequality clearly cannot hold indefinitely for large 𝑡. The only logical conclusion is that there must never
be updates beyond when the inequality is no longer satisfied. We have 𝑡 ≤ 2(𝑅2 + 1)/𝜖2 , which proves our
claim.
Note - sometimes the perceptron convergence theorem is written without bias 𝑏. In this case a lot of things
get simplified both in the proof and in the bound, since we can do away with the constant terms. Without
going through details, the theorem becomes 𝑡 ≤ 𝑅2 /𝜖2 .
Note - the perceptron convergence proof crucially relied on the fact that the data is actually separable. If
this is not the case, the perceptron algorithm will diverge. It will simply keep on trying to get things right
by updating (𝑤, 𝑏). And since it has no safeguard to keep the parameters bounded, the solution will get
worse. This sounds like an ‘academic’ concern, alas it is not. The avatar in the computer game [Black and
White](https://en.wikipedia.org/wiki/Black_%26_White_(video_game%29) used the perceptron to adjust
the avatar. Due to the poorly implemented update rule the game quickly became unplayable after a few
hours (as one of the authors can confirm).
More generally, a stochastic gradient descent algorithm uses the following template:
initialize w
loop over data and labels (x,y):
compute f(x)
compute loss gradient g = partial_w l(y, f(x))
w = w - eta g
Here the learning rate 𝜂 may well change as we iterate over the data. Moreover, we may traverse the data
in nonlinear order (e.g. we might reshuffle the data), depending on the specific choices of the algorithm.
The issue is that as we go over the data, sometimes the gradient might point us into the right direction and
sometimes it might not. Intuitively, on average things should get better. But to be really sure, there’s only
one way to find out - we need to prove it. We pick a simple and elegant (albeit a bit restrictive) proof of
Nesterov and Vial.
The situation we consider are convex losses. This is a bit restrictive in the age of deep networks but still
quite instructive (in addition to that, nonconvex convergence proofs are a lot messier). For recap - a convex
function 𝑓 (𝑥) satisfies 𝑓 (𝜆𝑥 + (1 − 𝜆)𝑥′ ) ≤ 𝜆𝑓 (𝑥) + (1 − 𝜆)𝑓 (𝑥′ ), that is, the linear interpolant between
function values is larger than the function values in between. Likewise, a convex set 𝑆 is a set where for
any points 𝑥, 𝑥′ ∈ 𝑆 the line [𝑥, 𝑥′ ] is in the set, i.e. 𝜆𝑥 + (1 − 𝜆)𝑥′ ∈ 𝑆 for all 𝜆 ∈ [0, 1]. Now assume that
𝑤* is the minimizer of the expected loss that we are trying to minimize, e.g.
𝑚
* 1 ∑︁
𝑤 = argmin𝑤 𝑅(𝑤) where 𝑅(𝑤) = 𝑙(𝑦𝑖 , 𝑓 (𝑥𝑖 , 𝑤))
𝑚
𝑖=1
Let’s assume that we actually know that 𝑤* is contained in some set convex set 𝑆, e.g. a ball of radius 𝑅
around the origin. This is convenient since we want to make sure that during optimization our parameter 𝑤
doesn’t accidentally diverge. We can ensure that, e.g. by shrinking it back to such a ball whenever needed.
Secondly, assume that we have an upper bound on the magnitude of the gradient 𝑔𝑖 := 𝜕𝑤 𝑙(𝑦𝑖 , 𝑓 (𝑥𝑖 , 𝑤))
for all 𝑖 by some constant 𝐿 (it’s called so since this is often referred to as the Lipschitz constant). Again,
this is super useful since we don’t want 𝑤 to diverge while we’re optimizing. In practice, many algorithms
employ e.g. gradient clipping to force our gradients to be well behaved, by shrinking the gradients back to
something tractable.
Third, to get rid of variance in the parameter 𝑤𝑡 that is obtained during the optimization,
∑︀ we
∑︀use the weighted
average over the entire optimization process as our solution, i.e. we use 𝑤 ¯ := 𝑡 𝜂𝑡 𝑤𝑡 / 𝑡 𝜂𝑡 .
Let’s look at the distance 𝑟𝑡 := ‖𝑤𝑡 − 𝑤* ‖, i.e. the distance between the optimal solution vector 𝑤* and
what we currently have. It is bounded as follows:
𝑡𝑜
Next we use convexity of 𝑅(𝑤). We know that 𝑅(𝑤* ) ≥ 𝑅(𝑤𝑡 ) + 𝜕𝑤 𝑅(𝑤𝑡 )⊤ (𝑤* ∑︀ − 𝑤𝑡 ) and moreover
∑︀ that
𝑇
the average of function values is larger than the function value of the average, i.e. 𝑡=1 𝜂𝑡 𝑅(𝑤𝑡 )/ 𝑡 𝜂𝑡 ≥
𝑅(𝑤).
¯ The first inequality allows us to bound the expected decrease in distance to optimality via
Summing over 𝑡 and using the facts that 𝑟𝑇 ≥ 0 and that 𝑤 is contained inside a ball of radius 𝑅 yields:
𝑇
∑︁ ∑︁
2
−𝑅 ≤ 𝐿 2
𝜂𝑡2 − 2 𝜂𝑡 E[𝑅[𝑤𝑡 ] − 𝑅[𝑤* ]]
𝑡=1 𝑡
∑︀
Rearranging terms, using convexity of 𝑅 the second time, and dividing by 𝑡 𝜂𝑡 yields a bound on how far
we are likely to stray from the best possible solution:
𝑅2 + 𝐿2 𝑇𝑡=1 𝜂𝑡2
∑︀
*
¯ − 𝑅[𝑤 ] ≤
E[𝑅[𝑤]]
2 𝑇𝑡=1 𝜂𝑡
∑︀
Depending on how we choose 𝜂𝑡 we will get different bounds. For instance, if we make 𝜂 constant, i.e. if√we
2 2 2
√ we get the bounds (𝑅 + 𝐿 𝜂 𝑇 )/(2𝜂𝑇 ). This is minimized for 𝜂 = 𝑅/𝐿 𝑇 ,
use a constant learning rate,
yielding a bound of 𝑅𝐿/ 𝑇 . A few things are interesting in this context:
• If we are potentially far away from the optimal solution, we should use a large learning rate (the O(R)
dependency).
• If the gradients are potentially large, we should use a smaller learning rate (the O(1/L) dependency).
• If we have a long time to converge, we should use a smaller learning rate, but not too small.
• Large gradients and a large degree of uncertainty as to how far we are away from the optimal solution
lead to poor convergence.
• More optimization steps make things better.
None of these insights are terribly surprising, albeit useful to keep in mind when we use SGD in the wild.
And this was the very point of going through
√ this somewhat tedious proof. Furthermore, if we use a de-
√ rate, e.g. 𝜂𝑡 = 𝑂(1/ 𝑡), then our bounds are somewhat less tight, and we get a bound
creasing learning
of 𝑂(log 𝑇 / 𝑇 ) bound on how far away from optimality we might be. The key difference is that for the
decreasing learning rate we need not know when to stop. In other words, we get an anytime algorithm that
provides a good result at any time, albeit not as good as what we could expect if we knew how much time
to optimize we have right from the beginning.
3.15.4 Next
Environment
For whinges or inquiries, open an issue on GitHub.
3.16 Environment
So far we did not worry very much about where the data came from and how the models that we build get
deployed. Not caring about it can be problematic. Many failed machine learning deployments can be traced
back to this situation. This chapter is meant to help with detecting such situations early and points out how to
mitigate them. Depending on the case this might be rather simple (ask for the ‘right’ data) or really difficult
(implement a reinforcement learning system).
Obviously this is unlikely to work well. The training set consists of photos, while the test set contains only
cartoons. The colors aren’t even accurate. Training on a dataset that looks substantially different from the
test set without some plan for how to adapt to the new domain is a bad idea. Unfortunately, this is a very
common pitfall. Statisticians call this Covariate Shift, i.e. the situation where the distribution over the
covariates (aka training data) is shifted on test data relative to the training case. Mathematically speaking,
we are referring the case where 𝑝(𝑥) changes but 𝑝(𝑦|𝑥) remains unchanged.
If we were to build a machine translation system, the distribution 𝑝(𝑦|𝑥) would be different, e.g. depending
on our location. This problem can be quite tricky to spot. A saving grace is that quite often the 𝑝(𝑦|𝑥) only
shifts gradually (e.g. the click-through rate for NOKIA phone ads). Before we go into further details, let us
discuss a number of situations where covariate and concept shift are not quite as blatantly obvious.
3.16.3 Examples
Medical Diagnostics
Imagine you want to design some algorithm to detect cancer. You get data of healthy and sick people;
you train your algorithm; it works fine, giving you high accuracy and you conclude that you’re ready for a
successful career in medical diagnostics. Not so fast . . .
Many things could go wrong. In particular, the distributions that you work with for training and those in the
wild might differ considerably. This happened to an unfortunate startup I had the opportunity to consult for
many years ago. They were developing a blood test for a disease that affects mainly older men and they’d
managed to obtain a fair amount of blood samples from patients. It is considerably more difficult, though,
to obtain blood samples from healthy men (mainly for ethical reasons). To compensate for that, they asked
a large number of students on campus to donate blood and they performed their test. Then they asked me
whether I could help them build a classifier to detect the disease. I told them that it would be very easy to
distinguish between both datasets with probably near perfect accuracy. After all, the test subjects differed
in age, hormone level, physical activity, diet, alcohol consumption, and many more factors unrelated to the
disease. This was unlikely to be the case with real patients: Their sampling procedure had caused an extreme
case of covariate shift that couldn’t be corrected by conventional means. In other words, training and test
data were so different that nothing useful could be done and they had wasted significant amounts of money.
Nonstationary distributions
A much more subtle situation is where the distribution changes slowly and the model is not updated ade-
quately. Here are a number of typical cases:
• We train a computational advertising model and then fail to update it frequently (e.g. we forget to
incorporate that an obscure new device called an iPad was just launched).
• We build a spam filter. It works well at detecting all spam that we’ve seen so far. But then the
spammers wisen up and craft new messages that look quite unlike anything we’ve seen before.
• We build a product recommendation system. It works well for the winter. But then it keeps on
recommending Santa hats after Christmas.
More Anecdotes
• We build a classifier for “Not suitable/safe for work” (NSFW) images. To make our life easy, we
scrape a few seedy Subreddits. Unfortunately the accuracy on real life data is lacking (the pictures
posted on Reddit are mostly ‘remarkable’ in some way, e.g. being taken by skilled photographers,
whereas most real NSFW images are fairly unremarkable . . . ). Quite unsurprisingly the accuracy is
not very high on real data.
• We build a face detector. It works well on all benchmarks. Unfortunately it fails on test data - the
offending examples are close-ups where the face fills the entire image (no such data was in the training
set).
• We build a web search engine for the USA market and want to deploy it in the UK.
In short, there are many cases where training and test distribution 𝑝(𝑥) are different. In some cases, we
get lucky and the models work despite the covariate shift. We now discuss principled solution strategies.
Warning - this will require some math and statistics.
Statisticians call the first term an empirical average, that is an average computed over the data drawn from
𝑝(𝑥)𝑝(𝑦|𝑥). If the data is drawn from the ‘wrong’ distribution 𝑞, we can correct for that by using the
following simple identity:
∫︁ ∫︁ [︂ ]︂
𝑝(𝑥) 𝑝(𝑥)
E𝑥∼𝑝(𝑥) [𝑓 (𝑥)] = 𝑓 (𝑥)𝑝(𝑥)𝑑𝑥 = 𝑓 (𝑥) 𝑞(𝑥)𝑑𝑥 = E𝑥∼𝑞(𝑥) 𝑓 (𝑥)
𝑞(𝑥) 𝑞(𝑥)
In other words, we need to re-weight each instance by the ratio of probabilities that it would have been
drawn from the correct distribution 𝛽(𝑥) := 𝑝(𝑥)/𝑞(𝑥). Alas, we do not know that ratio, so before we can do
anything useful we need to estimate it. Many methods are available, e.g. some rather fancy operator theoretic
ones which try to recalibrate the expectation operator directly using a minimum-norm or a maximum entropy
principle. Note that for any such approach, we need samples drawn from both distributions - the ‘true’ 𝑝, e.g.
by access to training data, and the one used for generating the training set 𝑞 (the latter is trivially available).
In this case there exists a very effective approach that will give almost as good results: logistic regression.
This is all that is needed to compute estimate probability ratios. We learn a classifier to distinguish be-
tween data drawn from 𝑝(𝑥) and data drawn from 𝑞(𝑥). If it is impossible to distinguish between the two
distributions then it means that the associated instances are equally likely to come from either one of the
two distributions. On the other hand, any instances that can be well discriminated should be significantly
over/underweighted accordingly. For simplicity’s sake assume that we have an equal number of instances
from both distributions, denoted by 𝑥𝑖 ∼ 𝑝(𝑥) and 𝑥𝑖 ∼ 𝑞(𝑥) respectively. Now denote by 𝑧𝑖 labels which
are 1 for data drawn from 𝑝 and -1 for data drawn from 𝑞. Then the probability in a mixed dataset is given
by
CovariateShiftCorrector(X, Z)
X: Training dataset (without labels)
Z: Test dataset (without labels)
Generative Adversarial Networks use the very idea described above to engineer a data generator such
that it cannot be distinguished from a reference dataset. For this, we use one network, say 𝑓 to distinguish
real and fake data and a second network 𝑔 that tries to fool the discriminator 𝑓 into accepting fake data as
real. We will discuss this in much more detail later.
• Batch Learning. Here we have access to training data and labels {(𝑥1 , 𝑦1 ), . . . (𝑥𝑛 , 𝑦𝑛 )}, which we
use to train a network 𝑓 (𝑥, 𝑤). Later on, we deploy this network to score new data (𝑥, 𝑦) drawn from
the same distribution. This is the default assumption for any of the problems that we discuss here.
For instance, we might train a cat detector based on lots of pictures of cats and dogs. Once we trained
it, we ship it as part of a smart catdoor computer vision system that lets only cats in. This is then
installed in a customer’s home and is never updated again (barring extreme circumstances).
• Online Learning. Now imagine that the data (𝑥𝑖 , 𝑦𝑖 ) arrives one sample at a time. More specifically,
assume that we first observe 𝑥𝑖 , then we need to come up with an estimate 𝑓 (𝑥𝑖 , 𝑤) and only once
we’ve done this, we observe 𝑦𝑖 and with it, we receive a reward (or incur a loss), given our decision.
Many real problems fall into this category. E.g. we need to predict tomorrow’s stock price, this allows
us to trade based on that estimate and at the end of the day we find out whether our estimate allowed
us to make a profit. In other words, we have the following cycle where we are continuously improving
our model given new observations.
model 𝑓𝑡 −→ data 𝑥𝑡 −→ estimate 𝑓𝑡 (𝑥𝑡 ) −→ observation 𝑦𝑡 −→ loss 𝑙(𝑦𝑡 , 𝑓𝑡 (𝑥𝑡 )) −→ model 𝑓𝑡+1
• Bandits. They are a special case of the problem above. While in most learning problems we have a
continuously parametrized function 𝑓 where we want to learn its parameters (e.g. a deep network), in
a bandit problem we only have a finite number of arms that we can pull (i.e. a finite number of actions
that we can take). It is not very surprising that for this simpler problem stronger theoretical guarantees
in terms of optimality can be obtained. We list it mainly since this problem is often (confusingly)
treated as if it were a distinct learning setting.
• Control (and nonadversarial Reinforcement Learning). In many cases the environment remembers
what we did. Not necessarily in an adversarial manner but it’ll just remember and the response will
depend on what happened before. E.g. a coffee boiler controller will observe different temperatures
depending on whether it was heating the boiler previously. PID (proportional integral derivative)
controller algorithms are a popular choice there. Likewise, a user’s behavior on a news site will
depend on what we showed him previously (e.g. he will read most news only once). Many such
algorithms form a model of the environment in which they act such as to make their decisions appear
less random (i.e. to reduce variance).
• Reinforcement Learning. In the more general case of an environment with memory, we may en-
counter situations where the environment is trying to cooperate with us (cooperative games, in partic-
ular for non-zero-sum games), or others where the environment will try to win. Chess, Go, Backgam-
mon or StarCraft are some of the cases. Likewise, we might want to build a good controller for
autonomous cars. The other cars are likely to respond to the autonomous car’s driving style in non-
trivial ways, e.g. trying to avoid it, trying to cause an accident, trying to cooperate with it, etc.
One key distinction between the different situations above is that the same strategy that might have worked
throughout in the case of a stationary environment, might not work throughout when the environment can
adapt. For instance, an arbitrage opportunity discovered by a trader is likely to disappear once he starts
exploiting it. The speed and manner at which the environment changes determines to a large extent the
type of algorithms that we can bring to bear. For instance, if we know that things may only change slowly,
we can force any estimate to change only slowly, too. If we know that the environment might change
instantaneously, but only very infrequently, we can make allowances for that. These types of knowledge are
crucial for the aspiring data scientist to deal with concept shift, i.e. when the problem that he is trying to
solve changes over time.
For whinges or inquiries, open an issue on GitHub.
𝑦ˆ = softmax(𝑊 𝑥 + 𝑏)
Graphically, we could depict the model like this, where the orange nodes
represent inputs and the teal nodes on the top represent the output:
If our labels really were related to our input data by an approximately linear function, then this approach
might be adequate. But linearity is a strong assumption. Linearity means that given an output of interest, for
each input, increasing the value of the input should either drive the value of the output up or drive it down,
irrespective of the value of the other inputs.
Imagine the case of classifying cats and dogs based on black and white images. That’s like saying that for
each pixel, increasing its value either increases the probability that it depicts a dog or decreases it. That’s
not reasonable. After all, the world contains both black dogs and black cats, and both white dogs and white
cats.
Teasing out what is depicted in an image generally requires allowing more complex relationships between
our inputs and outputs, considering the possibility that our pattern might be characterized by interactions
among the many features. In these cases, linear models will have low accuracy. We can model a more
general class of functions by incorporating one or more hidden layers. The easiest way to do this is to stack
a bunch of layers of neurons on top of each other. Each layer feeds into the layer above it, until we generate
an output. This architecture is commonly called a “multilayer perceptron”. With an MLP, we’re going to
stack a bunch of layers on top of each other.
ℎ1 = 𝜑(𝑊1 𝑥 + 𝑏1 )
ℎ2 = 𝜑(𝑊2 ℎ1 + 𝑏2 )
...
ℎ𝑛 = 𝜑(𝑊𝑛 ℎ𝑛−1 + 𝑏𝑛 )
Note that each layer requires its own set of parameters. For each hidden layer, we calculate its value by first
applying a linear function to the activations of the layer below, and then applying an element-wise nonlinear
activation function. Here, we’ve denoted the activation function for the hidden layers as 𝜑. Finally, given
the topmost hidden layer, we’ll generate an output. Because we’re still focusing on multiclass classification,
we’ll stick with the softmax activation in the output layer.
𝑦ˆ = softmax(𝑊𝑦 ℎ𝑛 + 𝑏𝑦 )
Multilayer perceptrons can account for complex interactions in the inputs because the hidden neurons de-
pend on the values of each of the inputs. It’s easy to design a hidden node that that does arbitrary compu-
tation, such as, for instance, logical operations on its inputs. And it’s even widely known that multilayer
perceptrons are universal approximators. That means that even for a single-hidden-layer neural network,
with enough nodes, and the right set of weights, it could model any function at all! Actually learning that
function is the hard part. And it turns out that we can approximate functions much more compactly if we use
deeper (vs wider) neural networks. We’ll get more into the math in a subsequent chapter, but for now let’s
actually build an MLP. In this example, we’ll implement a multilayer perceptron with two hidden layers and
one output layer.
3.17.1 Imports
In [ ]: from __future__ import print_function
import mxnet as mx
import numpy as np
from mxnet import nd, autograd, gluon
#######################
# Allocate parameters for the first hidden layer
#######################
W1 = nd.random_normal(shape=(num_inputs, num_hidden), scale=weight_scale, ctx=model
b1 = nd.random_normal(shape=num_hidden, scale=weight_scale, ctx=model_ctx)
#######################
# Allocate parameters for the second hidden layer
#######################
W2 = nd.random_normal(shape=(num_hidden, num_hidden), scale=weight_scale, ctx=model
b2 = nd.random_normal(shape=num_hidden, scale=weight_scale, ctx=model_ctx)
#######################
# Allocate parameters for the output layer
#######################
W3 = nd.random_normal(shape=(num_hidden, num_outputs), scale=weight_scale, ctx=mode
b3 = nd.random_normal(shape=num_outputs, scale=weight_scale, ctx=model_ctx)
Mathematically, that’s a perfectly reasonable thing to do. However, computationally, things can get hairy.
We’ll revisit the issue at length in a chapter more dedicated to implementation and less interested in statistical
modeling. But we’re going to make a change here so we want to give you the gist of why.
𝑧𝑗
Recall that the softmax function calculates 𝑦ˆ𝑗 = ∑︀𝑛𝑒 𝑒𝑧𝑖 , where 𝑦ˆ𝑗 is the j-th element of the input yhat
𝑖=1
variable in function cross_entropy and 𝑧𝑗 is the j-th element of the input y_linear variable in func-
tion softmax
If some of the 𝑧𝑖 are very large (i.e. very positive), 𝑒𝑧𝑖 might be larger than the largest number we can
have for certain types of float (i.e. overflow). This would make the denominator (and/or numerator)
inf and we get zero, or inf, or nan for 𝑦ˆ𝑗 . In any case, we won’t get a well-defined return value for
cross_entropy. This is the reason we subtract max(𝑧𝑖 ) from all 𝑧𝑖 first in softmax function. You can
verify that this shifting in 𝑧𝑖 will not change the return value of softmax.
After the above subtraction/ normalization step, it is possible that 𝑧𝑗 is very negative. Thus, 𝑒𝑧𝑗 will be
very close to zero and might be rounded to zero due to finite precision (i.e underflow), which makes 𝑦ˆ𝑗
zero and we get -inf for log(ˆ𝑦𝑗 ). A few steps down the road in backpropagation, we starts to get horrific
not-a-number (nan) results printed to screen.
Our salvation is that even though we’re computing these exponential functions, we ultimately plan to take
their log in the cross-entropy functions. It turns out that by combining these two operators softmax and
cross_entropy together, we can elude the numerical stability issues that might otherwise plague us
during backpropagation. As shown in the equation below, we avoided calculating 𝑒𝑧𝑗 but directly used 𝑧𝑗
due to 𝑙𝑜𝑔(𝑒𝑥𝑝(·)).
(︃ 𝑛 )︃ (︃ 𝑛 )︃
𝑒𝑧𝑗
(︂ )︂ ∑︁ ∑︁
𝑦𝑗 ) = log ∑︀𝑛 𝑧 = log(𝑒𝑧𝑗 ) − log
log(ˆ 𝑒𝑧𝑖 = 𝑧𝑗 − log 𝑒 𝑧𝑖
𝑖=1 𝑒
𝑖
𝑖=1 𝑖=1
We’ll want to keep the conventional softmax function handy in case we ever want to evaluate the probabili-
ties output by our model. But instead of passing softmax probabilities into our new loss function, we’ll just
pass our yhat_linear and compute the softmax and its log all at once inside the softmax_cross_entropy
loss function, which does smart things like the log-sum-exp trick (see on Wikipedia).
In [ ]: def softmax_cross_entropy(yhat_linear, y):
return - nd.nansum(y * nd.log_softmax(yhat_linear), axis=0, exclude=True)
#######################
# Compute the second hidden layer
#######################
h2_linear = nd.dot(h1, W2) + b2
h2 = relu(h2_linear)
#######################
# Compute the output layer.
# We will omit the softmax function here
# because it will be applied
# in the softmax_cross_entropy loss
#######################
yhat_linear = nd.dot(h2, W3) + b3
return yhat_linear
3.17.9 Optimizer
In [ ]: def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
denominator = 0.
for i, (data, label) in enumerate(data_iterator):
data = data.as_in_context(model_ctx).reshape((-1, 784))
label = label.as_in_context(model_ctx)
output = net(data)
predictions = nd.argmax(output, axis=1)
numerator += nd.sum(predictions == label)
denominator += data.shape[0]
return (numerator / denominator).asscalar()
for e in range(epochs):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx).reshape((-1, 784))
label = label.as_in_context(model_ctx)
label_one_hot = nd.one_hot(label, 10)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label_one_hot)
loss.backward()
SGD(params, learning_rate)
cumulative_loss += nd.sum(loss).asscalar()
samples = 10
plt.imshow(imtiles.asnumpy())
plt.show()
pred=model_predict(net,data.reshape((-1,784)))
print('model predictions are:', pred)
print('true labels :', label)
break
3.17.13 Conclusion
Nice! With just two hidden layers containing 256 hidden nodes, respectively, we can achieve over 95%
accuracy on this task.
3.17.14 Next
Multilayer perceptrons with gluon
For whinges or inquiries, open an issue on GitHub.
chapter03_deep-neural-networks/../img/sequential-not
3.18.1 Imports
First we’ll import the necessary bits.
In [ ]: from __future__ import print_function
import numpy as np
import mxnet as mx
from mxnet import nd, autograd, gluon
We’ll also want to set the contexts for our data and our models.
In [ ]: ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()
data_ctx = ctx
model_ctx = ctx
We can now instantiate a multilayer perceptron using our MLP class. And just as with any other block, we
can grab its parameters with collect_params and initialize them.
In [ ]: net = MLP()
net.collect_params().initialize(mx.init.Normal(sigma=.01), ctx=model_ctx)
And we can synthesize some gibberish data just to demonstrate one forward pass through the network.
In [ ]: data = nd.ones((1,784))
net(data.as_in_context(model_ctx))
Because we’re working with an imperative framework and not a symbolic framework, debugging Gluon
Blocks is easy. If we want to see what’s going on at each layer of the neural network, we can just plug in a
bunch of Python print statements.
In [ ]: class MLP(gluon.Block):
def __init__(self, **kwargs):
super(MLP, self).__init__(**kwargs)
with self.name_scope():
self.dense0 = gluon.nn.Dense(64, activation="relu")
self.dense1 = gluon.nn.Dense(64, activation="relu")
self.dense2 = gluon.nn.Dense(10)
net = MLP()
net.collect_params().initialize(mx.init.Normal(sigma=.01), ctx=model_ctx)
net(data.as_in_context(model_ctx))
3.19.3 Optimizer
In [ ]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .01})
for e in range(epochs):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx).reshape((-1, 784))
label = label.as_in_context(model_ctx)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(data.shape[0])
cumulative_loss += nd.sum(loss).asscalar()
3.19.6 Conclusion
In this chapter, we showed two ways to build multilayer perceptrons with Gluon. We demonstrated how to
subclass gluon.Block, and define your own forward passes. We also showed how you might debug your
network by lacing your forward pass with print statements. Finally, we showed how you could define and
instantiate an equivalent network with just 6 lines of code by using gluon.nn.Sequential. Now that
you understand the basics, you’re ready to leap ahead. If you’re following the book in order, then the next
stop will be dropout regularization. Other possible choices would be to start leanring about convolutional
neural networks which are especialy handy for working with images, or recurrent neural networks, which
are especially useful for natural language processing.
3.19.7 Next
Dropout regularization from scratch
For whinges or inquiries, open an issue on GitHub.
3.20.6 Dropout
In [ ]: def dropout(X, drop_probability):
keep_probability = 1 - drop_probability
mask = nd.random_uniform(0, 1.0, X.shape, ctx=X.context) < keep_probability
#############################
# Avoid division by 0 when scaling
#############################
if keep_probability > 0.0:
scale = (1/keep_probability)
else:
scale = 0.0
return mask * X * scale
In [ ]: A = nd.arange(20).reshape((5,4))
dropout(A, 0.0)
In [ ]: dropout(A, 0.5)
In [ ]: dropout(A, 1.0)
#######################
# Compute the second hidden layer
#######################
h2_linear = nd.dot(h1, W2) + b2
h2 = relu(h2_linear)
h2 = dropout(h2, drop_prob)
#######################
# Compute the output layer.
# We will omit the softmax function here
# because it will be applied
# in the softmax_cross_entropy loss
#######################
yhat_linear = nd.dot(h2, W3) + b3
return yhat_linear
3.20.10 Optimizer
In [ ]: def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, 10)
with autograd.record():
################################
# Drop out 50% of hidden activations on the forward pass
################################
output = net(data, drop_prob=.5)
loss = softmax_cross_entropy(output, label_one_hot)
loss.backward()
SGD(params, learning_rate)
##########################
# Keep a moving average of the losses
##########################
if i == 0:
moving_loss = nd.mean(loss).asscalar()
else:
moving_loss = .99 * moving_loss + .01 * nd.mean(loss).asscalar()
3.20.13 Conclusion
Nice. With just two hidden layers containing 256 and 128 hidden nodes, respectively, we can achieve over
95% accuracy on this task.
3.20.14 Next
Dropout regularization with gluon
###########################
# Adding second hidden layer
###########################
net.add(gluon.nn.Dense(num_hidden, activation="relu"))
###########################
# Adding dropout with rate .5 to the second hidden layer
###########################
net.add(gluon.nn.Dropout(.5))
###########################
# Adding the output layer
###########################
net.add(gluon.nn.Dense(num_outputs))
Note that we got the exact same answer on both forward passes through the net! That’s because by, default,
mxnet assumes that we are in predict mode. We can explicitly invoke this scope by placing code within a
with autograd.predict_mode(): block.
In [ ]: with autograd.predict_mode():
print(net(x[0:1]))
print(net(x[0:1]))
Unless something’s gone horribly wrong, you should see the same result as before. We can also run the code
in train mode. This tells MXNet to run our Blocks as they would run during training.
In [ ]: with autograd.train_mode():
print(net(x[0:1]))
print(net(x[0:1]))
with autograd.train_mode():
print(autograd.is_training())
To make our lives a little easier, record() takes one argument, train_mode, which has a default value of
True. So when we turn on autograd, this by default turns on train_mode (with autograd.record()
is equivalent to with autograd.record(train_mode=True):). To change this default behav-
ior (as when generating adversarial examples), we can optionally call record via (with autograd.
record(train_mode=False):).
3.21.8 Optimizer
In [ ]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1})
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1, 784))
label = label.as_in_context(ctx)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(data.shape[0])
##########################
# Keep a moving average of the losses
##########################
curr_loss = nd.mean(loss).asscalar()
moving_loss = (curr_loss if ((i == 0) and (e == 0))
else (1 - smoothing_constant) * moving_loss + (smoothing_con
3.21.11 Conclusion
Now let’s take a look at how to build convolutional neural networks.
3.21.12 Next
Introduction to ‘‘gluon.Block‘ and gluon.nn.Sequential <../chapter03_deep-neural-
networks/plumbing.ipynb>‘__
For whinges or inquiries, open an issue on GitHub.
###########################
# Specify the context we'll be using
###########################
ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()
###########################
# Load up our dataset
###########################
batch_size = 64
def transform(data, label):
return data.astype(np.float32)/255, label.astype(np.float32)
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transf
batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transf
batch_size, shuffle=False)
This is a convenient shorthand that allows us to express a neural network compactly. When we want to build
simple networks, this saves us a lot of time. But both (i) to understand how nn.Sequential works, and
(ii) to compose more complex architectures, you’ll want to understand gluon.Block.
Let’s take a look at how the same model would be expressed with gluon.Block.
In [ ]: class MLP(Block):
def __init__(self, **kwargs):
super(MLP, self).__init__(**kwargs)
with self.name_scope():
self.dense0 = nn.Dense(128)
self.dense1 = nn.Dense(64)
self.dense2 = nn.Dense(10)
Now that we’ve defined a class for MLPs, we can go ahead and instantiate one:
In [ ]: net2 = MLP()
At this point we can pass data through the network by calling it like a function, just as we have in the
previous tutorials.
In [ ]: for data, _ in train_data:
data = data.as_in_context(ctx)
break
net2(data[0:1])
In [ ]: net1 = gluon.nn.Sequential()
with net1.name_scope():
net1.add(gluon.nn.Dense(128, activation="relu"))
net1.add(gluon.nn.Dense(64, activation="relu"))
net1.add(gluon.nn.Dense(10))
In just 5 lines and 183 characters, we defined a multilayer perceptron with three fully-connected layers, each
parametrized by weight matrix and bias term. We also specified the ReLU activation function for the hidden
layers.
Sequential itself subclasses Block and maintains a list of _children. Then, every time we call net1.
add(...) our net simply registers a new child. We can actually pass in an arbitrary Block, even layers
that we write ourselves.
When we call forward on a Sequential, it executes the following code:
Basically, it calls each child on the output of the previous one, returning the final output at the end of the
chain.
Take a look at the shapes of the weight matrices: (128,0), (64, 0), (10, 0). What does it mean to have zero
dimension in a matrix? This is gluon’s way of marking that the shape of these matrices is not yet known.
The shape will be inferred on the fly once the network is provided with some input.
So when we initialize our parameters, you might wonder, what precisely is happening?
In [ ]: net1.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
In this situation, gluon is not actually initializing any parameters! Instead, it’s making a note of which
initializer to associate with each parameter, even though its shape is not yet known. The parameters are
instantiated and the initializer is called once we provide the network with some input.
In [ ]: net1(data)
print(net1.collect_params())
This shape inference can be extremely useful at times. For example, when working with convnets, it can be
quite a pain to calculate the shape of various hidden layers. It depends on both the kernel size, the number of
filters, the stride, and the precise padding scheme used which can vary in subtle ways from library to library.
Note that the parameters from this network can be initialized before we see any real data.
In [ ]: net2.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
print(net2.collect_params())
3.22.9 Next
Writing custom layers with ‘‘gluon.Block‘ <../chapter03_deep-neural-networks/custom-layer.ipynb>‘__
For whinges or inquiries, open an issue on GitHub.
###########################
###########################
# Load up our dataset
###########################
batch_size = 64
def transform(data, label):
return data.astype(np.float32)/255, label.astype(np.float32)
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transf
batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transf
batch_size, shuffle=False)
That’s it. We can just instantiate this block and make a forward pass. Note that this layer doesn’t actually
care what its input or output dimensions are. So we can just feed in an arbitrary array and should expect
appropriately transformed output. Whenever we are happy with whatever the automatic differentiation gen-
erates, this is all we need.
In [ ]: net = CenteredLayer()
net(nd.array([1,2,3,4,5]))
We can also incorporate this layer into a more complicated network, such as by using nn.Sequential().
In [ ]: net2 = nn.Sequential()
net2.add(nn.Dense(128))
net2.add(nn.Dense(10))
net2.add(CenteredLayer())
This network contains Blocks (Dense) that contain parameters and thus require initialization
In [ ]: net2.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
Now we can pass some data through it, say the first image from our MNIST dataset.
In [ ]: for data, _ in train_data:
data = data.as_in_context(ctx)
break
output = net2(data[0:1])
print(output)
And we can verify that as expected, the resulting vector has mean 0.
In [ ]: nd.mean(output)
There’s a good chance you’ll see something other than 0. When I ran this code, I got 2.68220894e-08.
That’s roughly .000000027. This is due to the fact that MXNet often uses low precision arithmetics. For
deep learning research, this is often a compromise that we make. In exchange for giving up a few significant
digits, we get tremendous speedups on modern hardware. And it turns out that most deep learning algorithms
don’t suffer too much from the loss of precision.
3.23.3 Parameters
Before we can add parameters to our custom Block, we should get to know how gluon deals with param-
eters generally. Instead of working with NDArrays directly, each Block is associated with some number
(as few as zero) of Parameter (groups).
At a high level, you can think of a Parameter as a wrapper on an NDArray. However, the Parameter
can be instantiated before the corresponding NDArray is. For example, when we instantiate a Block but
the shapes of each parameter still need to be inferred, the Parameter will wait for the shape to be inferred
before allocating memory.
To get a hands-on feel for mxnet.Parameter, let’s just instantiate one outside of a Block:
In [ ]: my_param = gluon.Parameter("exciting_parameter_yay", grad_req='write', shape=(5,5))
print(my_param)
Here we’ve instantiated a parameter, giving it the name “exciting_parameter_yay”. We’ve also specified
that we’ll want to capture gradients for this Parameter. Under the hood, that lets gluon know that it has
to call .attach_grad() on the underlying NDArray. We also specified the shape. Now that we have a
Parameter, we can initialize its values via .initialize() and extract its data by calling .data().
In [ ]: my_param.initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
print(my_param.data())
For data parallelism, a Parameter can also be initialized on multiple contexts. The Parameter will then keep
a copy of its value on each context. Keep in mind that you need to maintain consistency among the copies
when updating the Parameter (usually gluon.Trainer does this for you).
Note that you need at least two GPUs to run this section.
In [ ]: if len(mx.test_utils.list_gpus()) >= 2:
my_param = gluon.Parameter("exciting_parameter_yay", grad_req='write', shape=(5
my_param.initialize(mx.init.Xavier(magnitude=2.24), ctx=[mx.gpu(0), mx.gpu(1)])
print(my_param.data(mx.gpu(0)), my_param.data(mx.gpu(1)))
MXNet’s ParameterDict does a few cool things for us. First, we can instantiate a new Parameter by
calling pd.get()
In [ ]: pd.get("exciting_parameter_yay", grad_req='write', shape=(5,5))
Note that the new parameter is (i) contained in the ParameterDict and (ii) appends the prefix to its name.
This naming convention helps us to know which parameters belong to which Block or sub-Block. It’s
especially useful when we want to write parameters to disc (i.e. serialize), or read them from disc (i.e.
deserialize).
Like a regular Python dictionary, we can get the names of all parameters with .keys() and can access
parameters with:
In [ ]: pd["block1_exciting_parameter_yay"]
#################
# Now we just have to write the forward pass.
Recall that every Block can be run just as if it were an entire network. In fact, linear models are nothing
more than neural networks consisting of a single layer as a network.
So let’s go ahead and run some data through our bespoke layer. We’ll want to first instantiate the layer and
initialize its parameters.
In [ ]: dense = MyDense(20, in_units=10)
dense.collect_params().initialize(ctx=ctx)
In [ ]: dense.params
3.23.9 Optimizer
In [ ]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1})
denominator = 0.
metric.update([label], [output])
return metric.get()[1]
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
with autograd.record():
output = net(data)
cross_entropy = loss(output, label)
cross_entropy.backward()
trainer.step(data.shape[0])
3.23.12 Conclusion
It works! There’s a lot of other cool things you can do. In more advanced chapters, we’ll show how you
can make a layer that takes in multiple inputs, or one that cleverly calls down to MXNet’s symbolic API to
squeeze out extra performance without screwing up your convenient imperative workflow.
3.23.13 Next
Serialization: saving your models and parameters for later re-use
For whinges or inquiries, open an issue on GitHub.
we haven’t yet covered how to save and load models. In reality, we often train a model on one device and
then want to run it to make predictions on many devices simultaneously. In order for our models to persist
beyond the execution of a single Python script, we need mechanisms to save and load NDArrays, gluon
Parameters, and models themselves.
In [ ]: from __future__ import print_function
import mxnet as mx
from mxnet import nd, autograd
from mxnet import gluon
ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()
dir_name = 'checkpoints'
if not os.path.exists(dir_name):
os.makedirs(dir_name)
We can also save a dictionary where the keys are strings and the values are NDArrays.
In [ ]: mydict = {"X": X, "Y": Y}
filename = os.path.join(dir_name, "test2.params")
nd.save(filename, mydict)
In [ ]: C = nd.load(filename)
print(C)
with net.name_scope():
net.add(gluon.nn.Dense(num_hidden, activation="relu"))
net.add(gluon.nn.Dense(num_hidden, activation="relu"))
net.add(gluon.nn.Dense(num_outputs))
Now, let’s initialize the parameters by attaching an initializer and actually passing in a datapoint to induce
shape inference.
In [ ]: net.collect_params().initialize(mx.init.Normal(sigma=1.), ctx=ctx)
net(nd.ones((1, 100), ctx=ctx))
So this randomly initialized model maps a 100-dimensional vector of all ones to the number 362.53 (that’s
the number on my machine–your mileage may vary). Let’s save the parameters, instantiate a new network,
load them in and make sure that we get the same result.
In [ ]: filename = os.path.join(dir_name, "testnet.params")
net.save_parameters(filename)
net2 = gluon.nn.Sequential()
with net2.name_scope():
net2.add(gluon.nn.Dense(num_hidden, activation="relu"))
net2.add(gluon.nn.Dense(num_hidden, activation="relu"))
net2.add(gluon.nn.Dense(num_outputs))
net2.load_parameters(filename, ctx=ctx)
net2(nd.ones((1, 100), ctx=ctx))
Great! Now we’re ready to save our work. The practice of saving models is sometimes called checkpointing
and it’s especially important for a number of reasons. 1. We can preserve and syndicate models that are
trained once. 2. Some models perform best (as determined on validation data) at some epoch in the middle
of training. If we checkpoint the model after each epoch, we can later select the best epoch. 3. We might
want to ask questions about our trained model that we didn’t think of when we first wrote the scripts for
our experiments. Having the parameters lying around allows us to examine our past work without having
to train from scratch. 4. Sometimes people might want to run our models who don’t know how to execute
training themselves or can’t access a suitable dataset for training. Checkpointing gives us a way to share our
work with others.
3.24.3 Next
Convolutional neural networks from scratch
For whinges or inquiries, open an issue on GitHub.
This can require a lot of parameters! If our input were a 256x256 color image (still quite small for a
photograph), and our network had 1,000 nodes in the first hidden layer, then our first weight matrix would
require (256x256x3)x1000 parameters. That’s nearly 200 million. Moreover the hidden layer would ignore
all the spatial structure in the input image even though we know the local structure represents a powerful
source of prior knowledge.
Convolutional neural networks incorporate convolutional layers. These layers associate each of their nodes
with a small window, called a receptive field, in the previous layer, instead of connecting to the full layer.
This allows us to first learn local features via transformations that are applied in the same way for the top
right corner as for the bottom left. Then we collect all this local information to predict global qualities of
the image (like whether or not it depicts a dog).
3.25.3 Parameters
Each node in a convolutional layer is associated with a 3D block (height x width x channel) in the input
tensor. Moreover, the convolutional layer itself has multiple output channels. So the layer is parameterized
by a 4 dimensional weight tensor, commonly called a convolutional kernel.
The output tensor is produced by sliding the kernel across the input image skipping locations according to a
pre-defined stride (but we’ll just assume that to be 1 in this tutorial). Let’s initialize some such kernels from
scratch.
In [ ]: #######################
# Set the scale for weight initialization and choose
# the number of hidden units in the fully-connected layer
#######################
weight_scale = .01
num_fc = 128
num_filter_conv_layer1 = 20
num_filter_conv_layer2 = 50
Note the shape. The number of examples (64) remains unchanged. The number of channels (also called
filters) has increased to 20. And because the (3,3) kernel can only be applied in 26 different heights and
widths (without the kernel busting over the image border), our output is 26,26. There are some weird
padding tricks we can use when we want the input and output to have the same height and width dimensions,
but we won’t get into that now.
Note that the batch and channel components of the shape are unchanged but that the height and width have
been downsampled from (26,26) to (13,13).
########################
# Define the computation of the second convolutional layer
########################
h2_conv = nd.Convolution(data=h1, weight=W2, bias=b2, kernel=(5,5),
num_filter=num_filter_conv_layer2)
h2_activation = relu(h2_conv)
h2 = nd.Pooling(data=h2_activation, pool_type="avg", kernel=(2,2), stride=(2,2)
if debug:
print("h2 shape: %s" % (np.array(h2.shape)))
########################
# Flattening h2 so that we can feed it into a fully-connected layer
########################
h2 = nd.flatten(h2)
if debug:
print("Flat h2 shape: %s" % (np.array(h2.shape)))
########################
# Define the computation of the third (fully-connected) layer
########################
h3_linear = nd.dot(h2, W3) + b3
h3 = relu(h3_linear)
if debug:
print("h3 shape: %s" % (np.array(h3.shape)))
########################
# Define the computation of the output layer
########################
return yhat_linear
3.25.11 Optimizer
In [ ]: def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx)
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, num_outputs)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label_one_hot)
loss.backward()
SGD(params, learning_rate)
##########################
# Keep a moving average of the losses
##########################
curr_loss = nd.mean(loss).asscalar()
moving_loss = (curr_loss if ((i == 0) and (e == 0))
else (1 - smoothing_constant) * moving_loss + (smoothing_con
3.25.14 Conclusion
Contained in this example are nearly all the important ideas you’ll need to start attacking problems in
computer vision. While state-of-the-art vision systems incorporate a few more bells and whistles, they’re
all built on this foundation. Believe it or not, if you knew just the content in this tutorial 5 years ago,
you could probably have sold a startup to a Fortune 500 company for millions of dollars. Fortunately (or
unfortunately?), the world has gotten marginally more sophisticated, so we’ll have to come up with some
more sophisticated tutorials to follow.
3.25.15 Next
Convolutional neural networks with gluon
For whinges or inquiries, open an issue on GitHub.
3.26.6 Optimizer
In [ ]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1})
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx)
label = label.as_in_context(ctx)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(data.shape[0])
##########################
# Keep a moving average of the losses
##########################
curr_loss = nd.mean(loss).asscalar()
moving_loss = (curr_loss if ((i == 0) and (e == 0))
else (1 - smoothing_constant) * moving_loss + smoothing_cons
3.26.9 Conclusion
You might notice that by using gluon, we get code that runs much faster whether on CPU or GPU. That’s
largely because gluon can call down to highly optimized layers that have been written in C++.
3.26.10 Next
Deep convolutional networks (AlexNet)
For whinges or inquiries, open an issue on GitHub.
world of applied machine learning. One of us (Zack) entered graduate school in 2013. A friend in graduate
school summarized the state of affairs thus:
If you spoke to machine learning researchers, they believed that machine learning was both important and
beautiful. Elegant theories proved the properties of various classifiers. The field of machine learning was
thriving, rigorous and eminently useful. However, if you spoke to a computer vision researcher, you’d hear a
very different story. The dirty truth of image recognition, they’d tell you, is that the really important aspects
of the ML for CV pipeline were data and features. A slightly cleaner dataset, or a slightly better hand-tuned
feature mattered a lot to the final accuracy. However, the specific choice of classifier was little more than an
afterthought. At the end of the day you could throw your features in a logistic regression model, a support
vector machine, or any other classifier of choice, and they would all perform roughly the same.
Higher layers might build upon these representations to represent larger structures, like eyes, noses, blades of
grass, and features. Yet higher layers might represent whole objects like people, airplanes, dogs, or frisbees.
And ultimately, before the classification layer, the final hidden state might represent a compact representa-
tion of the image that summarized the contents in a space where data belonging to different categories would
be linearly separable.
This dataset pushed both computer vision and machine learning research into a new regime where the
previous best methods would no longer dominate.
these are needed for many computer graphics tasks. Fortunately, the math required for that is very similar
to convolutional layers in deep networks. Furthermore, around that time, NVIDIA and ATI had begun
optimizing GPUs for general compute operations, going as far as renaming them GPGPU (General Purpose
GPUs).
To provide some intuition, consider the cores of a modern microprocessor. Each of the cores is quite pow-
erful, running at a high clock frequency, it has quite advanced and large caches (up to several MB of L3).
Each core is very good at executing a very wide range of code, with branch predictors, a deep pipeline and
lots of other things that make it great at executing regular programs. This apparent strength, however, is
also its Achilles’ heel: general purpose cores are very expensive to build. They require lots of chip area, a
sophisticated support structure (memory interfaces, caching logic between cores, high speed interconnects,
etc.), and they’re comparatively bad at any single task. Modern laptops have up to 4 cores, and even high
end servers rarely exceed 64 cores, simply because it is not cost effective.
Compare that with GPUs. They consist of 100-1000 small processing elements (the details differ somewhat
betwen NVIDIA, ATI, ARM and other chip vendors), often grouped into larger groups (NVIDIA calls them
warps). While each core is relatively weak, running at sub-1GHz clock frequency, it is the total number
of such cores that makes GPUs orders of magnitude faster than CPUs. For instance, NVIDIA’s latest Volta
generation offers up to 120 TFlops per chip for specialized instructions (and up to 24 TFlops for more general
purpose ones), while floating point performance of CPUs has not exceeded 1 TFlop to date. The reason for
why this is possible is actually quite simple: firstly, power consumption tends to grow quadratically with
clock frequency. Hence, for the power budget of a CPU core that runs 4x faster (a typical number) you
can use 16 GPU cores at 1/4 the speed, which yields 16 x 1/4 = 4x the performance. Furthermore GPU
cores are much simpler (in fact, for a long time they weren’t even able to execute general purpose code),
which makes them more energy efficient. Lastly, many operations in deep learning require high memory
bandwidth. Again, GPUs shine here with buses that are at least 10x as wide as many CPUs.
Back to 2012. A major breakthrough came when Alex Krizhevsky and Ilya Sutskever implemented a deep
convolutional neural network that could run on GPU hardware. They realized that the computational bot-
tlenecks in CNNs (convolutions and matrix multiplications) are all operations that could be parallelized in
hardware. Using two NIVIDA GTX 580s with 3GB of memory (depicted below) they implemented fast
convolutions. The code cuda-convnet was good enough that for several years it was the industry standard
and powered the first couple years of the deep learning boom.
3.27.4 AlexNet
In 2012, using their cuda-convnet implementation on an eight-layer CNN, Khrizhevsky, Sutskever and Hin-
ton won the ImageNet challenge on image recognition by a wide margin. Their model, introduced in this
paper, is very similar to the LeNet architecture from 1995.
In the rest of the chapter we’re going to implement a similar model to the one that they designed. Due
to memory constraints on the GPU they did some wacky things to make the model fit. For example, they
designed a dual-stream architecture in which half of the nodes live on each GPU. The two streams, and thus
the two GPUs only communicate at certain layers. This limits the amount of overhead for keeping the two
GPUs in sync with each other. Fortunately, distributed deep learning has advanced a long way in the last few
years, so we won’t be needing those features (except for very unusual architectures). In later sections, we’ll
go into greater depth on how you can speed up your networks by training on many GPUs (in AWS you can
get up to 16 on a single machine with 12GB each), and how you can train on many machine simultaneously.
As usual, we’ll start by importing the same dependencies as in the past gluon tutorials:
In [ ]: from __future__ import print_function
import mxnet as mx
from mxnet import nd, autograd
from mxnet import gluon
import numpy as np
mx.random.seed(1)
In [ ]: # ctx = mx.gpu()
ctx = mx.cpu()
In [ ]: batch_size = 64
train_data = gluon.data.DataLoader(
gluon.data.vision.CIFAR10('./data', train=True, transform=transformer),
batch_size=batch_size, shuffle=True, last_batch='discard')
test_data = gluon.data.DataLoader(
gluon.data.vision.CIFAR10('./data', train=False, transform=transformer),
batch_size=batch_size, shuffle=False, last_batch='discard')
In [ ]: for d, l in train_data:
break
In [ ]: print(d.shape, l.shape)
In [ ]: d.dtype
Besides the specific architectural choices and the data preparation, we can recycle all of the code we’d used
for LeNet verbatim.
[right now relying on a different data pipeline (the new gluon.vision). Sync this with the other chapter
soon and commit to one data pipeline.]
[add dropout once we are 100% final on API]
In [ ]: alex_net = gluon.nn.Sequential()
with alex_net.name_scope():
# First convolutional layer
alex_net.add(gluon.nn.Conv2D(channels=96, kernel_size=11, strides=(4,4), activa
alex_net.add(gluon.nn.MaxPool2D(pool_size=3, strides=2))
# Second convolutional layer
alex_net.add(gluon.nn.Conv2D(channels=192, kernel_size=5, activation='relu'))
alex_net.add(gluon.nn.MaxPool2D(pool_size=3, strides=(2,2)))
# Third convolutional layer
alex_net.add(gluon.nn.Conv2D(channels=384, kernel_size=3, activation='relu'))
# Fourth convolutional layer
alex_net.add(gluon.nn.Conv2D(channels=384, kernel_size=3, activation='relu'))
# Fifth convolutional layer
alex_net.add(gluon.nn.Conv2D(channels=256, kernel_size=3, activation='relu'))
alex_net.add(gluon.nn.MaxPool2D(pool_size=3, strides=2))
# Flatten and apply fullly connected layers
alex_net.add(gluon.nn.Flatten())
alex_net.add(gluon.nn.Dense(4096, activation="relu"))
alex_net.add(gluon.nn.Dense(4096, activation="relu"))
alex_net.add(gluon.nn.Dense(10))
3.27.8 Optimizer
In [ ]: trainer = gluon.Trainer(alex_net.collect_params(), 'sgd', {'learning_rate': .001})
for e in range(epochs):
for i, (d, l) in enumerate(train_data):
data = d.as_in_context(ctx)
label = l.as_in_context(ctx)
with autograd.record():
output = alex_net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(data.shape[0])
##########################
# Keep a moving average of the losses
##########################
curr_loss = nd.mean(loss).asscalar()
moving_loss = (curr_loss if ((i == 0) and (e == 0))
else (1 - smoothing_constant) * moving_loss + (smoothing_con
3.27.12 Next
Very deep convolutional neural nets with repeating blocks
For whinges or inquiries, open an issue on GitHub.
3.28.1 VGG
We begin with the usual import ritual
In [ ]: from __future__ import print_function
import mxnet as mx
from mxnet import nd, autograd
from mxnet import gluon
import numpy as np
mx.random.seed(1)
In [ ]: ctx = mx.gpu()
def vgg_stack(architecture):
out = nn.Sequential()
for (num_convs, channels) in architecture:
out.add(vgg_block(num_convs, channels))
return out
num_outputs = 10
architecture = ((1,64), (1,128), (2,256), (2,512))
net = nn.Sequential()
with net.name_scope():
net.add(vgg_stack(architecture))
net.add(nn.Flatten())
net.add(nn.Dense(512, activation="relu"))
net.add(nn.Dropout(.5))
net.add(nn.Dense(512, activation="relu"))
net.add(nn.Dropout(.5))
net.add(nn.Dense(num_outputs))
3.28.5 Optimizer
In [ ]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .05})
for e in range(epochs):
for i, (d, l) in enumerate(train_data):
data = d.as_in_context(ctx)
label = l.as_in_context(ctx)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(data.shape[0])
##########################
# Keep a moving average of the losses
##########################
curr_loss = nd.mean(loss).asscalar()
moving_loss = (curr_loss if ((i == 0) and (e == 0))
else (1 - smoothing_constant) * moving_loss + smoothing_cons
3.28.9 Next
Batch normalization from scratch
For whinges or inquiries, open an issue on GitHub.
num_outputs = 10
def transform(data, label):
return nd.transpose(data.astype(np.float32), (2,0,1))/255, label.astype(np.floa
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transf
batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transf
batch_size, shuffle=False)
𝑚
2 1 ∑︁
𝜎𝐵 ← (𝑥𝑖 − 𝜇𝐵 )2
𝑚
𝑖=1
𝑥 𝑖 − 𝜇𝐵
𝑥ˆ𝑖 ← √︁
2 +𝜖
𝜎𝐵
• formulas taken from Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep
network training by reducing internal covariate shift.” International Conference on Machine Learning.
2015.
With gluon, this is all actually implemented for us, but we’ll do it this one time by ourselves, using the
formulas from the original paper so you know how it works, and perhaps you can improve upon it!
Pay attention that, when it comes to (2D) CNN, we normalize batch_size * height * width over
each channel. So that gamma and beta have the lengths the same as channel_count. In our imple-
mentation, we need to manually reshape gamma and beta so that they could (be automatically broadcast
and) multipy the matrices in the desired way.
In [ ]: def pure_batch_norm(X, gamma, beta, eps = 1e-5):
if len(X.shape) not in (2, 4):
raise ValueError('only supports dense or 2dconv')
# dense
if len(X.shape) == 2:
# mini-batch mean
mean = nd.mean(X, axis=0)
# mini-batch variance
variance = nd.mean((X - mean) ** 2, axis=0)
# normalize
X_hat = (X - mean) * 1.0 / nd.sqrt(variance + eps)
# 2d conv
elif len(X.shape) == 4:
# extract the dimensions
N, C, H, W = X.shape
# mini-batch mean
mean = nd.mean(X, axis=(0, 2, 3))
# mini-batch variance
variance = nd.mean((X - mean.reshape((1, C, 1, 1))) ** 2, axis=(0, 2, 3))
# normalize
X_hat = (X - mean.reshape((1, C, 1, 1))) * 1.0 / nd.sqrt(variance.reshape((
# scale and shift
out = gamma.reshape((1, C, 1, 1)) * X_hat + beta.reshape((1, C, 1, 1))
return out
Let’s do some sanity checks. We expect each column of the input matrix to be normalized.
In [ ]: A = nd.array([1,7,5,4,6,10], ctx=ctx).reshape((3,2))
A
In [ ]: pure_batch_norm(A,
gamma = nd.array([1,1], ctx=ctx),
beta=nd.array([0,0], ctx=ctx))
In [ ]: ga = nd.array([1,1], ctx=ctx)
be = nd.array([0,0], ctx=ctx)
B = nd.array([1,6,5,7,4,3,2,5,6,3,2,4,5,3,2,5,6], ctx=ctx).reshape((2,2,2,2))
B
In [ ]: pure_batch_norm(B, ga, be)
Our tests seem to support that we’ve done everything correctly. Note that for batch normalization, imple-
menting backward pass is a little bit tricky. Fortunately, you won’t have to worry about that here, because
the MXNet’s autograd package can handle differentiation for us automatically.
Besides that, in the testing process, we want to use the mean and variance of the complete dataset, instead
of those of mini batches. In the implementation, we use moving statistics as a trade off, because we don’t
want to or don’t have the ability to compute the statistics of the complete dataset (in the second loop).
Then here comes another concern: we need to maintain the moving statistics along with multiple runs of
the BN. It’s an engineering issue rather than a deep/machine learning issue. On the one hand, the moving
statistics are similar to gamma and beta; on the other hand, they are not updated by the gradient backwards.
In this quick-and-dirty implementation, we use the global dictionary variables to store the statistics, in which
each key is the name of the layer (scope_name), and the value is the statistics. (Attention: always be
very careful if you have to use global variables!) Moreover, we have another parameter is_training to
indicate whether we are doing training or testing.
Now we are ready to define our complete batch_norm():
In [ ]: def batch_norm(X,
gamma,
beta,
momentum = 0.9,
eps = 1e-5,
scope_name = '',
is_training = True,
debug = False):
"""compute the batch norm """
global _BN_MOVING_MEANS, _BN_MOVING_VARS
#########################
# the usual batch norm transformation
#########################
# dense
if len(X.shape) == 2:
# mini-batch mean
mean = nd.mean(X, axis=0)
# mini-batch variance
variance = nd.mean((X - mean) ** 2, axis=0)
# normalize
if is_training:
# while training, we normalize the data using its mean and variance
X_hat = (X - mean) * 1.0 / nd.sqrt(variance + eps)
else:
# while testing, we normalize the data using the pre-computed mean and
X_hat = (X - _BN_MOVING_MEANS[scope_name]) *1.0 / nd.sqrt(_BN_MOVING_VA
# scale and shift
out = gamma * X_hat + beta
# 2d conv
elif len(X.shape) == 4:
# extract the dimensions
N, C, H, W = X.shape
# mini-batch mean
mean = nd.mean(X, axis=(0,2,3))
# mini-batch variance
variance = nd.mean((X - mean.reshape((1, C, 1, 1))) ** 2, axis=(0, 2, 3))
# normalize
X_hat = (X - mean.reshape((1, C, 1, 1))) * 1.0 / nd.sqrt(variance.reshape((
if is_training:
# while training, we normalize the data using its mean and variance
X_hat = (X - mean.reshape((1, C, 1, 1))) * 1.0 / nd.sqrt(variance.resha
else:
# while testing, we normalize the data using the pre-computed mean and
X_hat = (X - _BN_MOVING_MEANS[scope_name].reshape((1, C, 1, 1))) * 1.0
/ nd.sqrt(_BN_MOVING_VARS[scope_name].reshape((1, C, 1, 1)) + eps)
# scale and shift
out = gamma.reshape((1, C, 1, 1)) * X_hat + beta.reshape((1, C, 1, 1))
#########################
# to keep the moving statistics
#########################
#########################
# debug info
#########################
if debug:
print('== info start ==')
print('scope_name = {}'.format(scope_name))
print('mean = {}'.format(mean))
print('var = {}'.format(variance))
print('_BN_MOVING_MEANS = {}'.format(_BN_MOVING_MEANS[scope_name]))
print('_BN_MOVING_VARS = {}'.format(_BN_MOVING_VARS[scope_name]))
print('output = {}'.format(out))
print('== info end ==')
#########################
# return
#########################
return out
params = [W1, b1, gamma1, beta1, W2, b2, gamma2, beta2, W3, b3, gamma3, beta3, W4,
In [ ]: for param in params:
param.attach_grad()
########################
# Define the computation of the second convolutional layer
########################
h2_conv = nd.Convolution(data=h1, weight=W2, bias=b2, kernel=(5,5), num_filter=
########################
# Flattening h2 so that we can feed it into a fully-connected layer
########################
h2 = nd.flatten(h2)
if debug:
print("Flat h2 shape: %s" % (np.array(h2.shape)))
########################
# Define the computation of the third (fully-connected) layer
########################
h3_linear = nd.dot(h2, W3) + b3
h3_normed = batch_norm(h3_linear, gamma3, beta3, scope_name='bn3', is_training=
h3 = relu(h3_normed)
if debug:
print("h3 shape: %s" % (np.array(h3.shape)))
########################
# Define the computation of the output layer
########################
yhat_linear = nd.dot(h3, W4) + b4
if debug:
print("yhat_linear shape: %s" % (np.array(yhat_linear.shape)))
return yhat_linear
3.29.10 Optimizer
In [ ]: def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, 10)
output = net(data, is_training=False) # attention here!
predictions = nd.argmax(output, axis=1)
numerator += nd.sum(predictions == label)
denominator += data.shape[0]
return (numerator / denominator).asscalar()
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx)
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, num_outputs)
with autograd.record():
# we are in training process,
# so we normalize the data using batch mean and variance
output = net(data, is_training=True)
loss = softmax_cross_entropy(output, label_one_hot)
loss.backward()
SGD(params, learning_rate)
##########################
# Keep a moving average of the losses
##########################
if i == 0:
moving_loss = nd.mean(loss).asscalar()
else:
moving_loss = .99 * moving_loss + .01 * nd.mean(loss).asscalar()
3.29.13 Next
Batch normalization with gluon
For whinges or inquiries, open an issue on GitHub.
net.add(gluon.nn.Conv2D(channels=50, kernel_size=5))
net.add(gluon.nn.BatchNorm(axis=1, center=True, scale=True))
net.add(gluon.nn.Activation(activation='relu'))
net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
# The Flatten layer collapses all axis, except the first one, into one axis.
net.add(gluon.nn.Flatten())
net.add(gluon.nn.Dense(num_fc))
net.add(gluon.nn.BatchNorm(axis=1, center=True, scale=True))
net.add(gluon.nn.Activation(activation='relu'))
net.add(gluon.nn.Dense(num_outputs))
3.30.5 Optimizer
In [ ]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1})
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx)
label = label.as_in_context(ctx)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(data.shape[0])
##########################
# Keep a moving average of the losses
##########################
curr_loss = nd.mean(loss).asscalar()
moving_loss = (curr_loss if ((i == 0) and (e == 0))
else (1 - smoothing_constant) * moving_loss + (smoothing_con
3.30.8 Next
Introduction to recurrent neural networks
For whinges or inquiries, open an issue on GitHub.
At each iteration 𝑡, we feed in a new example 𝑥𝑡 , by setting the values of the input nodes (orange). We then
feed the activation forward by successively calculating the activations of each higher layer in the network.
Finally, we read the outputs from the topmost layer.
So when we feed the next example 𝑥𝑡+1 , we overwrite all of the previous activations. If consecutive inputs
to our network have no special relationship to each other (say, images uploaded by unrelated users), then
this is perfectly acceptable behavior. But what if our inputs exhibit a sequential relationship?
Say for example that you want to predict the next character in a string of text. We might decide to feed each
character into the neural network with the goal of predicting the succeeding character.
In the above example, the neural network forgets the previous context every time you feed a new input. How
is the neural network supposed to know that “e” is followed by a space? It’s hard to see why that should be
so probable if you didn’t know that the “e” was the final letter in the word “Time”.
Recurrent neural networks provide a slick way to incorporate sequential structure. At each time step 𝑡, each
hidden layer ℎ𝑡 (typically) will receive input from both the current input 𝑥𝑡 and from that same hidden layer
at the previous time step ℎ𝑡−1
Now, when our net is trying to predict what comes after the “e” in time, it has access to its previous beliefs,
and by extension, the entire history of inputs. Zooming back in to see how the nodes in a basic RNN are
connected, you’ll see that each node in the hidden layer is connected to each node at the hidden layer at the
next time step:
Even though the neural network contains loops (the hidden layer is connected to itself), because this con-
nection spans a time step our network is still technically a feedforward network. Thus we can still train by
backpropagration just as we normally would with an MLP. Typically the loss function will be an average of
the losses at each time step.
In this tutorial, we’re going to roll up our sleeves and write a simple RNN in MXNet using nothing but
mxnet.ndarray and mxnet.autograd. In practice, unless you’re trying to develop fundamentally
new recurrent layers, you’ll want to use the prebuilt layers that call down to extremely optimized primitives.
You’ll also want to rely on some pre-built batching code because batching sequences can be a pain. But we
think in general, if you’re going to work with this stuff, and have a modicum of self respect, you’ll want to
implement from scratch and understand how it works at a reasonably low level.
Let’s go ahead and import our dependencies and specify our context. If you’ve been following along without
a GPU until now, this might be where you’ll want to get your hands on some faster hardware. GPU instances
are available by the hour through Amazon Web Services. A single GPU via a p2 instance (NVIDIA K80s)
or even an older g2 instance will be perfectly adequate for this tutorial.
In [1]: from __future__ import print_function
import mxnet as mx
from mxnet import nd, autograd
import numpy as np
mx.random.seed(1)
ctx = mx.gpu(0)
And you’ll probably want to get a taste for what the text looks like.
In [3]: print(time_machine[0:500])
Project Gutenberg's The Time Machine, by H. G. (Herbert George) Wells
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net
Language: English
3.31.2 Tidying up
I went through and discovered that the last 38083 characters consist entirely of legalese from the Gutenberg
gang. So let’s chop that off lest our language model learn to generate such boring drivel.
In [4]: print(time_machine[-38075:-37500])
time_machine = time_machine[:-38083]
End of Project Gutenberg's The Time Machine, by H. G. (Herbert George) Wells
*** END OF THIS PROJECT GUTENBERG EBOOK THE TIME MACHINE ***
Creating the works from public domain print editions means that no
one owns a United States copyright in these works, so the Foundation
(and you!) c
We’ll often want to access the index corresponding to each character quickly so let’s store this as a dictionary.
In [6]: character_dict = {}
for e, char in enumerate(character_list):
character_dict[char] = e
print(character_dict)
{'H': 0, ']': 44, ';': 1, 'J': 65, 'Q': 50, 'D': 2, '_': 4, 'a': 43, ' ': 6, '0': 7, 'V': 9
In [7]: time_numerical = [character_dict[char] for char in time_machine]
In [8]: #########################
# Check that the length is right
#########################
print(len(time_numerical))
#########################
# Check that the format looks right
#########################
print(time_numerical[:20])
#########################
# Convert back to text
#########################
print("".join([character_list[idx] for idx in time_numerical[:39]]))
179533
[61, 23, 69, 21, 15, 5, 41, 6, 62, 20, 41, 15, 27, 67, 15, 23, 55, 14, 71, 6]
Project Gutenberg's The Time Machine, b
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.]]
<NDArray 2x77 @gpu(0)>
That looks about right. Now let’s write a function to convert our one-hots back to readable text.
In [11]: def textify(embedding):
result = ""
indices = nd.argmax(embedding, axis=1).asnumpy()
for idx in indices:
result += character_list[int(idx)]
return result
In [12]: textify(one_hots(time_numerical[0:40]))
Out[12]: "Project Gutenberg's The Time Machine, by"
Now that we’ve chopped our dataset into sequences of length seq_length, at every time step, our input is
a single one-hot vector. This means that our computation of the hidden layer would consist of matrix-vector
multiplications, which are not especially efficient on GPU. To take advantage of the available computing
resources, we’ll want to feed through a batch of sequences at the same time. The following code may look
tricky but it’s just some plumbing to make the data look like this.
In [14]: batch_size = 32
chapter05_recurrent-neural-networks/img/recurrent-ba
Let’s sanity check that everything went the way we hope. For each data_row, the second sequence should
follow the first:
In [16]: for i in range(3):
print("***Batch %s:***\n %s \n %s \n\n" % (i, textify(train_data[i, :, 0]), te
***Batch 0:***
Project Gutenberg's The Time Machine, by H. G. (Herbert George)
vement of the barometer. Yesterday it was so high, yesterday nig
***Batch 1:***
Wells
***Batch 2:***
nd with
almost no restrictions whatsoever. You may copy it, giv
d to
here. Surely the mercury did not trace this line in any of
train_label = nd.swapaxes(train_label, 1, 2)
print(train_label.shape)
(87, 64, 32, 77)
chapter05_recurrent-neural-networks/img/simple-rnn.p
Recall that the update for an ordinary hidden layer in a neural network with activation function 𝜑 is given by
ℎ = 𝜑(𝑥𝑊 + 𝑏)
To make this a recurrent neural network, we’re simply going to add a weight sum of the previous hidden
state ℎ𝑡−1 :
𝑦
^𝑡 = softmax𝑜𝑛𝑒−ℎ𝑜𝑡 (ℎ𝑡 𝑊ℎ𝑦 + 𝑏𝑦 )
########################
# Weights connecting the inputs to the hidden layer
########################
Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
########################
########################
# Bias vector for hidden layer
########################
bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
########################
# Weights to the output nodes
########################
Why = nd.random_normal(shape=(num_hidden,num_outputs), ctx=ctx) * .01
by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01
In [24]: ####################
# Often we want to sample with low temperatures to produce sharp probabilities
####################
softmax(nd.array([[10,-10],[-10,10]]), temperature=.1)
Out[24]:
[[ 1. 0.]
[ 0. 1.]]
<NDArray 2x2 @cpu(0)>
3.31.15 Optimizer
In [29]: def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
#####################################
# Prepare the prefix as a sequence of one-hots for ingestion by RNN
#####################################
prefix_numerical = [character_dict[char] for char in prefix]
input_sequence = one_hots(prefix_numerical)
#####################################
# Set the initial state of the hidden representation ($h_0$) to the zero vecto
#####################################
sample_state = nd.zeros(shape=(1, num_hidden), ctx=ctx)
#####################################
# For num_chars iterations,
# 1) feed in the current input
# 2) sample next character from from output distribution
# 3) add sampled character to the decoded string
# 4) prepare the sampled character as a one_hot (to be the next input)
#####################################
for i in range(num_chars):
outputs, sample_state = simple_rnn(input_sequence, sample_state, temperatu
choice = np.random.choice(vocab_size, p=outputs[-1][0].asnumpy())
string += character_list[choice]
input_sequence = one_hots([choice])
return string
In [ ]: epochs = 2000
moving_loss = 0.
learning_rate = .5
##########################
# Keep a moving average of the losses
##########################
if (i == 0) and (e == 0):
moving_loss = np.mean(loss.asnumpy()[0])
else:
moving_loss = .99 * moving_loss + .01 * np.mean(loss.asnumpy()[0])
3.31.17 Conclusions
Once you start running this code, it will spit out a sample at the end of each epoch. I’ll leave this output cell
blank so you don’t see megabytes of text, but here are some patterns that I observed when I ran this code.
The network seems to first work out patterns with no sequential relationship and then slowly incorporates
longer and longer windows of context. After just 1 epoch, my RNN generated this:
e e e ee e eee e e ee e e ee e e ee
˓→ e e ee e e e e e e e e
˓→e ee e aee e e ee e e ee ee e ee
˓→e e e e e ete e e e e e e ee n eee
˓→ee e eeee e e e e e e ee e e e e
˓→e e eee ee e e e e e e ee
˓→ee e e e e e e e e t e ee e eee e e e
˓→ee e e e e eee e e e eeeee
˓→ e eeee e e ee ee ee a e e eee ee e
˓→e e e aee e e e e eee e
˓→ e e e e e e e e e e e e
˓→ee e ee e e e e e e e
˓→ e e e e ee e e ee n e ee e e
˓→ e e e e t ee ee ee eee et e
˓→ e e e ee e e e e e e e e
˓→ e e"
It’s learned that spaces and “e”s (to my knowledge, there’s no aesthetically pleasing way to spell the plural
form of the letter “e”) are the most common characters.
A little bit later on it spits out strings like:
the the the the the the the the the the the the the the the the the the the the the the the the the the
the the the the the the the the the the the the the the the the the the the the the the the the the the
the the the the the the the the the the the the the the the the the the the the the the the the the the
the the the the the the the the the the the the the the the the the the the the the the the the the the
the the the the the the the the the the the the the the the the the the the the the the the the the the
the the the the the the the the the the the the the the the the the the the the the the the the the the
the the the the the the the the the the the the the the the the the the the the the the the the the the
the the the the the the the the the the the the the the the the the the the the the the the the the the
the the the the the the the the the the the the the the the the the the the the the the the the the the
the the
At this point it’s learned that after the space usually comes a nonspace character, and perhaps that “t” is the
most common character to immediately follow a space, “h” to follow a “t” and “e” to follow “th”. However
it doesn’t appear to be looking far enough back to realize that the word “the” should be very unlikely
immediately after the word “the”. . .
By the 175th epoch, the model appears to be putting together a fairly large vocabulary although it puts words
together in ways that might be charitably described as “creative”.
the little people had been as I store of the sungher had leartered along the realing of the stars of
the little past and stared at the thing that I had the sun had to the stars of the sunghed a stirnt a
moment the sun had come and fart as the stars of the sunghed a stirnt a moment the sun had to
the was completely and of the little people had been as I stood and all amations of the staring
and some of the really
In subsequent tutorials we’ll explore sophisticated techniques for evaluating and improving language mod-
els. We’ll also take a look at some related but more complicate problems like language translations and
image captioning.
3.31.18 Next
LSTM recurrent neural networks from scratch
For whinges or inquiries, open an issue on GitHub.
𝑐𝑡 = 𝑓𝑡 ⊙ 𝑐𝑡−1 + 𝑖𝑡 ⊙ 𝑔𝑡 ,
ℎ𝑡 = 𝑜𝑡 ⊙ tanh(𝑐𝑡 ),
where ⊙ is an element-wise multiplication operator, and for all = [𝑥1 , 𝑥2 , . . . , 𝑥𝑘 ]⊤ ∈𝑘 the two activation
functions:
[︂ ]︂⊤
1 1
𝜎() = ,..., ] ,
1 + exp(−𝑥1 ) 1 + exp(−𝑥𝑘 )
[︂ ]︂⊤
1 − exp(−2𝑥1 ) 1 − exp(−2𝑥𝑘 )
tanh() = ,..., .
1 + exp(−2𝑥1 ) 1 + exp(−2𝑥𝑘 )
In the transformations above, the memory cell 𝑐𝑡 stores the “long-term” memory in the vector form. In other
words, the information accumulatively captured and encoded until time step 𝑡 is stored in 𝑐𝑡 and is only
passed along the same layer over different time steps.
Given the inputs 𝑐𝑡 and ℎ𝑡 , the input gate 𝑖𝑡 and forget gate 𝑓𝑡 will help the memory cell to decide how to
overwrite or keep the memory information. The output gate 𝑜𝑡 further lets the LSTM block decide how to
retrieve the memory information to generate the current state ℎ𝑡 that is passed to both the next layer of the
current time step and the next time step of the current layer. Such decisions are made using the hidden-layer
parameters 𝑊 and 𝑏 with different subscripts: these parameters will be inferred during the training phase by
gluon.
########################
# Weights connecting the inputs to the hidden layer
########################
Wxg = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
Wxi = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
Wxf = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
Wxo = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
########################
# Recurrent weights connecting the hidden layer across time steps
########################
Whg = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
Whi = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
Whf = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
Who = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
########################
# Bias vector for hidden layer
########################
bg = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
bi = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
bf = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
bo = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
########################
# Weights to the output nodes
########################
Why = nd.random_normal(shape=(num_hidden,num_outputs), ctx=ctx) * .01
by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01
3.32.13 Optimizer
In [14]: def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
#####################################
# Prepare the prefix as a sequence of one-hots for ingestion by RNN
#####################################
prefix_numerical = [character_dict[char] for char in prefix]
input_sequence = one_hots(prefix_numerical)
#####################################
# Set the initial state of the hidden representation ($h_0$) to the zero vecto
#####################################
h = nd.zeros(shape=(1, num_hidden), ctx=ctx)
c = nd.zeros(shape=(1, num_hidden), ctx=ctx)
#####################################
# For num_chars iterations,
# 1) feed in the current input
# 2) sample next character from from output distribution
# 3) add sampled character to the decoded string
# 4) prepare the sampled character as a one_hot (to be the next input)
#####################################
for i in range(num_chars):
outputs, h, c = lstm_rnn(input_sequence, h, c, temperature=temperature)
choice = np.random.choice(vocab_size, p=outputs[-1][0].asnumpy())
string += character_list[choice]
input_sequence = one_hots([choice])
return string
In [ ]: epochs = 2000
moving_loss = 0.
learning_rate = 2.0
for e in range(epochs):
############################
# Attenuate the learning rate by a factor of 2 every 100 epochs.
############################
if ((e+1) % 100 == 0):
learning_rate = learning_rate / 2.0
h = nd.zeros(shape=(batch_size, num_hidden), ctx=ctx)
c = nd.zeros(shape=(batch_size, num_hidden), ctx=ctx)
for i in range(num_batches):
data_one_hot = train_data[i]
label_one_hot = train_label[i]
with autograd.record():
outputs, h, c = lstm_rnn(data_one_hot, h, c)
loss = average_ce_loss(outputs, label_one_hot)
loss.backward()
SGD(params, learning_rate)
##########################
# Keep a moving average of the losses
##########################
if (i == 0) and (e == 0):
moving_loss = nd.mean(loss).asscalar()
else:
moving_loss = .99 * moving_loss + .01 * nd.mean(loss).asscalar()
3.32.15 Conclusions
3.32.16 Next
Gated recurrent units (GRU) RNNs from scratch
For whinges or inquiries, open an issue on GitHub.
• The input gate 𝑖𝑡 and forget gate 𝑓𝑡 are replaced by an single update gate 𝑧𝑡 , which weighs the old and
new content via 𝑧𝑡 and (1 − 𝑧𝑡 ) respectively.
• There is no output gate 𝑜𝑡 ; the weighted sum is what becomes ℎ𝑡 .
We use the GRU block with the following transformations that map inputs to outputs across blocks at
consecutive layers and consecutive time steps:
########################
# Weights connecting the inputs to the hidden layer
########################
Wxz = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
Wxr = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
########################
# Recurrent weights connecting the hidden layer across time steps
########################
Whz = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
Whr = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
########################
# Bias vector for hidden layer
########################
bz = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
br = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
########################
# Weights to the output nodes
########################
Why = nd.random_normal(shape=(num_hidden,num_outputs), ctx=ctx) * .01
by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01
3.33.13 Optimizer
In [14]: def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
#####################################
# Prepare the prefix as a sequence of one-hots for ingestion by RNN
#####################################
prefix_numerical = [character_dict[char] for char in prefix]
input_sequence = one_hots(prefix_numerical)
#####################################
# Set the initial state of the hidden representation ($h_0$) to the zero vecto
#####################################
h = nd.zeros(shape=(1, num_hidden), ctx=ctx)
c = nd.zeros(shape=(1, num_hidden), ctx=ctx)
#####################################
# For num_chars iterations,
# 1) feed in the current input
# 2) sample next character from from output distribution
# 3) add sampled character to the decoded string
# 4) prepare the sampled character as a one_hot (to be the next input)
#####################################
for i in range(num_chars):
outputs, h = gru_rnn(input_sequence, h, temperature=temperature)
choice = np.random.choice(vocab_size, p=outputs[-1][0].asnumpy())
string += character_list[choice]
input_sequence = one_hots([choice])
return string
In [ ]: epochs = 2000
moving_loss = 0.
learning_rate = 2.0
##########################
# Keep a moving average of the losses
##########################
if (i == 0) and (e == 0):
moving_loss = nd.mean(loss).asscalar()
else:
moving_loss = .99 * moving_loss + .01 * nd.mean(loss).asscalar()
3.33.15 Conclusions
[Placeholder]
3.33.16 Next
Simple, LSTM, and GRU RNNs with gluon
For whinges or inquiries, open an issue on GitHub.
def __len__(self):
return len(self.idx2word)
The Dictionary class is used by the Corpus class to index the words of the input document.
In [ ]: class Corpus(object):
def __init__(self, path):
self.dictionary = Dictionary()
self.train = self.tokenize(path + 'train.txt')
self.valid = self.tokenize(path + 'valid.txt')
self.test = self.tokenize(path + 'test.txt')
super(RNNModel, self).__init__(**kwargs)
with self.name_scope():
self.drop = nn.Dropout(dropout)
self.encoder = nn.Embedding(vocab_size, num_embed,
weight_initializer = mx.init.Uniform(0.1))
if mode == 'rnn_relu':
self.rnn = rnn.RNN(num_hidden, num_layers, activation='relu', dropo
input_size=num_embed)
elif mode == 'rnn_tanh':
self.rnn = rnn.RNN(num_hidden, num_layers, dropout=dropout,
input_size=num_embed)
elif mode == 'lstm':
self.rnn = rnn.LSTM(num_hidden, num_layers, dropout=dropout,
input_size=num_embed)
elif mode == 'gru':
self.rnn = rnn.GRU(num_hidden, num_layers, dropout=dropout,
input_size=num_embed)
else:
raise ValueError("Invalid mode %s. Options are rnn_relu, "
"rnn_tanh, lstm, and gru"%mode)
if tie_weights:
self.decoder = nn.Dense(vocab_size, in_units = num_hidden,
params = self.encoder.params)
else:
self.decoder = nn.Dense(vocab_size, in_units = num_hidden)
self.num_hidden = num_hidden
args_lr = 1.0
args_clip = 0.2
args_epochs = 1
args_batch_size = 32
args_bptt = 5
args_dropout = 0.2
args_tied = True
args_cuda = 'store_true'
args_log_interval = 500
args_save = 'model.param'
3.34.7 Train the model and evaluate on validation and testing data sets
Now we can define functions for training and evaluating the model. The following are two helper functions
that will be used during model training and evaluation.
In [ ]: def get_batch(source, i):
seq_len = min(args_bptt, source.shape[0] - 1 - i)
data = source[i : i + seq_len]
target = source[i + 1 : i + 1 + seq_len]
def detach(hidden):
if isinstance(hidden, (tuple, list)):
hidden = [i.detach() for i in hidden]
else:
hidden = hidden.detach()
return hidden
The following is the function for model evaluation. It returns the loss of the model prediction. We will
discuss the details of the loss measure shortly.
In [ ]: def eval(data_source):
total_L = 0.0
ntotal = 0
hidden = model.begin_state(func = mx.nd.zeros, batch_size = args_batch_size, ct
for i in range(0, data_source.shape[0] - 1, args_bptt):
data, target = get_batch(data_source, i)
output, hidden = model(data, hidden)
L = loss(output, target)
total_L += mx.nd.sum(L).asscalar()
ntotal += L.size
return total_L / ntotal
Now we are ready to define the function for training the model. We can monitor the model performance on
the training, validation, and testing data sets over iterations.
In [ ]: def train():
best_val = float("Inf")
for epoch in range(args_epochs):
total_L = 0.0
start_time = time.time()
hidden = model.begin_state(func = mx.nd.zeros, batch_size = args_batch_size
for ibatch, i in enumerate(range(0, train_data.shape[0] - 1, args_bptt)):
data, target = get_batch(train_data, i)
hidden = detach(hidden)
with autograd.record():
output, hidden = model(data, hidden)
L = loss(output, target)
L.backward()
trainer.step(args_batch_size)
total_L += mx.nd.sum(L).asscalar()
val_L = eval(val_data)
print('[Epoch %d] time cost %.2fs, validation loss %.2f, validation perplex
epoch + 1, time.time() - start_time, val_L, math.exp(val_L)))
Recall that the RNN model training is based on maximization likelihood of observations. For evaluation
purposes, we have used the following two measures:
• Loss: the loss function is defined as the average negative log likelihood of the target words (ground
truth) under prediction:
𝑁
1 ∑︁
loss = − log 𝑝target𝑖 ,
𝑁
𝑖=1
where 𝑁 is the number of predictions and 𝑝target𝑖 the predicted likelihood of the 𝑖-th target word.
• Perplexity: the average per-word perplexity is exp(loss).
To orient the reader using concrete examples, let us illustrate the idea of the perplexity measure as follows.
• Consider the perfect scenario where the model always predicts the likelihood of the target word as 1.
In this case, for every 𝑖 we have 𝑝target𝑖 = 1. As a result, the perplexity of the perfect model is 1.
• Consider a baseline scenario where the model always predicts the likelihood of the target word ran-
domly at uniform among the given word set 𝑊 . In this case, for every 𝑖 we have 𝑝target𝑖 = 1/|𝑊 |. As
a result, the perplexity of a uniformly random prediction model is always |𝑊 |.
• Consider the worst-case scenario where the model always predicts the likelihood of the target word
as 0. In this case, for every 𝑖 we have 𝑝target𝑖 = 0. As a result, the perplexity of the worst model is
positive infinity.
Therefore, a model with a lower perplexity that is closer to 1 is generally more effective. Any effective
model has to achieve a perplexity lower than the cardinality of the target set.
Now we are ready to train the model and evaluate the model performance on validation and testing data sets.
In [ ]: train()
model.load_parameters(args_save, context)
test_L = eval(test_data)
print('Best test loss %.2f, test perplexity %.2f'%(test_L, math.exp(test_L)))
3.34.8 Next
Introduction to optimization
For whinges or inquiries, open an issue on GitHub.
3.35 Introduction
You might find it weird that we’re sticking a chapter on optimization here. If you’re following the tutorials
in sequence, then you’ve probably already been optimizing over the parameters of ten or more machine
learning models. You might consider yourself an old pro. In this chapter we’ll supply some depth to
complement your experience.
We need to think seriously about optimization matters for several reasons. First, we want optimizers to be
fast. Optimizing complicated models with millions of parameters can take upsettingly long. You might have
heard of researchers training deep learning models for many hours, days, or even weeks. They probably
weren’t exaggerating. Second, optimization is how we choose our parameters. So the performance (e.g.
accuracy) of our models depends entirely on the quality of the optimizer.
def f(x):
return x * np.cos(np.pi * x)
analytic solution. To refresh your memory, in linear regression we build a predictor of the form:
y
^ = 𝑋w
We ignored the intercept term 𝑏 here, but that can be handled by simply appending a column of all 1s to the
design matrix X.
And we want to solve the following minimization problem
As a refresher, that’s just the sum of the squared differences between our predictions and the ground truth
answers.
𝑛
∑︁
(𝑦𝑖 − w𝑇 x𝑖 )2
𝑖=1
Because we know that this function is quadratic, we know that it has a single critical point where the
derivative of the loss with respect to the weights w is equal to 0. Moreover, we know that the weights that
minimize our loss constitute a critical point. So our solution corresponds to the one setting of the weights
that gives a derivative of 0. First, let’s rewrite our loss function:
^) = (y − 𝑋w)𝑇 (y − 𝑋w)
ℒ(y, y
Now, setting the derivative of our loss to 0 gives the following equation:
𝜕ℒ(y, y
^)
= −2(𝑋)𝑇 (y − 𝑋w) = 0
𝜕w
We can now simplify these equations to find the optimal setting of the parameters w:
−2𝑋 𝑇 y + 2𝑋 𝑇 𝑋w = 0 (3.1)
𝑇 𝑇
𝑋 𝑋w = 𝑋 y
(3.2)
w = (𝑋 𝑇 𝑋)−1 𝑋 𝑇
y
(3.3)
You might have noticed that we assumed that the matrix 𝑋 𝑇 𝑋 can be inverted. If you take this fact for
granted, then it should be clear that we can recover the exact optimal value w* exactly. No matter what
values the data 𝑋, y takes we can produce an exact answer by computing just one matrix multiplication, one
matrix inversion, and two matrix-vector products.
For many problems, even if they don’t have an analytic solution, they may have only one minima. An
especially convenient class of functions are the convex functions. These are functions with a uniformly pos-
itive second derivative. They have no local minima and are especially well-suited to efficient optimization.
Unfortunately, this is a book about neural networks. And neural networks are not in general convex. More-
over, they have abundant local minima. With numerical methods, it may not be possible to find the global
minimizer of an objective function. For non-convex functions, a numerical method often halts around local
minima that are not necessarily the global minima.
Many optimization algorithms, like Newton’s method, are designed to be attracted to critical points, includ-
ing minima and saddle points. Since saddle points are generally common in high-dimensional space, some
optimization algorithms, such as Newton’s method, may fail to train deep learning models effectively as
they may get stuck in saddle points. Another challenging scenarios for neural networks is that there may be
large, flat regions in parameters space that correspond to bad values of the objective function.
where the coefficient term of 𝒪(𝜖2 ) is 𝑓 ′′ (𝑥)/2. This means that a small change of order 𝜖 in the optimum
solution 𝑥* will change the value of 𝑓 (𝑥* ) in the order of 𝜖2 . In other words, if there is an error in the
function value, the precision of the solution value is constrained by the order of the square root of that error.
For example, if the machine precision is 10−8 , the precision of the solution value is only in the order of
10−4 , which is much worse than the machine precision.
3.35.7 Next
Gradient descent and stochastic gradient descent from scratch
For whinges or inquiries, open an issue on GitHub.
𝑓 (𝑥 + 𝜖) ≈ 𝑓 (𝑥) + 𝑓 ′ (𝑥)𝜖.
𝑓 (𝑥 − 𝜂𝑓 ′ (𝑥)) ≤ 𝑓 (𝑥).
𝑥 := 𝑥 − 𝜂𝑓 ′ (𝑥)
3.36. Gradient descent and stochastic gradient descent from scratch 213
Deep Learning - The Straight Dope, Release 0.1
may reduce the value of 𝑓 (𝑥) if its current derivative value 𝑓 ′ (𝑥) ̸= 0. Since the derivative 𝑓 ′ (𝑥) is a special
case of gradient in one-dimensional domain, the above update of 𝑥 is gradient descent in one-dimensional
domain.
The positive scalar 𝜂 is called the learning rate or step size. Note that a larger learning rate increases the
chance of overshooting the global minimum and oscillating. However, if the learning rate is too small, the
convergence can be very slow. In practice, a proper learning rate is usually selected with experiments.
To keep our notation compact we may use the notation ∇𝑓 (x) and ∇x 𝑓 (x) interchangeably when there is
no ambiguity about which parameters we are optimizing over. In plain English, each element 𝜕𝑓 (x)/𝜕𝑥𝑖 of
the gradient indicates the rate of change for 𝑓 at the point x with respect to the input 𝑥𝑖 only. To measure
the rate of change of 𝑓 in any direction that is represented by a unit vector u, in multivariate calculus, we
define the directional derivative of 𝑓 at x in the direction of u as
𝑓 (x + ℎu) − 𝑓 (x)
𝐷u 𝑓 (x) = lim ,
ℎ→0 ℎ
which can be rewritten according to the chain rule as
𝐷u 𝑓 (x) = ∇𝑓 (x) · u.
Since 𝐷u 𝑓 (x) gives the rates of change of 𝑓 at the point x in all possible directions, to minimize 𝑓 , we are
interested in finding the direction where 𝑓 can be reduced fastest. Thus, we can minimize the directional
derivative 𝐷u 𝑓 (x) with respect to u. Since 𝐷u 𝑓 (x) = ‖∇𝑓 (x)‖ · ‖u‖ · cos(𝜃) = ‖∇𝑓 (x)‖ · cos(𝜃), where
𝜃 is the angle between ∇𝑓 (x) and u, the minimum value of cos(𝜃) is -1 when 𝜃 = 𝜋. Therefore, 𝐷u 𝑓 (x)
is minimized when u is at the opposite direction of the gradient ∇𝑓 (x). Now we can iteratively reduce the
value of 𝑓 with the following gradient descent update:
x := x − 𝜂∇𝑓 (x),
where the positive scalar 𝜂 is called the learning rate or step size.
where 𝑓𝑖 (x) is a loss function based on the training data instance indexed by 𝑖. It is important to highlight
that the per-iteration computational cost in gradient descent scales linearly with the training data set size 𝑛.
Hence, when 𝑛 is huge, the per-iteration computational cost of gradient descent is very high.
In view of this, stochastic gradient descent offers a lighter-weight solution. At each iteration, rather than
computing the gradient ∇𝑓 (x), stochastic gradient descent randomly samples 𝑖 at uniform and computes
∇𝑓𝑖 (x) instead. The insight is, stochastic gradient descent uses ∇𝑓𝑖 (x) as an unbiased estimator of ∇𝑓 (x)
since
𝑛
1 ∑︁
E𝑖 ∇𝑓𝑖 (x) = ∇𝑓𝑖 (x) = ∇𝑓 (x).
𝑛
𝑖=1
In a generalized case, at each iteration a mini-batch ℬ that consists of indices for training data instances may
be sampled at uniform with replacement. Similarly, we can use
1 ∑︁
∇𝑓ℬ (x) = ∇𝑓𝑖 (x)
|ℬ|
𝑖∈ℬ
to update x as
x := x − 𝜂∇𝑓ℬ (x),
where |ℬ| denotes the cardinality of the mini-batch and the positive scalar 𝜂 is the learning rate or step size.
Likewise, the mini-batch stochastic gradient ∇𝑓ℬ (x) is an unbiased estimator for the gradient ∇𝑓 (x):
This generalized stochastic algorithm is also called mini-batch stochastic gradient descent and we simply
refer to them as stochastic gradient descent (as generalized). The per-iteration computational cost is 𝒪(|ℬ|).
Thus, when the mini-batch size is small, the computational cost at each iteration is light.
There are other practical reasons that may make stochastic gradient descent more appealing than gradient
descent. If the training data set has many redundant data instances, stochastic gradients may be so close
to the true gradient ∇𝑓 (x) that a small number of iterations will find useful solutions to the optimization
problem. In fact, when the training data set is large enough, stochastic gradient descent only requires a
small number of iterations to find useful solutions such that the total computational cost is lower than that
of gradient descent even for just one iteration. Besides, stochastic gradient descent can be considered as
offering a regularization effect especially when the mini-batch size is small due to the randomness and
noise in the mini-batch sampling. Moreover, certain hardware processes mini-batches of specific sizes more
efficiently.
3.36.4 Experiments
For demonstrating the aforementioned gradient-based optimization algorithms, we use the regression prob-
lem in the linear regression chapter as a case study.
In [1]: # Mini-batch stochastic gradient descent.
def sgd(params, lr, batch_size):
for param in params:
param[:] = param - lr * param.grad / batch_size
3.36. Gradient descent and stochastic gradient descent from scratch 215
Deep Learning - The Straight Dope, Release 0.1
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
X = nd.random_normal(scale=1, shape=(num_examples, num_inputs))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
# Linear regression.
def net(X, w, b):
return nd.dot(X, w) + b
# Loss function.
def square_loss(yhat, y):
return (yhat - y.reshape(yhat.shape)) ** 2 / 2
In [3]: %matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
import numpy as np
3.36. Gradient descent and stochastic gradient descent from scratch 217
Deep Learning - The Straight Dope, Release 0.1
3.36. Gradient descent and stochastic gradient descent from scratch 219
Deep Learning - The Straight Dope, Release 0.1
3.36.5 Next
Gradient descent and stochastic gradient descent with Gluon
For whinges or inquiries, open an issue on GitHub.
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
X = nd.random_normal(scale=1, shape=(num_examples, num_inputs))
net = gluon.nn.Sequential()
net.add(gluon.nn.Dense(1))
square_loss = gluon.loss.L2Loss()
In [2]: %matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
3.37. Gradient descent and stochastic gradient descent with Gluon 221
Deep Learning - The Straight Dope, Release 0.1
3.37. Gradient descent and stochastic gradient descent with Gluon 223
Deep Learning - The Straight Dope, Release 0.1
3.37.1 Next
Momentum from scratch
For whinges or inquiries, open an issue on GitHub.
𝜕 2 𝑓 (x)
H𝑖,𝑗 =
𝜕𝑥𝑖 𝜕𝑥𝑗
for all 𝑖, 𝑗 = 1, . . . , 𝑑. Since H is a real symmetric matrix, by spectral theorem, it is orthogonally diagonal-
izable as
S⊤ HS = Λ,
v := 𝛾v + 𝜂∇𝑓ℬ (x),
x := x − v,
where v is the current velocity and 𝛾 is the momentum parameter. The learning rate 𝜂 and the stochastic
gradient ∇𝑓ℬ (x) with respect to the sampled mini-batch ℬ are both defined in the previous chapter.
It is important to highlight that, the scale of advancement at each iteration now also depends on how aligned
the directions of the past gradients are. This scale is the largest when all the past gradients are perfectly
aligned to the same direction.
To better understand the momentum parameter 𝛾, let us simplify the scenario by assuming the stochastic
gradients ∇𝑓ℬ (x) are the same as g throughout the iterations. Since all the gradients are perfectly aligned
to the same direction, the momentum algorithm accelerates the advancement along the same direction of g
as
v1 := 𝜂g,
v2 := 𝛾v1 + 𝜂g = 𝜂g(𝛾 + 1),
v3 := 𝛾v2 + 𝜂g = 𝜂g(𝛾 2 + 𝛾 + 1),
...
𝜂g
vinf := .
1−𝛾
Thus, if 𝛾 = 0.99, the final velocity is 100 times faster than that of the corresponding gradient descent where
the gradient is g.
Now with the momentum algorithm, a sample search path can be improved as illustrated in the following
figure.
Experiments
For demonstrating the momentum algorithm, we still use the regression problem in the linear regression
chapter as a case study. Specifically, we investigate stochastic gradient descent with momentum.
In [1]:
def sgd_momentum(params, vs, lr, mom, batch_size):
for param, v in zip(params, vs):
v[:] = mom * v + lr * param.grad / batch_size
param[:] = param - v
In [2]: import mxnet as mx
from mxnet import autograd
from mxnet import ndarray as nd
from mxnet import gluon
import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
X = nd.random_normal(scale=1, shape=(num_examples, num_inputs))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
#
vs.append(param.zeros_like())
return params, vs
# Linear regression.
def net(X, w, b):
return nd.dot(X, w) + b
# Loss function.
def square_loss(yhat, y):
return (yhat - y.reshape(yhat.shape)) ** 2 / 2
In [3]: %matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
import numpy as np
Next
Momentum with Gluon
For whinges or inquiries, open an issue on GitHub.
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
X = nd.random_normal(scale=1, shape=(num_examples, num_inputs))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
net = gluon.nn.Sequential()
net.add(gluon.nn.Dense(1))
square_loss = gluon.loss.L2Loss()
In [2]: %matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
3.39.1 Next
Adagrad from scratch
For whinges or inquiries, open an issue on GitHub.
# Adagrad.
def adagrad(params, sqrs, lr, batch_size):
eps_stable = 1e-7
for param, sqr in zip(params, sqrs):
g = param.grad / batch_size
sqr[:] += nd.square(g)
div = lr * g / nd.sqrt(sqr + eps_stable)
param[:] -= div
import mxnet as mx
from mxnet import autograd
from mxnet import gluon
import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
# Linear regression.
def net(X, w, b):
return nd.dot(X, w) + b
# Loss function.
def square_loss(yhat, y):
return (yhat - y.reshape(yhat.shape)) ** 2 / 2
In [2]: %matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
import numpy as np
3.40.1 Next
Adagrad with Gluon
For whinges or inquiries, open an issue on GitHub.
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
X = nd.random_normal(scale=1, shape=(num_examples, num_inputs))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
net = gluon.nn.Sequential()
net.add(gluon.nn.Dense(1))
square_loss = gluon.loss.L2Loss()
In [2]: %matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
plt.ylabel('loss')
plt.show()
In [3]: train(batch_size=10, lr=0.9, epochs=3, period=10)
Batch size 10, Learning rate 0.900000, Epoch 1, loss 5.3231e-05
Batch size 10, Learning rate 0.900000, Epoch 2, loss 4.9388e-05
Batch size 10, Learning rate 0.900000, Epoch 3, loss 4.9256e-05
w: [[ 1.99946415 -3.39996123]] b: 4.19967
3.41.1 Next
RMSProp from scratch
For whinges or inquiries, open an issue on GitHub.
# RMSProp.
def rmsprop(params, sqrs, lr, gamma, batch_size):
eps_stable = 1e-8
for param, sqr in zip(params, sqrs):
g = param.grad / batch_size
sqr[:] = gamma * sqr + (1. - gamma) * nd.square(g)
div = lr * g / nd.sqrt(sqr + eps_stable)
param[:] -= div
import mxnet as mx
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
X = nd.random_normal(scale=1, shape=(num_examples, num_inputs))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
# Linear regression.
def net(X, w, b):
return nd.dot(X, w) + b
# Loss function.
def square_loss(yhat, y):
return (yhat - y.reshape(yhat.shape)) ** 2 / 2
In [2]: %matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
import numpy as np
3.42.1 Next
RMSProp with Gluon
For whinges or inquiries, open an issue on GitHub.
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
X = nd.random_normal(scale=1, shape=(num_examples, num_inputs))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
net = gluon.nn.Sequential()
net.add(gluon.nn.Dense(1))
square_loss = gluon.loss.L2Loss()
In [2]: %matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
3.43.1 Next
AdaDalta from scratch
For whinges or inquiries, open an issue on GitHub.
# Adadalta.
def adadelta(params, sqrs, deltas, rho, batch_size):
eps_stable = 1e-5
for param, sqr, delta in zip(params, sqrs, deltas):
g = param.grad / batch_size
sqr[:] = rho * sqr + (1. - rho) * nd.square(g)
cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g
delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta
# update weight
param[:] -= cur_delta
import mxnet as mx
from mxnet import autograd
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
X = nd.random_normal(scale=1, shape=(num_examples, num_inputs))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
# Linear regression.
# Loss function.
def square_loss(yhat, y):
return (yhat - y.reshape(yhat.shape)) ** 2 / 2
In [2]: %matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
import numpy as np
3.44.1 Next
AdaDalta with Gluon
For whinges or inquiries, open an issue on GitHub.
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
X = nd.random_normal(scale=1, shape=(num_examples, num_inputs))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
net = gluon.nn.Sequential()
net.add(gluon.nn.Dense(1))
square_loss = gluon.loss.L2Loss()
In [2]: %matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
3.45.1 Next
Adam from scratch
For whinges or inquiries, open an issue on GitHub.
import mxnet as mx
from mxnet import autograd
from mxnet import ndarray as nd
from mxnet import gluon
import random
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
X = nd.random_normal(scale=1, shape=(num_examples, num_inputs))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
# Linear regression.
def net(X, w, b):
return nd.dot(X, w) + b
# Loss function.
def square_loss(yhat, y):
return (yhat - y.reshape(yhat.shape)) ** 2 / 2
In [2]: %matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
import numpy as np
t = 0
# Epoch starts from 1.
for epoch in range(1, epochs + 1):
for batch_i, data, label in data_iter(batch_size):
with autograd.record():
output = net(data, w, b)
loss = square_loss(output, label)
loss.backward()
# Increment t before invoking adam.
t += 1
adam([w, b], vs, sqrs, lr, batch_size, t)
if batch_i * batch_size % period == 0:
total_loss.append(np.mean(square_loss(net(X, w, b), y).asnumpy()))
print("Batch size %d, Learning rate %f, Epoch %d, loss %.4e" %
(batch_size, lr, epoch, total_loss[-1]))
print('w:', np.reshape(w.asnumpy(), (1, -1)),
'b:', b.asnumpy()[0], '\n')
x_axis = np.linspace(0, epochs, len(total_loss), endpoint=True)
plt.semilogy(x_axis, total_loss)
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
In [3]: train(batch_size=10, lr=0.1, epochs=3, period=10)
Batch size 10, Learning rate 0.100000, Epoch 1, loss 6.7040e-04
Batch size 10, Learning rate 0.100000, Epoch 2, loss 5.0751e-05
Batch size 10, Learning rate 0.100000, Epoch 3, loss 5.0725e-05
w: [[ 1.9997046 -3.39914703]] b: 4.1986
3.46.1 Next
Adam with Gluon
For whinges or inquiries, open an issue on GitHub.
mx.random.seed(1)
random.seed(1)
# Generate data.
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
X = nd.random_normal(scale=1, shape=(num_examples, num_inputs))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
net = gluon.nn.Sequential()
net.add(gluon.nn.Dense(1))
square_loss = gluon.loss.L2Loss()
In [2]: %matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
3.47.1 Next
Fast & flexible: combining imperative & symbolic nets with HybridBlocks
For whinges or inquiries, open an issue on GitHub.
programs.
Take for example a prototypical program written below in pseudo-Python. We grab some input arrays, we
compute upon them to produce some intermediate values, and finally we produce the result that we actually
care about.
result = our_function(W, X, Y, Z)
As you might expect when we compute E, we’re actually performing some numerical operation, like mul-
tiplication, and returning an array that we assign to the variable E. Same for F. And if we want to do a
similar computation many times by putting these lines in a function, each time our program will have to step
through these three lines of Python.
The advantage of this approach is it’s so natural that it might not even occur to some people that there is
another way. But the disadvantage is that it’s slow. That’s because we are constantly engaging the Python
execution environment (which is slow) even though our entire function performs the same three low-level
operations in the same sequence every time. It’s also holding on to all the intermediate values D and E until
the function returns even though we can see that they’re not needed. We might have made this program
more efficient by re-using memory from either E or F to store the result G.
There actually is a different way to do things. It’s called symbolic programming and most of the early deep
learning libraries, including Theano and Tensorflow, embraced this approach exclusively. You might have
also heard this approach referred to as declarative programming or define-then-run programming. These all
mean the exact same thing. The approach consists of three basic steps:
• Define a computation workflow, like a pass through a neural network, using placeholder data
• Compile the program into a front-end language, e.g. Python, independent format
• Invoke the compiled function, feeding it real data
Revisiting our previous pseudo-Python example, a symbolic version of the same program might look some-
thing like this:
# Create some placeholders to stand in for real data that might be supplied
˓→to the compiled function.
A = placeholder()
B = placeholder()
C = placeholder()
D = placeholder()
result = our_function(W, X, Y, Z)
Here, when we run the line E = symbolic_function1(A, B), no numerical computation actually
happens. Instead, the symbolic library notes the way that E is related to A and B and records this infor-
mation. We don’t do actual computation, we just make a roadmap for how to go from inputs to outputs.
Because we can draw all of the variables and operations (both inputs and intermediate values) a nodes, and
the relationships between nodes with edges, we call the resulting roadmap a computational graph. In the
symbolic approach, we first define the entire graph, and then compile it.
import numpy as np
a = np.ones(10)
b = np.ones(10) * 2
c = b * a
d = c + 1
...
Assume that each cell in the array occupies 8 bytes of memory. How much memory do we need to execute
this program in the Python console? As an imperative program we need to allocate memory at each line.
That leaves us allocating 4 arrays of size 10. So we’ll need 4 * 10 * 8 = 320 bytes. On the other hand,
if we built a computation graph, and knew in advance that we only needed d, we could reuse the memory
originally allocated for intermediate values. For example, by performing computations in-place, we might
recycle the bits allocated for b to store c. And we might recycle the bits allocated for c to store d. In the end
we could cut our memory requirement in half, requiring just 2 * 10 * 8 = 160 bytes.
Symbolic programs can also perform another kind of optimization, called operation folding. Returning
to our toy example, the multiplication and addition operations can be folded into one operation. If the
computation runs on a GPU processor, one GPU kernel will be executed, instead of two. In fact, this is one
way we hand-craft operations in optimized libraries, such as CXXNet and Caffe. Operation folding improves
computation efficiency. Note, you can’t perform operation folding in imperative programs, because the
intermediate values might be referenced in the future. Operation folding is possible in symbolic programs
because we get the entire computation graph in advance, before actually doing any calculation, giving us a
clear specification of which values will be needed and which will not.
HybridSequential
We already learned how to use Sequential to stack the layers. The regular Sequential can be built
from regular Blocks and so it too has to be a regular Block. However, when you want to build a network
using sequential and run it at crazy speeds, you can construct your network using HybridSequential
instead. The functionality is the same Sequential:
In [1]: import mxnet as mx
from mxnet.gluon import nn
from mxnet import nd
def get_net():
# construct a MLP
net = nn.HybridSequential()
with net.name_scope():
net.add(nn.Dense(256, activation="relu"))
net.add(nn.Dense(128, activation="relu"))
net.add(nn.Dense(2))
# initialize the parameters
net.collect_params().initialize()
return net
# forward
x = nd.random_normal(shape=(1, 512))
net = get_net()
print('=== net(x) ==={}'.format(net(x)))
=== net(x) ===
[[ 0.08827585 0.0050519 ]]
<NDArray 1x2 @cpu(0)>
To compile and optimize the HybridSequential, we can then call its hybridize method. Only
HybridBlocks, e.g. HybridSequential, can be compiled. But you can still call hybridize
on normal Block and its HybridBlock children will be compiled instead. We will talk more about
HybridBlocks later.
In [2]: net.hybridize()
print('=== net(x) ==={}'.format(net(x)))
=== net(x) ===
[[ 0.08827585 0.0050519 ]]
<NDArray 1x2 @cpu(0)>
Performance
To get a sense of the speedup from hybridizing, we can compare the performance before and after hybridiz-
ing by measuring in either case the time it takes to make 1000 forward passes through the network.
In [3]: from time import time
def bench(net, x):
mx.nd.waitall()
start = time()
for i in range(1000):
y = net(x)
mx.nd.waitall()
return time() - start
net = get_net()
print('Before hybridizing: %.4f sec'%(bench(net, x)))
net.hybridize()
print('After hybridizing: %.4f sec'%(bench(net, x)))
Before hybridizing: 0.4344 sec
After hybridizing: 0.2230 sec
As you can see, hybridizing gives a significant performance boost, almost 2x the speed.
y = net(x)
print('\n=== the symbolic program of net===')
print(y)
y_json = y.tojson()
print('\n=== the according json definition===')
print(y_json)
=== input data holder ===
<Symbol data>
"inputs": []
},
{
"op": "FullyConnected",
"name": "hybridsequential1_dense0_fwd",
"attrs": {
"flatten": "True",
"no_bias": "False",
"num_hidden": "256"
},
"inputs": [[0, 0, 0], [1, 0, 0], [2, 0, 0]]
},
{
"op": "Activation",
"name": "hybridsequential1_dense0_relu_fwd",
"attrs": {"act_type": "relu"},
"inputs": [[3, 0, 0]]
},
{
"op": "null",
"name": "hybridsequential1_dense1_weight",
"attrs": {
"__dtype__": "0",
"__lr_mult__": "1.0",
"__shape__": "(128, 0)",
"__storage_type__": "0",
"__wd_mult__": "1.0"
},
"inputs": []
},
{
"op": "null",
"name": "hybridsequential1_dense1_bias",
"attrs": {
"__dtype__": "0",
"__init__": "zeros",
"__lr_mult__": "1.0",
"__shape__": "(128,)",
"__storage_type__": "0",
"__wd_mult__": "1.0"
},
"inputs": []
},
{
"op": "FullyConnected",
"name": "hybridsequential1_dense1_fwd",
"attrs": {
"flatten": "True",
"no_bias": "False",
"num_hidden": "128"
},
"inputs": [[4, 0, 0], [5, 0, 0], [6, 0, 0]]
},
{
"op": "Activation",
"name": "hybridsequential1_dense1_relu_fwd",
"attrs": {"act_type": "relu"},
"inputs": [[7, 0, 0]]
},
{
"op": "null",
"name": "hybridsequential1_dense2_weight",
"attrs": {
"__dtype__": "0",
"__lr_mult__": "1.0",
"__shape__": "(2, 0)",
"__storage_type__": "0",
"__wd_mult__": "1.0"
},
"inputs": []
},
{
"op": "null",
"name": "hybridsequential1_dense2_bias",
"attrs": {
"__dtype__": "0",
"__init__": "zeros",
"__lr_mult__": "1.0",
"__shape__": "(2,)",
"__storage_type__": "0",
"__wd_mult__": "1.0"
},
"inputs": []
},
{
"op": "FullyConnected",
"name": "hybridsequential1_dense2_fwd",
"attrs": {
"flatten": "True",
"no_bias": "False",
"num_hidden": "2"
},
"inputs": [[8, 0, 0], [9, 0, 0], [10, 0, 0]]
}
],
"arg_nodes": [0, 1, 2, 5, 6, 9, 10],
"node_row_ptr": [
0,
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12
],
"heads": [[11, 0, 0]],
"attrs": {"mxnet_version": ["int", 10300]}
}
Now we can save both the program and parameters onto disk, so that it can be loaded later not only
in Python, but in all other supported languages, such as C++, R, and Scala, as well. For that we use
the .export(prefix, epoch) function, it will save the json symbolic representation in the format
prefix-symbol.json and the corresponding parameters as prefix-{epoch}.params.
In [5]: net.export('my_model', epoch=0)
HybridBlock
Now let’s dive deeper into how hybridize works. Remember that gluon networks are composed of
Blocks each of which subclass gluon.Block. With normal Blocks, we just need to define a forward
function that takes an input x and computes the result of the forward pass through the network. MXNet can
figure out the backward pass for us automatically with autograd.
To define a HybridBlock, we instead have a hybrid_forward function:
In [6]: from mxnet import gluon
class Net(gluon.HybridBlock):
def __init__(self, **kwargs):
super(Net, self).__init__(**kwargs)
with self.name_scope():
self.fc1 = nn.Dense(256)
self.fc2 = nn.Dense(128)
self.fc3 = nn.Dense(2)
The hybrid_forward function takes an additional input, F, which stands for a backend. This exploits
one awesome feature of MXNet. MXNet has both a symbolic API (mxnet.symbol) and an imperative
API (mxnet.ndarray). In this book, so far, we’ve only focused on the latter. Owing to fortuitous
historical reasons, the imperative and symbolic interfaces both support roughly the same API. They have
many of same functions (currently about 90% overlap) and when they do, they support the same arguments
in the same order. When we define hybrid_forward, we pass in F. When running in imperative mode,
hybrid_forward is called with F as mxnet.ndarray and x as some ndarray input. When we compile
with hybridize, F will be mxnet.symbol and x will be some placeholder or intermediate symbolic
value. Once we call hybridize, the net is compiled, so we’ll never need to call hybrid_forward again.
Let’s demonstrate how this all works by feeding some data through the network twice. We’ll do this for
both a regular network and a hybridized net. You’ll see that in the first case, hybrid_forward is actually
called twice.
In [7]: net = Net()
net.collect_params().initialize()
x = nd.random_normal(shape=(1, 512))
print('=== 1st forward ===')
y = net(x)
print('=== 2nd forward ===')
y = net(x)
=== 1st forward ===
type(x): NDArray, F: mxnet.ndarray
=== 2nd forward ===
type(x): NDArray, F: mxnet.ndarray
Conclusion
Through HybridSequental and HybridBlock, we can convert an imperative program into a symbolic
program by calling hybridize.
Next
Training MXNet models with multiple GPUs
For whinges or inquiries, open an issue on GitHub.
If an NVIDIA driver is installed on our machine, then we can check how many GPUs are available by
running the command nvidia-smi.
In [1]: !nvidia-smi
Fri Oct 13 00:11:36 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 0000:00:1B.0 Off | 0 |
| N/A 34C P8 13W / 150W | 0MiB / 7613MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 On | 0000:00:1C.0 Off | 0 |
| N/A 29C P8 15W / 150W | 0MiB / 7613MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M60 On | 0000:00:1D.0 Off | 0 |
| N/A 33C P8 13W / 150W | 0MiB / 7613MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M60 On | 0000:00:1E.0 Off | 0 |
| N/A 31C P8 14W / 150W | 0MiB / 7613MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
We want to use all of the GPUs on together for the purpose of significantly speeding up training (in terms of
wall clock). Remember that CPUs and GPUs each can have multiple cores. CPUs on a laptop might have
2 or 4 cores, and on a server might have up to 16 or 32 cores. GPUs tend to have many more cores - an
NVIDIA K80 GPU has 4992 - but run at slower clock speeds. Exploiting the parallelism across the GPU
cores is how GPUs get their speed advantage in the first place.
As compared to the single CPU or single GPU setting where all the cores are typically used by default,
parallelism across devices is a little more complicated. That’s because most layers of a neural network can
only run on a single device. So, in order to parallelize across devices, we need to do a little extra. Therefore,
we need to do some additional work to partition a workload across multiple GPUs. This can be done in a
few ways.
start = time()
x = nd.random_uniform(shape=(2000,2000))
y = nd.dot(x, x)
print('=== workloads are pushed into the backend engine ===\n%f sec' % (time() - st
z = y.asnumpy()
print('=== workloads are finished ===\n%f sec' % (time() - start))
=== workloads are pushed into the backend engine ===
0.001160 sec
=== workloads are finished ===
0.174040 sec
Second, MXNet depends on a powerful scheduling algorithm that analyzes the dependencies of the pushed
workloads. This scheduler checks to see if two workloads are independent of each other. If they are, then
the engine may run them in parallel. If a workload depend on results that have not yet been computed, it
will be made to wait until its inputs are ready.
For example, if we call three operators:
a = nd.random_uniform(...)
b = nd.random_uniform(...)
c = a + b
Then the computation for a and b may run in parallel, while c cannot be computed until both a and b are
ready.
The following code shows that the engine effectively parallelizes the dot operations on two GPUs:
In [3]: from mxnet import gpu
def run(x):
"""push 10 matrix-matrix multiplications"""
return [nd.dot(x,x) for i in range(10)]
def wait(x):
"""explicitly wait until all results are ready"""
for y in x:
y.wait_to_read()
print('=== Run on GPU 0 and then copy results to CPU in sequential ===')
start = time()
y0 = run(x0)
wait(y0)
z0 = copy(y0, cpu())
wait(z0)
print(time() - start)
loss = gluon.loss.SoftmaxCrossEntropyLoss()
# plain SGD
def SGD(params, lr):
for p in params:
p[:] = p - lr * p.grad
Given a list of data that spans multiple GPUs, we then define a function to sum the data and broadcast the
results to each GPU.
In [7]: def allreduce(data):
# sum on data[0].context, and then broadcast
for i in range(1, len(data)):
data[0][:] += data[i].copyto(data[0].context)
for i in range(1, len(data)):
data[0].copyto(data[i])
Given a data batch, we define a function that splits this batch and copies each part into the corresponding
GPU.
In [8]: def split_and_load(data, ctx):
n, k = data.shape[0], len(ctx)
assert (n//k)*k == n, '# examples is not divided by # devices'
idx = list(range(0, n+1, n//k))
return [data[idx[i]:idx[i+1]].as_in_context(ctx[i]) for i in range(k)]
batch = nd.arange(16).reshape((4,4))
print('=== original data ==={}'.format(batch))
ctx = [gpu(0), gpu(1)]
splitted = split_and_load(batch, ctx)
print('\n=== splitted into {} ==={}\n{}'.format(ctx, splitted[0], splitted[1]))
=== original data ===
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[ 12. 13. 14. 15.]]
<NDArray 4x4 @cpu(0)>
[[ 8. 9. 10. 11.]
[ 12. 13. 14. 15.]]
<NDArray 2x4 @gpu(1)>
For inference, we simply let it run on the first GPU. We leave a data parallelism implementation as an
exercise.
In [10]: def valid_batch(batch, params, ctx):
data = batch.data[0].as_in_context(ctx[0])
pred = nd.argmax(lenet(data, params[0]), axis=1)
return nd.sum(pred == batch.label[0].as_in_context(ctx[0])).asscalar()
# data iterator
mnist = get_mnist()
train_data = NDArrayIter(mnist["train_data"], mnist["train_label"], batch_size
valid_data = NDArrayIter(mnist["test_data"], mnist["test_label"], batch_size)
print('Batch size is {}'.format(batch_size))
# validating
valid_data.reset()
correct, num = 0.0, 0.0
for batch in valid_data:
Running on multiple GPUs, we often want to increase the batch size so that each GPU still gets a large
enough batch size for good computation performance. (A larger batch size sometimes slows down the
convergence, we often want to increases the learning rate as well but in this case we’ll keep it same. Feel
free to try higher learning rates.)
In [13]: run(2, 128, 0.3)
Running on [gpu(0), gpu(1)]
Batch size is 128
Epoch 0, training time = 3.9 sec
validation accuracy = 0.8873
Epoch 1, training time = 3.4 sec
validation accuracy = 0.9477
Epoch 2, training time = 3.3 sec
validation accuracy = 0.9614
Epoch 3, training time = 3.1 sec
validation accuracy = 0.9798
Epoch 4, training time = 2.8 sec
validation accuracy = 0.9824
3.49.7 Conclusion
We have shown how to implement data parallelism on a deep neural network from scratch. Thanks to the
auto-parallelism, we only need to write serial codes while the engine is able to parallelize them on multiple
GPUs.
3.49.8 Next
Training with multiple GPUs with gluon
For whinges or inquiries, open an issue on GitHub.
loss = gluon.loss.SoftmaxCrossEntropyLoss()
Given a batch of input data, we can split it into parts (equal to the number of contexts) by calling gluon.
utils.split_and_load(batch, ctx). The split_and_load function doesn’t just split the
data, it also loads each part onto the appropriate device context.
So now when we call the forward pass on two separate parts, each one is computed on the appropriate
corresponding device and using the version of the parameters stored there.
In [3]: from mxnet.test_utils import get_mnist
mnist = get_mnist()
batch = mnist['train_data'][0:GPU_COUNT*2, :]
data = gluon.utils.split_and_load(batch, ctx)
print(net(data[0]))
print(net(data[1]))
1.19591374e-02 -6.60043515e-05]
[ -1.17358668e-02 -2.16879714e-02 1.71219767e-03 2.49827504e-02
1.16810966e-02 -9.52543691e-03 -1.03610428e-02 5.08510228e-03
7.06662657e-03 -9.25292261e-03]]
<NDArray 2x10 @gpu(1)>
At any time, we can access the version of the parameters stored on each device. Recall from the first Chapter
that our weights may not actually be initialized when we call initialize because the parameter shapes
may not yet be known. In these cases, initialization is deferred pending shape inference.
In [4]: weight = net.collect_params()['cnn_conv0_weight']
for c in ctx:
print('=== channel 0 of the first conv on {} ==={}'.format(
c, weight.data(ctx=c)[0]))
=== channel 0 of the first conv on gpu(0) ===
[[[ 0.04118239 0.05352169 -0.04762455]
[ 0.06035256 -0.01528978 0.04946674]
[ 0.06110793 -0.00081179 0.02191102]]]
<NDArray 1x3x3 @gpu(0)>
=== channel 0 of the first conv on gpu(1) ===
[[[ 0.04118239 0.05352169 -0.04762455]
[ 0.06035256 -0.01528978 0.04946674]
[ 0.06110793 -0.00081179 0.02191102]]]
<NDArray 1x3x3 @gpu(1)>
Similarly, we can access the gradients on each of the GPUs. Because each GPU gets a different part of the
batch (a different subset of examples), the gradients on each GPU vary.
In [5]: def forward_backward(net, data, label):
with autograd.record():
losses = [loss(net(X), Y) for X, Y in zip(data, label)]
for l in losses:
l.backward()
# data iterator
mnist = get_mnist()
train_data = NDArrayIter(mnist["train_data"], mnist["train_label"], batch_size)
valid_data = NDArrayIter(mnist["test_data"], mnist["test_label"], batch_size)
print('Batch size is {}'.format(batch_size))
net.collect_params().initialize(force_reinit=True, ctx=ctx)
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
for epoch in range(5):
# train
start = time()
train_data.reset()
for batch in train_data:
train_batch(batch, ctx, net, trainer)
nd.waitall() # wait until all computations are finished to benchmark the t
print('Epoch %d, training time = %.1f sec'%(epoch, time()-start))
# validating
valid_data.reset()
correct, num = 0.0, 0.0
for batch in valid_data:
correct += valid_batch(batch, ctx, net)
num += batch.data[0].shape[0]
print(' validation accuracy = %.4f'%(correct/num))
3.50.3 Conclusion
Both parameters and trainers in gluon support multi-devices. Moving from one device to multi-devices is
straightforward.
3.50.4 Next
Distributed training with multiple machines
For whinges or inquiries, open an issue on GitHub.
store.init('weight', x)
print('=== init "weight" ==={}'.format(x))
=== init "weight" ===
[[ 0.54881352 0.59284461 0.71518934]
[ 0.84426576 0.60276335 0.85794562]]
<NDArray 2x3 @cpu(0)>
We can also push new data value into the store. It will first sum the data on the same key and then overwrite
the current value.
In [3]: z = [nd.ones(shape, ctx=ctx[i])+i for i in range(len(ctx))]
store.push('weight', z)
print('=== push to "weight" ===\n{}'.format(z))
store.pull('weight', out=y)
print('=== pull "weight" ===\n{}'.format(y))
=== push to "weight" ===
[
[[ 1. 1. 1.]
[ 1. 1. 1.]]
<NDArray 2x3 @gpu(0)>,
[[ 2. 2. 2.]
[ 2. 2. 2.]]
<NDArray 2x3 @gpu(1)>]
=== pull "weight" ===
[
[[ 3. 3. 3.]
[ 3. 3. 3.]]
<NDArray 2x3 @gpu(0)>,
[[ 3. 3. 3.]
[ 3. 3. 3.]]
<NDArray 2x3 @gpu(1)>]
With push and pull we can replace the allreduce function defined in multiple-gpus-scratch by
store = kv.create('dist')
Now if we run the code from the previous section on two machines at the same time, then the store will
aggregate the two ndarrays pushed from each machine, and after that, the pulled results will be:
[[ 6. 6. 6.]
[ 6. 6. 6.]]
In the distributed setting, MXNet launches three kinds of processes (each time, running python myprog.
py will create a process). One is a worker, which runs the user program, such as the code in the previous
section. The other two are the server, which maintains the data pushed into the store, and the scheduler,
which monitors the aliveness of each node.
It’s up to users which machines to run these processes on. But to simplify the process placement and
launching, MXNet provides a tool located at tools/launch.py.
Assume there are two machines, A and B. They are ssh-able, and their IPs are saved in a file named
hostfile. Then we can start one worker in each machine through:
It will also start a server in each machine, and the scheduler on the same machine we are currently on.
chapter07_distributed-learning/img/dist_kv.png
store = kv.create('dist')
trainer = gluon.Trainer(..., kvstore=store)
To split the data, however, we cannot directly copy the previous approach. One commonly used solution is
to split the whole dataset into k parts at the beginning, then let the i-th worker only read the i-th part of the
data.
We can obtain the total number of workers by reading the attribute num_workers and the rank of the
current worker from the attribute rank.
In [4]: print('total number of workers: %d'%(store.num_workers))
print('my rank among workers: %d'%(store.rank))
total number of workers: 1
my rank among workers: 0
With this information, we can manually access the proper chunk of the input data. In addition, several data
iterators provided by MXNet already support reading only part of the data. For example,
FOUR
PART 2: APPLICATIONS
work/mxnet/tests/nightly/straight_dope/tmp_notebook/
So object defers from image classification in a few ways. First, while a classifier outputs a single category
per image, an object detector must be able to recognize multiple objects in a single image. Technically, this
task is called multiple object detection, but most research in the area addresses the multiple object setting, so
we’ll abuse terminology just a little. Second, while classifiers need only to output probabilities over classes,
object detectors must output both probabilities of class membership and also the coordinates that identify
the location of the objects.
On this chapter we’ll demonstrate the single shot multiple box object detector (SSD), a popular model for
object detection that was first described in this paper, and is straightforward to implement in MXNet Gluon.
/home/doosik/gluon/mxnet-the-straight-dope/build/_bu
We first use a body network to extract the image features, which are used as the input to the first scale
(scale 0). The class labels and the corresponding anchor boxes are predicted by class_predictor
275
Deep Learning - The Straight Dope, Release 0.1
and box_predictor, respectively. We then downsample the representations to the next scale (scale 1).
Again, at this new resolution, we predict both classes and anchor boxes. This downsampling and predicting
routine can be repeated in multiple times to obtain results on multiple resolution scales. Let’s walk through
the components one by one in a bit more detail.
n = 40
# shape: batch x channel x height x weight
x = nd.random_uniform(shape=(1, 3, n, n))
We can visualize all anchor boxes generated for one pixel on a certain size feature map.
In [2]: import matplotlib.pyplot as plt
def box_to_rect(box, color, linewidth=3):
"""convert an anchor box to a matplotlib rectangle"""
box = box.asnumpy()
return plt.Rectangle(
(box[0], box[1]), (box[2]-box[0]), (box[3]-box[1]),
fill=False, edgecolor=color, linewidth=linewidth)
colors = ['blue', 'green', 'red', 'black', 'magenta']
plt.imshow(nd.ones((n, n, 3)).asnumpy())
anchors = boxes[20, 20, :, :]
for i in range(anchors.shape[0]):
plt.gca().add_patch(box_to_rect(anchors[i,:]*n, colors[i]))
plt.show()
Predict classes
For each anchor box, we want to predict the associated class label. We make this prediction by using a
convolution layer. We choose a kernel of size 3 × 3 with padding size (1, 1) so that the output will have
the same width and height as the input. The confidence scores for the anchor box class labels are stored in
channels. In particular, for the i-th anchor box:
• channel i*(num_class+1) store the scores for this box contains only background
• channel i*(num_class+1)+1+j store the scores for this box contains an object from the j-th class
In [3]: from mxnet.gluon import nn
def class_predictor(num_anchors, num_classes):
"""return a layer to predict classes"""
return nn.Conv2D(num_anchors * (num_classes + 1), 3, padding=1)
• 𝑡𝑦 = (𝑌𝑦 − 𝑏𝑦 )/𝑏ℎ𝑒𝑖𝑔ℎ𝑡
• 𝑡𝑤𝑖𝑑𝑡ℎ = (𝑌𝑤𝑖𝑑𝑡ℎ − 𝑏𝑤𝑖𝑑𝑡ℎ )/𝑏𝑤𝑖𝑑𝑡ℎ
• 𝑡ℎ𝑒𝑖𝑔ℎ𝑡 = (𝑌ℎ𝑒𝑖𝑔ℎ𝑡 − 𝑏ℎ𝑒𝑖𝑔ℎ𝑡 )/𝑏ℎ𝑒𝑖𝑔ℎ𝑡
Normalizing the deltas with box width/height tends to result in better convergence behavior.
Similar to classes, we use a convolution layer here. The only difference is that the output channel size is
now num_anchors * 4, with the predicted delta positions for the i-th box stored from channel i*4 to
i*4+3.
In [4]: def box_predictor(num_anchors):
"""return a layer to predict delta locations"""
return nn.Conv2D(num_anchors * 4, 3, padding=1)
box_pred = box_predictor(10)
box_pred.initialize()
x = nd.zeros((2, 3, 20, 20))
print('Box prediction', box_pred(x).shape)
Box prediction (2, 40, 20, 20)
Down-sample features
Each time, we downsample the features by half. This can be achieved by a simple pooling layer with pooling
size 2. We may also stack two convolution, batch normalization and ReLU blocks before the pooling layer
to make the network deeper.
In [5]: def down_sample(num_filters):
"""stack two Conv-BatchNorm-Relu blocks and then a pooling layer
to halve the feature size"""
out = nn.HybridSequential()
for _ in range(2):
out.add(nn.Conv2D(num_filters, 3, strides=1, padding=1))
out.add(nn.BatchNorm(in_channels=num_filters))
out.add(nn.Activation('relu'))
out.add(nn.MaxPool2D(2))
return out
blk = down_sample(10)
blk.initialize()
x = nd.zeros((2, 3, 20, 20))
print('Before', x.shape, 'after', blk(x).shape)
Before (2, 3, 20, 20) after (2, 10, 10, 10)
def concat_predictions(preds):
return nd.concat(*preds, dim=1)
flat_y1 = flatten_prediction(y1)
print('Flatten class prediction 1', flat_y1.shape)
flat_y2 = flatten_prediction(y2)
print('Flatten class prediction 2', flat_y2.shape)
print('Concat class predictions', concat_predictions([flat_y1, flat_y2]).shape)
Flatten class prediction 1 (2, 22000)
Flatten class prediction 2 (2, 3300)
Concat class predictions (2, 25300)
Body network
The body network is used to extract features from the raw pixel inputs. Common choices follow the ar-
chitectures of the state-of-the-art convolution neural networks for image classification. For demonstration
purpose, we just stack several down sampling blocks to form the body network.
In [8]: from mxnet import gluon
def body():
"""return the body network"""
out = nn.HybridSequential()
for nfilters in [16, 32, 64]:
out.add(down_sample(nfilters))
return out
bnet = body()
bnet.initialize()
x = nd.zeros((2, 3, 256, 256))
print('Body network', [y.shape for y in bnet(x)])
Body network [(64, 32, 32), (64, 32, 32)]
downsamples.add(down_sample(128))
downsamples.add(down_sample(128))
downsamples.add(down_sample(128))
print(toy_ssd_model(5, 2))
(HybridSequential(
(0): HybridSequential(
(0): Conv2D(None -> 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm(momentum=0.9, fix_gamma=False, axis=1, use_global_stats=False, eps=1e-05
(2): Activation(relu)
(3): Conv2D(None -> 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): BatchNorm(momentum=0.9, fix_gamma=False, axis=1, use_global_stats=False, eps=1e-05
(5): Activation(relu)
(6): MaxPool2D(size=(2, 2), stride=(2, 2), padding=(0, 0), ceil_mode=False)
)
(1): HybridSequential(
(0): Conv2D(None -> 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm(momentum=0.9, fix_gamma=False, axis=1, use_global_stats=False, eps=1e-05
(2): Activation(relu)
(3): Conv2D(None -> 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): BatchNorm(momentum=0.9, fix_gamma=False, axis=1, use_global_stats=False, eps=1e-05
(5): Activation(relu)
(6): MaxPool2D(size=(2, 2), stride=(2, 2), padding=(0, 0), ceil_mode=False)
)
(2): HybridSequential(
(0): Conv2D(None -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm(momentum=0.9, fix_gamma=False, axis=1, use_global_stats=False, eps=1e-05
(2): Activation(relu)
(3): Conv2D(None -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): BatchNorm(momentum=0.9, fix_gamma=False, axis=1, use_global_stats=False, eps=1e-05
(5): Activation(relu)
(6): MaxPool2D(size=(2, 2), stride=(2, 2), padding=(0, 0), ceil_mode=False)
)
), Sequential(
(0): HybridSequential(
(0): Conv2D(None -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm(momentum=0.9, fix_gamma=False, axis=1, use_global_stats=False, eps=1e-05
(2): Activation(relu)
(3): Conv2D(None -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): BatchNorm(momentum=0.9, fix_gamma=False, axis=1, use_global_stats=False, eps=1e-05
(5): Activation(relu)
(6): MaxPool2D(size=(2, 2), stride=(2, 2), padding=(0, 0), ceil_mode=False)
)
(1): HybridSequential(
(0): Conv2D(None -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm(momentum=0.9, fix_gamma=False, axis=1, use_global_stats=False, eps=1e-05
(2): Activation(relu)
(3): Conv2D(None -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): BatchNorm(momentum=0.9, fix_gamma=False, axis=1, use_global_stats=False, eps=1e-05
(5): Activation(relu)
(6): MaxPool2D(size=(2, 2), stride=(2, 2), padding=(0, 0), ceil_mode=False)
)
(2): HybridSequential(
(0): Conv2D(None -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm(momentum=0.9, fix_gamma=False, axis=1, use_global_stats=False, eps=1e-05
(2): Activation(relu)
(3): Conv2D(None -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): BatchNorm(momentum=0.9, fix_gamma=False, axis=1, use_global_stats=False, eps=1e-05
(5): Activation(relu)
(6): MaxPool2D(size=(2, 2), stride=(2, 2), padding=(0, 0), ceil_mode=False)
)
), Sequential(
(0): Conv2D(None -> 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Conv2D(None -> 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(2): Conv2D(None -> 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): Conv2D(None -> 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): Conv2D(None -> 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
), Sequential(
(0): Conv2D(None -> 20, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Conv2D(None -> 20, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(2): Conv2D(None -> 20, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): Conv2D(None -> 20, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): Conv2D(None -> 20, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
))
Forward
Given an input and the model, we can run the forward pass.
In [10]: def toy_ssd_forward(x, body, downsamples, class_preds, box_preds, sizes, ratios):
# extract feature with the body network
x = body(x)
for i in range(5):
default_anchors.append(MultiBoxPrior(x, sizes=sizes[i], ratios=ratios[i]))
predicted_boxes.append(flatten_prediction(box_preds[i](x)))
predicted_classes.append(flatten_prediction(class_preds[i](x)))
if i < 3:
x = downsamples[i](x)
elif i == 3:
# simply use the pooling layer
x = nd.Pooling(x, global_pool=True, pool_type='max', kernel=(4, 4))
with self.name_scope():
self.body, self.downsamples, self.class_preds, self.box_preds = toy_ss
Outputs of ToySSD
In [12]: # instantiate a ToySSD network with 10 classes
net = ToySSD(2)
net.initialize()
x = nd.zeros((1, 3, 256, 256))
default_anchors, class_predictions, box_predictions = net(x)
print('Outputs:', 'anchors', default_anchors.shape, 'class prediction', class_pred
Outputs: anchors (1, 5444, 4) class prediction (1, 5444, 3) box prediction (1, 21776)
4.1.2 Dataset
For demonstration purposes, we’ll train our model to detect Pikachu in the wild. We generated a synthetic
toy dataset by rendering images from open-sourced 3D Pikachu models. The dataset consists of 1000
pikachus with random pose/scale/position in random background images. The exact locations are recorded
as ground-truth for training and validation.
Download dataset
In [13]: from mxnet.test_utils import download
import os.path as osp
def verified(file_path, sha1hash):
import hashlib
sha1 = hashlib.sha1()
with open(file_path, 'rb') as f:
while True:
data = f.read(1048576)
if not data:
break
sha1.update(data)
matched = sha1.hexdigest() == sha1hash
if not matched:
print('Found hash mismatch in file {}, possibly due to incomplete download
return matched
url_format = 'https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/pikac
hashes = {'train.rec': 'e6bcb6ffba1ac04ff8a9b1115e650af56ee969c8',
'train.idx': 'dcf7318b2602c06428b9988470c731621716c393',
'val.rec': 'd6c33f799b4d058e82f2cb5bd9a976f69d72d520'}
for k, v in hashes.items():
fname = 'pikachu_' + k
target = osp.join('data', fname)
url = url_format.format(k)
if not osp.exists(target) or not verified(target, v):
print('Downloading', target, url)
download(url, fname=fname, dirname='data', overwrite=True)
Load dataset
In [14]: import mxnet.image as image
data_shape = 256
batch_size = 32
def get_iterators(data_shape, batch_size):
class_names = ['pikachu']
num_class = len(class_names)
train_iter = image.ImageDetIter(
batch_size=batch_size,
data_shape=(3, data_shape, data_shape),
path_imgrec='./data/pikachu_train.rec',
path_imgidx='./data/pikachu_train.idx',
shuffle=True,
mean=True,
rand_crop=1,
min_object_covered=0.95,
max_attempts=200)
val_iter = image.ImageDetIter(
batch_size=batch_size,
data_shape=(3, data_shape, data_shape),
path_imgrec='./data/pikachu_val.rec',
shuffle=False,
mean=True)
Illustration
Let’s display one image loaded by ImageDetIter.
In [15]: import numpy as np
4.1.3 Train
Losses
Network predictions will be penalized for incorrect class predictions and wrong box deltas.
In [16]: from mxnet.contrib.ndarray import MultiBoxTarget
def training_targets(default_anchors, class_predicts, labels):
class_predicts = nd.transpose(class_predicts, axes=(0, 2, 1))
z = MultiBoxTarget(*[default_anchors, labels, class_predicts])
box_target = z[0] # box offset target for (x, y, width, height)
box_mask = z[1] # mask is used to ignore box offsets we don't want to penaliz
cls_target = z[2] # cls_target is an array of labels for all anchors boxes
return box_target, box_mask, cls_target
Pre-defined losses are provided in gluon.loss package, however, we can define losses manually.
First, we need a Focal Loss for class predictions.
In [17]: class FocalLoss(gluon.loss.Loss):
def __init__(self, axis=-1, alpha=0.25, gamma=2, batch_axis=0, **kwargs):
super(FocalLoss, self).__init__(None, batch_axis, **kwargs)
self._axis = axis
self._alpha = alpha
self._gamma = gamma
# cls_loss = gluon.loss.SoftmaxCrossEntropyLoss()
cls_loss = FocalLoss()
print(cls_loss)
FocalLoss(batch_axis=0, w=None)
box_loss = SmoothL1Loss()
print(box_loss)
SmoothL1Loss(batch_axis=0, w=None)
Evaluation metrics
Here, we define two metrics that we’ll use to evaluate our performance whien training. You’re already
familiar with accuracy unless you’ve been naughty and skipped straight to object detection. We use the
accuracy metric to assess the quality of the class predictions. Mean absolute error (MAE) is just the L1
distance, introduced in our linear algebra chapter. We use this to determine how close the coordinates of
the predicted bounding boxes are to the ground-truth coordinates. Because we are jointly solving both a
classification problem and a regression problem, we need an appropriate metric for each task.
In [19]: cls_metric = mx.metric.Accuracy()
box_metric = mx.metric.MAE() # measure absolute difference between prediction and
In [20]: ### Set context for training
ctx = mx.gpu() # it may takes too long to train using CPU
try:
_ = nd.zeros(1, ctx=ctx)
# pad label for cuda implementation
train_data.reshape(label_shape=(3, 5))
train_data = test_data.sync_label_shape(train_data)
except mx.base.MXNetError as err:
print('No GPU enabled, fall back to CPU, sit back and be patient...')
ctx = mx.cpu()
Initialize parameters
In [21]: net = ToySSD(num_class)
net.initialize(mx.init.Xavier(magnitude=2), ctx=ctx)
Set up trainer
In [22]: net.collect_params().reset_ctx(ctx)
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1, 'wd':
Start training
Optionally we load pretrained model for demonstration purpose. One can set from_scratch = True
to training from scratch, which may take more than 30 mins to finish using a single capable GPU.
In [23]: epochs = 1 # set larger to get better performance
log_interval = 20
from_scratch = False # set to True to train from scratch
if from_scratch:
start_epoch = 0
else:
start_epoch = 148
pretrained = 'ssd_pretrained.params'
sha1 = 'fbb7d872d76355fff1790d864c2238decdb452bc'
url = 'https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/models/ssd_pikac
if not osp.exists(pretrained) or not verified(pretrained, sha1):
print('Downloading', pretrained, url)
download(url, fname=pretrained, overwrite=True)
net.load_parameters(pretrained, ctx)
In [24]: import time
from mxnet import autograd as ag
for epoch in range(start_epoch, epochs):
# reset iterator and tick
train_data.reset()
cls_metric.reset()
box_metric.reset()
tic = time.time()
4.1.4 Test
Testing is similar to training, except that we don’t need to compute gradients and training targets. Instead,
we take the predictions from network output, and combine them to get the real detection output.
def preprocess(image):
"""Takes an image and apply preprocess"""
# resize to data_shape
image = cv2.resize(image, (data_shape, data_shape))
# swap BGR to RGB
image = image[:, :, (2, 1, 0)]
# convert to float before subtracting mean
image = image.astype(np.float32)
# subtract mean
image -= np.array([123, 117, 104])
# organize as [batch-channel-height-width]
image = np.transpose(image, (2, 0, 1))
image = image[np.newaxis, :]
# convert to ndarray
image = nd.array(image)
return image
image = cv2.imread('../img/pikachu.jpg')
x = preprocess(image)
print('x', x.shape)
x (1, 3, 256, 256)
Network inference
In a single line of code!
In [26]: # if pre-trained model is provided, we can load it
# net.load_parameters('ssd_%d.params' % epochs, ctx)
anchors, cls_preds, box_preds = net(x.as_in_context(ctx))
print('anchors', anchors)
print('class predictions', cls_preds)
print('box delta predictions', box_preds)
anchors
[[[-0.084375 -0.084375 0.115625 0.115625 ]
[-0.12037501 -0.12037501 0.151625 0.151625 ]
[-0.12579636 -0.05508568 0.15704636 0.08633568]
...
[ 0.01949999 0.01949999 0.9805 0.9805 ]
[-0.12225395 0.18887302 1.1222539 0.81112695]
[ 0.18887302 -0.12225395 0.81112695 1.1222539 ]]]
<NDArray 1x5444x4 @gpu(0)>
class predictions
[[[ 0.3136385 -1.6613694 ]
[ 1.1190383 -1.7688792 ]
[ 1.165454 -0.97607 ]
...
[-0.26088136 -1.2618818 ]
[ 0.4366543 -0.88175875]
[ 0.24387847 -0.8944956 ]]]
<NDArray 1x5444x2 @gpu(0)>
box delta predictions
[[-0.16194503 -0.15946479 -0.68138134 ... -0.23063782 0.09888595
-0.25365576]]
<NDArray 1x21776 @gpu(0)>
Each row in the output corresponds to a detection box, as in format [class_id, confidence, xmin, ymin, xmax,
ymax].
Most of the detection results are -1, indicating that they either have very small confidence scores, or been
suppressed through non-maximum-suppression.
Display results
In [28]: def display(img, out, thresh=0.5):
import random
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (10,10)
pens = dict()
plt.clf()
plt.imshow(img)
for det in out:
cid = int(det[0])
if cid < 0:
continue
score = det[1]
if score < thresh:
continue
if cid not in pens:
pens[cid] = (random.random(), random.random(), random.random())
scales = [img.shape[1], img.shape[0]] * 2
xmin, ymin, xmax, ymax = [int(p * s) for p, s in zip(det[2:6].tolist(), sc
rect = plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin, fill=False,
edgecolor=pens[cid], linewidth=3)
plt.gca().add_patch(rect)
text = class_names[cid]
plt.gca().text(xmin, ymin-2, '{:s} {:.3f}'.format(text, score),
bbox=dict(facecolor=pens[cid], alpha=0.5),
fontsize=12, color='white')
plt.show()
4.1.5 Conclusion
Detection is harder than classification, since we want not only class probabilities, but also localizations of
different objects including potential small objects. Using sliding window together with a good classifier
might be an option, however, we have shown that with a properly designed convolutional neural network,
we can do single shot detection which is blazing fast and accurate!
For whinges or inquiries, open an issue on GitHub.
animals. And we talked about the ImageNet dataset, the default academic benchmark, which contains 1M
million images, 1000 each from 1000 separate classes.
The ImageNet dataset categorically changed what was possible in computer vision. It turns out some things
are possible (these days, even easy) on gigantic datasets, that simply aren’t with smaller datasets. In fact,
we don’t know of any technique that can comparably powerful model on a similar photograph dataset but
containing only, say, 10k images.
And that’s a problem. Because however impressive the results of CNNs on ImageNet may be, most people
aren’t interested in ImageNet itself. They’re interested in their own problems. Recognize people based
on pictures of their faces. Distinguish between photographs of 10 different types of coral on the ocean
floor. Usually when individuals (and not Amazon, Google, or inter-institutional big science initiatives) are
interested in solving a computer vision problem, they come to the table with modestly sized datasets. A few
hundred examples may be common and a few thousand examples may be as much as you can reasonably
ask for.
So one natural question emerges. Can we somehow use the powerful models trained on millions of examples
for one dataset, and apply them to improve performance on a new problem with a much smaller dataset?
This kind of problem (learning on source dataset, bringing knowledge to target dataset), is appropriately
called transfer learning. Fortunately, we have some effective tools for solving this problem.
For deep neural networks, the most popular approach is called finetuning and the idea is both simple and
effective:
• Train a neural network on the source task 𝑆.
• Decapitate it, replacing it’s output layer appropriate to target task 𝑇 .
• Initialize the weights on the new output layer randomly, keeping all other (pretrained) weights the
same.
• Begin training on the new dataset.
This might be clearer if we visualize the algorithm:
In this section, we’ll demonstrate fine-tuning, using the popular and compact SqueezeNet architecture. Since
we don’t want to saddle you with the burden of downloading ImageNet, or of training on ImageNet from
scratch, we’ll pull the weights of the pretrained Squeeze net from the internet. Specifically, we’ll be fine-
tuning a squeezenet-1.1 that was pre-trained on imagenet-12. Finally, we’ll fine-tune it to recognize hotdogs.
We’ll start with the obligatory ritual of importing a bunch of stuff that you’ll need later.
In [ ]: %pylab inline
pylab.rcParams['figure.figsize'] = (10, 6)
4.2.1 Settings
We’ll set a few settings up here that you can configure later to manipulate the behavior of the algorithm.
These are mostly familiar. Hybrid mode, uses the just in time compiler described in our chapter on high
performance training to make the network much faster to train. Since we’re not working with any crazy
dynamic graphs that can’t be compiled, there’s no reason not to hybridize. The batch size, number of
training epochs, weight decay, and learing rate should all be familiar by now. The positive class weight,
says how much more we should upweight the importance of positive instances (photos of hot dogs) in the
objective function. We use this to combat the extreme class imbalance (not surprisingly, most pictures do
not depict hot dogs).
In [ ]: # Demo mode uses the validation dataset for training, which is smaller and faster t
demo = True
log_interval = 100
# training hyperparameters
batch_size = 256
if demo:
epochs = 5
learning_rate = 0.02
wd = 0.002
else:
epochs = 40
learning_rate = 0.05
wd = 0.002
# the class weight for hotdog class to help the imbalance problem.
positive_class_weight = 5
In [ ]: from __future__ import print_function
import logging
logging.basicConfig(level=logging.INFO)
import os
import time
from collections import OrderedDict
import skimage.io as io
import mxnet as mx
from mxnet.test_utils import download
mx.random.seed(127)
4.2.2 Dataset
Formally, hot dog recognition is a binary classification problem. We’ll use 1 to represent the hotdog class,
and 0 for the not hotdog class. Our hot dog dataset (the target dataset which we’ll fine-tune the model
to) contains 18,141 sample images, 2091 of which are hotdogs. Because the dataset is imbalanced (e.g.
hotdog class is only 1% in mscoco dataset), sampling interesting negative samples can help to improve the
performance of our algorithm. Thus, in the negative class in the our dataset, two thirds are images from food
categories (e.g. pizza) other than hotdogs, and 30% are images from all other categories.
Files
We prepare the dataset in the format of MXRecord using im2rec tool. As of the current draft, rec files are
not yet explained in the book, but if you’re reading after November or December 2017 and you still see this
note, open an issue on GitHub and let us know to stop slacking off.
• not_hotdog_train.rec 641M (1882 positive, 10000 interesting negative, and 5000 random negative)
• not_hotdog_validation.rec 49M (209 positive, 700 interesting negative, and 350 random negative)
In [ ]: dataset_files = {'train': ('not_hotdog_train-e6ef27b4.rec', '0aad7e1f16f5fb109b719a
'validation': ('not_hotdog_validation-c0201740.rec', '723ae5f8a433
To demo the model here, we’re justgoing to use the smaller validation set. But if you’re interested in training
on the full set, set ‘demo’ to False in the settings at the beginning. Now we’re ready to download and verify
the dataset.
In [ ]: if demo:
training_dataset, training_data_hash = dataset_files['validation']
else:
training_dataset, training_data_hash = dataset_files['train']
url_format = 'https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/{}'
if not os.path.exists(training_dataset) or not verified(training_dataset, training_
logging.info('Downloading training dataset.')
download(url_format.format(training_dataset),
overwrite=True)
if not os.path.exists(validation_dataset) or not verified(validation_dataset, valid
logging.info('Downloading validation dataset.')
download(url_format.format(validation_dataset),
overwrite=True)
Iterators
The record files can be read using mx.io.ImageRecordIter
In [ ]: # load dataset
train_iter = mx.io.ImageRecordIter(path_imgrec=training_dataset,
min_img_size=256,
data_shape=(3, 224, 224),
rand_crop=True,
shuffle=True,
batch_size=batch_size,
max_random_scale=1.5,
min_random_scale=0.75,
rand_mirror=True)
val_iter = mx.io.ImageRecordIter(path_imgrec=validation_dataset,
min_img_size=256,
data_shape=(3, 224, 224),
batch_size=batch_size)
4.2.3 Model
The model we are finetuning is SqueezeNet. Gluon module offers squeezenet v1.0 and v1.1 that are pre-
trained on ImageNet. This is just a convolutional neural network, with an architecture chosen to have a small
number of parameters and to require a minimal amount of computation. It’s especially popular for folks that
need to run CNNs on low-powered devices like cell phones and other internet-of-things devices.
DeepDog net
We can now use the feature extractor part from the pretrained squeezenet to build our own network. The
model zoo, even handles the decaptiation for us. All we have to do is specify the number out of output
classes in our new task, which we do via the keyword argument classes=2.
In [ ]: deep_dog_net = models.squeezenet1_1(prefix='deep_dog_', classes=2)
deep_dog_net.collect_params().initialize(ctx=contexts)
deep_dog_net.features = net.features
print(deep_dog_net)
The network can already be used for prediction. However, since it hasn’t been finetuned yet, the network
performance could be bad.
In [ ]: from skimage.color import rgba2rgb
# our classifier is for two classes. here, we reuse the hotdog class weight,
# and randomly initialize the 'not hotdog' class.
new_classifier_w = mx.nd.concat(mx.nd.random_normal(shape=hotdog_w.shape, scale=0.0
hotdog_w,
dim=0)
new_classifier_b = mx.nd.concat(mx.nd.random_normal(shape=hotdog_b.shape, scale=0.0
hotdog_b,
dim=0)
4.2.5 Evaluation
Our task is a binary classification problem with imbalanced classes. So we’ll monitor performance both
using accuracy and F1 score, a metric favored in settings with extreme class imbalance. [Note to authors:
ensure that F1 score is explained earlier or explain it here in full]
In [ ]: # return metrics string representation
def metric_str(names, accs):
return ', '.join(['%s=%f'%(name, acc) for name, acc in zip(names, accs)])
metric = mx.metric.create(['acc', 'f1'])
The following snippet performs inferences on evaluation dataset, and updates the metrics. Once the evalua-
tion data iterator is exhausted, it returns the values of each of the metrics.
In [ ]: import mxnet.gluon as gluon
from mxnet.image import color_normalize
4.2.6 Training
We now can train the model just as we would any supervised model. In this example, we set up the training
loop for multi-GPU use as described from first principles here and in the context of gluon here.
In [ ]: import mxnet.autograd as autograd
best_f1 = 0
val_names, val_accs = evaluate(net, val_iter, ctx)
logging.info('[Initial] validation: %s'%(metric_str(val_names, val_accs)))
for epoch in range(epochs):
tic = time.time()
train_iter.reset()
btic = time.time()
for i, batch in enumerate(train_iter):
# the model zoo models expect normalized images
data = color_normalize(batch.data[0]/255,
mean=mx.nd.array([0.485, 0.456, 0.406]).reshape(
std=mx.nd.array([0.229, 0.224, 0.225]).reshape((
data = gluon.utils.split_and_load(data, ctx_list=ctx, batch_axis=0)
label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_
outputs = []
Ls = []
with autograd.record():
for x, y in zip(data, label):
z = net(x)
# rescale the loss based on class to counter the imbalance prob
L = loss(z, y) * (1+y*positive_class_weight)/positive_class_wei
# store the loss and do backward after we have done forward
# on all GPUs for better speed on multiple GPUs.
Ls.append(L)
outputs.append(z)
for L in Ls:
L.backward()
trainer.step(batch.data[0].shape[0])
metric.update(label, outputs)
if log_interval and not (i+1)%log_interval:
names, accs = metric.get()
logging.info('[Epoch %d Batch %d] speed: %f samples/s, training: %s
epoch, i, batch_size/(time.time()-btic), metric_str(
btic = time.time()
if mode == 'hybrid':
deep_dog_net.hybridize()
if epochs > 0:
deep_dog_net.collect_params().reset_ctx(contexts)
train(deep_dog_net, train_iter, val_iter, epochs, contexts)
download('https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/models/deep-dog-5a
overwrite=True)
deep_dog_net.load_parameters('deep-dog-5a342a6f.params', contexts)
In [ ]: classify_hotdog(deep_dog_net, '../img/real_hotdog.jpg', contexts)
In [ ]: classify_hotdog(deep_dog_net, '../img/leg_hotdog.jpg', contexts)
In [ ]: classify_hotdog(deep_dog_net, '../img/dog_hotdog.jpg', contexts)
4.2.8 Conclusions
As you can see, given a pretrained model, we can get a great classifier, even for tasks where we simply
don’t have enough data to train from scratch. That’s because the representations necessary to perform both
tasks have a lot in common. Since they both address natural images, they both require recognizing textures,
shapes, edges, etc. Whenever you have a small enough dataset that you fear impoverishing your model,
try thinking about what larger datasets you might be able to pre-train your model on, so that you can just
perform fine-tuning on the task at hand.
4.2.9 Next
This section is still changing too fast to say for sure what will come next. Stay tuned!
For whinges or inquiries, open an issue on GitHub.
2. Filter the samples giving top k answers(k can be 1000, 2000. . . ). This will make the prediction easier.
In the first model, we will concatenate the image and question features and use multilayer perception(MLP)
to predict the answer.
In [3]: class Net1(gluon.Block):
def __init__(self, **kwargs):
super(Net1, self).__init__(**kwargs)
with self.name_scope():
# layers created in name_scope will inherit name space
# from parent layer.
self.bn = nn.BatchNorm()
self.dropout = nn.Dropout(0.3)
self.fc1 = nn.Dense(8192,activation="relu")
self.fc2 = nn.Dense(1000)
In the second model, instead of linearly combine the image and text features, we use count sketch to estimate
the outer product of the image and question features. It is also named as multimodel compact bilinear
pooling(MCB).
This method was proposed in Multimodal Compact Bilinear Pooling for VQA. The key idea is:
𝜓(𝑥 ⊗ 𝑦, ℎ, 𝑠) = 𝜓(𝑥, ℎ, 𝑠) ⋆ 𝜓(𝑦, ℎ, 𝑠)
where 𝜓 is the count sketch operator, 𝑥, 𝑦 are the inputs, ℎ, 𝑠 are the hash tables, ⊗ defines outer product
and ⋆ is the convolution operator. This can further be simplified by using FFT properties: convolution in
time domain equals to elementwise product in frequency domain.
One improvement we made is adding ones vectors to each features before count sketch. The intuition is:
given input vectors 𝑥, 𝑦, estimating outer product between [𝑥, 1𝑠] and [𝑦, 1𝑠] gives us information more than
just 𝑥 ⊗ 𝑦. It also contains information of 𝑥 and 𝑦.
In [4]: class Net2(gluon.Block):
def __init__(self, **kwargs):
super(Net2, self).__init__(**kwargs)
with self.name_scope():
# layers created in name_scope will inherit name space
# from parent layer.
self.bn = nn.BatchNorm()
self.dropout = nn.Dropout(0.3)
self.fc1 = nn.Dense(8192,activation="relu")
self.fc2 = nn.Dense(1000)
buckets.sort()
ndiscard = 0
self.data = [[] for _ in buckets]
for i in range(len(sentences)):
buck = bisect.bisect_left(buckets, len(sentences[i]))
if buck == len(buckets):
ndiscard += 1
continue
buff = np.full((buckets[buck],), invalid_label, dtype=dtype)
buff[:len(sentences[i])] = sentences[i]
self.data[buck].append(buff)
self.batch_size = batch_size
self.buckets = buckets
self.text_name = text_name
self.img_name = img_name
self.label_name = label_name
self.dtype = dtype
self.invalid_label = invalid_label
self.nd_text = []
self.nd_img = []
self.ndlabel = []
self.major_axis = layout.find('N')
self.default_bucket_key = max(buckets)
if self.major_axis == 0:
self.provide_data = [(text_name, (batch_size, self.default_bucket_key))
(img_name, (batch_size, self.default_bucket_key))]
self.provide_label = [(label_name, (batch_size, self.default_bucket_key
elif self.major_axis == 1:
self.provide_data = [(text_name, (self.default_bucket_key, batch_size))
(img_name, (self.default_bucket_key, batch_size))]
self.provide_label = [(label_name, (self.default_bucket_key, batch_size
else:
raise ValueError("Invalid layout %s: Must by NT (batch major) or TN (ti
self.idx = []
for i, buck in enumerate(self.data):
self.idx.extend([(i, j) for j in range(0, len(buck) - batch_size + 1, b
self.curr_idx = 0
self.reset()
def reset(self):
self.curr_idx = 0
self.nd_text = []
self.nd_img = []
self.ndlabel = []
def next(self):
if self.curr_idx == len(self.idx):
raise StopIteration
i, j = self.idx[self.curr_idx]
self.curr_idx += 1
if self.major_axis == 1:
img = self.nd_img[i][j:j + self.batch_size].T
text = self.nd_text[i][j:j + self.batch_size].T
label = self.ndlabel[i][j:j+self.batch_size]
else:
img = self.nd_img[i][j:j + self.batch_size]
text = self.nd_text[i][j:j + self.batch_size]
label = self.ndlabel[i][j:j+self.batch_size]
url_format = 'https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/VQA-no
if not os.path.exists(train_q):
logging.info('Downloading training dataset.')
download(url_format.format(train_q),overwrite=True)
download(url_format.format(train_i),overwrite=True)
download(url_format.format(train_a),overwrite=True)
if not os.path.exists(val_q):
logging.info('Downloading validation dataset.')
download(url_format.format(val_q),overwrite=True)
download(url_format.format(val_i),overwrite=True)
download(url_format.format(val_a),overwrite=True)
train_question = np.load(train_q)['x']
val_question = np.load(val_q)['x']
train_ans = np.load(train_a)['x']
val_ans = np.load(val_a)['x']
train_img = np.load(train_i)['x']
val_img = np.load(val_i)['x']
metric = mx.metric.Accuracy()
data_iterator.reset()
for i, batch in enumerate(data_iterator):
with autograd.record():
data1 = batch.data[0].as_in_context(ctx)
data2 = batch.data[1].as_in_context(ctx)
data = [data1,data2]
label = batch.label[0].as_in_context(ctx)
output = net(data)
metric.update([label], [output])
return metric.get()[1]
4.3.8 Optimizer
In [10]: trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01})
##########################
# Keep a moving average of the losses
##########################
if i == 0:
moving_loss = np.mean(cross_entropy.asnumpy()[0])
else:
moving_loss = .99 * moving_loss + .01 * np.mean(cross_entropy.asnumpy(
#if i % 200 == 0:
# print("Epoch %s, batch %s. Moving avg of loss: %s" % (e, i, moving_lo
eva_accuracy = evaluate_accuracy(data_eva, net)
train_accuracy = evaluate_accuracy(data_train, net)
print("Epoch %s. Loss: %s, Train_acc %s, Eval_acc %s" % (e, moving_loss, train
if eva_accuracy > best_eva:
best_eva = eva_accuracy
logging.info('Best validation acc found. Checkpointing...')
net.save_parameters('vqa-mlp-%d.params'%(e))
if test:
test_question = np.load("test_question.npz")['x']
test_img = np.load("test_img.npz")['x']
test_question_id = np.load("test_question_id.npz")['x']
test_img_id = np.load("test_img_id.npz")['x']
#atoi = np.load("atoi.json")['x']
INFO:root:Downloading test dataset.
INFO:root:downloaded https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/VQA-not
INFO:root:downloaded https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/VQA-not
INFO:root:downloaded https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/VQA-not
INFO:root:downloaded https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/VQA-not
INFO:root:downloaded https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/VQA-not
INFO:root:downloaded https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/VQA-not
4.4.2 Preliminaries
Before getting going, you’ll probably want to note a couple preliminary details:
• Use of GPUs is preferred if one wants to run the complete training to match the state-of-the-art results.
• To show a progress meter, one should install the tqdm (“progress” in Arabic) through pip
install tqdm. One should also install the HTTP library through pip install requests.
In [1]: import mxnet as mx
from mxnet.gluon import Block, nn
from mxnet.gluon.parameter import Parameter
In [2]: class Tree(object):
def __init__(self, idx):
self.children = []
self.idx = idx
def __repr__(self):
if self.children:
return '{0}: {1}'.format(self.idx, str(self.children))
else:
return str(self.idx)
In [3]: tree = Tree(0)
tree.children.append(Tree(1))
tree.children.append(Tree(2))
tree.children.append(Tree(3))
tree.children[1].children.append(Tree(4))
print(tree)
0: [1, 2: [4], 3]
4.4.3 Model
The model is based on child-sum tree LSTM. For each sentence, the tree LSTM model extracts information
following the dependency parse tree structure, and produces the sentence embedding at the root of each tree.
This embedding can be used to predict semantic similarity.
if children_states:
# sum of children states, (N, C)
hs = F.add_n(*[state[0] for state in children_states])
# concatenation of children hidden states, (N, K, C)
hc = F.concat(*[F.expand_dims(state[0], axis=1) for state in children_s
# concatenation of children cell states, (N, K, C)
cs = F.concat(*[F.expand_dims(state[1], axis=1) for state in children_s
Final model
In [6]: # putting the whole model together
class SimilarityTreeLSTM(nn.Block):
def __init__(self, sim_hidden_size, rnn_hidden_size, embed_in_size, embed_dim,
super(SimilarityTreeLSTM, self).__init__()
with self.name_scope():
self.embed = nn.Embedding(embed_in_size, embed_dim)
self.childsumtreelstm = ChildSumLSTMCell(rnn_hidden_size, input_size=em
self.similarity = Similarity(sim_hidden_size, rnn_hidden_size, num_clas
Vocab
In [7]: import os
import logging
logging.basicConfig(level=logging.INFO)
import numpy as np
import random
from tqdm import tqdm
import mxnet as mx
self.add(Vocab.PAD_WORD)
self.add(Vocab.UNK_WORD)
self.add(Vocab.BOS_WORD)
self.add(Vocab.EOS_WORD)
self.embed = None
@property
def size(self):
return len(self.idx2tok)
Data iterator
In [8]: # Iterator class for SICK dataset
class SICKDataIter(object):
def reset(self):
if self.shuffle:
mask = list(range(self.size))
random.shuffle(mask)
self.l_sentences = [self.l_sentences[i] for i in mask]
self.r_sentences = [self.r_sentences[i] for i in mask]
self.l_trees = [self.l_trees[i] for i in mask]
self.r_trees = [self.r_trees[i] for i in mask]
self.labels = [self.labels[i] for i in mask]
self.index = 0
def next(self):
out = self[self.index]
self.index += 1
return out
def __len__(self):
return self.size
import mxnet as mx
from mxnet import gluon
from mxnet.gluon import nn
# initialization
context = [mx.gpu(0) if use_gpu else mx.cpu()]
# seeding
mx.random.seed(seed)
np.random.seed(seed)
random.seed(seed)
# read dataset
def verified(file_path, sha1hash):
import hashlib
sha1 = hashlib.sha1()
with open(file_path, 'rb') as f:
while True:
data = f.read(1048576)
if not data:
break
sha1.update(data)
matched = sha1.hexdigest() == sha1hash
if not matched:
logging.warn('Found hash mismatch in file {}, possibly due to incomplete do
.format(file_path))
return matched
data_file_name = 'tree_lstm_dataset-3d85a6c4.cPickle'
data_file_hash = '3d85a6c44a335a33edc060028f91395ab0dcf601'
if not os.path.exists(data_file_name) or not verified(data_file_name, data_file_has
from mxnet.test_utils import download
download('https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/%s'%da
overwrite=True)
# get network
# the prediction from the network is log-probability vector of each score class
# so use the following function to convert scalar score to the vector
# e.g 4.5 -> [0, 0, 0, 0.5, 0.5]
def to_target(x):
target = np.zeros((1, num_classes))
ceil = int(math.ceil(x))
floor = int(math.floor(x))
if ceil==floor:
target[0][floor-1] = 1
else:
target[0][floor-1] = ceil - x
target[0][ceil-1] = x - floor
return mx.nd.array(target)
if isinstance(ctx, mx.Context):
ctx = [ctx]
net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx[0])
net.embed.weight.set_data(vocab.embed.as_in_context(ctx[0]))
train_data.set_context(ctx[0])
dev_data.set_context(ctx[0])
best_r = -1
Loss = gluon.loss.KLDivLoss()
for i in range(epoch):
train_data.reset()
num_samples = min(len(train_data), training_batches_per_epoch*batch_size)
# collect predictions and labels for evaluation metrics
preds = []
labels = [mx.nd.array(train_data.labels[:num_samples], ctx=ctx[0]).reshape(
for j in tqdm(range(num_samples), desc='Training epoch {}'.format(i)):
# get next batch
l_tree, l_sent, r_tree, r_sent, label = train_data.next()
# use autograd to record the forward calculation
with ag.record():
# forward calculation. the output is log probability
z = net(mx.nd, l_sent, r_sent, l_tree, r_tree)
# calculate loss
loss = Loss(z, to_target(label).as_in_context(ctx[0]))
# backward calculation for gradients.
loss.backward()
preds.append(z)
# update weight after every batch_size samples
if (j+1) % batch_size == 0:
trainer.step(batch_size)
4.4.6 Conclusion
• Gluon offers great tools for modeling in an imperative way.
I (Zack) have honestly no idea why Amazon wants me to watch Bubble Guppies. It’s possible that Bubble
Guppies is a masterpiece, and the recommender systems knows that my life will change upon watching it.
It’s also possible that the recommender made a mistake. For example, it might have extrapolated incor-
rectly from my affinity for the anime Death Note, thinking that I would similarly love any animated series.
And, since I’ve never rated a nickelodean series (either postiively or negatively), the system may have no
knowledge to the contrary. It’s also possible that this series is a new addition to the catalogue, and thus they
need to recommend the item to many users in ordder to develop a sense of who likes Bubble Guppies. This
problem, of sorting out how to handle a new item, is called the cold-start problem.
A recommender system doesn’t have to use any sophisticated machine learning techniques. And it doesn’t
even have to be personalized. One reasonable baseline for most applications is to suggest the most popular
items to everyone. But we have to be careful. Depending on how we define popularity, we might create
a feedback loop. The most popular items get recommended which makes them even more popular, which
makes them even more frequently recommended, etc.
For services with diverse users, however, personalization can be essential. Diapers are among the most
popular items on Amazon, but we probably shouldn’t recommend diapers to adolescents. We also probably
should not recommend anything associated with Justin Bieber to a user who isn’t an adolescent. Moreover,
we might want to personalize, not only to the user, but to the context. For example, just after I bought a
Pixel phone, I was in the market for a phone case. But I have no interested in buying a phone case one year
later.
In [11]: data[0]
Out[11]: {'asin': '616719923X',
'helpful': [0, 0],
'overall': 4.0,
'reviewText': 'Just another flavor of Kit Kat but the taste is unique and a bit d
'reviewTime': '06 1, 2013',
'reviewerID': 'A1VEELTKS8NLZB',
'reviewerName': 'Amazon Customer',
'summary': 'Good Taste',
'unixReviewTime': 1370044800}
4.5.4 Models
• Just the average
• Offset plus user and item biases
• Latent factor model / matrix factorization
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
ℎ0 ∼ 𝒩 (𝜇0 , Σ0 )
The LDS is thus fully specified by the system parameters 𝐴 ∈ R𝐻×𝐻 , 𝐵 ∈ R𝐷×𝐻 , Σℎ ∈ 𝒮+
𝐻 , Σ ∈ 𝒮𝐷,
𝑣 +
𝜇0 ∈ R𝐻 , Σ0 ∈ 𝒮+𝐻 . 𝒮 denotes the space of positive definite (PD) matrices.
+
Given such a LDS specification, and a sequence of observations 𝑣0 , 𝑣1 , . . . , 𝑣𝑇 , one is typically interested in
one of the following
1. (Log-)Likelihood computation, i.e. computing the probability of the data under the model,
𝑃 (𝑣0 , 𝑣1 , . . . , 𝑣𝑇 )
2. Filtering, i.e. computing the mean and covariance of 𝑃 (ℎ𝑡 |𝑣0 , 𝑣1 , . . . , 𝑣𝑡 )
3. Smoothing, i.e. computing the mean and covariance of 𝑃 (ℎ𝑡 |𝑣0 , 𝑣1 , . . . , 𝑣𝑇 )
4. Parameter learning: find the system parameters that best describe the data, e.g. by maximizing likeli-
hood
In this notebook we will focus on the filtering problem, and will also see how to compute the log-likelihood
as a byproduct. For details on other problems, See e.g. Barber, 2012, Chapter 24.
4.7 Filtering
We want to find the “filtered” distributions 𝑝(ℎ𝑡 |𝑣0:𝑡 ) where 𝑣0:𝑡 denotes {𝑣0 , · · · , 𝑣𝑡 }. Due to the closure
properties of Gaussian distributions, each of these distributions is also Gaussian 𝑝(ℎ𝑡 |𝑣0:𝑡 ) = 𝒩 (ℎ𝑡 |𝑓𝑡 , 𝐹𝑇 ).
The filtering procedure proceeds sequentially, by expressing 𝑓𝑡 and 𝐹𝑡 in terms of 𝑓𝑡−1 and 𝐹𝑡−1 . We
initialize 𝑓0 and 𝐹0 to be 0.
4.7.1 Prerequisite
To derive the formulas for filtering, here is all you need [see Bishop 2008, Appendix B]
• Conditional Gaussian equations
Σ = (Λ + 𝐴𝑇 𝐿𝐴)−1 (2)
4.7.2 Derivation
Now we are ready to derive the filtering equations, by Bayes Theorem
The derivation boils down to caclulate the two terms on the right hand side (you can think that the first is
𝑝(𝑦|𝑥) and the second is 𝑝(𝑥) as in the conditional Gaussian equations) and use (2) above to get the desired
formula.
The first term is directly given by the observation equation, i.e., 𝑝(𝑣𝑡 |ℎ𝑡 ) = 𝒩 (𝐵ℎ𝑡 , Σ𝑣 ), and the second
term can be calculated as follows
∫︁
𝑝(ℎ𝑡 |𝑣0:𝑡−1 ) = 𝑝(ℎ𝑡 |ℎ𝑡−1 , 𝑣0:𝑡−1 )𝑝(ℎ𝑡−1 |𝑣0:𝑡−1 )dℎ𝑡−1
∫︁
= 𝑝(ℎ𝑡 |ℎ𝑡−1 )𝑝(ℎ𝑡−1 |𝑣0:𝑡−1 )dℎ𝑡−1 by Markov property
∫︁
= 𝒩 (ℎ𝑡 |𝐴ℎ𝑡−1 , Σℎ )𝒩 (ℎ𝑡−1 |𝑓𝑡−1 , 𝐹𝑡−1 )dℎ𝑡−1
where we have used the matrix inversion lemman and define the Kalman gain matrix as
Notice that for numerical stability, the covariance matrix is normally calculated using so-called “Joseph’s
symmetrized update,”
𝑓𝑡 = 𝜇ℎ + 𝐾(𝑣 − 𝐵𝜇ℎ ).
𝜇ℎ = 𝐴𝑓𝑡−1 𝜇𝑣 = 𝐵𝜇ℎ
𝐾𝑡 = Σℎℎ 𝐵 𝑇 Σ−1
𝑣𝑣
and then using 𝑃 (𝑣𝑡 |𝑣0 , 𝑣1 , . . . , 𝑣𝑡−1 ) = 𝒩 (𝑣𝑡 |𝜇𝑣 , Σ𝑣𝑣 ) with parameters obtained during filtering to com-
pute each term.
In [1]: import mxnet as mx
import mxnet.ndarray as nd
In [2]: import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (10, 5)
v = []
# initial state h_0
h = np.array([1, 0])
for t in range(T):
# h_t = Bh_{t-1} + \epsilon_h
h = np.random.multivariate_normal(A.asnumpy().dot(h), S_h.asnumpy())
v.append(vv)
v = nd.array(np.array(v).reshape((T,1)))
In [5]: plt.plot(v.asnumpy());
f_0 = nd.zeros((H,1))
F_0 = nd.zeros((H,H))
eye_h = nd.array(np.eye(H))
F_t = None
f_t = None
F_seq = []
f_seq = []
log_p_seq = []
for t in range(T):
if t == 0:
# At the first time step, use the prior
mu_h = f_0
S_hh = F_0
else:
# Otherwise compute using update eqns.
mu_h = gemm2(A, f_t)
S_hh = gemm2(A, gemm2(F_t, A, transpose_b=1)) + S_h
In the next notebook, we will use Kalman filtering as a subroutine in more complex models. In particular,
we will show how to do time series forecasting with innovative state space models (ISSMs).
In words, the next step forecast is a convex combination of the most recent obseravtion and forecast. Ex-
panding the above equation, it is clear that the forecast is given by the exponentially weighted average of
4.9. Exponential Smoothing and Innovation State Space Model (ISSM) 329
Deep Learning - The Straight Dope, Release 0.1
past observations,
Here 𝛼 > 0 is a smoothing parameter that controls the weight given to each observation. Note that the
recent observations are given more weight than the older observations. In fact the weight given to the past
observation descreases exponentially as it gets older and hence the name exponential smoothing.
General exponential smoothing methods consider the extensions of simple ETS to include time series pat-
terns such as (linear) trend, various periodic seasonal effects. All ETS methods falls under the category of
forecasting methods as the predictions are point forecasts (a single value is predicted for each future time
step). On the other hand a statistical model describes the underlying data generation process and has an
advantage that it can produce an entire probability distribuiton for each of the future time steps. Innova-
tion state space model (ISSM) is an example of such models with considerable flexibility in respresnting
commonly occurring time series patterns and underlie the exponential smoothing methods.
The idea behind ISSMs is to maintain a latent state vector 𝑙𝑡 with recent information about level, trend, and
seasonality factors. The state vector 𝑙𝑡 evolves over time adding small innvoation (i.e., the Gaussian noise)
at each time step. The observations are then a linear combination of the components of the current state.
Mathematically, ISSM is specified by two equations
• The state transition equation is given by
𝑙𝑡 = 𝐹𝑡 𝑙𝑡−1 + 𝑔𝑡 𝜖𝑡 , 𝜖𝑡 ∼ 𝒩 (0, 1).
Note that the innovation strength is controlled by 𝑔𝑡 , i.e., 𝑔𝑡 𝜖𝑡 ∼ 𝒩 (0, 𝑔𝑡2 ).
• The observation equation is given by
𝑧𝑡 = 𝑎 ⊤
𝑡 𝑙𝑡−1 + 𝑏𝑡 + 𝜈𝑡 , 𝜈𝑡 ∼ 𝒩 (0, 𝜎𝑡2 )
Note that here we allow for an additional term 𝑏𝑡 which can model any determinstic component (exogenous
variables).
This describes a fairy generic model allowing the user to encode specific time series patterns using the
coefficients 𝐹 , 𝑎𝑡 and thus are problem dependent. The innovation vector 𝑔𝑡 comes in terms of parameters
to be learned (the innovation strengths). Moreover, the initial state 𝑙0 has to be specified. We do so by
specifying a Gaussian prior distribution 𝑃 (𝑙0 ), whose parameters (means, standard deviation) are learned
from data as well.
The parameters of the ISSM are typically learned using the maximum likelihood principle. This requires
the computation of the log-likelihood of the given observations i.e., computing the probability of the data
under the model, 𝑃 (𝑧1 , . . . , 𝑧𝑇 ). Fortunately, in the previous notebook, we have learned how to compute
the log-likelihood as a byproduct of LDS filtering problem.
4.10 Filtering
We remark that ISSM is a special case of linear dynamical system except that the coefficients are allowed
to change over time. The filtering equations for ISSM can readily be obtained from the general derivation
described in LDS.
Note the change in the notation in the following equations for filtered mean (𝜇𝑡 ) and filtered variance (𝑆𝑡 )
because of the conflict of notation for the ISSM coefficient 𝐹 . Also note that the deterministic part 𝑏𝑡 needs
to be subtracted from the observations [𝑧𝑡 ].
𝜇ℎ = 𝐹𝑡 𝜇𝑡−1 𝜇𝑣 = 𝑎⊤
𝑡 𝜇ℎ
eye_h = nd.array(np.eye(H))
mu_seq = []
S_seq = []
log_p_seq = []
for t in range(T):
if t == 0:
# At the first time step, use the prior
mu_h = m_prior
S_hh = S_prior
else:
# Otherwise compute using update eqns.
F_t = F[:, :, t]
g_t = g[:, t].reshape((H,1))
sigma_t = sigma[t]
S_vv = gemm2(a_t, S_hh_x_a_t, transpose_a=1) + nd.square(sigma_t)
kalman_gain = nd.broadcast_div(S_hh_x_a_t, S_vv)
# Filtered estimates
mu_t = mu_h + gemm2(kalman_gain, delta)
# likelihood term
log_p = (-0.5 * (delta * delta / S_vv
+ np.log(2.0 * np.pi)
+ nd.log(S_vv))
)
mu_seq.append(mu_t)
S_seq.append(S_t)
log_p_seq.append(log_p)
4.10.2 Data
We will use the 10 year US Government Bond Yields dataset to illustrate two specific instances of ISSM
models.
In [3]: import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12, 5)
In [4]: df = pd.read_csv("https://datahub.io/core/bond-yields-us-10y/r/monthly.csv", header
In [5]: df.set_index("Date")
𝑙𝑡 = 𝛿𝑙𝑡−1 + 𝛼𝜖𝑡 .
Or in ISSM terminology,
The level 𝑙𝑡 ∈ R evolves over time by adding a random innovation 𝛼𝜖𝑡 ∼ 𝒩 (0, 𝛼2 ) to the previous level, so
that 𝛼 specifies the amount of level drift over time. At time 𝑡, the previous level 𝑙𝑡−1 is used in the prediction
𝑧𝑡 and then the level is updated. The damping factor 𝛿 ∈ (0, 1] allows the ‘‘damping” of the level. The
initial state prior 𝑃 (𝑙0 ) is given by 𝑙0 ∼ 𝑁 (𝜇0 , 𝜎02 ). For Level-ISSM, we learn the parameters 𝛼 > 0, 𝜇0 ,
𝜎0 > 0.
Here we will fix the parameters for the illustration of filtering. Learning of the parameters will be discussed
in another notebook.
In [7]: latent_dim = 1
T = len(ts)
Forecast
One advantage of the ISSM model is that one can obtain the complete probability distribution for each of
the future time steps:
𝑇 𝑇 2
𝑇 +𝑡 ) = 𝒩 (𝑎𝑇 +𝑡 𝜇𝑇 +𝑡 , 𝑎𝑇 +𝑡 𝑆𝑇 +𝑡 𝑎𝑇 +𝑡 + 𝜎𝑇 +𝑡 ),
𝑝(𝑧̂︂ 𝑡>0
𝑝(𝑙𝑇 +𝑡 ) = 𝒩 (𝐹 𝜇𝑇 +𝑡−1 , 𝐹 𝑆𝑇 +𝑡−1 𝐹 + 𝑔𝑇 +𝑡 𝑔𝑇𝑇 +𝑡 )
𝑇
forecasts_mean = []
forecasts_std = []
mu_last_state = mu_last_state.asnumpy()
S_last_state = S_last_state.asnumpy()
F = F.asnumpy()
a = a.asnumpy()
g = g.asnumpy()
sigma = sigma.asnumpy()
for t in range(horizon):
a_t = a[:, t]
forecast_mean = a_t.dot(mu_last_state)[0]
forecast_std = a_t.dot(S_last_state).dot(a_t) + np.square(sigma[t])[0]
forecasts_mean.append(forecast_mean)
forecasts_std.append(forecast_std)
plt.plot(ts, color="r")
plt.plot(v_filtered_mean, color="b")
T = len(v_filtered_mean)
x = np.arange(T)
plt.fill_between(x, v_filtered_mean-v_filtered_std,
v_filtered_mean+v_filtered_std,
facecolor="blue", alpha=0.2)
𝑙𝑡 = 𝛿𝑙𝑡−1 + 𝛾𝑏𝑡−1 + 𝛼 · 𝜖𝑡
𝑏𝑡 = 𝛾𝑏𝑡−1 + 𝛽 · 𝜖𝑡
where 𝛼 > 0, 𝛽 > 0 and the damping factors 𝛿, 𝛾 ∈ (0, 1]. Both the level and slope components evolve over
time by adding innovations 𝛼𝜖𝑡 and 𝛽𝜖𝑡 respectively, so that 𝛽 > 0 is the innovation strength for the slope.
The level at time 𝑡 is the sum of level at 𝑡 − 1 and slope at 𝑡 − 1 (linear prediction) modulo the damping
factors for level 𝛿 and growth 𝛾.
In [15]: latent_dim = 2
T = len(ts)
FIVE
∑︁
ℓ(𝜃) = log (𝑝𝜃 (𝑥𝑖 ))
𝑖
∑︁ (︂∫︁ )︂
= log 𝑝𝜃 (𝑥𝑖 , 𝑧)𝑑𝑧
𝑖
(︂ [︂ ]︂)︂
∑︁ 𝑝𝜃 (𝑥𝑖 , 𝑧)
= log E𝑧∼𝑄
𝑞(𝑧)
𝑖
[︂ (︂ )︂]︂
∑︁ 𝑝𝜃 (𝑥𝑖 , 𝑧)
≥ E𝑧∼𝑄 log
𝑞(𝑧)
𝑖
⏟ ⏞
𝐸𝐿𝐵𝑂:ℒ(𝑞,𝜃)
∑︁ (︂∫︁ )︂
= log 𝑝𝜃 (𝑥𝑖 , 𝑧)𝑑𝑧
𝑖
339
Deep Learning - The Straight Dope, Release 0.1
[︂ (︂ )︂]︂
∑︁ 𝑝𝜃 (𝑥𝑖 , 𝑧)
≥ E𝑧∼𝑄 log
𝑞(𝑧)
⏟𝑖 ⏞
𝐸𝐿𝐵𝑂:ℒ(𝑞,𝜃)
Importantly, among all choices of 𝑞(𝑧), we’d be able to maximize the ELBO ℒ(𝑞, 𝜃) with respect to 𝑞 if 𝑞 is
^𝑡−1 ^𝑡−1
chosen to be the inferred posterior, i.e. at 𝑡-th iteration 𝑞 𝑡 (𝑧) = 𝑝(𝑧|𝑥𝑖 ; 𝜃ˆ𝑡−1 ) = ∫︀ 𝑝(𝑥𝑖 |𝑧;𝜃 )𝑝(𝑧;𝜃 ) . This
𝑝(𝑥𝑖 |𝑧;𝜃^𝑡−1 )𝑝(𝑧;𝜃^𝑡−1 )𝑑𝑧
is the essesce of the E-step in EM algorithm. In M-step, we then maximize over 𝜃. The particular choice
of 𝑞(𝑧) in E-step ensures that EM would monotonically increase the ELBO ℒ(𝑞, 𝜃), thus the log-liklihood
ℓ(𝜃). The chain of improvements through E-step and M-step are illustrated below.
From EM to VAE
With more complex distributions of 𝑝𝜃 (𝑥|𝑧), the integration in E-step for exact inference of the posterier
𝑝𝜃 (𝑧|𝑥) is intractable. This posterier inference problem can be addressed with variational inference meth-
ods such as mean-field approximation (where we assume factorizable 𝑞(𝑧)) or sampling based methods (e.g.
collapsed Gibbs sampling for solving Latent Dirichlet allocation). Mean-field approximation put undue con-
straints on the variational family 𝑞(𝑧), and sampling based methods could have slow convergence problems.
Moreover, both methods involves arduous derivation of update functions, that would require rederivation
even for small changes in model and thus could limit the exploration of more complex models.
Auto-Encoding Variational Bayes brought about a flexible neural-network based approach. In this frame-
work, the variational inference / variational optimization task of finding the optimal 𝑞 become a matter of
finding the best parameters of a neural network via backpropagation and stochastic gradient descent. Thus
making blackbox inference possible as well as allowing scalable to trainng for deeper and larger neural
network models. We refer to this framework as Neural Variational Inference.
Here is how it works: - Select a prior for latent variable 𝑝𝜃 (𝑧), which may or may not actually involve
parameters. - Use a neural network to parameterize the distribution 𝑝𝜃 (𝑥|𝑧). Because this part of the model
maps latent varibale (code) 𝑧 to observed data 𝑥, it is viewed as a “decoder” network. - Rather than explictly
calculating the intractable 𝑝(𝑧|𝑧), use another neural network to parameterize the distribution 𝑞𝜑 (𝑧|𝑥) as
the approximate posterior. Due to the mapping from from data 𝑥 to latent variable (code) 𝑧, this part of the
model is viewed as a “encoder” network. - The objective is still to maxmize ELBO ℒ(𝜑, 𝜃). But now instead
of separately finding the optimal 𝜑 (corresponding to 𝑞 in EM) and 𝜃 like EM, we can find the parameters 𝜃
and 𝜑 jointly via standard stochastic gradient descent.
The resulted model resembles an encoder-decoder structure, thus commonly referred to as variational auto-
encoder (VAE).
In the classic example in Auto-Encoding Variational Bayes, we have the prior 𝑝(𝑧) as a standard isotropic
Gaussian 𝒩 (0, 𝐼), and the approximate posterior 𝑞𝜑 (𝑧|𝑥) also be isotropic Gaussian 𝒩 (𝜇𝜑 (𝑥), 𝜎𝜑 (𝑥)𝐼),
where 𝜇𝜑 (𝑥) and 𝜎𝜑 (𝑥) are functions implemented as neural networks and their outputs are used as the
parameters for the Guassian distribution 𝑞𝜑 (𝑧|𝑥). This model configuration is often referred as Gaussian
VAE.
With this setup the training loss to minimize is the negative of ELBO and can be expressed as follows:
−ℒ(𝑥𝑖 , 𝜑, 𝜃) = −E𝑧∼𝑄𝜑 (𝑧|𝑥𝑖 ) [log 𝑝𝜃 (𝑥𝑖 |𝑧) + log 𝑝𝜃 (𝑧) − log 𝑞𝜑 (𝑧|𝑥𝑖 )]
= −E𝑧∼𝑄𝜑 (𝑧|𝑥𝑖 ) [log 𝑝𝜃 (𝑥𝑖 |𝑧)] + 𝐷𝐾𝐿 [log 𝑞𝜑 (𝑧|𝑥𝑖 )‖𝑝𝜃 (𝑧)]
𝐿
1 ∑︁
≈ [− log 𝑝𝜃 (𝑥𝑖 |𝑧𝑠 )] + 𝐷𝐾𝐿 [log 𝑞𝜑 (𝑧|𝑥𝑖 )‖𝑝𝜃 (𝑧)]
𝐿 𝑠 ⏟ ⏞
⏟ ⏞ Can be calculated analytically between Gaussians
Sampling 𝑧𝑠 ∼𝑄𝜑 (𝑧|𝑥𝑖 )
= −E𝑧∼𝑄𝜑 (𝑧|𝑥𝑖 ) [log 𝑝𝜃 (𝑥𝑖 |𝑧)] + 𝐷𝐾𝐿 [log 𝑞𝜑 (𝑧|𝑥𝑖 )‖𝑝𝜃 (𝑧)]
where the ELBO above is the same as the ELBO expression in EM but with 𝑝(𝑥, 𝑧) expanded and with
𝑞(𝑥)
𝐷𝐾𝐿 denoting the KL-divergence, i.e. 𝐷𝐾𝐿 (𝑄‖𝑃 ) = E𝑥∼𝑄 [log( 𝑝(𝑥) ]. As indicated, the first term can
be approximated by drawing 𝐿 Monte Carlo samples from the distribution 𝑞𝜑 (𝑧|𝑥) (a very feasible task of
drawing from an isotropic Gaussian distribution), while the 𝐷𝐾𝐿 has convenient analytical solutions which
is preferred over Monte Carlo samples in order to have lower variance gradient.
With sampling involved, the remaining question is how do we backpropagate through a sampling node in
the computation graph. The authors of Auto-Encoding Variational Bayes proposed Reparameterize Trick
(RT). Instead of sampling 𝑧 from 𝒩 (𝜇𝜑 (𝑥), 𝜎𝜑 (𝑥)𝐼) directly, we sample 𝜖 from fixed distribution 𝒩 (0, 𝐼)
and construct 𝑧 = 𝜇(𝑥) + 𝜎(𝑥) · 𝜖. This way the random sampling is based on 𝜖, and 𝑧 deterministically
depends on 𝜇(𝑥) and 𝜎(𝑥) allowing gradient to flow through them. RT is a generally applicable technique
for distribution that allows location-scale transformation or has analytical inverse CDFs.
/Users/rding/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py:46: Depreca
import OpenSSL.SSL
In [2]: def gpu_exists():
try:
mx.nd.zeros((1,), ctx=mx.gpu(0))
except:
return False
return True
data_ctx = mx.cpu()
if gpu_exists():
print('Using GPU for model_ctx')
model_ctx = mx.gpu(0)
else:
print('Using CPU for model_ctx')
model_ctx = mx.cpu()
Using CPU for model_ctx
In [3]: mx.random.seed(1)
output_fig = False
n_samples = 10
idx = np.random.choice(len(mnist['train_data']), n_samples)
_, axarr = plt.subplots(1, n_samples, figsize=(16,4))
for i,j in enumerate(idx):
axarr[i].imshow(mnist['train_data'][j][0], cmap='Greys')
#axarr[i].axis('off')
axarr[i].get_xaxis().set_ticks([])
axarr[i].get_yaxis().set_ticks([])
plt.show()
/Users/rding/anaconda3/lib/python3.6/site-packages/mxnet/test_utils.py:1430: DeprecationWar
label = np.fromstring(flbl.read(), dtype=np.int8)
/Users/rding/anaconda3/lib/python3.6/site-packages/mxnet/test_utils.py:1433: DeprecationWar
image = np.fromstring(fimg.read(), dtype=np.uint8).reshape(len(label), rows, cols)
self.decoder = nn.HybridSequential(prefix='decoder')
for i in range(n_layers):
self.decoder.add(nn.Dense(n_hidden, activation=act_type))
self.decoder.add(nn.Dense(n_output, activation='sigmoid'))
KL = 0.5*F.sum(1+lv-mu*mu-F.exp(lv),axis=1)
logloss = F.sum(x*F.log(y+self.soft_zero)+ (1-x)*F.log(1-y+self.soft_zero),
loss = -logloss-KL
return loss
In [9]: n_hidden=400
n_latent=2
n_layers=2 # num of dense layers in encoder and decoder respectively
n_output=784
training_loss = []
validation_loss = []
for epoch in tqdm_notebook(range(n_epoch), desc='epochs'):
epoch_loss = 0
epoch_val_loss = 0
train_iter.reset()
test_iter.reset()
n_batch_train = 0
for batch in train_iter:
n_batch_train +=1
data = batch.data[0].as_in_context(model_ctx)
with autograd.record():
loss = net(data)
loss.backward()
trainer.step(data.shape[0])
epoch_loss += nd.mean(loss).asscalar()
n_batch_val = 0
for batch in test_iter:
n_batch_val +=1
data = batch.data[0].as_in_context(model_ctx)
loss = net(data)
epoch_val_loss += nd.mean(loss).asscalar()
epoch_loss /= n_batch_train
epoch_val_loss /= n_batch_val
training_loss.append(epoch_loss)
validation_loss.append(epoch_val_loss)
if epoch % max(print_period,1) == 0:
tqdm.write('Epoch{}, Training loss {:.2f}, Validation loss {:.2f}'.format(
end = time.time()
print('Time elapsed: {:.2f}s'.format(end - start))
A Jupyter Widget
Epoch0, Training loss 184.74, Validation loss 171.09
In [16]: n_samples = 10
idx = np.random.choice(batch_size, n_samples)
_, axarr = plt.subplots(2, n_samples, figsize=(16,4))
for i,j in enumerate(idx):
axarr[0,i].imshow(original[j].reshape((28,28)), cmap='Greys')
if i==0:
axarr[0,i].set_title('original')
#axarr[0,i].axis('off')
axarr[0,i].get_xaxis().set_ticks([])
axarr[0,i].get_yaxis().set_ticks([])
axarr[1,i].imshow(result[j].reshape((28,28)), cmap='Greys')
if i==0:
axarr[1,i].set_title('reconstruction')
#axarr[1,i].axis('off')
axarr[1,i].get_xaxis().set_ticks([])
axarr[1,i].get_yaxis().set_ticks([])
plt.show()
fig.colorbar(im, ax=axarr[0])
x = np.linspace(norm.cdf(-3), norm.cdf(3),n_pts)
x = ndtri(x)
images = net2.decoder(zsamples.as_in_context(model_ctx)).asnumpy()
#plot
canvas = np.empty((28*n_pts, 28*n_pts))
for i, img in enumerate(images):
x, y = zsamples_id[i]
canvas[(n_pts-y-1)*28:(n_pts-y)*28, x*28:(x+1)*28] = img.reshape(28, 28)
plt.figure(figsize=(6, 6))
plt.imshow(canvas, origin="upper", cmap="Greys")
plt.axis('off')
plt.tight_layout()
if output_fig:
plt.savefig('2d_latent_space_scan_for_generation.png')
new photorealistic image that looks like it might plausibly have come from the same dataset. This kind of
learning is called generative modeling.
Until recently, we had no method that could synthesize novel photorealistic images. But the success of deep
neural networks for discriminative learning opened up new possiblities. One big trend over the last three
years has been the application of discriminative deep nets to overcome challenges in problems that we don’t
generally think of as supervised learning problems. The recurrent neural network language models are one
example of using a discriminative network (trained to predict the next character) that once trained can act as
a generative model.
In 2014, a young researcher named Ian Goodfellow introduced Generative Adversarial Networks (GANs) a
clever new way to leverage the power of discriminative models to get good generative models. GANs made
quite a splash so it’s quite likely you’ve seen the images before. For instance, using a GAN you can create
fake images of bedrooms, as done by Radford et al. in 2015 and depicted below.
At their heart, GANs rely on the idea that a data generator is good if we cannot tell fake data apart from
real data. In statistics, this is called a two-sample test - a test to answer the question whether datasets
𝑋 = {𝑥1 , . . . 𝑥𝑛 } and 𝑋 ′ = {𝑥′1 , . . . 𝑥′𝑛 } were drawn from the same distribution. The main difference
between most statistics papers and GANs is that the latter use this idea in a constructive way. In other
words, rather than just training a model to say ‘hey, these two datasets don’t look like they came from the
same distribution’, they use the two-sample test to provide training signal to a generative model. This allows
us to improve the data generator until it generates something that resembles the real data. At the very least,
it needs to fool the classifier. And if our classifier is a state of the art deep neural network.
As you can see, there are two pieces to GANs - first off, we need a device (say, a deep network but it really
could be anything, such as a game rendering engine) that might potentially be able to generate data that looks
just like the real thing. If we are dealing with images, this needs to generate images. If we’re dealing with
speech, it needs to generate audio sequences, and so on. We call this the generator network. The second
component is the discriminator network. It attempts to distinguish fake and real data from each other. Both
networks are in competition with each other. The generator network attempts to fool the discriminator
network. At that point, the discriminator network adapts to the new fake data. This information, in turn is
ctx = mx.cpu()
Let’s see what we got. This should be a Gaussian shifted in some rather arbitrary way with mean 𝑏 and
covariance matrix 𝐴⊤ 𝐴.
In [3]: plt.scatter(X[:,0].asnumpy(), X[:,1].asnumpy())
plt.show()
print("The covariance matrix is")
print(nd.dot(A.T, A))
[[ 1.00999999 1.95000005]
[ 1.95000005 4.25 ]]
<NDArray 2x2 @cpu(0)>
# loss
loss = gluon.loss.SoftmaxCrossEntropyLoss()
netG.initialize(mx.init.Normal(0.02), ctx=ctx)
netD.initialize(mx.init.Normal(0.02), ctx=ctx)
# set up logging
from datetime import datetime
import os
import time
with autograd.record():
real_output = netD(data)
errD_real = loss(real_output, real_label)
fake = netG(noise)
fake_output = netD(fake.detach())
errD_fake = loss(fake_output, fake_label)
errD = errD_real + errD_fake
errD.backward()
trainerD.step(batch_size)
metric.update([real_label,], [real_output,])
metric.update([fake_label,], [fake_output,])
############################
# (2) Update G network: maximize log(D(G(z)))
###########################
with autograd.record():
output = netD(fake)
errG = loss(output, real_label)
errG.backward()
trainerG.step(batch_size)
plt.scatter(X[:,0].asnumpy(), X[:,1].asnumpy())
plt.scatter(fake[:,0].asnumpy(), fake[:,1].asnumpy())
plt.show()
5.3.6 Conclusion
A word of caution here - to get this to converge properly, we needed to adjust the learning rates very carefully.
And for Gaussians, the result is rather mediocre - a simple mean and covariance estimator would have
worked much better. However, whenever we don’t have a really good idea of what the distribution should
be, this is a very good way of faking it to the best of our abilities. Note that a lot depends on the power of
the discriminating network. If it is weak, the fake can be very different from the truth. E.g. in our case it
had trouble picking up anything along the axis of reduced variance. In summary, this isn’t exactly easy to
set and forget. One nice resource for dirty practioner’s knowledge is Soumith Chintala’s handy list of tricks
for how to babysit GANs.
For whinges or inquiries, open an issue on GitHub.
import mxnet as mx
from mxnet import gluon
from mxnet import ndarray as nd
from mxnet.gluon import nn, utils
from mxnet import autograd
import numpy as np
use_gpu = True
ctx = mx.gpu() if use_gpu else mx.cpu()
lr = 0.0002
beta1 = 0.5
First, we resize images to size 64 × 64. Then, we normalize all pixel values to the [−1, 1] range.
In [ ]: target_wd = 64
target_ht = 64
img_list = []
Visualize 4 images:
In [ ]: def visualize(img_arr):
plt.imshow(((img_arr.asnumpy().transpose(1, 2, 0) + 1.0) * 127.5).astype(np.uin
plt.axis('off')
for i in range(4):
plt.subplot(1,4,i+1)
visualize(img_list[i + 10][0])
plt.show()
ndf = 64
netD = nn.Sequential()
with netD.name_scope():
# input is (nc) x 64 x 64
netD.add(nn.Conv2D(ndf, 4, 2, 1, use_bias=False))
netD.add(nn.LeakyReLU(0.2))
# state size. (ndf) x 32 x 32
netD.add(nn.Conv2D(ndf * 2, 4, 2, 1, use_bias=False))
netD.add(nn.BatchNorm())
netD.add(nn.LeakyReLU(0.2))
# state size. (ndf) x 16 x 16
netD.add(nn.Conv2D(ndf * 4, 4, 2, 1, use_bias=False))
netD.add(nn.BatchNorm())
netD.add(nn.LeakyReLU(0.2))
# state size. (ndf) x 8 x 8
netD.add(nn.Conv2D(ndf * 8, 4, 2, 1, use_bias=False))
netD.add(nn.BatchNorm())
netD.add(nn.LeakyReLU(0.2))
# state size. (ndf) x 4 x 4
netD.add(nn.Conv2D(1, 4, 1, 0, use_bias=False))
stamp = datetime.now().strftime('%Y_%m_%d-%H_%M')
logging.basicConfig(level=logging.DEBUG)
with autograd.record():
# train with real image
output = netD(data).reshape((-1, 1))
errD_real = loss(output, real_label)
metric.update([real_label,], [output,])
trainerD.step(batch.data[0].shape[0])
############################
# (2) Update G network: maximize log(D(G(z)))
###########################
with autograd.record():
fake = netG(latent_z)
output = netD(fake).reshape((-1, 1))
errG = loss(output, real_label)
errG.backward()
trainerG.step(batch.data[0].shape[0])
metric.reset()
# logging.info('\nbinary training acc at epoch %d: %s=%f' % (epoch, name, acc))
# logging.info('time: %f' % (time.time() - tic))
5.4.6 Results
Given a trained generator, we can generate some images of faces.
In [ ]: num_image = 8
for i in range(num_image):
latent_z = mx.nd.random_normal(0, 1, shape=(1, latent_z_size, 1, 1), ctx=ctx)
img = netG(latent_z)
plt.subplot(2,4,i+1)
visualize(img[0])
plt.show()
We can also interpolate along the manifold between images by interpolating linearly between points in the
latent space and visualizing the corresponding images. We can see that small changes in the latent space
results in smooth changes in generated images.
In [ ]: num_image = 12
latent_z = mx.nd.random_normal(0, 1, shape=(1, latent_z_size, 1, 1), ctx=ctx)
step = 0.05
for i in range(num_image):
img = netG(latent_z)
plt.subplot(3,4,i+1)
visualize(img[0])
latent_z += 0.05
plt.show()
import tarfile
import matplotlib.image as mpimg
from matplotlib import pyplot as plt
import mxnet as mx
from mxnet import gluon
from mxnet import ndarray as nd
from mxnet.gluon import nn, utils
from mxnet.gluon.nn import Dense, Activation, Conv2D, Conv2DTranspose, \
BatchNorm, LeakyReLU, Flatten, HybridSequential, HybridBlock, Dropout
from mxnet import autograd
import numpy as np
use_gpu = True
ctx = mx.gpu() if use_gpu else mx.cpu()
lr = 0.0002
beta1 = 0.5
lambda1 = 100
pool_size = 50
We first resize images to size 512 * 256. Then normalize image pixel values to be between -1 and 1.
In [4]: img_wd = 256
img_ht = 256
train_img_path = '%s/train' % (dataset)
val_img_path = '%s/val' % (dataset)
def download_data(dataset):
if not os.path.exists(dataset):
url = 'https://people.eecs.berkeley.edu/~tinghuiz/projects/pix2pix/datasets
os.mkdir(dataset)
data_file = utils.download(url)
with tarfile.open(data_file) as tar:
tar.extractall(path='.')
os.remove(data_file)
download_data(dataset)
train_data = load_data(train_img_path, batch_size, is_reversed=True)
val_data = load_data(val_img_path, batch_size, is_reversed=True)
Visualize 4 images:
In [5]: def visualize(img_arr):
plt.imshow(((img_arr.asnumpy().transpose(1, 2, 0) + 1.0) * 127.5).astype(np.uin
plt.axis('off')
def preview_train_data():
img_in_list, img_out_list = train_data.next().data
for i in range(4):
plt.subplot(2,4,i+1)
visualize(img_in_list[i])
plt.subplot(2,4,i+5)
visualize(img_out_list[i])
plt.show()
preview_train_data()
PatchGAN – that only penalizes structure at the scale of patches is applied as disciminator architecture.
This discriminator tries to classify if each N × N patch in an image is real or fake. We run this discriminator
convolutionally across the image, averaging all responses to provide the ultimate output of netD.
In [6]: # Define Unet generator skip block
class UnetSkipUnit(HybridBlock):
def __init__(self, inner_channels, outer_channels, inner_block=None, innermost=
use_dropout=False, use_bias=False):
super(UnetSkipUnit, self).__init__()
with self.name_scope():
self.outermost = outermost
en_conv = Conv2D(channels=inner_channels, kernel_size=4, strides=2, pad
in_channels=outer_channels, use_bias=use_bias)
en_relu = LeakyReLU(alpha=0.2)
en_norm = BatchNorm(momentum=0.1, in_channels=inner_channels)
de_relu = Activation(activation='relu')
de_norm = BatchNorm(momentum=0.1, in_channels=outer_channels)
if innermost:
de_conv = Conv2DTranspose(channels=outer_channels, kernel_size=4, s
in_channels=inner_channels, use_bias=use_
self.model = HybridSequential()
with self.model.name_scope():
for block in model:
self.model.add(block)
with self.name_scope():
self.model = unet
with self.name_scope():
self.model = HybridSequential()
kernel_size = 4
padding = int(np.ceil((kernel_size - 1)/2))
self.model.add(Conv2D(channels=ndf, kernel_size=kernel_size, strides=2,
padding=padding, in_channels=in_channels))
self.model.add(LeakyReLU(alpha=0.2))
nf_mult = 1
for n in range(1, n_layers):
nf_mult_prev = nf_mult
nf_mult = min(2 ** n, 8)
self.model.add(Conv2D(channels=ndf * nf_mult, kernel_size=kernel_si
padding=padding, in_channels=ndf * nf_mult_pr
use_bias=use_bias))
self.model.add(BatchNorm(momentum=0.1, in_channels=ndf * nf_mult))
self.model.add(LeakyReLU(alpha=0.2))
nf_mult_prev = nf_mult
nf_mult = min(2 ** n_layers, 8)
self.model.add(Conv2D(channels=ndf * nf_mult, kernel_size=kernel_size,
padding=padding, in_channels=ndf * nf_mult_prev,
use_bias=use_bias))
self.model.add(BatchNorm(momentum=0.1, in_channels=ndf * nf_mult))
self.model.add(LeakyReLU(alpha=0.2))
self.model.add(Conv2D(channels=1, kernel_size=kernel_size, strides=1,
padding=padding, in_channels=ndf * nf_mult))
if use_sigmoid:
self.model.add(Activation(activation='sigmoid'))
5.5.4 Construct networks, Initialize parameters, Setup Loss Function and Opti-
mizer
We use binary cross entropy and L1 loss as loss functions. L1 loss can be used to capture low frequencies
in images.
In [7]: def param_init(param):
if param.name.find('conv') != -1:
if param.name.find('weight') != -1:
param.initialize(init=mx.init.Normal(0.02), ctx=ctx)
else:
param.initialize(init=mx.init.Zero(), ctx=ctx)
elif param.name.find('batchnorm') != -1:
param.initialize(init=mx.init.Zero(), ctx=ctx)
# Initialize gamma from normal distribution with mean 1 and std 0.02
if param.name.find('gamma') != -1:
param.set_data(nd.random_normal(1, 0.02, param.data().shape))
def network_init(net):
def set_network():
# Pixel2pixel networks
netG = UnetGenerator(in_channels=3, num_downs=8)
netD = Discriminator(in_channels=6)
# Initialize parameters
network_init(netG)
network_init(netD)
# Loss
GAN_loss = gluon.loss.SigmoidBinaryCrossEntropyLoss()
L1_loss = gluon.loss.L1Loss()
ret_imgs.append(image)
ret_imgs = nd.concat(*ret_imgs, dim=0)
return ret_imgs
def train():
image_pool = ImagePool(pool_size)
metric = mx.metric.CustomMetric(facc)
stamp = datetime.now().strftime('%Y_%m_%d-%H_%M')
logging.basicConfig(level=logging.DEBUG)
fake_out = netG(real_in)
fake_concat = image_pool.query(nd.concat(real_in, fake_out, dim=1))
with autograd.record():
# Train with fake image
# Use image pooling to utilize history images
output = netD(fake_concat)
fake_label = nd.zeros(output.shape, ctx=ctx)
errD_fake = GAN_loss(output, fake_label)
metric.update([fake_label,], [output,])
metric.update([real_label,], [output,])
trainerD.step(batch.data[0].shape[0])
############################
# (2) Update G network: maximize log(D(x, G(x, z))) - lambda1 * L1(y, G
###########################
with autograd.record():
fake_out = netG(real_in)
fake_concat = nd.concat(real_in, fake_out, dim=1)
output = netD(fake_concat)
real_label = nd.ones(output.shape, ctx=ctx)
errG = GAN_loss(output, real_label) + L1_loss(real_out, fake_out) *
errG.backward()
trainerG.step(batch.data[0].shape[0])
train()
5.5.7 Results
Generate images with generator.
In [10]: def print_result():
num_image = 4
img_in_list, img_out_list = val_data.next().data
for i in range(num_image):
img_in = nd.expand_dims(img_in_list[i], axis=0)
plt.subplot(2,4,i+1)
visualize(img_in[0])
img_out = netG(img_in.as_in_context(ctx))
plt.subplot(2,4,i+5)
visualize(img_out[0])
plt.show()
print_result()
5.5.9 Citation
CMP Facades dataset: @INPROCEEDINGS{ Tylecek13, author = {Radim Tyle{ č }ek, Radim { Š }{‘
a}ra}, title = {Spatial Pattern Templates for Recognition of Objects with Regular Structure}, booktitle =
{Proc. GCPR}, year = {2013}, address = {Saarbrucken, Germany}, }
Cityscapes training set: @inproceedings{Cordts2016Cityscapes, title={The Cityscapes Dataset for Seman-
tic Urban Scene Understanding}, author={Cordts, Marius and Omran, Mohamed and Ramos, Sebastian
and Rehfeld, Timo and Enzweiler, Markus and Benenson, Rodrigo and Franke, Uwe and Roth, Stefan and
Schiele, Bernt}, booktitle={Proc. of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR)}, year={2016} }
While networks trained using this approach usually perform well in regions with lots of data, they fail to
express uncertainity in regions with little or no data, leading to overconfident decisions. This drawback
motivates the application of Bayesian learning to neural networks, introducing probability distributions over
the weights. These distributions can be of various nature in theory. To make our lifes easier and to have an
intuitive understanding of the distribution at each weight, we will use a Gaussian distribution.
Unfortunately though, exact Bayesian inference on the parameters of a neural network is intractable. One
promising way of addressing this problem is presented by the “Bayes by Backprop” algorithm (introduced
by the “Weight Uncertainity in Neural Networks” paper) which derives a variational approximation to the
true posterior. This algorithm does not only make networks more “honest” with respect to their overall
uncertainity, but also automatically leads to regularization, thereby eliminating the need of using dropout in
this model.
While we will try to explain the most important concepts of this algorithm in this notebook, we also encour-
age the reader to consult the paper for deeper insights.
Let’s start implementing this idea and evaluate its performance on the MNIST classification problem. We
start off with the usual set of imports.
In [1]: from __future__ import print_function
import collections
import mxnet as mx
import numpy as np
from mxnet import nd, autograd
from matplotlib import pyplot as plt
For easy tuning and experimentation, we define a dictionary holding the hyper-parameters of our model.
In [2]: config = {
"num_hidden_layers": 2,
"num_hidden_units": 400,
"batch_size": 128,
"epochs": 10,
"learning_rate": 0.001,
"num_samples": 1,
"pi": 0.25,
"sigma_p": 1.0,
"sigma_p1": 0.75,
"sigma_p2": 0.1,
}
mnist = mx.test_utils.get_mnist()
num_inputs = 784
num_outputs = 10
batch_size = config['batch_size']
In order to reproduce and compare the results from the paper, we preprocess the pixels by dividing by 126.
Activation function
As with lots of past examples, we will again use the ReLU as our activation function for the hidden units of
our neural network.
In [5]: def relu(X):
return nd.maximum(X, nd.zeros_like(X))
The resulting loss function, commonly referred to as either variational free energy or expected lower bound
(ELBO), has to be minimized and is then given as follows:
As one can easily see, the cost function tries to balance the complexity of the data 𝑃 (𝒟 | w) and the
simplicity of the prior 𝑃 (w).
We can approximate this exact cost through a Monte Carlo sampling procedure as follows
𝑛
∑︁
ℱ(𝒟, 𝜃) ≈ log 𝑞(w(𝑖) | 𝜃) − log 𝑃 (w(𝑖) ) − log 𝑃 (𝒟 | w(𝑖) )
𝑖=1
where w(𝑖) corresponds to the 𝑖-th Monte Carlo sample from the variational posterior. While writing this
notebook, we noticed that even taking just one sample leads to good results and we will therefore stick to
just sampling once throughout the notebook.
Since we will be working with mini-batches, the exact loss form on mini-batch 𝑖 we will be using looks as
follows:
1
ℱ(𝒟𝑖 , 𝜃) = KL[log 𝑞(w | 𝜃) || log 𝑃 (w)] − E𝑞(w | 𝜃) [log 𝑃 (𝒟𝑖 | w)]
𝑀
1
≈ (log 𝑞(w(1) | 𝜃) − log 𝑃 (w(1) )) − log 𝑃 (𝒟𝑖 | w(1) )
𝑀
where 𝑀 corresponds to the number of batches, and ℱ(𝒟, 𝜃) = 𝑀
∑︀
𝑖=1 ℱ(𝒟𝑖 , 𝜃)
Likelihood
As with lots of past examples, we will again use the softmax to define our likelihood 𝑃 (𝒟𝑖 | w). Revisit the
MLP from scratch notebook for a detailed motivation of this function.
In [7]: def log_softmax_likelihood(yhat_linear, y):
return nd.nansum(y * nd.log_softmax(yhat_linear), axis=0, exclude=True)
Prior
Since we are introducing a Bayesian treatment for the network, we need to define a Prior over the weights.
Gaussian prior
A popular and simple prior is the Gaussian distribution. The prior over the entire weight vector simply
corresponds to the prodcut of the individual Gaussians:
∏︁
𝑃 (w) = 𝒩 (w𝑖 | 0, 𝜎𝑝2 )
𝑖
We can define the Gaussian distribution and our Gaussian prior as seen below. Note that we are ultimately
intersted in the log-prior log 𝑃 (w) and therefore compute the sum of the log-Gaussians.
∑︁
log 𝑃 (w) = log 𝒩 (w𝑖 | 0, 𝜎𝑝2 )
𝑖
def gaussian_prior(x):
sigma_p = nd.array([config['sigma_p']], ctx=ctx)
Instead of a single Gaussian, the paper also suggests the usage of a scale mixture prior for 𝑃 (w) as an
alternative:
∏︁ (︂ )︂
2 2
𝑃 (w) = 𝜋𝒩 (w𝑖 | 0, 𝜎1 ) + (1 − 𝜋)𝒩 (w𝑖 | 0, 𝜎2 )
𝑖
where 𝜋 ∈ [0, 1], 𝜎1 > 𝜎2 and 𝜎2 ≪ 1. Again we are intersted in the log-prior log 𝑃 (w), which can be
expressed as follows:
∑︁ (︂ )︂
2 2
log 𝑃 (w) = log 𝜋𝒩 (w𝑖 | 0, 𝜎1 ) + (1 − 𝜋)𝒩 (w𝑖 | 0, 𝜎2 )
𝑖
def scale_mixture_prior(x):
sigma_p1 = nd.array([config['sigma_p1']], ctx=ctx)
sigma_p2 = nd.array([config['sigma_p2']], ctx=ctx)
pi = config['pi']
Variational Posterior
The last missing piece in the equation is the variational posterior. Again, we choose a Gaussian disribution
for this purpose. The variational posterior on the weights is centered on the mean vector 𝜇 and has variance
𝜎2: ∏︁
𝑞(w | 𝜃) = 𝒩 (w𝑖 | 𝜇, 𝜎 2 )
𝑖
Combined Loss
After introducing the data likelihood, the prior, and the variational posterior, we are now able to build our
1
combined loss function: ℱ(𝒟𝑖 , 𝜃) = 𝑀 (log 𝑞(w | 𝜃) − log 𝑃 (w)) − log 𝑃 (𝒟𝑖 | w)
In [10]: def combined_loss(output, label_one_hot, params, mus, sigmas, log_prior, log_likel
# Calculate prior
log_prior_sum = sum([nd.sum(log_prior(param)) for param in params])
5.6.4 Optimizer
We use vanilla stochastic gradient descent to optimize the variational parameters. Note that this implements
the updates described in the paper, as the gradient contribution due to the reparametrization trick is au-
tomatically included by taking the gradients of the combined loss function with respect to the variational
parameters.
In [11]: def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
Since these are the parameters we wish to do gradient descent on, we need to allocate space for storing the
gradients.
In [14]: for param in variational_params:
param.attach_grad()
return epsilons
2. Transform 𝜌 to a postive vector via the softplus function: 𝜎 = softplus(𝜌) = log(1 + exp(𝜌))
In [16]: def softplus(x):
return nd.log(1. + nd.exp(x))
def transform_rhos(rhos):
return [softplus(rho) for rho in rhos]
Complete loop
The complete training loop is given below.
In [18]: epochs = config['epochs']
learning_rate = config['learning_rate']
smoothing_constant = .01
train_acc = []
test_acc = []
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1, 784))
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, 10)
with autograd.record():
# sample epsilons from standard normal
epsilons = sample_epsilons(layer_param_shapes)
SGD(variational_params, learning_rate)
plt.plot(train_acc)
plt.plot(test_acc)
plt.show()
Epoch 0. Loss: 2626.47417991, Train_acc 0.945617, Test_acc 0.9455
Epoch 1. Loss: 2606.28165139, Train_acc 0.962783, Test_acc 0.9593
Epoch 2. Loss: 2600.2452303, Train_acc 0.969783, Test_acc 0.9641
Epoch 3. Loss: 2595.75639899, Train_acc 0.9753, Test_acc 0.9684
Epoch 4. Loss: 2592.98582057, Train_acc 0.978633, Test_acc 0.9723
Epoch 5. Loss: 2590.05895182, Train_acc 0.980483, Test_acc 0.9733
Epoch 6. Loss: 2588.57918775, Train_acc 0.9823, Test_acc 0.9756
Epoch 7. Loss: 2586.00932367, Train_acc 0.984, Test_acc 0.9749
Epoch 8. Loss: 2585.4614887, Train_acc 0.985883, Test_acc 0.9765
Epoch 9. Loss: 2582.92995846, Train_acc 0.9878, Test_acc 0.9775
For demonstration purposes, we can now take a look at one particular weight by plotting its distribution.
In [19]: def show_weight_dist(mean, variance):
sigma = nd.sqrt(variance)
x = np.linspace(mean.asscalar() - 4*sigma.asscalar(), mean.asscalar() + 4*sigm
plt.plot(x, gaussian(nd.array(x, ctx=ctx), mean, sigma).asnumpy())
plt.show()
mu = mus[0][0][0]
var = softplus(rhos[0][0][0]) ** 2
show_weight_dist(mu, var)
Great! We have obtained a fully functional Bayesian neural network. However, the number of weights now
is twice as high as for traditional neural networks. As we will see in the final section of this notebook, we
are able to drastically reduce the number of weights our network uses for prediction with weight pruning.
We further introduce a few helper methods which turn our list of weights into a single vector containing all
weights. This will make our subsequent actions easier.
In [21]: def vectorize_matrices_in_vector(vec):
return vec
def concact_vectors_in_vector(vec):
concat_vec = vec[0]
for i in range(1, len(vec)):
concat_vec = nd.concat(concat_vec, vec[i], dim=0)
return concat_vec
def transform_vector_structure(vec):
vec = vectorize_matrices_in_vector(vec)
vec = concact_vectors_in_vector(vec)
return vec
In addition, we also have a helper method which transforms the pruned weight vector back to the original
layered structure.
In [22]: from functools import reduce
import operator
def prod(iterable):
return reduce(operator.mul, iterable, 1)
def restore_weight_structure(vec):
pruned_weights = []
index = 0
return pruned_weights
The actual pruning of the vector happens in the following function. Note that this function accepts an
ordered list of percentages to evaluate the performance at different pruning rates. In this setting, pruning at
each iteration means extracting the index of the lowest signal-to-noise-ratio weight and setting the weight at
this index to 0.
In [23]: def prune_weights(sign_to_noise_vec, prediction_vector, percentages):
pruning_indices = nd.argsort(sign_to_noise_vec, axis=0)
mus_copy = mus.copy()
mus_copy_vec = transform_vector_structure(mus_copy)
Depending on the number of units used in the original network and the number of training epochs, the highest
achievable pruning percentages (without significantly reducing the predictive performance) can vary. The
paper, for example, reports almost no change in the test accuracy when pruning 95% of the weights in
a 2x1200 unit Bayesian neural network, which creates a significantly sparser network, leading to faster
predictions and reduced memory requirements.
5.6.9 Conclusion
We have taken a look at an efficient Bayesian treatment for neural networks using variational inference via
the “Bayes by Backprop” algorithm (introduced by the “Weight Uncertainity in Neural Networks” paper).
We have implemented a stochastic version of the variational lower bound and optimized it in order to find
an approximation to the posterior distribution over the weights of a MLP network on the MNIST data set.
As a result, we achieve regularization on the network’s parameters and can quantify our uncertainty about
the weights accurately. Finally, we saw that it is possible to significantly reduce the number of weights in
the neural network after training while still keeping a high accuracy on the test set.
We also note that, given this model implementation, we were able to reproduce the paper’s results on the
MNIST data set, achieving a comparable test accuracy for all documented instances of the MNIST classifi-
cation problem.
For whinges or inquiries, open an issue on GitHub.
For easy tuning and experimentation, we define a dictionary holding the hyper-parameters of our model.
In [ ]: config = {
"num_hidden_layers": 2,
"num_hidden_units": 400,
"batch_size": 128,
"epochs": 10,
"learning_rate": 0.001,
"num_samples": 1,
"pi": 0.25,
"sigma_p": 1.0,
"sigma_p1": 0.75,
"sigma_p2": 0.01,
}
mnist = mx.test_utils.get_mnist()
num_inputs = 784
num_outputs = 10
batch_size = config['batch_size']
In order to reproduce and compare the results from the paper, we preprocess the pixels by dividing by 126.
that Bayes by Backprop should be thought of as a training method, rather than a special architecture.
In [ ]: num_layers = config['num_hidden_layers']
num_hidden = config['num_hidden_units']
net = gluon.nn.Sequential()
with net.name_scope():
for i in range(num_layers):
net.add(gluon.nn.Dense(num_hidden, activation="relu"))
net.add(gluon.nn.Dense(num_outputs))
Then we have to forward-propagate a single data set entry once to set up all network parameters (weights
and biases) with the desired initliaizer specified above.
In [ ]: for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1, 784))
net(data)
break
In [ ]: weight_scale = .1
rho_offset = -3
5.7.5 Optimizer
Now, we still have to choose the optimizer we wish to use for training. This time, we are using the adam
optimizer.
2. Transform 𝜌 to a positive vector via the softplus function: 𝜎 = softplus(𝜌) = log(1 + exp(𝜌))
In [ ]: def softplus(x):
return nd.log(1. + nd.exp(x))
def transform_rhos(rhos):
return [softplus(rho) for rho in rhos]
Evaluation metric
In order to being able to assess our model performance we define a helper function which evaluates our
accuracy on an ongoing basis.
In [ ]: def evaluate_accuracy(data_iterator, net, layer_params):
numerator = 0.
denominator = 0.
for i, (data, label) in enumerate(data_iterator):
data = data.as_in_context(ctx).reshape((-1, 784))
label = label.as_in_context(ctx)
output = net(data)
predictions = nd.argmax(output, axis=1)
numerator += nd.sum(predictions == label)
denominator += data.shape[0]
return (numerator / denominator).asscalar()
Complete loop
The complete training loop is given below.
In [ ]: epochs = config['epochs']
learning_rate = config['learning_rate']
smoothing_constant = .01
train_acc = []
test_acc = []
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1, 784))
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, 10)
with autograd.record():
# generate sample
layer_params, sigmas = generate_weight_sample(shapes, raw_mus, raw_rhos
trainer.step(data.shape[0])
plt.plot(train_acc)
plt.plot(test_acc)
plt.show()
For demonstration purposes, we can now take a look at one particular weight by plotting its distribution.
In [ ]: def gaussian(x, mu, sigma):
scaling = 1.0 / nd.sqrt(2.0 * np.pi * (sigma ** 2))
bell = nd.exp(- (x - mu) ** 2 / (2.0 * sigma ** 2))
mu = raw_mus[0][0][0]
var = softplus(raw_rhos[0][0][0]) ** 2
show_weight_dist(mu, var)
We further introduce a few helper methods which turn our list of weights into a single vector containing all
weights. This will make our subsequent actions easier.
In [ ]: def vectorize_matrices_in_vector(vec):
for i in range(0, (num_layers + 1) * 2, 2):
if i == 0:
vec[i] = nd.reshape(vec[i], num_inputs * num_hidden)
elif i == num_layers * 2:
vec[i] = nd.reshape(vec[i], num_hidden * num_outputs)
else:
vec[i] = nd.reshape(vec[i], num_hidden * num_hidden)
return vec
def concact_vectors_in_vector(vec):
concat_vec = vec[0]
for i in range(1, len(vec)):
concat_vec = nd.concat(concat_vec, vec[i], dim=0)
return concat_vec
def transform_vector_structure(vec):
vec = vectorize_matrices_in_vector(vec)
vec = concact_vectors_in_vector(vec)
return vec
In addition, we also have a helper method which transforms the pruned weight vector back to the original
layered structure.
In [ ]: from functools import reduce
import operator
def prod(iterable):
return reduce(operator.mul, iterable, 1)
def restore_weight_structure(vec):
pruned_weights = []
index = 0
return pruned_weights
The actual pruning of the vector happens in the following function. Note that this function accepts an
ordered list of percentages to evaluate the performance at different pruning rates. In this setting, pruning at
each iteration means extracting the index of the lowest signal-to-noise-ratio weight and setting the weight at
this index to 0.
In [ ]: def prune_weights(sign_to_noise_vec, prediction_vector, percentages):
pruning_indices = nd.argsort(sign_to_noise_vec, axis=0)
mus_copy = raw_mus.copy()
mus_copy_vec = transform_vector_structure(mus_copy)
Depending on the number of units used in the original network, the highest achievable pruning percent-
ages (without significantly reducing the predictive performance) can vary. The paper, for example, reports
almost no change in the test accuracy when pruning 95% of the weights in a 1200 unit Bayesian neural
network, which creates a significantly sparser network, leading to faster predictions and reduced memory
requirements.
5.7.8 Conclusion
We have taken a look at an efficient Bayesian treatment for neural networks using variational inference via
the “Bayes by Backprop” algorithm (introduced by the “Weight Uncertainity in Neural Networks” paper).
We have implemented a stochastic version of the variational lower bound and optimized it in order to find
an approximation to the posterior distribution over the weights of a MLP network on the MNIST data set.
As a result, we achieve regularization on the network’s parameters and can quantify our uncertainty about
the weights accurately. Finally, we saw that it is possible to significantly reduce the number of weights in
the neural network after training while still keeping a high accuracy on the test set.
We also note that, given this model implementation, we were able to reproduce the paper’s results on the
MNIST data set, achieving a comparable test accuracy for all documented instances of the MNIST classifi-
cation problem.
For whinges or inquiries, open an issue on GitHub.