Übersicht

Purpose

This module introduces students to some common plots and graphs and discusses when to use each depending on the data and the context. It also defines and explores various distributions with analysis of their properties and examples from the real world.

Lessons

Plotting helps us spot trends, determine centrality, find outliers, etc. Good plots will utilize the data’s inherent structure and what humans can easily and quickly observe to present the data.
There are multiple options when representing multivariate data (e.g. size, color, etc.) but each comes with its own drawbacks so a good plot will use several methods of representation at once to make the information accessible to everyone.
Outliers are often hard to spot just from looking at the raw data and plotting the data can make this task easier, but it can still be difficult to know if something is an outlier or if the dataset is just incomplete.
The normal distribution is a distribution that occurs naturally everywhere, and it can look vastly different depending on the exact values for its mean and standard deviation.
It can be hard to determine when something is normally distributed, even if it appears to be normal to the naked eye, since data must be infinite and symmetric in order to even qualify as normal. However, when the standard deviation is sufficiently small relative to the mean, we can “waive” the infinity requirement since the tails are incredibly unlikely.
Other distributions, such as the exponential distribution, also appear naturally and many have long tails which can make it easy to mistake them for the normal distribution.

Introduction to Plotting and Graphing

Exercise: Exploring Plots and Graphs

Outliers

Where Do Outliers Come From?

How Do We Handle Outliers?

Advanced Topic: Introduction to Distributions

Exercise: Is It Normal?

Exceptions to the Rule

Exponential

Other Distributions (Optional)

Bernoulli

Binomial

Introduction to Plotting and Graphing

Plot vs. Graph

Although “plot” and “graph” are often used interchangeably, there is a technical difference between the two. A plot is a visualization of a data set composed of finite points while a graph is a visualization of a function that can have infinite points. Usually, when we are analyzing data, we are using plots to visualize the dataset. Later, when we talk about distributions, we will be talking about functions and, therefore, using graphs to visualize the functions.

“Plot” and “graph” are also verbs that are used to describe making the plot or graph e.g. “plotting” a dataset or “graphing” a function. For the rest of this section, we will focus on plots.

Why Do We Plot?

It is often difficult for humans to stare at a dataset and observe trends or patterns immediately. However, there are certain characteristics that the human eye notices quickly and easily, such as size, color, and relative positioning. Therefore, plots allow us to take advantage of this fact to make analyzing a dataset easier.

Once we plot a dataset, it becomes easier to spot trends, determine centrality, find outliers, etc. For example, the scatter plot below plots the prevalence of coronary heart disease and difference in projected max temperature relative to 2006 for every county in the United States.

This figure plots the max temperature difference relative to 2006 and coronary heart disease prevalence for every county in the United States. We can observe two outliers in the upper right that would otherwise be difficult to spot by just looking at the numerical data. Try clicking “Source” below to determine which counties the outliers represent and what they have in common.