Professional Documents
Culture Documents
Analyzing Baseball Data With R-Chapman and Hall - CRC (2013)
Analyzing Baseball Data With R-Chapman and Hall - CRC (2013)
Analyzing
Baseball Data
with R
Max Marchi
Jim Albert
Analyzing
Baseball Data
with R
Chapman & Hall/CRC
The R Series
Series Editors
John M. Chambers Torsten Hothorn
Department of Statistics Division of Biostatistics
Stanford University University of Zurich
Stanford, California, USA Switzerland
Customer and Business Analytics: Applied Data Mining for Business Decision
Making Using R, Daniel S. Putler and Robert E. Krider
Analyzing
Baseball Data
with R
Max Marchi
Jim Albert
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2014 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a pho-
tocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface xv
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The Lahman Database: Season-by-Season Data . . . . . . . . 2
1.2.1 Bonds, Aaron, and Ruth home run trajectories . . . . 2
1.2.2 Obtaining the database . . . . . . . . . . . . . . . . . 4
1.2.3 The Master table . . . . . . . . . . . . . . . . . . . . . 4
1.2.4 The Batting table . . . . . . . . . . . . . . . . . . . . 6
1.2.5 The Pitching table . . . . . . . . . . . . . . . . . . . . 8
1.2.6 The Fielding table . . . . . . . . . . . . . . . . . . . . 11
1.2.7 The Teams table . . . . . . . . . . . . . . . . . . . . . 11
1.2.8 Baseball questions . . . . . . . . . . . . . . . . . . . . 13
1.3 Retrosheet Game-by-Game Data . . . . . . . . . . . . . . . . 14
1.3.1 The 1998 McGwire and Sosa home run race . . . . . . 14
1.3.2 Retrosheet . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.3 Game logs . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.4 Obtaining the game logs from Retrosheet . . . . . . . 16
1.3.5 Game log example . . . . . . . . . . . . . . . . . . . . 16
1.3.6 Baseball questions . . . . . . . . . . . . . . . . . . . . 16
1.4 Retrosheet Play-by-Play Data . . . . . . . . . . . . . . . . . 18
1.4.1 Event files . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.2 Event example . . . . . . . . . . . . . . . . . . . . . . 19
1.4.3 Baseball questions . . . . . . . . . . . . . . . . . . . . 20
1.5 Pitch-by-Pitch Data . . . . . . . . . . . . . . . . . . . . . . . 21
1.5.1 MLBAM Gameday and PITCHf/x . . . . . . . . . . . 21
1.5.2 PITCHf/x Example . . . . . . . . . . . . . . . . . . . 22
1.5.3 Baseball questions . . . . . . . . . . . . . . . . . . . . 24
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vii
viii Contents
2 Introduction to R 29
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 Installing R and RStudio . . . . . . . . . . . . . . . . . . . . 30
2.3 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 Career of Warren Spahn . . . . . . . . . . . . . . . . . 31
2.3.2 Vectors: defining and calculations . . . . . . . . . . . . 32
2.3.3 Vector functions . . . . . . . . . . . . . . . . . . . . . 34
2.3.4 Vector index and logical variables . . . . . . . . . . . . 35
2.4 Objects and Containers in R . . . . . . . . . . . . . . . . . . 36
2.4.1 Character data and matrices . . . . . . . . . . . . . . 37
2.4.2 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.3 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Collection of R Commands . . . . . . . . . . . . . . . . . . . 41
2.5.1 R scripts . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2 R functions . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Reading and Writing Data in R . . . . . . . . . . . . . . . . 43
2.6.1 Importing data from a file . . . . . . . . . . . . . . . . 43
2.6.2 Saving datasets . . . . . . . . . . . . . . . . . . . . . . 45
2.7 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 45
2.7.2 Manipulations with data frames . . . . . . . . . . . . 47
2.7.3 Merging and selecting from data frames . . . . . . . . 49
2.8 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.9 Splitting, Applying, and Combining Data . . . . . . . . . . . 50
2.9.1 Using sapply . . . . . . . . . . . . . . . . . . . . . . . 51
2.9.2 Using ddply in the plyr package . . . . . . . . . . . . 52
2.10 Getting Help . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3 Traditional Graphics 59
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Factor Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.1 A bar graph . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.2 Add axes labels and a title . . . . . . . . . . . . . . . 61
3.2.3 Other graphs of a factor . . . . . . . . . . . . . . . . . 62
3.3 Saving Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Dot plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Numeric Variable: Stripchart and Histogram . . . . . . . . . 65
3.6 Two Numeric Variables . . . . . . . . . . . . . . . . . . . . . 67
3.6.1 Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6.2 Building a graph, step-by-step . . . . . . . . . . . . . 69
3.7 A Numeric Variable and a Factor Variable . . . . . . . . . . 73
Contents ix
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2 The Teams Table in Lahman’s Database . . . . . . . . . . . 88
4.3 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 The Pythagorean Formula for Winning Percentage . . . . . . 93
4.5 The Exponent in the Pythagorean Formula . . . . . . . . . . 95
4.6 Good and Bad Predictions by the Pythagorean Formula . . . 96
4.7 How Many Runs for a Win? . . . . . . . . . . . . . . . . . . 99
4.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9 Simulation 211
Bibliography 325
Index 329
Preface
Baseball has always had a fascination with statistics. Schwarz (2005) docu-
ments the quantitative measurements of teams and players since the begin-
ning of professional baseball history in the 19th century. Since the foundation
of the Society of Baseball Research in 1971, an explosion of new measures
have been developed for understanding offensive and defensive contributions
of players. One can learn much about the current developments in sabermet-
rics by viewing articles at websites such as www.baseballprospectus.com,
www.hardballtimes.com, and www.fangraphs.com.
The quantity and detail of baseball data has exhibited remarkable growth
since the birth of the Internet. The first data were collected for players and
teams for individual seasons – this type of data were what would be dis-
played on the back side of a Topps baseball card. The volunteer-run Project
Scoresheet organized the collection of play-by-play game data, and these
type of data are currently freely available at the Retrosheet organization at
www.retrosheet.org/. Since 2006, PITCHf/x data have been measuring the
speeds and trajectories of every pitched ball, and newer types of data are col-
lecting the speeds and locations of batted balls and the locations and move-
ments of fielders.
The ready availability of these large baseball datasets has led to challenges
for the baseball enthusiast interested in answering baseball questions with
these data. It can be problematic to download and organize the data. Stan-
dard statistical software packages may be well-suited for working with small
datasets of a specific format, but they are less helpful in merging datasets of
different types or performing particular types of analyses, say contour graphs
of pitch locations, that are helpful for PITCHf/x data.
Fortunately, a new open-source statistical computing environment, R, has
experienced increasing popularity among the statistical and computer science
community. R is a system for statistical computation and graphics, and it is
a computer language designed for typical and possibly specialized statistical
and graphical applications. The software is available for Unix, Windows, and
Macintosh platforms and is available from www.r-project.org.
The public availability of baseball data and the open-source R software is
an attractive marriage. R provides a large range of tools for importing, ar-
ranging, and organizing large datasets. By the use of built-in functions and
collections of packages from the R user-community, one can perform various
data and graphical analyses, and communicate this work easily to other base-
xv
xvi Preface
ball enthusiasts over the Internet. One of us recently asked a number of MLB
team analytics groups about their use of R and here are some responses:
It is clear that R is a major tool for the analytical work of MLB teams.
The purpose of this book is to introduce R to sabermetricians, baseball
enthusiasts, and students interested in exploring baseball data. Chapter 1 pro-
vides an overview of the publicly available baseball datasets and Chapter 2
gives a gentle introduction to the type of data structures and exploratory and
data management capabilities of R. One of the strongest features of R is its
graphics capabilities – Chapter 3 provides an overview of the traditional graph-
ics functions available in the base package and Chapter 6 introduces more
sophisticated graphical displays available through the lattice and ggplot2
packages.
The remainder of the book illustrates the use of R in exploring a number
of popular topics in sabermetrics. Two fundamental ideas in sabermetrics are
the relationship between runs and wins, and the measurement of the value of
baseball events by runs. Chapter 4 explores the famous Pythagorean formula
derived by Bill James and Chapters 5 and 7 describe the value of plays and
pitch sequences using run expectancy. It is fascinating to explore career per-
formance trajectories of ballplayers and Chapter 8 illustrates the use of R to
fit quadratic models to player trajectories. Chapter 9 illustrates the use of R
simulation functions to simulate a game of baseball by a Markov chain model,
and simulate a season of baseball competition. Baseball fans are interested in
streaky patterns of performance of teams and players and Chapter 10 explores
methods of describing and understanding the significance of streaky patterns
of hitting. Given the large size of baseball datasets, it may be more convenient
to work with a database and Chapter 11 illustrates the application of several
R packages to interface with a MySQL database. Chapter 12 describes the
usefulness of several R packages for exploring fielding statistics. The datafiles
available through Retrosheet and MLBAM Gameway and PITCHf/x are rela-
tively sophisticated and the appendix material provides detailed descriptions
on downloading and reading this data into R.
The reader is encouraged to work on the book datasets and try out
the presented R code as the chapters are read. All of the data files and
Preface xvii
R code sections used in the book are available at the GitHub repository
at github.com/maxtoki/baseball_R. In addition, there is a book blog at
baseballwithr.wordpress.com where the authors will provide advice on us-
ing R in sabermetrics research and keep the reader informed on new develop-
ments in R software and baseball datasets.
The authors are very grateful for the efforts of our editor, John Kimmel,
who played an important role in our collaboration and provided us with timely
reviews that led to significant improvements of the manuscript. We wish to
thank Anne and Ramona for encouragement and inspiration. Although the
two of us live thousands of miles apart, we share a passion both for statistics,
baseball, and the knowledge that one can learn about the game through the
exploration of data.
1
The Baseball Datasets
CONTENTS
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The Lahman Database: Season-by-Season Data . . . . . . . . . . . . . . . . . 2
1.2.1 Bonds, Aaron, and Ruth home run trajectories . . . . . . . . . 2
1.2.2 Obtaining the database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 The Master table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.4 The Batting table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.5 The Pitching table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.6 The Fielding table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.7 The Teams table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.8 Baseball questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Retrosheet Game-by-Game Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 The 1998 McGwire and Sosa home run race . . . . . . . . . . . . 14
1.3.2 Retrosheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.3 Game logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.4 Obtaining the game logs from Retrosheet . . . . . . . . . . . . . . . 16
1.3.5 Game log example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.6 Baseball questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 Retrosheet Play-by-Play Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 Event files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.2 Event example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.3 Baseball questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5 Pitch-by-Pitch Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5.1 MLBAM Gameday and PITCHf/x . . . . . . . . . . . . . . . . . . . . . . 21
1.5.2 PITCHf/x Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5.3 Baseball questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1
2 Analyzing Baseball Data with R
1.1 Introduction
Baseball’s marriage with numbers goes back to the origins of the sport. The
pioneers of the game in the 1840s had not yet decided the ultimate distance
between the pitcher’s rubber and home plate, nor the number of balls needed
to be awarded a base, when the first box scores and the first stats appeared
in newspapers.
This chapter introduces three sources of freely available data, the Lahman
database, Retrosheet data, and PITCHf/x data. Baseball records from these
sources have a growing level of detail, from seasonal stats available since the
1871 season, to box score data for individual games, to play-by-play accounts
covering most games since 1945, to extremely detailed pitch-by-pitch data
recorded for nearly all the pitches thrown in MLB parks since 2008. Examples
throughout this book will predominately use subsets of data coming from
these three sources.
Bonds (762)
Aaron (755)
Ruth (714)
600
career home runs
400
200
20 25 30 35 40
age
FIGURE 1.1
Career home runs by age for the top three home run hitters in baseball history.
declined in the final years of his career, hitting 20, 12, and 10 home runs from
1974 to 1976.
Barry Bonds had a relatively late major league debut as he did not come to
an agreement with the team that first drafted him, and was not in the career
home run race until after his 35th birthday. Towards the end of his career,
Bonds put together an impressive season home run counts of 49, 73, 46, 45,
and 45 home runs, closing in on Babe’s 714 mark. Then, after missing most
of the 2005 season because of injuries, he completed the chase to the record
with two solid seasons (26 and 28 homers) when he was 42 and 43 years old.
To compare sluggers, a researcher needs season-to-season batting data in-
cluding age and home run count for Bonds, Aaron, and Ruth. One needs this
data for a wide range of seasons, as Ruth’s career began in 1914 and Bonds’
career ended in 2007.
For many years database journalist and author Sean Lahman has been
making available at his website1 a database (Lahman, 2012) containing pitch-
ing, hitting, and fielding statistics for the entire history of professional baseball
1 www.seanlahman.com/
4 Analyzing Baseball Data with R
from 1871 to the current season. The data are available in several formats,
including a set of comma-separated-value (csv) tables, that will be used in
this book.
1. Access www.seanlahman.com/baseball-archive/statistics/.
2. Below the Limited Use License section, there is the section for downloading
the most recent version of the database. At the time of this writing, the
section is named Download 2012 Version. Click on the 2012 Version -
comma-delimited version.
3. Save the file to a directory of your choice.
After the compressed file has been download and extracted, one has a to-
tal of 24 files with a csv extension listed in Table 1.1. In addition, a “readme
2012.txt” text file is included which gives a thorough description of the Lah-
man database. One is encouraged to read the documentation provided in the
“readme 2012.txt” file to learn about the contents of these files. Here we give
a general description of the variables in the data tables most relevant for the
studies described in this book.
in the Hall of Fame (therefore having an entry in the Master table) are baseball pioneer
Henry Chadwick and career Negro Leaguer Josh Gibson.
The Baseball Datasets 5
TABLE 1.1
Files in the Lahman’s database.
File Description
AllStarFull.csv Players’ appearances in All-Star Games
Appearances.csv Seasonal players’ appearances by position
AwardsManagers.csv Recipients of the Manager of the Year Award
AwardsPlayers.csv Players recipients of the various Awards
AwardsShareManagers.csv Voting results for the Manager of the Year
Award
AwardsSharePlayers.csv Voting results for the various Awards for play-
ers
Batting.csv Seasonal batting statistics
BattingPost.csv Seasonal batting statistics for post-season
Fielding.csv Seasonal fielding statistics
FieldingOF.csv Seasonal appearances at the three outfield po-
sitions
FieldingPost.csv Seasonal fielding data for post-season
HallOfFame.csv Voting results for the Hall of Fame
Managers.csv Seasonal data for managers
ManagersHalf.csv Seasonal split data for managers
Master.csv Biographical information for individuals ap-
pearing in the database
Pitching.csv Seasonal pitching statistics
PitchingPost.csv Seasonal pitching statistics for post-season
Salaries.csv Seasonal salaries for players
Schools.csv List of college teams
SchoolsPlayers.csv Information on schools attended by players
SeriesPost.csv Outcomes of post-season series
Teams.csv Seasonal stats for teams
TeamsFranchises.csv Timelines of Franchises
TeamsHalf.csv Seasonal split stats for teams
6 Analyzing Baseball Data with R
Master.csv file which gives information about the first player in the database
Hank Aaron. For clarity, we place Aaron’s information in a table format in
Table 1.2.
lahmanID,playerID,managerID,hofID,birthYear,birthMonth,birthDay,
birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,
deathCountry,deathState,deathCity,nameFirst,nameLast,nameNote,
nameGiven,nameNick,weight,height,bats,throws,debut,finalGame,
college,lahman40ID,lahman45ID,retroID,holtzID,bbrefID
1,aaronha01,,aaronha01h,1934,2,5,USA,AL,Mobile,,,,,,,Hank,Aaron,
,Henry Louis,"Hammer,Hammerin’ Hank,Bad Henry",180,72,R,R,
4/13/1954,10/3/1976,,aaronha01,aaronha01,aaroh101,aaronha01,
aaronha01
From this information, we learn some details about Aaron’s life. Hank
Aaron was born February 5, 1934, in Mobile, Alabama, his full name is Henry
Louis Aaron, and his nicknames were Hammer, Hammerin’ Hank, and Bad
Henry. Aaron weighed 180 pounds and was 72 inches tall, he threw and batted
right-handed, and he played in the big leagues from 4/13/1954 to 10/3/1976.
There is a series of blank columns (those consecutive commas between “Mo-
bile” and “Hank”) corresponding to death information, which is obviously
unavailable for a living person. Finally there are the various identifying codes
for the player. The value of playerID, aaronha01, is the identifying code for
Hank Aaron in every table in the Lahman’s database. The value of the vari-
able retroID, aaroh101, is the player id specific to the Retrosheet files to be
described in Section 1.3.
TABLE 1.2
First line of the Master.csv file.
easily understood by those familiar with baseball box scores. If one has doubts
about the meaning of one particular column name, the “readme 2012.txt” file
provided with the database gives the variable descriptions.
An excerpt of the Batting.csv file for Babe Ruth is conveniently format-
ted in Table 1.3. This table shows his batting statistics for his early seasons
as a Boston Red Sox pitcher, his years for the Yankees when he became a
great home run slugger, and his seasons at the twilight of his career with the
Boston Braves.
Only count statistics such as the count of at-bats and count of hits are
reported in the batting table. Derived statistics such as a batting average
need to be computed from these count statistics. For example, a researcher
who wants to know Ruth’s batting average for the 1919 season has to calculate
it following paragraph 10.21(b) of the Official Baseball Rules (Triumph Books,
2012) that instructs to “divide the number of safe hits by the total times at
bat.” The relevant columns are H and AB, and the desired result is 139 / 432 =
.322. Some statistics are not visible for Babe Ruth as they were not recorded
in the 1920s. For example, the counts of intentional walks (IBB) are blank for
Ruth’s seasons, indicating that intentional walks were not recorded for Ruth’s
seasons.
TABLE 1.4
Pitching statistics for Babe Ruth, taken from the Pitching.csv file. A few extra columns, not reported here, are avialable
in the Pitching.csv file.
playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER HR BB SO BAOpp ERA IBB WP HBP BK BFP GF R
1 ruthba01 1914 1 BOS AL 2 1 4 3 1 0 0 69 21 10 1 7 3 0.23 3.91 0 0 0 100 0 12
2 ruthba01 1915 1 BOS AL 18 8 32 28 16 1 0 653 166 59 3 85 112 0.21 2.44 9 6 1 895 3 80
3 ruthba01 1916 1 BOS AL 23 12 44 41 23 9 1 971 230 63 0 118 170 0.20 1.75 3 8 1 1300 3 83
4 ruthba01 1917 1 BOS AL 24 13 41 38 35 6 2 979 244 73 2 108 128 0.21 2.01 5 11 0 1313 2 91
5 ruthba01 1918 1 BOS AL 13 7 20 19 18 1 0 499 125 41 1 49 40 0.21 2.22 3 2 1 660 0 51
6 ruthba01 1919 1 BOS AL 9 5 17 15 12 0 1 400 148 44 2 58 30 0.29 2.97 5 2 1 591 2 59
7 ruthba01 1920 1 NYA AL 1 0 1 1 0 0 0 12 3 2 0 2 0 0.20 4.50 0 0 0 17 0 4
8 ruthba01 1921 1 NYA AL 2 0 2 1 0 0 0 27 14 9 1 9 2 0.35 9.00 0 0 0 49 1 10
9 ruthba01 1930 1 NYA AL 1 0 1 1 1 0 0 27 11 3 0 2 3 0.30 3.00 0 0 0 39 0 3
10 ruthba01 1933 1 NYA AL 1 0 1 1 1 0 0 27 12 5 0 3 0 0.30 5.00 0 0 0 42 0 5
10
The Baseball Datasets 11
TABLE 1.5
Fielding statistics for Babe Ruth, taken from the Fielding.csv file. Columns
featuring statistics relevant only to catchers are not reported.
playerID yearID stint teamID lgID POS G GS InnOuts PO A E DP
1 ruthba01 1914 1 BOS AL P 4 0 7 0 0
2 ruthba01 1915 1 BOS AL P 32 17 63 2 3
3 ruthba01 1916 1 BOS AL P 44 24 83 3 6
4 ruthba01 1917 1 BOS AL P 41 19 101 2 4
5 ruthba01 1918 1 BOS AL 1B 13 130 6 5 8
6 ruthba01 1918 1 BOS AL OF 59 121 8 7 3
7 ruthba01 1918 1 BOS AL P 20 19 58 6 5
8 ruthba01 1919 1 BOS AL 1B 5 35 4 1 4
9 ruthba01 1919 1 BOS AL OF 111 222 14 1 6
10 ruthba01 1919 1 BOS AL P 17 13 35 2 1
11 ruthba01 1920 1 NYA AL 1B 2 10 0 1 1
12 ruthba01 1920 1 NYA AL OF 141 259 21 19 3
13 ruthba01 1920 1 NYA AL P 1 1 0 0 0
14 ruthba01 1921 1 NYA AL 1B 2 8 0 0 0
15 ruthba01 1921 1 NYA AL OF 152 348 17 13 6
16 ruthba01 1921 1 NYA AL P 2 1 2 0 0
17 ruthba01 1922 1 NYA AL 1B 1 0 0 0 0
18 ruthba01 1922 1 NYA AL OF 110 226 14 9 3
19 ruthba01 1923 1 NYA AL 1B 4 41 1 1 2
20 ruthba01 1923 1 NYA AL OF 148 378 20 11 2
21 ruthba01 1924 1 NYA AL OF 152 340 18 14 4
22 ruthba01 1925 1 NYA AL OF 98 207 15 6 3
23 ruthba01 1926 1 NYA AL 1B 2 10 0 0 2
24 ruthba01 1926 1 NYA AL OF 149 308 11 7 5
25 ruthba01 1927 1 NYA AL OF 151 328 14 13 4
26 ruthba01 1928 1 NYA AL OF 154 304 9 8 0
27 ruthba01 1929 1 NYA AL OF 133 240 5 4 2
28 ruthba01 1930 1 NYA AL OF 144 266 10 10 0
29 ruthba01 1930 1 NYA AL P 1 0 4 0 2
30 ruthba01 1931 1 NYA AL 1B 1 5 0 0 0
31 ruthba01 1931 1 NYA AL OF 142 237 5 7 2
32 ruthba01 1932 1 NYA AL 1B 1 3 0 0 0
33 ruthba01 1932 1 NYA AL OF 128 209 10 9 1
34 ruthba01 1933 1 NYA AL 1B 1 6 0 1 0
35 ruthba01 1933 1 NYA AL OF 132 215 9 7 4
36 ruthba01 1933 1 NYA AL P 1 1 1 0 0
37 ruthba01 1934 1 NYA AL OF 111 197 3 8 0
38 ruthba01 1935 1 BSN NL OF 26 39 1 2 0
The Baseball Datasets 13
three-character code (teamID). The column name in the Teams.csv file helps
in recognizing clubs by their full name.
To illustrate the teams dataset, we extract the data for one of the greatest
teams in baseball history, the 1927 New York Yankees.
yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin
1927 AL NYA NYY <NA> 1 155 77 110 44 <NA> <NA>
LgWin WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA
Y Y 975 5347 1644 291 103 158 635 605 90 64 NA NA 599 494 3.2
CG SHO SV IPouts HA HRA BBA SOA E DP FP name
82 11 20 4167 1403 42 409 431 196 123 0.96 New York Yankees
park attendance BPF PPF teamIDBR teamIDlahman45
820 Yankee Stadium I 1164015 98 94 NYY NYA
We see the 1927 Yankees finished the season with a 110-44 record and won
the World Series. The “Bronx Bombers” hit 158 home runs, stole 90 bases,
and had a total home attendance of 1,164,015.
A From 1900 to 1909 pitchers completed 79% of the games they started, from
2000 to 2010 it had dropped to 3.5%.
Data for this answer can be found in the Pitching table.
1.3.2 Retrosheet
Retrosheet is a volunteer organization, founded in 1989 by University of
Delaware professor David Smith, that aims to collect play-by-play accounts of
every game played in Major League Baseball history. Through the labor of love
of many volunteers who have unearthed old newspaper accounts, scanned mi-
crofilms, and manually entered data into computers, the Retrosheet website4
contains game-by-game summaries going back to the dawn of Major League
Baseball in the nineteenth century. The Retrosheet site also has play-by-play
data of most of the games played since the 1945 season and continues to add
games for previous seasons. This data is introduced in Section 1.4.
4 www.retrosheet.org
The Baseball Datasets 15
70
62
60
home runs in the season
50
40
30
20
McGwire (70)
10 Sosa (66)
0
Apr May Jun Jul Aug Sep Oct
date
FIGURE 1.2
Seasonal home runs for Mark McGwire and Sammy Sosa during the 1998 race.
A Since 1980, July has been the month with the most home runs per game
(1.97), while September has had the lowest frequency (1.84). In the same
6 Some other team statistics, such as Stolen Bases and Caught Stealings, omitted in Table
TABLE 1.6
Excerpt of information available in the Game Logs. Sample from the Cal Rip-
ken’s Iron Man game (Sept. 9, 1995).
date 19950906
dayofweek Wed
visitorteam CAL
hometeam BAL
visitorrunsscored 2
homerunsscore 4
daynight N
parkid BAL12
attendence 46272
duration 215
visitorlinescore 100000010
homelinescore 10020010x
homeab 34
homeh 9
homehr 4
homerbi 4
homebb 1
homek 8
homegdp 0
homelob 7
homepo 27
homea 8
homee 0
umpirehname Larry Barnett
umpire1bname Greg Kosc
umpire2bname Dan Morrison
umpire3bname Al Clark
visitormanagername Marcel Lachemann
homemanagername Phil Regan
homestartingpitchername Mike Mussina
homebatting1name Brady Anderson
homebatting1position 8
homebatting2name Manny Alexander
homebatting2position 4
homebatting3name Rafael Palmeiro
homebatting3position 3
homebatting4name Bobby Bonilla
homebatting4position 9
homebatting5name Cal Ripken
homebatting5position 6
homebatting6name Harold Baines
homebatting6position 10
homebatting7name Chris Hoiles
homebatting7position 2
homebatting8name Jeff Huson
homebatting8position 5
homebatting9name Mark Smith
homebatting9position 7
18 Analyzing Baseball Data with R
time frame, 2.71 home runs per game have been hit in Coors Field (home
of the Colorado Rockies), and 1.14 in the Astrodome (the former home of
the Houston Astros).
Q Do runs happen more frequently when some umpires are be-
hind the plate? What is the difference between the most pitcher-
friendly and the most hitter-friendly umpires?
A Among umpires with more than 400 games called since 1980, teams scored
the highest number of runs (10.0 per game combined) when Chuck Meri-
wether was behind the plate and the lowest (7.8) when Doug Harvey was
in charge.
Q How many extra people attend ballgames during the weekend?
What’s the average attendance by day of the week?
A Close to 33,000 people attend games played on Saturdays (data from 1980
to 2011) and 31,000 on Sundays. The average goes down to 29,000 on
Fridays, 25,000 on Thursdays and Mondays, and 24,000 on Tuesdays and
Wednesdays.
TABLE 1.7
Excerpt of information available in Retrosheet Event files. Sample from Jeter’s
“Flip Play” (Oct. 13, 2001)
1. GAME ID OAK200110130
2. YEAR ID 2001
3. AWAY TEAM ID NYA
4. INN CT 7
5. BAT HOME ID 1
6. OUTS CT 2
7. BALLS CT 2
8. STRIKES CT 2
9. PITCH SEQ TX CSBBFX
10. AWAY SCORE CT 1
11. HOME SCORE CT 0
12. BAT ID longt002
13. BAT HAND CD L
14. PIT ID mussm001
15. PIT HAND CD R
16. POS2 FLD ID posaj001
17. POS3 FLD ID martt002
18. POS4 FLD ID soria001
19. POS5 FLD ID bross001
20. POS6 FLD ID jeted001
21. POS7 FLD ID knobc001
22. POS8 FLD ID willb002
23. POS9 FLD ID spens001
24. BASE1 RUN ID giamj002
25. BASE2 RUN ID NA
26. BASE3 RUN ID NA
27. EVENT TX D9/9S.1XH(962)
28. BAT FLD CD 7
29. BAT LINEUP ID 7
A Mark McGwire hit 37 home runs in 313 plate appearances with runners
on base, Sammy Sosa 29 in 367. Once walks (both intentional and unin-
tentional) and hit by pitches are removed, the number of opportunities
become 223 for McGwire and 317 for Sosa.
Q How many intentional walks in unusual situations (e.g., empty
bases or bases loaded) was Barry Bonds issued in his 73 HR
campaign?
A During his record 2001 season, Barry Bonds was passed intentionally only
35 times. Of those free passes one came with a runner on first and two
with runners on first and second. When he was awarded 120 intentional
walks in 2004, 19 came with nobody on, 11 with a runner on first, and 3
with runners on first and third. He was once walked intentionally with the
bases loaded in 1998.
Q What is the Major league batting average when the ball/strike
count is 0-2? What about on 2-0?
A In 2011, hitters compiled a .253 batting average on plate appearances
where they fell behind 0-2. Conversely they hit .479 after going ahead 2-0.
4
(ft. from ground)
vertical location
breaking ●
● fastball
3 ●
rulebook
●
● ● ● strike zone
●
● ●
●
2
−1 0 1
horizontal location
(ft. from center of plate
catcher's view)
FIGURE 1.3
Pitch type and location for Jose Bautista’s 54 home runs of the 2010 season.
FoxTrax hockey puck. Two cameras installed in each MLB park record the
flight of the baseball between the pitcher’s mound and home plate, and ad-
vanced software calculates the position, the velocity and the acceleration of
the ball, giving sufficient information to estimate the full trajectory of the ball
in its mound-to-plate trip. Raw PITCHf/x data can be accessed by anyone
from the MLB.com website, however its format (XML) might not be easy to
manage for the average reader. The data used for the examples in this book
are available in a format suitable for quick use inside R. In Appendix B we will
show how the XML package allows to manage data in XML format in R and
point to some free online resources that can be used to download PITCHf/x
data.
in Table 1.8. The outcome of the pitch (variable des) is recorded by a stringer,
while most of the remaining information is either captured by the Sportvision
system or calculated from the captured data.
Each pitch is assigned an identifier (sv id), that is actually a time stamp:
Humber’s final pitch was recorded on April 21, 2012, at 15:25:37. The key
information Sportvision obtains through its camera system is recorded in lines
11 through 19 of the table. Those nine parameters give the position (variables
x0, y0, z0), velocity (variables vx0, vy0, vz0), and acceleration (variables ax,
ay, az) components of the pitch at release point. With these nine parameters
the full trajectory of the pitch from release to home plate can be estimated. (In
fact, Sportvision actually estimates the parameters somewhere in the middle
of the ball’s flight, then derives the parameters at release point.)
While the nine parameters just mentioned are sufficient for learning about
the trajectory of the pitch, they are difficult to understand by casual fans who
follow the game on MLBAM Gameday.12 Other more descriptive quantitities
are calculated starting from those nine parameters. The one measure familiar
to baseball fans is the pitch speed at release, which for Humber’s final pitch
is calculated at 85.3 mph (variable start speed). PITCHf/x also provides
the speed of the ball as it crosses the plate, 79.1 mph in this case (variable
end speed). Another two important values are the variables px and pz; they
represent the horizontal and vertical location of the pitches, respectively, and
can be combined with the batter’s strike zone upper and lower limits (sz top
and sz bot) to infer whether the pitch crossed the strike zone.
Let’s focus on the location of this particular pitch. The horizontal reference
point is the middle of the plate, with positive values indicating pitches passing
on the right side of it from the umpire’s viewpoint. In this case the ball crossed
the plate 2.21 feet on the right of its midpoint. Since the plate is 17 inches
wide, it was way out of the strike zone. The pitch was also too low to be a
strike, as the vertical point at which crossed the plate is listed at 1.17 feet,
while the hitter’s lower limit of the strike zone is 1.74.13 Luckily for Humber,
since otherwise a walk would have ruined the perfect game, the home plate
umpire controversially declared that Brendan Ryan had swung the bat for
strike three.
Other interesting quantities about a pitch are available with PITCHf/x,
including the horizontal and vertical movement (variables pfx x and pfx z)
of the pitch trajectory, the spin direction, and its rate (variables spin dir
and spin rate).14 MLBAM has devised a complex algorithm which processes
12 Philip Humber’s perfect game can be relived, pitch by pitch, at mlb.mlb.com/mlb/
gameday/index.jsp?gid=2012_04_21_chamlb_seamlb_1&mode=gameday.
13 The batter’s strike zone boundaries are recorded by the human stringer at the beginning
of the at-bat, and thus are less precise than the pitch location coordinates recorded by the
advanced system.
14 Detailed explanations for the PITCHf/x fields have been provided by Mike Fast at
fastballs.wordpress.com/2007/08/02/glossary-of-the-gameday-pitch-fields/. Prof.
Alan Nathan provides a collection of PITCHf/x references at webusers.npl.illinois.edu/
~a-nathan/pob/pitchtracker.html.
24 Analyzing Baseball Data with R
TABLE 1.8
Excerpt of information available from PITCHf/x. Sample from the final pitch
of Phil Humber’s perfect game (Apr. 21, 2012)
1. des Swinging Strike (Blocked)
2. sv id 120421 152537
3. start speed 85.3
4. end speed 79.1
5. sz top 3.73
6. sz bot 1.74
7. pfx x 0.31
8. pfx z 1.81
9. px 2.211
10. pz 1.17
11. x0 -1.58
12. y0 50.0
13. z0 5.746
14. vx0 9.228
15. vy0 -124.71
16. vz0 -5.311
17. ax 0.483
18. ay 25.576
19. az -29.254
20. break y 23.8
21. break angle -4.1
22. break length 7.8
22. pitch type SL
23. spin dir 170.609
24. spin rate 344.307
the information captured by Sportvision and marks the pitch with a label
familiar to baseball fans. In this case the algorithm recognizes the pitch as a
slider (variable pitch type).
A Nine of the fastest ten pitches recorded by PITCHf/x from 2008 to 2011
have been thrown by Aroldis Chapman, the highest figure being a 105.1
mph pitch thrown on September 24, 2010, in San Diego. Neftali Feliz is
the other pitcher making the top ten list with a 103.4 fastball delivered in
Kansas City on September 1, 2010.
Q What are the chances of a successful steal when the pitcher
throws a fastball compared to when a curve is delivered?
A From 2008 to 2010 baserunners were successful 73% of the times at stealing
second base on a fastball. The success rate increases to 85% when the pitch
is a curveball.
1.6 Summary
When choosing among the three main sources of baseball data (Lahman, Ret-
rosheet, and PITCHf/x), one always has to consider the trade-off between the
level of detail and the seasons covered by the source. With Lahman’s database,
for example, one can explore the evolution of the game since its beginnings
back into the nineteenth century. However only the basic season count statis-
tics are available from this source. For example, simple information such as
Babe Ruth’s batting splits by pitcher’s handedness cannot be retrieved from
Lahman’s files.
Retrosheet is steadily adding past seasons to its play-by-play database,
allowing researchers to perform studies to validate or reject common beliefs
about players of the past decades. During the years, for example, analysis
of play-by-play data has confirmed the huge defensive value of players like
Brooks Robinson and Mark Belanger, and has substantiated the greatness of
Roberto Clemente’s throwing arm.
PITCHf/x has been available only since 2008 and, contrary to Retrosheet,
there is no way to compile data for games of the past. This means we will
never be able to compare the velocity of Aroldis Chapman’s fastball to that
of Nolan Ryan or Bob Feller. However, studies performed since its inception
have provided an enhanced understanding of the game, enabling researchers
to explore issues like pitch sequencing, batter discipline, pitcher fatigue, and
the catcher’s ability to block bad pitches.
26 Analyzing Baseball Data with R
1.8 Exercises
1. (Which Datafile?)
This chapter has given an overview of the Lahman database, the Ret-
rosheet game logs, the Retrosheet play-by-play files, and the PITCHf/x
database. Describe the relevant data among these four databases that can
be used to answer the following baseball questions.
(a) How has the rate of walks (per team for nine innings) changed over
the history of baseball?
(b) What fraction of baseball games in 1968 were shutouts? Compare this
fraction with the fraction of shutouts in the 2012 baseball season.
(c) What percentage of first pitches are strikes? If the count is 2-0, what
fraction of the pitches are strikes?
(d) Is it easier to steal second base or third base? (Compare the fraction
of successful steals of second base with the fraction of successful steals
of third base.)
(a) Gibson started 34 games for the Cardinals in 1968. What fraction of
these games were completed by Gibson?
(b) What was Gibson’s ratio of strikeouts to walks this season?
The Baseball Datasets 27
(a) What was the time in hours and minutes of this particular game?
(b) Why is the attendance value in this record equal to zero?
(c) How many extra base hits did the Phillies have in this game? (We
know that the Mets had no extra base hits this game.)
(d) What was the Phillies’ on-base percentage in this game?
Based on the records, write a short paragraph that describes the play-by-
play events of this particular inning.
5. (PITCHf/x Record of Several Pitches)
R. A. Dickey is one of the current pitchers who predominantly throws
a knuckleball. The following gives some PITCHf/x variables for the first
knuckleball and the first fastball that Dickey threw for a game against the
Kansas City Royals on April 13, 2013.
start_speed end_speed pfx_x pfx_z px pz sz_bot sz_top
73 66.3 -0.64 -7.58 -0.047 2.475 1.5 3.35
Describe the differences between the knuckleball and the fastball in terms
of pitch speed, movement (horizontal and vertical directions), and location
in the strike zone. Based on this data, why is a knuckleball so difficult for
a batter to make contact?
2
Introduction to R
CONTENTS
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 Installing R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 Career of Warren Spahn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.2 Vectors: defining and calculations . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 Vector functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.4 Vector index and logical variables . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Objects and Containers in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.1 Character data and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.2 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.3 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Collection of R Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.1 R scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2 R functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Reading and Writing Data in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6.1 Importing data from a file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6.2 Saving datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7.2 Manipulations with data frames . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7.3 Merging and selecting from data frames . . . . . . . . . . . . . . . . 49
2.8 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.9 Splitting, Applying, and Combining Data . . . . . . . . . . . . . . . . . . . . . . . 50
2.9.1 Using sapply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.9.2 Using ddply in the plyr package . . . . . . . . . . . . . . . . . . . . . . . 52
2.10 Getting Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
29
30 Analyzing Baseball Data with R
2.1 Introduction
In this chapter, we provide a general introduction to the R statistical system.
We describe the process of installing R and the program RStudio that provides
an attractive interface to the R system. Pitching data from the legend Warren
Spahn is used to motivate manipulations with vectors, a basic data struc-
ture. We describe different data types such as characters, factors, and lists,
and different “containers” for holding these different data types. The process
of executing collections of R commands by means of scripts and functions is
discussed, and methods for importing and exporting datasets from R are de-
scribed. A fundamental data structure in R is a data frame and we introduce
defining a data frame, performing manipulations, merging data frames, and
performing operations on a data frame split by values of a variable. We con-
clude the chapter by describing how to install and load R packages and how
one gets help using resources from the R system and the RStudio interface.
installed in the system and the Help tab will display documentation for R
functions and datasets.
FIGURE 2.1
The opening screen of the RStudio interface to R.
2.3 Vectors
2.3.1 Career of Warren Spahn
One of the authors collected the 1965 Warren Spahn baseball card pictured
in Figure 2.2. The back of the card, shown in Figure 2.3, displays many of
the standard pitching statistics for the seasons preceding Spahn’s final 1965
season. We use data from Spahn’s season statistics to illustrate some basic
components of the R system.
32 Analyzing Baseball Data with R
FIGURE 2.2
Front of the 1965 Warren Spahn Topps card.
FIGURE 2.3
Back of the 1965 Warren Spahn Topps card.
We can display these winning percentages by simply typing the variable name:
Win.Pct
[1] 61.53846 67.74194 55.55556 60.00000 55.26316 61.11111 42.42424
For a sequence of integer values, the colon notation will also work:
Year <- 1946 : 1952
Suppose we wish to calculate Spahn’s age for these seasons. Spahn was
born in April 1921 and we can compute his age by subtracting 1921 from each
season value – the resulting vector is stored in the variable Age.
Age <- Year - 1921
plot(Age, Win.Pct)
We see that Spahn was pretty successful for most of his Boston seasons – his
winning percentage exceeded 55% for six of his seven seasons.
●
65
●
●
60
●
Win.Pct
●
55
●
50
45
25 26 27 28 29 30 31
Age
FIGURE 2.4
Scatterplot of the winning percentage against age for Warren Spahn’s seasons
playing for the Boston Braves.
One can sort the win numbers from low to high by the sort function:
sort(W)
[1] 8 14 15 21 21 21 22
We see from the output that Spahn won 8 games in the first season, 29 games
in the first two seasons, and so on. The summary function applied on the
winning percentages displays several summary statistics of the vector values
such as the extremes (low and high values), the quartiles (first and third), the
median, and the mean.
summary(Win.Pct)
Min. 1st Qu. Median Mean 3rd Qu. Max.
42.42 55.41 60.00 57.66 61.32 67.74
This output tells us that his median winning percentage was 60, his mean
percentage was 57.66, and the entire group of winning percentages ranged
from 42.42 to 67.74.
will extract the first, second, and fifth entries of the vector W. The first four
values of the vector can be extracted by typing
W[1 : 4]
[1] 8 21 15 21
Win.Pct > 60
[1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE
The result of this calculation is a logical vector – the output indicates that
Spahn had a winning percentage exceeding 60% for the first, second, and sixth
seasons (TRUE), and not exceeding 60% for the remaining seasons (FALSE).
Were there any seasons where Spahn won more than 20 games and his winning
percentage exceeded 60%? We use the logical & (AND) operator to find the
years where W > 20 and Win.Pct > 60.
(W > 20) & (Win.Pct > 60)
[1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE
The output indicates that both conditions were true for the second and sixth
seasons.
By using logical variables and the square bracket notion, we can find sub-
sets of vectors satisfying different conditions. During this period, when did
Spahn have his highest winning percentage? We use
Win.Pct == max(Win.Pct)
[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE
to create a logical vector which is true when this condition is satisfied. (Note
the use of the double equals sign notion to indicate logical equality.) Then we
select the corresponding year by indexing Year by this logical vector.
Year[Win.Pct == max(Win.Pct)]
[1] 1947
We see that the highest winning percentage occurred in 1947 during this
period.
What seasons did the number of decisions (wins plus losses) exceed 30?
We first create a logical vector based on W + L > 30, and then choose the
seasons using this logical vector.
Year[W + L > 30]
[1] 1947 1949 1950 1951 1952
We see that the number of decisions exceeded 30 for the five seasons 1947,
1949, 1950, 1951, and 1952.
and logical in the previous section. We store a number of objects into a con-
tainer. A vector is a simple type of container where we place a number of
objects of the same type, say objects that are all numeric or all logical. Here
we illustrate some of the different object types and containers that we find
useful in working with baseball data.
There are other ways to store objects besides vectors. For example, suppose
we wish to display the World Series contestants in a tabular format. A matrix
is a rectangular grid of objects of the same type. A matrix can be created by
the matrix function; the arguments are the objects to be put in the matrix
and the number of rows and the number of columns. By default the objects are
placed in the matrix by columns. Suppose we want to create a matrix with 10
rows and 2 columns with the National League contestants in the first column
and the American League contestants in the second column. We combine the
two team vectors into one vector by the c function and apply the matrix
function, storing the result in variable results.
results <- matrix(c(NL, AL), 10, 2)
results
[,1] [,2]
[1,] "FLA" "NYY"
[2,] "STL" "BOS"
[3,] "HOU" "CHW"
[4,] "STL" "DET"
[5,] "COL" "BOS"
[6,] "PHI" "TBR"
[7,] "PHI" "NYY"
38 Analyzing Baseball Data with R
2.4.2 Factors
A factor is a special way of representing character data. To motivate the con-
sideration of factors, suppose we construct a frequency table of the National
League representatives to the World Series in the character vector NL.
table(NL)
NL
COL FLA HOU PHI SFG STL
1 1 1 2 2 3
Introduction to R 39
7
6
5
4
3
2
1
0
AL NL
FIGURE 2.5
Bar graph of the number of wins of the American League and National League
teams in the World Series between 2003 and 2012.
Note that R will organize the teams alphabetically (from COL to STL) in the
frequency table. It may be preferable to organize the teams by the division
(East, Central, and West). We can change the organization of the team labels
by converting this character type to a factor.
We make this conversion by means of the factor function. The basic
arguments are the vector of data to be converted and a vector levels that
gives the ordered values of the variable. Here we list the values ordered by the
East, Central, and West divisions. The result is stored in the factor variable
NL2.
NL2 <- factor(NL, levels=c("FLA", "PHI", "HOU", "STL", "COL", "SFG"))
One can understand how factor variables are stored by using the str function
to examine the structure of the variable NL2.
str(NL2)
Factor w/ 6 levels "FLA","PHI","HOU",..: 1 4 3 4 5 2 2 6 4 6
We see that a factor variable is actually encoded by integers (1, 4, 3, ...) where
the levels are the team names. If we table this new factor variable
40 Analyzing Baseball Data with R
table(NL2)
NL2
FLA PHI HOU STL COL SFG
1 2 1 3 1 2
we obtain the same frequencies as before, but the teams are now listed in the
order specified in the factor function.
Generally, we will see that R will automatically convert character-type
data to factors. Many R functions require the use of factors, and the use of
factors gives one finer control on how character labels are displayed in output
and graphs.
2.4.3 Lists
All of the containers we have described such as vectors and matrices require
that data values have the same type. For example, vectors contain all numeric
data or all character data; one cannot mix numeric and character data in a
single vector. A list is a convenient way of storing data of different types. To
illustrate, suppose we wish to collect the league that won the World Series (a
character type), the number of games played (a numeric type), and a short
description (a character type) into a single variable. Using the list function,
we create a new list World.Series with components Winner, Number.Games,
and Seasons.
World.Series <- list(Winner=Winner, Number.Games=N.Games,
Seasons="2003 to 2012")
Or we can use the double square brackets to display the second component of
the list.
World.Series[[2]]
[1] 6 4 4 5 4 5 6 5 7 4
As an alternative, we can use the single square brackets with the name of the
component in quotes.
World.Series["Number.Games"]
$Number.Games
[1] 6 4 4 5 4 5 6 5 7 4
Note that the first two options return vectors and the third option returns a
list with a single component Number.Games.
Introduction to R 41
FIGURE 2.6
Snapshot of the RStudio interface after executing commands from an R script.
2.5.2 R functions
We have illustrated the use of a number of R built-in packages. One attractive
features of R is the capability to create one’s own function to implement
specific computations and graphs of interest.
As a simple example, suppose you are interested in writing a function to
compute a player’s home run rates for a collection of seasons. One inputs a
vector age of player ages, a vector hr of home run counts, and a vector ab of
at-bats. You want the function to compute the player’s home run rates (as a
percentage, rounded to the nearest tenth), and output the ages and rates in
a form amiable for graphing.
The following function hr.rates will perform the desired calculations. All
functions start with the syntax Name.of.function <- function(arguments),
where arguments is a list of input variables. All of the work in the function
goes inside the curly brackets that follow. The result of the last line of the
function is returned as the output. In our example, the name of the function
is hr.rates and there are three vector inputs age, hr, and ab. The round
function is used to compute the home run rates.2 The output of this function
2 The function round(x, n) rounds x to n decimal places.
Introduction to R 43
is a list with two components: x is the vector of ages, and y is the vector of
home run rates.
hr.rates <- function(age, hr, ab){
rates <- round(100 * hr / ab, 1)
list(x=age, y=rates)
}
To use this function, first it needs to be read into R. This can be done by
entering it directly into the Console window, or by saving the function in a
file, say hr.rates.R, and reading it into R by the source function.
source("hr.rates.R")
We illustrate using this function on some home run data for Mickey Mantle for
the seasons 1951 to 1961. We enter Mantle’s home run counts in the vector HR,
the corresponding at-bats in AB, and the ages in Age. We apply the function
hr.rates with inputs Age, HR, AB, and the output is a list with Mantle’s
ages and corresponding home run rates.
HR <- c(13, 23, 21, 27, 37, 52, 34, 42, 31, 40, 54)
AB <- c(341, 549, 461, 543, 517, 533, 474, 519, 541, 527, 514)
Age <- 19 : 29
hr.rates(Age, HR, AB)
$x
[1] 19 20 21 22 23 24 25 26 27 28 29
$y
[1] 3.8 4.2 4.6 5.0 7.2 9.8 7.2 8.1 5.7 7.6 10.5
One can easily construct a scatterplot (not shown here) of Mantle’s rates
against age by the plot function on the output of the function.
plot(hr.rates(Age, HR, AB))
Note that Mantle’s home run rates rose steadily in the first six seasons of his
career.
placed the file in the current working directory. One can check the location of
the current working directory in R by means of typing in the Console window:
getwd()
On RStudio, one can change the working directory by selecting the “Change
Working Directory” option on the Tools menu or by use of the setwd func-
tion. One can easily import this dataset in RStudio by pressing the “Import
Dataset” button in the top right window. You select the “From Text File”
option and find the dataset of interest. After you select the file, Figure 2.7
shows a snapshot of the Import Dataset window. One sees the input file and
also the format of the data that will be saved into R. It is important to check
the button that the file contains a heading which means the first line of the
input file contains the variable names.
FIGURE 2.7
Snapshot of the Import Dataset window in the RStudio interface.
expression reads the comma separated value file “spahn.csv” and saves the
data into a data frame with name spahn.
spahn <- read.csv("spahn.csv")
We use the write.csv function to save the data to the current working
directory. This function has two arguments: the R object Mantle that we
wish to save, and the output file name “mantle.csv”. By using the argument
row.names = FALSE option, row names will be omitted in the file that is
saved.
write.csv(Mantle, "mantle.csv", row.names=FALSE)
It is good to confirm that a new file “mantle.csv” exists in the current working
directory.
the vectors into a matrix where the vectors are columns of the matrix.
46 Analyzing Baseball Data with R
The header labels Year, Age, Tm, W, L, W.L, ERA, G, GS are some variable
names of the data frame; the numbers 1, 2, 3 displayed on the left give the row
numbers. We can display all variables for the first row by leaving the second
argument blank.
spahn[1, ]
Year Age Tm Lg W L W.L ERA G GS GF CG SHO SV IP H R ER HR BB
1 1942 21 BSN NL 0 0 NA 5.74 4 2 0 1 0 0 15.2 25 15 10 0 11
IBB SO HBP BK WP BF ERA. WHIP H.9 HR.9 BB.9 SO.9 SO.BB Awards
1 NA 7 0 0 0 79 59 2.298 14.4 0 6.3 4 0.64
summary(spahn$ERA)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.100 2.940 3.040 3.256 3.260 5.740
From this display, we see that 50% of Spahn’s season eras fell between the
lower quartile 2.940 and the upper quartile 3.260. Using logical operators, the
age when Spahn had his lowest ERA can be found by use of the following
expression.
spahn$Age[spahn$ERA == min(spahn$ERA)]
[1] 32
Using the ERA measure, Spahn had his best pitching season at the age of 32.
13HR + 3BB − 2K
F IP = .
IP
We add a new variable to a current data frame using the $ convention. In the
following R code, the with function indicates that the variables HR, BB, SO,
and IP are understood in the environment of the spahn data frame.
spahn$FIP <- with(spahn, (13 * HR + 3 * BB - 2 * SO) / IP)
fielders.
48 Analyzing Baseball Data with R
It is interesting that Spahn’s best FIP seasons occurred during the middle of
his career. Also, note that Spahn had a smaller (better) FIP in 1952 compared
to 1953, although his ERA was significantly larger in 1952.
Since Spahn pitched primarily for two cities, Boston and Milwaukee, sup-
pose we are interested in comparing his pitching for the two cities. We first
create a new data frame spahn1 containing only the statistics for the two
teams. This is done using the subset function with two arguments – the first
argument is the data frame to subset, and the second argument is the logical
condition defining the new data frame. (We introduce the logical OR operator
|.)
spahn1 <- subset(spahn, Tm == "BSN" | Tm == "MLN")
The current factor variable Tm has three possible values, “BSN,” “MLN,”
and “TOT” (for the total statistics for the 1965 season when Spahn played
for two teams). We redefine Tm using the factor function so that there are
only two possible values.
spahn1$Tm <- factor(spahn1$Tm, levels=c("BSN", "MLN"))
To compare various pitching statistics for the two teams, we use the by
function. The three arguments to by are the data frame to be summarized
(here the subset of the main data frame consisting of the variables W.L, ERA,
WHIP, and FIP), the grouping variable Tm, and the function (here summary)
that will be applied to each group. The output gives the summary statistics
for the Boston seasons and the Milwaukee seasons.
by(spahn1[, c("W.L", "ERA", "WHIP", "FIP")], spahn1$Tm, summary)
spahn1$Tm: BSN
W.L ERA WHIP FIP
Min. :0.4240 Min. :2.330 Min. :1.136 Min. :0.3448
1st Qu.:0.5545 1st Qu.:2.970 1st Qu.:1.154 1st Qu.:0.6251
Median :0.6000 Median :3.025 Median :1.222 Median :0.8219
Mean :0.5766 Mean :3.364 Mean :1.331 Mean :0.7922
3rd Qu.:0.6130 3rd Qu.:3.297 3rd Qu.:1.230 3rd Qu.:0.9836
Max. :0.6770 Max. :5.740 Max. :2.298 Max. :1.2500
NA’s :1
------------------------------------------------------------
spahn1$Tm: MLN
W.L ERA WHIP FIP
Min. :0.3160 Min. :2.100 Min. :1.058 Min. :0.3620
1st Qu.:0.5780 1st Qu.:2.757 1st Qu.:1.123 1st Qu.:0.8345
Median :0.6405 Median :3.030 Median :1.163 Median :0.9944
Mean :0.6202 Mean :3.121 Mean :1.187 Mean :0.9839
3rd Qu.:0.6695 3rd Qu.:3.170 3rd Qu.:1.226 3rd Qu.:1.0764
Max. :0.7670 Max. :5.290 Max. :1.474 Max. :1.7263
It is interesting that Spahn’s ERAs were higher in Boston (the middle 50%
between 2.970 and 3.297 in Boston, compared to the middle 50% between
Introduction to R 49
2.757 and 3.170 in Milwaukee), but Spahn’s FIPs were lower in Boston. This
indicates that Spahn may have had a weaker defense or unlucky with hits in
ball in play in Boston.
This command assumes that the two data frames NLbatting and ALbatting
have the same variables; otherwise an error message will be displayed.
Suppose instead that we have read in the batting data NLbatting and the
pitching data NLpitching for the NL teams in the 2011 season and we wish to
merge these data frames horizontally. In this case, a row of the merged data
frame would contain the batting and pitching statistics for a particular team.
In this case, we use the function merge where we specify the two data frames
and the by argument indicates the common variable (Tm) to merge by.
NLpitching <- read.csv("NLpitching.csv")
NL <- merge(NLbatting, NLpitching, by="Tm")
The new data frame NL contains 16 (the number of NL teams) rows and all
of the variables from both the NLbatting and ALbatting data frames.
A third useful operation is choosing a subset of a data frame that satisfies
a particular condition. Suppose one has the data frame NLbatting and one
wishes to focus on the batting statistics for only the teams who hit over 150
home runs this season. We use the subset function – the arguments are the
original data frame and the logical condition that describes how teams are
selected.
NL.150 <- subset(NLbatting, HR > 150)
The new data frame NL.150 contains the batting statistics for the eight teams
who hit over 150 home runs.
50 Analyzing Baseball Data with R
2.8 Packages
This book will focus on functions available on the base R system that is
installed. But one attractive feature of R is the availability of collections of
functions and datasets in R packages. Currently, there are over 4000 packages
contributed by R users available on the R website cran.r-project.org/ and
these packages expand the capabilities of the R system. In our book, we focus
on a few contributed packages that we find useful in our baseball work.
To illustrate installing and loading an R package, we recently found a pack-
age Lahman, that contains the data files from the Lahman database described
in Section 1.2. Assuming one is connected to the Internet, one can install the
current version of this package into R by means of the command
install.packages("Lahman")
Alternately, one can install packages by use of the Install Packages button on
the Package tab in RStudio.
After a package has been installed, then one needs to load the package into
R to have access to the functions and datasets. For example, to load the new
package Lahman, one types
library(Lahman)
To confirm that the package has been loaded correctly, we use the help func-
tion to learn about the dataset Batting in the Lahman package. (A general
discussion of the help function is given in Section 2.10.)
?Batting
When one launches R, one needs to load the packages that are not automati-
cally loaded in the system.
Since we are focusing on the 1960s, the subset function is used to select
batting data only for the seasons between 1960 and 1969, creating the new
data frame Batting.60.
Batting.60 <- subset(Batting, yearID >= 1960 & yearID <= 1969)
By use of the unique function, a vector of the ids for all of the players
in the 1960s is created. The home runs for all players is accomplished by the
sapply function – the arguments are the vector players and the function
compute.hr that will be applied to each element in the vector. The output is
a vector S containing the total home run count for all players in the vector.
players <- unique(Batting.60$playerID)
S <- sapply(players, compute.hr)
A new data frame R is created using the data.frame function. The syntax
indicates there are two variables in the new data frame – Player correspond
to the player ids contained in the vector players and HR corresponds to the
home run counts contained in the vector S. Using the order function, we sort
this data frame so that the best home run hitters are on the top, and display
the first lines of this data frame.
R <- data.frame(Player=players, HR=S)
R <- R[order(R$HR, decreasing=TRUE), ]
head(R)
players S
857 killeha01 393
1 aaronha01 375
52 Analyzing Baseball Data with R
The best home run hitters in the 1960s were Harmon Killebrew, Hank Aaron,
Willie Mays, and Frank Robinson.
Now that we have this new variable Career.AB, one can now use the subset
function to choose only the season batting statistics for the players with 5000
AB.
Batting.5000 <- subset(Batting, Career.AB >= 5000)
Introduction to R 53
For each player in the data frame Batting.5000, we want to compute the
career AB, career HR, and career SO. This is another example of the “split,
apply, combine” operation done conveniently by the ddply function. We first
write a small function ab.hr.so that works on the season batting statistics
for a single player. The input is the data frame d containing the statistics for
one player and the output is a data frame with the career AB, HR, and SO.
ab.hr.so <- function(d){
c.AB <- sum(d$AB, na.rm=TRUE)
c.HR <- sum(d$HR, na.rm=TRUE)
c.SO <- sum(d$SO, na.rm=TRUE)
data.frame(AB=c.AB, HR=c.HR, SO=c.SO)
}
To illustrate the use of ab.hr.so, we extract Hank Aaron’s batting statistics
and apply this function on Aaron’s data frame aaron.
aaron <- subset(Batting.5000, playerID == "aaronha01")
ab.hr.so(aaron)
AB HR SO
1 12364 755 1383
This confirms that Aaron had 755 career home runs and 1383 career strikeouts.
To apply this function to each batter and collect the results, we again use
the function ddply. The arguments are the data frame Batting.5000 to split,
the splitting variable player.ID, and the function ab.hr.so to apply on each
part.
d.5000 <- ddply(Batting.5000, .(playerID), ab.hr.so)
The resulting data frame d.5000 contains the career AB, HR, and SO for all
batters with at least 5000 career AB. To confirm, the first six lines of the data
frame are displayed by the head function.
head(d.5000)
playerID AB HR SO
1 aaronha01 12364 755 1383
2 abreubo01 8128 284 1763
3 adamssp01 5557 9 223
4 adcocjo01 6606 336 1059
5 alfoned01 5385 146 617
6 allendi01 6332 351 1556
Is there an association between a player’s home run rate and his strike-
out rate? Using the plot function, we construct a scatterplot of HR/AB and
SO/AB. Using the lines function we add a smoothing curve (using the function
lowess) to the scatterplot. (See Figure 2.8.)
with(d.5000, plot(HR/AB, SO/AB))
with(d.5000, lines(lowess(HR/AB, SO/AB)))
It is clear from the graph that batters with higher home run rates tend to
have higher strikeout rates.
54 Analyzing Baseball Data with R
●
0.30
●
● ●
● ●●
●
● ● ●
0.25
● ● ● ●
●● ●● ●
● ● ●●
● ● ● ●
● ● ●●
● ● ●● ●
● ●
● ● ● ● ● ● ●
● ●● ● ● ●●
● ●●
●
0.20
● ● ● ●●●
● ● ● ● ● ● ●● ●
●● ● ●● ● ● ●
●●● ● ● ●
● ● ● ●
● ●
● ● ● ● ●● ● ● ● ● ●●
SO/AB
●● ●● ●
●
●●● ● ●● ● ● ● ●●
●
●
●●
● ● ●
●
●● ● ● ● ●●● ● ● ● ● ● ● ●●
● ●
● ●
● ● ● ●● ●● ●● ● ● ●●
●● ● ● ●● ●● ● ●●●●● ●● ● ●●●
0.15
● ● ●
●● ●● ● ●●● ● ● ● ●● ● ●● ● ● ●
● ● ●●● ● ● ● ● ●●
● ● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ●
● ●
●● ● ●● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ●● ●● ●● ●
● ●
● ● ●●
● ●
●
● ●●● ●●●● ● ● ● ● ●
●
● ●
● ● ● ● ●● ●
● ●●● ●● ●● ●●● ● ● ●
● ●● ● ● ● ● ● ● ●
●● ●●● ● ●● ●●●●●●
● ● ● ●
●●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●
0.10
● ●●● ●● ●
●● ● ●● ● ● ●●● ●
● ● ●● ●● ● ●● ●● ●●●● ● ●● ● ● ● ●● ●
● ●●● ●● ●
● ● ● ● ● ●
●● ● ● ●● ● ●●●● ●
●
●● ●● ● ● ●●●
● ●
●
● ●
●● ● ●● ● ● ●
●● ●
● ●
● ● ●● ●
●● ●●●●● ●● ●●●●● ●● ● ● ●● ●● ●
● ●● ● ●●● ●●
● ●
● ● ●●
●
●
●●●● ●●
●●●● ●●●●●● ● ●
●
● ●● ●●●●● ● ●● ● ●
●● ● ●● ● ● ●
●●● ●
●
0.05
●● ●● ● ● ●●●●
● ● ●●●● ●
HR/AB
FIGURE 2.8
Scatterplot of the homerun rates and strikeout rates of all players with at
least 5000 career at-bats. A smoothing curve is added to the plot to show that
home run rates and strikeout rates have a positive association.
you see a long description of this function including all of the possible function
arguments. To find out about related functions, one can preface “dotchart”
by two question marks to find all objects that contain this character string:
Introduction to R 55
??dotchart
2.12 Exercises
1. (Top Base Stealers in the Hall of Fame)
The following table gives the number of stolen bases (SB), the number of
times caught stealing (CS), and the number of games played (G) for nine
players currently inducted in the Hall of Fame.
56 Analyzing Baseball Data with R
Player SB CS G
Rickey Henderson 1406 335 3081
Lou Brock 938 307 2616
Ty Cobb 897 212 3034
Eddie Collins 741 195 2826
Max Carey 738 109 2476
Joe Morgan 689 162 2649
Luis Aparicio 506 136 2599
Paul Molitor 504 131 2683
Roberto Alomar 474 114 2379
(a) In R, place the stolen base, caught stealing, and game counts in the
vectors SB, CS, and G.
(b) For all players, compute the number of stolen base attempts SB + CS
and store in the vector SB.Attempt.
(c) For all players, compute the success rate Success.Rate = SB /
SB.Attempt.
(d) Compute the number of stolen bases per game SB.Game = SB /
Game.
(e) Construct a scatterplot of the stolen bases per game against the suc-
cess rates. Are there particular players with unusually high or low
stolen base success rates? Which player had the greatest number of
stolen bases per game?
Player W L SO BB
Pete Alexander 373 208 2198 951
Roger Clemens 354 184 4672 1580
Pud Galvin 364 310 1806 745
Walter Johnson 417 279 3509 1363
Greg Maddux 355 227 3371 999
Christy Mathewson 373 188 2502 844
Kid Nichols 361 208 1868 1268
Warren Spahn 363 245 2583 1434
Cy Young 511 316 2803 1217
(a) In R, place the wins and losses in the vectors W and L, respectively.
Also, create a character vector Name containing the last names of
these pitchers.
(b) Compute the winning percentage for all pitchers defined by 100 ×
W/(W +L) and put these winning percentages in the vector Win.PCT.
(c) By use of the command
Wins.350 <- data.frame(Name, W, L, Win.PCT)
create a data frame Wins.350 containing the names, wins, losses, and
winning percentages.
(d) By use of the order function, sort the data frame Wins.350 by win-
ning percentage. Among these pitchers, who had the largest and
smallest winning percentages?
4. (Pitchers in the 350 Wins Club, Continued)
(a) In R, place the strikeout and walk totals from the 350 win pitchers
in the vectors SO and BB, respectively. Also, create a character vector
Name containing the last names of these pitchers.
(b) Compute the strikeout-walk ratio by SO/BB and put these ratios in
the vector SO.BB.Ratio.
(c) By use of the command
SO.BB <- data.frame(Name, SO, BB, SO.BB.Ratio)
create a data frame SO.BB containing the names, strikeouts, walks,
and strikeout-walk ratios.
(d) By use of the subset function, find the pitchers who had a strikeout-
walk ratio exceeding 2.8.
58 Analyzing Baseball Data with R
(e) By use of the order function, sort the data frame by the number of
walks. Did the pitcher with the largest number of walks have a high
or low strikeout-walk ratio?
5. (Pitcher Strikeout/Walk Ratios)
(a) Read the Lahman “pitching.csv” data file into R into a data frame
Pitching.
(b) The following function computes the cumulative strikeouts, cumula-
tive walks, mid career year, and the total innings pitched (measured
in terms of outs) for a pitcher whose season statistics are stored in
the data frame d.
stats <- function(d){
c.SO <- sum(d$SO, na.rm=TRUE)
c.BB <- sum(d$BB, na.rm=TRUE)
c.IPouts <- sum(d$IPouts, na.rm=TRUE)
c.midYear <- median(d$yearID, na.rm=TRUE)
data.frame(SO=c.SO, BB=c.BB, IPouts=c.IPouts,
midYear=c.midYear)
}
Using the function ddply (plyr package) together with the function
stats, find the career statistics for all pitchers in the pitching dataset.
Call this new data frame career.pitching.
(c) Use the merge function to merge the Pitching and career.pitching
data frames.
(d) Use the subset function to construct a new data frame career.10000
consisting of data for only those pitchers with at least 10,000 career
IPouts.
(e) For the pitchers with at least 10,000 career IPouts, construct a scat-
terplot of mid career year and ratio of strikeouts to walks. Comment
on the general pattern in this scatterplot.
3
Traditional Graphics
CONTENTS
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Factor Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.1 A bar graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.2 Add axes labels and a title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.3 Other graphs of a factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3 Saving Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Dot plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Numeric Variable: Stripchart and Histogram . . . . . . . . . . . . . . . . . . . . 65
3.6 Two Numeric Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6.1 Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6.2 Building a graph, step-by-step . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.7 A Numeric Variable and a Factor Variable . . . . . . . . . . . . . . . . . . . . . . 73
3.7.1 Parallel stripcharts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.7.2 Parallel boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.8 Comparing Ruth, Aaron, Bonds, and A-Rod . . . . . . . . . . . . . . . . . . . . 76
3.8.1 Getting the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.8.2 Creating the player data frames . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.8.3 Constructing the graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.9 The 1998 Home Run Race . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.9.1 Getting the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.9.2 Extracting the variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.9.3 Constructing the graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.1 Introduction
To illustrate basic methods for creating graphs in R in the graphics package,
consider all the career batting statistics for the current members of the Hall of
Fame. If we remove the pitchers’ batting statistics from the dataset, then we
have statistics for 147 non-pitchers. The data file “hofbatting.csv” contains
59
60 Analyzing Baseball Data with R
the career batting statistics for this group. We read this data file into R by
the read.csv function; the statistics are stored in a data frame named hof.
hof <- read.csv("hofbatting.csv")
The type of graph we use depends on the measurement scale of the variable.
There are two fundamental data types – measurement and categorical – which
are represented in R as numeric and factor variables. We initially describe
graphs for a single factor variable and a single numeric variable, and then
describe graphical displays helpful for understanding relationships between
the variables. Using the traditional graphics package of R, it is easy to modify
the attributes of a graph by adding labels and changing the style of plotting
symbols and lines. After describing the graphical methods, we describe the
process of creating graphs for two home run stories. The first graph compares
the home run career progress of four great sluggers in baseball history and
the second graph illustrates the famous home run race of Mark McGwire and
Sammy Sosa during the 1998 season.
40
30
30
Frequency
20
20
10
10
0
19th Century Lively Ball Expansion Long Ball 19th Century Lively Ball Expansion Long Ball
Era
(a) (b)
FIGURE 3.1
Bar graphs of era of the Hall of Fame non-pitchers. The right graph adds axes
labels and a title to the basic plot.
constructs a pie chart. Figure 3.2 shows these alternative displays A line graph
Dead Ball
40
Lively Ball
19th Century
30
T
Long Ball
20
Free Agency
10
Integration
Expansion
0
(a) (b)
FIGURE 3.2
Line graph and pie chart of the frequencies of era of the Hall of Fame non-
pitchers.
is helpful when there are a large number of categories of the factor. Although
pie charts are popular in displaying frequencies in the media, we prefer a bar
graph since it can be more difficult for a reader to visually compare the relative
sizes of slices of a pie chart than lengths of bars in a bar chart.
allows one to “Save Plot as Image,” “Save Plot as PDF,” or “Copy Plot to the
Clipboard.” If one chooses the “Save Plot as Image” option, then by choosing
an option from a drop-down menu, one can save the graph in PNG, JPEG,
TIFF, BMP, metafile, clipboard, SVG, or EPS formats. The PNG format is
convenient for uploading to a web page and the EPS and PDF formats are
well-suited for use in a LaTex document. The metafile and clipboard options
are useful for insertion of the graph into a Microsoft Word document.
Alternately, plots can be saved by use of R functions typed in the Console
window. For example, suppose we wish to save the bar graph shown in Figure
3.1(b) in a graphics file of PNG format. We use the special png function where
the argument is the name of the saved graphics file. We follow this with the
R commands to produce the graph, and conclude with the dev.off function.
(Note that no graph will be displayed in this operation.)
png("bargraph.png")
barplot(table(hof$Era), xlab="Era", ylab="Frequency",
main="Era of the Nonpitching Hall of Famers")
dev.off()
RStudioGD
2
one will see command instructions for saving graphs in other graphics formats.
This method of saving graphs is especially useful if one wishes to save a number
of graphs in a single file. For example, if one types
pdf("graphs.pdf")
barplot(table(hof$Era))
plot(table(hof$Era))
dev.off()
RStudioGD
2
then the bar graph and the line graph will be saved together in the PDF file
“graphs.pdf.”
The dotplot, shown in Figure 3.3 simply represents each frequency by an open
circle on a scale against the corresponding label. It is easy to compare the era
frequencies from this display.
Long Ball ●
Free Agency ●
Expansion ●
Integration ●
Lively Ball ●
Dead Ball ●
19th Century ●
10 20 30 40
Frequency
FIGURE 3.3
Dot plot of era of the Hall of Fame non-pitchers.
Dot plots can be used to display any collection of labeled numeric data.
Suppose we are interested in exploring the career OPS values for the Hall of
Fame players with at least 500 career home runs. We first use the subset
function to create a new data frame hof.500 consisting of the statistics of
the players with at least 500 career home runs. By use of the order function,
we order the rows of this matrix by the career OPS. We use the dotchart
Traditional Graphics 65
function to construct a dot plot of the OPS values where the labels are the
names of the players (variable X in the data frame hof.500).
hof.500 <- subset(hof, HR >= 500)
hof.500 <- hof.500[order(hof.500$OPS), ]
dotchart(hof.500$OPS, labels=hof.500$X, xlab="OPS")
In this display (Figure 3.4), we see that Babe Ruth, Ted Williams, and Jimmie
Foxx stand out as the top OPS players in this 500-home run group.
OPS
FIGURE 3.4
Dot plot of career OPS values for Hall of Famers with at least 500 career home
runs.
of the OPS values. For example, is the distribution of OPS values symmetric,
or is the distribution right or left skewed? Also we are interested in learn-
ing about the typical or representative Hall-of-Fame OPS value, and how the
OPS values are spread out. Graphical displays provide a quick visual way of
studying distributions of collections of baseball statistics.
For a single numeric variable, two useful displays for visualizing a distribu-
tion are the stripchart or one-dimensional scatterplot, and the histogram. A
stripchart is basically a number line graph, where the values of the statistics
are plotted over a number line ranging over all possible values of the variable.
This graph is constructed in R by the stripchart function. To begin, we use
the windows function to open a new graphics window 7 inches wide and 3.5
inches tall.1 This is done since we don’t want to use the default 7 in. by 7 in.
format. The only required argument in the stripchart function is the vector
of data to be graphed. The optional argument method="jitter" indicates
that the points are randomly placed in a band over their values; this “jitter-
ing” method of plotting is helpful when you have multiple plotting points with
the same value. The pch=1 argument indicates that the plotting symbol 1 (an
open circle) is to be used,2 and the xlab argument indicates that the x axis
is labeled by “Mid Career”.
windows(width=7, height=3.5)
stripchart(hof$MidCareer, method="jitter", pch=1,
xlab="Mid Career")
The resulting graph is shown in Figure 3.5. One interesting observation from
this graph is the presence of a gap between the seasons 1910 and 1920 with
no Hall of Famers represented.
A second graphical display for a numeric variable is a histogram where
the values are grouped into bins of equal width and the bin frequencies are
displayed as non-overlapping bars over the bins. A histogram is constructed
in R using the function hist. The only required input to hist is the vector
of mid careers hof$MidCareer. The xlab adds a label to the x axis and the
main="" argument removes the default title that is produced with hist.
hist(hof$MidCareer, xlab="Mid Career", main="")
The histogram of mid career values in Figure 3.6, as expected, resembles the
bar graph of the variable Era described in the previous section. One issue
in constructing a histogram is the choice of bins and the function hist will
typically make reasonable choices for the bins to produce a good display of the
data distribution. One can select one’s own bins in the function hist by use of
the argument breaks. For example, if one wanted to choose the alternative bin
endpoints 1880, 1900, 1920, 1940, 1960, 1980, 2000, then one could construct
the histogram by the following code (the figure is not displayed):
1 On a Macintosh computer, the quartz function is used to open up a new graphics
window.
2 See www.statmethods.net/advgraphs/parameters.html for a display of all the possible
●● ●
● ●● ● ●
●● ● ● ● ● ●●
● ●●●●●
●●
●● ●
● ● ●●● ●●●●
● ●●●●
● ● ●● ●●●
●
●● ● ●
●
● ● ●●
●
● ● ●●●● ● ● ●●● ● ●●●●● ● ● ●●●
●
●● ●
●● ●●● ● ● ● ● ●● ●●●●
●● ● ●
●● ● ● ●●● ● ● ● ● ● ● ● ● ●
●●
● ● ● ●● ● ● ●
●
● ● ● ● ●● ● ● ● ●
Mid Career
FIGURE 3.5
One-dimensional scatterplot (stripchart) of mid career values of HOF non-
pitchers.
20
15
Frequency
10
5
0
Mid Career
FIGURE 3.6
Histogram of mid career values of HOF nonpitchers.
scatterplot (see Figure 3.7), we notice four unusual career OPS values, three
large values and one very small value, and we’d like to identify the players
with these extreme values. Identification of specific points is accomplished by
the identify function; the inputs are the two variables in the scatterplot, the
vector giving labels (here, names) for the points, and the number of points to
identify. When identify is executed, a hairline will appear when the mouse
is placed on the graph, and the user clicks the mouse near the extreme points.
Figure 3.7 shows the scatterplot with points identified.
with(hof, plot(MidCareer, OPS))
with(hof, lines(lowess(MidCareer, OPS, f=0.3)))
with(hof, identify(MidCareer, OPS, X, n=4))
What do we learn from Figure 3.7? The typical OPS of a Hall of Famer
has stayed pretty constant through the years. But there was an increase in the
OPS during the 1930s when Babe Ruth and Lou Gehrig were in their primes.
There has been a steady decline in the average OPS (among these Hall of
Famers) over the least 30 years. It is interesting to note that the variability of
the OPS values among these players seems small in recent seasons.
Traditional Graphics 69
●
●
1.0
● ● ●
●
● ● ● ● ● ●
● ● ● ●●
● ●● ●
0.9
●
●●
●
● ●● ● ● ●● ● ● ●●
●
● ● ●
● ● ●●● ● ● ● ● ●● ●● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●●
● ●● ● ●
OPS
● ● ● ●●● ● ● ●●
● ● ●
0.8
● ● ● ● ●● ● ●
● ● ● ●● ● ● ● ● ●●
● ●
● ● ● ●
●
● ● ●
● ● ●
● ● ●● ●
● ● ●
●
0.7
● ●
● ●
● ● ●
● ● ● ●
0.6
0.5
MidCareer
FIGURE 3.7
Scatterplot of mid career against season for HOF nonpitchers.
Looking at this figure, we see several problems with this display. First,
due to the one outlier in the bottom-left section of the graph, most of the
70 Analyzing Baseball Data with R
0.7
●
● ●
●
0.6
● ●
● ●●
● ●
● ●●
● ●
● ● ●
● ● ●
● ●
● ●
● ● ●
0.5
● ●
● ●● ● ● ● ●
●● ●
SLG
● ●● ●
● ● ●
●● ● ●
● ●● ● ● ●●
●● ●
● ● ●
● ● ● ●
● ● ●●
●● ● ●● ●● ●●
● ● ●● ●●● ●
●
● ●●
●● ●
●● ●
● ●● ● ●
●
● ● ●● ●
●
●●
0.4
●
●
● ●●
● ●● ●
● ●● ●
● ● ●
● ●●
●
●●●
●
●
●
0.3
OBP
FIGURE 3.8
Scatterplot of OBP against SLG for HOF members.
points fall in a relatively small region of the plotting region. Second, it may
be preferable to use an alternative plotting symbol such as a filled circle that
is more distinctive than the default open circle symbol. Last, the graph would
be easier to read if more descriptive labels were used for the two axes. A new
figure is plotted to incorporate these new ideas. By use of the xlim and ylim
arguments, we change the limits of the horizontal and vertical axes. By the
new choices of the limits (0.28, 0.50) for the horizontal and (0.28, 0.75) for
the vertical, we remove the outlier and allow for more space in the upper-left
section of the graph for labels. By use of the pch=19 argument, we change the
plotting symbol to a sold black circle. We use the xlab and ylab arguments to
replace OBP and SLG respectively with “On-Base Percentage” and “Slugging
Percentage.” The updated display is shown in Figure 3.9.
with(hof, plot(OBP, SLG, xlim=c(0.25, 0.50),
ylim=c(0.28, 0.75), pch=19,
xlab="On-Base Percentage",
ylab="Slugging Percentage"))
0.7
0.6
Slugging Percentage
0.5
0.4
0.3
FIGURE 3.9
Scatterplot of OBP against SLG, changing axes limits and axes labels.
OPS = OBP + SLG. To evaluate hitters in our graph on the basis of OPS, it
would be helpful to draw constant values of OPS on the graph. If we represent
OBP and SLG by x and y, suppose we wish to draw a line where OPS = 0.7 or
where x+y = 0.7. Equivalently, we want to draw the function y = 0.7−x on the
graph; this is accomplished in R by the curve function where the argument
of the function is represented by the variable x. The add=TRUE arguments
indicate that this function is to be drawn on the current graph. Similarly, we
apply the curve function three more times to draw lines on the graph where
OPS takes on the values 0.8, 0.9, and 1.0. The resulting display is shown in
Figure 3.10.
curve(.7 - x, add=TRUE)
curve(.8 - x, add =TRUE)
curve(.9 - x, add=TRUE)
curve(1.0 - x, add=TRUE)
In our final iteration, we add labels to the lines showing the constant values
of OPS, and we label the points corresponding to players having a lifetime
OPS exceeding one. Each of the line labels is accomplished using the text
72 Analyzing Baseball Data with R
OPS = 1.0
0.7
OPS = 0.9
0.6
Slugging Percentage
OPS = 0.8
0.5
OPS = 0.7
0.4
0.3
FIGURE 3.10
Scatterplot of OBP against SLG, adding lines of constant values of OPS =
OBP + SLG.
function – the three arguments are the x location and y location where the
text is to be drawn, and the string of text to be displayed.
text(.27, .42, "OPS = 0.7")
text(.27, .52, "OPS = 0.8")
text(.27, .62, "OPS = 0.9")
text(.27, .72, "OPS = 1.0")
To label the points for the best hitters by use of mouse clicks, the identify
function is used. The inputs are the x and y plotting variables, the vector of
point labels, and the number of points to label. The final graph is displayed
in Figure 3.11.
with(hof, identify(OBP, SLG, X, n=6))
This final graph is very informative about the batting performance of these
Hall of Famers. We see that a large group of these batters have career OPS
values between 0.8 and 0.9, and only six players (Hank Greenberg, Roger
Hornsby, Jimmie Foxx, Ted Williams, Lou Gehrig, and Babe Ruth) had ca-
reer OPS values exceeding 1.0. Points to the right of the major point cloud
Traditional Graphics 73
OPS = 1.0
0.7
OPS = 0.8
0.5
OPS = 0.7
0.4
0.3
FIGURE 3.11
Scatterplot of OBP against SLG, adding text.
correspond to players with strong skills in getting on-base, but relatively weak
in advancing runners home. In contrast the two points to the left of the major
point cloud correspond to hitters who are better in slugging than in reaching
base.
eras. Suppose one focuses on the home run rate defined by HR / AB for our
Hall of Fame players. We add a new variable HR.AB to the data frame hof:
hof$HR.Rate <- with(hof, HR / AB)
●
Long Ball ●
● ●● ●● ●● ● ●
Free Agency ●● ● ● ● ●● ● ●
● ● ●● ●
Expansion ●
●
● ●
● ●● ●
● ● ●
●
● ●
●
●
● ● ●● ● ●
Integration ●
●
●● ●
● ● ● ● ● ●●● ● ●● ●
● ●● ●
Lively Ball ●●●●
●●●●● ● ●● ●●
●● ●●● ●
● ● ● ● ●●
●● ● ● ●
● ● ●● ● ●
● ●
●● ●
●●
●●● ●●●●
Dead Ball ●● ● ●●● ● ●
● ●●● ●
● ●
19th Century ●
● ●
●●
● ●●
●
0.00
0.02
0.04
0.06
HR.Rate 0.08
FIGURE 3.12
Stripcharts of home run rates of HOFers for each era.
The parallel boxplot display is shown in Figure 3.13. Each rectangle in the
display shows the location of the lower quartile, the median, and the upper
quartile, and lines are drawn to the extreme values. Unusual points (outliers)
that fall far from the rest of the distribution are indicated by open circle points.
This graph confirms the observations we made when we viewed the stripchart
display. Home run hitting was low in the first two eras and started to increase
in the Lively Ball era. It is interesting that the only “outlier” among these
Hall of Famers was Babe Ruth’s career home run rate of 0.085.
76 Analyzing Baseball Data with R
Long Ball
Free Agency
Expansion
Integration
Lively Ball ●
Dead Ball
19th Century
0.00
0.02
0.04
0.06
HR Rate 0.08
FIGURE 3.13
Boxplots of home run rates of HOFers for each era.
We begin by reading in the Lahman master file, storing the file in the data
frame master.
master <- read.csv("Master.csv")
From the Master table, we wish to extract the player id and the birth year
for a particular player. Since we will be doing this operation for four players,
it is convenient to write a new function getinfo to get this information for an
arbitrary player of interest. The inputs to this function are the first and last
names of the player and the output will be a list (a special data structure in
R) giving the player’s id and birth year. Some comments can be made about
the R code in the function.
• The subset function is used to extract the row in the master data frame
matching the player’s first and last names; this row of this data frame is
stored in the variable playerline.
We use the function getinfo to get the information for the sluggers Babe
Ruth, Hank Aaron, Barry Bonds, and Alex Rodriguez and store the infor-
mation in variables. By displaying ruth.info, we see the player id and birth
year for Babe Ruth.
ruth.info <- getinfo("Babe", "Ruth")
aaron.info <- getinfo("Hank", "Aaron")
bonds.info <- getinfo("Barry", "Bonds")
arod.info <- getinfo("Alex", "Rodriguez")
ruth.info
78 Analyzing Baseball Data with R
$name.code
[1] "ruthba01"
$byear
[1] 1895
One of the variables in the batting data frame is playerID. To get the
batting data for Babe Ruth, we use the subset function to extract the rows
of the batting data from where playerID is equal to “ruthba01”. We create a
new variable Age defined to be the season year minus the player’s birth year.
(Recall that we made a slight modification to the byear variable so that one
obtains a player’s correct age for a season.)
ruth.data <- subset(batting, playerID == ruth.info$name.code)
ruth.data$Age <- ruth.data$yearID - ruth.info$byear
We perform similar commands to get batting data frames for the sluggers
Hank Aaron, Barry Bonds, and Alex Rodriguez.
aaron.data <- subset(batting, playerID == aaron.info$name.code)
aaron.data$Age <- aaron.data$yearID - aaron.info$byear
bonds.data <- subset(batting, playerID == bonds.info$name.code)
bonds.data$Age <- bonds.data$yearID - bonds.info$byear
arod.data <- subset(batting, playerID == arod.info$name.code)
arod.data$Age <- arod.data$yearID - arod.info$byear
cumsum(c(1, 2, 3, 4))
[1] 1 3 6 10
We use the plot function to graph Ruth’s cumulative home run count
against his age. The arguments to plot indicate that a line graph will be drawn
(type="l") using a dotted line type (lty=3) of double-thickness (lwd=2). The
xlab and ylab arguments label the horizontal and vertical axes and the xlim
and ylim arguments give the limits of the two axes.
with(ruth.data, plot(Age, cumsum(HR), type="l", lty=3, lwd=2,
xlab="Age", ylab="Career Home Runs",
xlim=c(18, 45), ylim=c(0, 800)))
Using three applications of the lines function, three lines are added to the
current graph corresponding to the cumulative home runs of Aaron, Bonds,
and Rodriguez. Different line styles are applied by use of the lty argument so
we can distinguish the four lines of the graph. Using the legend function, a
legend is added to the graph connecting the line styles with the players. The
argument to legend are the x and y coordinates of the location, a vector of
character strings to display, the corresponding vector of line styles (lty), and
the line width (lwd). Figure 3.14 displays the completed graph.
with(aaron.data, lines(Age, cumsum(HR), lty=2, lwd=2))
with(bonds.data, lines(Age, cumsum(HR), lty=1, lwd=2))
with(arod.data, lines(Age, cumsum(HR), lty=4, lwd=2))
legend(20, 700, legend=c("Bonds", "Aaron", "Ruth", "ARod"),
lty=1 : 4, lwd=2)
800
Bonds
Aaron
600
Ruth
ARod
Career Home Runs
400
200
0
20 25 30 35 40 45
Age
FIGURE 3.14
Career home runs by age for four great home run hitters in baseball history.
We use the function createdata twice, once on Sosa’s batting data and
once on McGwire’s batting data, obtaining the new data frames mac.hr and
sosa.hr. We display the first few lines (using the head function) of sosa.hr
to show the format of these new data frames.
82 Analyzing Baseball Data with R
70
62
60
50
Home Runs in the Season
40
30
20
McGwire (70)
Sosa (66)
10
0
Date
FIGURE 3.15
Seasonal home runs for Mark McGwire and Sammy Sosa during the 1998 race.
3.11 Exercises
1. (Hall of Fame Pitching Dataset)
The data file “hofpitching.csv” contains the career pitching statistics for
all of the pitchers inducted in the Hall of Fame. This data file can be read
into R by means of the read.csv function.
hofpitching <- read.csv("hofpitching.csv")
(a) By use of the order function, order the rows of the data frame by
the value of WAR.Season.
(b) Construct a dot plot of the values of WAR.Season where the labels
are the pitcher names.
Traditional Graphics 85
(c) Which two 1960+ pitchers stand out with respect to wins above re-
placement per season?
(a) Read the Lahman “Master.csv” and “batting.csv” data files into R.
(b) Use the getinfo to obtain three data frames for the season batting
statistics for the great hitters Ty Cobb, Ted Williams, and Pete Rose.
(c) Add the variable Age to each data frame corresponding to the ages
of the three players.
(d) Using the plot function, construct a line graph of the cumulative hit
totals against age for Pete Rose.
(e) Using the lines function, overlay the cumulative hit totals for Cobb
and Williams.
(f) Write a short paragraph summarizing what you have learned about
the hitting pattern of these three players.
7. (Working with the Retrosheet Play-by-Play Dataset)
In Section 3.9, we used the Retrosheet play-by-play data to explore the
home run race between Mark McGwire and Sammy Sosa in the 1998 sea-
son. Another way to compare the patterns of home run hitting of the
two players is to compute the spacings, the number of plate appearances
between home runs.
(a) Following the work in Section 3.9, create the two data frames
mac.data and sosa.data containing the batting data for the two
players.
(b) Use the following R commands to restrict the two data frames to
the plays where a batting event occurred. (The relevant variable
BAT EVENT FL is either TRUE or FALSE.)
mac.data <- subset(mac.data, BAT_EVENT_FL == TRUE)
sosa.data <- subset(sosa.data, BAT_EVENT_FL == TRUE)
86 Analyzing Baseball Data with R
(c) For each data frame, create a new variable PA that numbers the plate
appearances 1, 2, ... (The function nrow gives the number of rows of
a data frame.)
mac.data$PA <- 1:nrow(mac.data)
sosa.data$PA <- 1:nrow(sosa.data)
(d) The following commands will return the numbers of the plate appear-
ances when the players hit home runs.
mac.HR.PA <- mac.data$PA[mac.data$EVENT_CD==23]
sosa.HR.PA <- sosa.data$PA[sosa.data$EVENT_CD==23]
(e) Using the R function diff, the following commands compute the
spacings between the occurrences of home runs.
mac.spacings <- diff(c(0, mac.HR.PA))
sosa.spacings <- diff(c(0, sosa.HR.PA))
(f) By use of the summary and hist functions on the vectors
mac.spacings and sosa.spacings, compare the home run spacings
of the two players.
4
The Relation Between Runs and Wins
CONTENTS
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2 The Teams Table in Lahman’s Database . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 The Pythagorean Formula for Winning Percentage . . . . . . . . . . . . . 93
4.5 The Exponent in the Pythagorean Formula . . . . . . . . . . . . . . . . . . . . . 95
4.6 Good and Bad Predictions by the Pythagorean Formula . . . . . . . 96
4.7 How Many Runs for a Win? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.1 Introduction
The goal of a baseball team is, just like in any other sport, winning games.
Similarly, the goal of the baseball analyst is being able to measure what hap-
pens on the field in term of wins. Answering a question such as “Who is the
better player between Brett Gardner and Prince Fielder?” becomes an easier
task if one succeeds in estimating how much Gardner’s speed and slick field-
ing contribute to his team’s victories and how many wins can be attributed
to Prince’s powerful bat.
Victories are obtained by outscoring opponents, thus the percentage of
wins obtained by a team over the course of a season is strongly related with
the number of runs it scores and allows. This chapter explores the relation-
ship between runs and wins. Understanding this relationship is a critical step
towards answering questions on players’ value. In fact, while it’s impossible to
directly quantify the impact of players in terms of wins, it will be seen in the
following chapters that it is possible to measure their contributions in term of
runs.
87
88 Analyzing Baseball Data with R
tail(myteams)
W pct = a + b × RD + ǫ,
where a and b are unknown constants and ǫ is the error term which
captures all other factors influencing the dependent variable (Wpct).
This is a special case of a linear model fit using the lm function
from the stats package (which is installed and loaded in R by de-
fault). The most basic call to the function requires a formula, specified
as response ~ predictor1 + predictor2 + ..., data=dataset, in which
90 Analyzing Baseball Data with R
●
0.7
●
●● ●
● ● ●
● ● ●●
● ●
0.6
● ●
● ●
●
winning percentage
● ●●● ● ●
●● ●●●● ● ●● ●
● ●● ●●●●● ●● ●●●
●
● ● ● ● ● ●● ●
●● ● ● ●●● ●
●● ● ●●●● ●
● ●●●● ●●●● ●
● ●
● ●●●●● ● ●
● ●●
●● ●●
● ●●●● ●
● ● ● ●●● ● ●
●●● ● ●● ●●●●● ●●
● ● ●● ●● ●
●●●● ●●
●
● ● ● ●
0.5
● ●●● ●
● ● ●●●
●●
●●● ● ●●● ●
●
● ●● ●● ● ● ● ●
●●● ●● ●● ●● ●
● ● ●● ● ● ●
● ●● ● ●
● ● ●●●●●
●● ●
● ●●●● ● ● ●
● ● ● ●● ● ●
●
●● ●●● ● ●
● ● ● ●●●● ● ●●
● ● ●●● ●●● ●
●●
● ● ●
● ● ●● ●
●● ●
● ● ● ●● ●●
● ● ● ●●
0.4
●●●● ● ●
●●● ●●
● ●● ●
●●●
● ●●●
●● ● ●
● ● ●
● ●
●
●
● ●● ●●
●
0.3
FIGURE 4.1
Scatterplot of team run differential against team winning percentage for major
league teams from 2001 to 2011. A best-fitting line is overlaid on top of the
scatterplot.
linfit
Call:
lm(formula = Wpct ~ RD, data=myteams)
Coefficients:
(Intercept) RD
0.499992 0.000623
The Relation Between Runs and Wins 91
The fitted line in the plot of Figure 4.1 is obtained with the abline func-
tion1 and the coef function for extracting the model (linfit) coefficients.
This formula tells us that a team with a run differential of zero (RD = 0)
will win half of its games (estimated intercept ≈ .500) which is reasonable. In
addition, a one-unit increase in run differential corresponds to an increase of
0.000623 in winning percentage. To give further insight into this relationship,
a team scoring 750 runs and allowing 750 runs is predicted to win half of its
games corresponding to 81 games in a typical MLB season of 162 games. In
contrast, a team scoring 760 runs and allowing 750 has a run differential of +10
and is predicted to have a winning percentage of 0.500 + 10 · 0.000623 ≈ 0.506.
A winning percentage of 0.506 in a 162-game schedule corresponds to 82 wins.
Thus an increase of 10 runs in the run differential of a team corresponds,
according to the straight-line model, to an additional win in the standings.
One concern is that predictions from this fitted line can assume values
outside the range [0, 1]. For example, a hypothetical team that outscores its
opponent by a total of 805 runs would be predicted to win more than 100
percent of its games which is impossible. However, since over 99 percent of
teams throughout major league baseball history have run differentials between
-350 and +350, the straight-line model is a reasonable approximation.
Once one has a fitted model, the function predict can be used to calculate
the predicted values from the model, while the function residuals computes
the difference between the response values and the fitted values (i.e., between
the actual and the estimated winning percentages).
myteams$linWpct <- predict(linfit)
myteams$linResiduals <- residuals(linfit)
Figure 4.2 displays a plot of the residuals against the run differential using
the following code. The abline and text functions have been introduced in
Chapter 3, while points is used to draw data points at specified coordinates
(in particular, the points and text functions are used to mark and label a
few anomalous data points).
plot(myteams$RD, myteams$linResiduals,
xlab="run differential",
ylab="residual")
1 In this case, rather than a single argument h or v for plotting an horizontal or a vertical
line, abline is supplied with two arguments a and b indicating the intercept and the slope
of the line to be drawn.
92 Analyzing Baseball Data with R
abline(h=0, lty=3)
points(c(68, 88), c(.0749, -.0733), pch=19)
text(68, .0749, "LAA ’08", pos=4, cex=.8)
text(88, -.0733, "CLE ’06", pos=4, cex=.8)
● ● LAA '08
● ● ●
●
0.05
●●
● ●
● ●
●
● ●● ● ●
● ●
● ● ● ●● ●
● ●
●● ●●
● ● ● ● ●
● ● ●● ●●● ●● ●●● ●●
●
●
● ● ● ● ●● ● ● ● ● ●
● ●●
●●● ●●●● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ●● ● ● ●● ●● ● ●●
residual
● ● ●
● ● ● ●
●
● ● ●● ●● ● ●● ● ●●●● ●●●● ●
●
0.00
● ● ● ● ● ●
● ●
●● ● ● ● ● ●● ●● ●●●● ●● ●
● ● ● ● ●●
● ● ●●● ●● ●●●●● ●●
● ● ● ● ● ● ● ●●●● ●
● ● ● ● ● ● ●● ● ●●
●● ●● ● ● ● ●
● ● ●● ●● ●●
●
●● ● ● ●● ●●● ● ●●
● ● ● ● ●
● ● ●●●●●● ● ● ●●●● ●● ●● ●
● ●● ● ● ● ● ● ● ●●
● ●
● ● ● ●
● ● ● ● ● ● ● ●
●●●● ●● ●
●
● ●
● ● ●● ● ● ● ●
● ● ●
● ●● ●● ● ●
−0.05
● ● ● ● ●
● ●
● ●
● ●
●
● CLE '06
FIGURE 4.2
Residuals versus run differential for the fitted linear model. Two large resid-
uals are labeled corresponding to the 2008 Los Angeles Angels and the 2006
Cleveland Indians.
The average value of the residuals for this model is equal to zero, which
means that the model predictions are equally likely to overestimate than to un-
derestimate the winning percentage, or that the method for fitting the model
is unbiased. In order to estimate the average magnitude of the errors, one
first squares the residuals so that each error has a positive value, calculates
the mean of the squared residuals, and takes the square root of such mean
value to get back to the original scale. The value so calculated is the root
mean square error, abbreviated as RMSE. (The square root function sqrt is
introduced.)
mean(myteams$linResiduals)
[1] -2.952603e-19
linRMSE <- sqrt(mean(myteams$linResiduals ^ 2))
linRMSE
[1] 0.02507176
Approximately two thirds of the residuals fall between −RM SE and
+RM SE, while 95% of the residuals are between −2·RM SE and 2·RM SE.2
These statements can be confirmed with the following lines of code. (The func-
tion abs computes the absolute value.)
nrow(subset(myteams, abs(linResiduals) < linRMSE)) /
nrow(myteams)
[1] 0.6757576
by the linear model comes within four wins of the actual number of wins in two-thirds of
the cases, while for 19 out of 20 teams the difference is not higher than 8 wins.
94 Analyzing Baseball Data with R
One can use this formula to predict winning percentages by use of the following
R code.
myteams$pytWpct <- with(myteams, R ^ 2 / (R ^ 2 + RA ^ 2))
Here the residuals need to be calculated explicitly, but that’s not a hard task.
A new variable pytResiduals is defined that is the difference between the
actual and predicted winning percentages. The RMSE is computed for these
new predictions.
myteams$pytResiduals <- myteams$Wpct - myteams$pytWpct
sqrt(mean(myteams$pytResiduals ^ 2))
[1] 0.02545247
The RMSE calculated on the Pythagorean predictions is similar in value to
the one calculated with the linear predictions (it’s actually higher for the 2000-
2011 data we have been using here). Thus it does not seem justifiable using
a more complex model. However, the Pythagorean expectation has several
desirable properties missing in the linear model. Both of these advantages can
be illustrated with several examples.
Suppose there exists a powerhouse team that scores an average of ten runs
per game, while allowing a close to average five runs per game. In a 162-
game schedule, this team would score 1620 runs, while allowing 810, for a run
differential of 810. Replacing RD with 810 in the linear equation, one obtains a
winning percentage of over 1, which is impossible. On the other hand, replacing
R and RA with 1620 and 810 respectively in the Pythagorean expectation, the
resulting winning percentage is equal to 0.8, a more reasonable prediction. A
second hypothetical team has pitchers who never allow runs, while the hitters
always manage to score the only run they need. Such a team will score 162 runs
in a season and win all of its games, but the linear equation would predict it to
be merely a .601 team. The Pythagorean formula, instead, correctly predicts
this team to win all of its games.
While neither of the above examples is ever going to materialize, there are
some extreme situations in modern baseball history. For example, the 2001
Seattle Mariners had 116 wins and 46 losses for a +300 run differential and
the 2003 Detroit Tigers had a 43-119 recored with a -337 run differential.
In these unlikely scenarios, the Pythagorean formula will give more sensible
winning percentage estimates.
Recall our statement at the end of the introductory section that the runs-
to-wins relationship is crucial in assessing the contribution of players to their
team’s wins. Once we estimate the number of runs players contribute to their
teams (as it will be shown in the following chapters), runs-to-wins formulas
can be used to convert these run values to wins. One can now answer questions
like “Home many wins would a lineup of nine Albert Pujols’ accumulate in
a season?” For these kind of investigations, the scenarios in which the linear
formula break down are more likely to occur, thus highlighting the need for
a formula such as the Pythagorean expectation that gives reasonable predic-
tions.
The Relation Between Runs and Wins 95
Rk
W% =
Rk + RAk
With some algebra, the equation can be rewritten as follows:
W Rk
=
L RAk
Taking the logarithm on both sides of the equation (using the function log),
one obtains the linear relationship
W R
log = k · log
L RA
The value of k can now be estimated using linear regression, where the re-
sponse variable is log(W/L) and the predictor is log(R/RA). In the following
R code, we compute the logarithm of the ratio of wins to losses, the logarithm
of the ratio of runs to runs allowed, and fit a simple linear model with these
transformed variables. (In the call of the lm function, a model with a zero
intercept is indicated by a zero term on the right side of the formula.)
myteams$logWratio <- log(myteams$W / myteams$L)
myteams$logRratio <- log(myteams$R / myteams$RA)
pytFit <- lm(logWratio ~ 0 + logRratio, data=myteams)
pytFit
Call:
lm(formula = logWratio ~ 0 + logRratio, data=myteams)
Coefficients:
logRratio
1.903
The R output suggests a Pythagorean exponent of 1.903 which is significantly
smaller than the value 2.
96 Analyzing Baseball Data with R
8752
162 × ≈ 95
8752+ 7372
The Red Sox actually won 90 games. The five games difference was quite
costly to the Red Sox, as they missed clinching the Wild Card (that went to
the Rays) in the final game (actually in the final minute) of the season. The
Pythagorean formula is more on target with the Tampa Bay Rays of the same
season, as the prediction of 92 (coming from their 707 runs scored versus 614
runs allowed) is just a bit higher than the actual 91.
Why does the Pythagorean formula miss so poorly on the Red Sox? In
other words, why did they win five less games than expected from their run
differential? Let’s have a look at their season game by game.
The gl2011.txt (a game log file downloaded from Retrosheet, see Section
1.3.3) contains detailed information on every game played in the 2011 season.
The following commands load the file into R, select the lines pertaining to the
Red Sox games, and keep only the runs related columns.
gl2011 <- read.table("gl2011.txt", sep=",")
glheaders <- read.csv("retrosheet/game_log_header.csv")
names(gl2011) <- names(glheaders)
BOS2011 <- subset(gl2011, HomeTeam=="BOS" | VisitingTeam=="BOS")[
, c("VisitingTeam", "HomeTeam", "VisitorRunsScored",
"HomeRunsScore")]
head(BOS2011)
Using the results of every game featuring the Boston team, run differentials
(ScoreDiff) are calculated both for games won and lost and a column W is
added indicating whether the Red Sox won the game.
The Relation Between Runs and Wins 97
0.04 ●
SFN ●
●
●
●
0.02
● ●
●
Pythagorean residuals
● ●
● ●
●
●
0.00
● ●
−0.02
● ●
●
● ●
●
●
●
−0.04
●
●
SDN
●
20 22 24 26 28 30 32
one run wins
FIGURE 4.3
Scatterplot of number of one-run games won and Pythagorean residuals for
major league teams in 2011.
season, a team scoring ten more runs is likely to have one more win in the
standings. The number comes directly from the Pythagorean formula with
an exponent of two. Suppose a team scores an average of five runs per game,
while allowing the same number of runs. In a 162-game season, the team would
score (and allow) 810 runs. Inserting 810 in the Pythagorean formula one gets
(as expected) a perfect .500 expected winning percentage with 81 wins. If one
substitutes 810 with 820 for the number of runs scored in the formula, one
obtains a .506 winning percentage that translates to 82 wins in 162 games.
The same result is obtained for a team scoring 810 runs and allowing 800.
Ralph Caola has derived the number of extra runs needed to get an extra
win in a more rigorous way using calculus. He starts from the equivalent
representation of the Pythagorean formula.
R2
W =G·
R2 + RA2
If one takes a partial derivative of the right side of the above equation with
respect to R, holds RA constant, the result is the incremental number of wins
per run scored. Taking the reciprocal of this result, one can derive the number
of runs needed for an extra win.
R is capable of calculating partial derivatives, thus we can retrace Ralph’s
steps in R by using the functions D and expression to take the partial deriva-
tive of R2 /(R2 + RA2 ) with respect to R.
D(expression(G * R ^ 2 / (R ^ 2 + RA ^ 2)), "R")
G * (2 * R)/(R^2 + RA^2) G * R^2 * (2 * R)/(R^2 + RA^2)^2
Unfortunately R does not do the simplifying. The reader has the choice of
either doing the tedious work himself or believing the final equation for incre-
mental runs per win (IR/W) is the following3 :
2
R2 + RA2
IR/W =
2 · G · R · RA2
If R and RA are expressed in runs per game, the G is removed from the above
formula.
Using this formula, one can compute the incremental runs needed per
one win for various runs scored/runs allowed scenarios. As a first step, a
function IR is created to calculate the incremental runs, according to Caola’s
formula; this function takes runs scored per game and runs allowed per game
as arguments.
IR <- function(RS=5, RA=5){
round((RS ^ 2 + RA ^ 2)^2 / (2 * RS * RA ^ 2), 1)
}
3 The formula is the result of algebraic simplification and taking the reciprocal.
The Relation Between Runs and Wins 101
This function is used to create a table for various runs scored/runs al-
lowed combinations. We perform this step by using the functions seq and
expand.grid. The seq function is used create a vector containing a regular
sequence specifying, as arguments, the start value, the end value, and the in-
crement value. Here seq creates a vector of values from 3 to 6 in increments
of 0.5. Then the expand.grid function is used to obtain a data frame con-
taining all the combinations of the elements of the supplied vectors. In the
following code the first and the final few lines of the new data frame IRtable
are displayed.
IRtable <- expand.grid(RS=seq(3, 6, .5), RA=seq(3, 6, .5))
rbind(head(IRtable), tail(IRtable))
RS RA
1 3.0 3
2 3.5 3
3 4.0 3
4 4.5 3
5 5.0 3
6 5.5 3
44 3.5 6
45 4.0 6
46 4.5 6
47 5.0 6
48 5.5 6
49 6.0 6
Finally, the incremental runs are calculated for the various scenarios. The
xtabs function in the second line of the following code is used to show the
results in a tabular form. The formula specified as the first argument has the
variable which populates the cells on the left side of the tilde character, while
on the right side the cross-classifying variables are separated by a + sign.
IRtable$IRW <- IR(IRtable$RS, IRtable$RA)
xtabs(IRW ~ RS + RA, data=IRtable)
RA
RS 3 3.5 4 4.5 5 5.5 6
3 6.0 6.1 6.5 7.0 7.7 8.5 9.4
3.5 7.2 7.0 7.1 7.5 7.9 8.5 9.2
4 8.7 8.1 8.0 8.1 8.4 8.8 9.4
4.5 10.6 9.6 9.1 9.0 9.1 9.4 9.8
5 12.8 11.3 10.5 10.1 10.0 10.1 10.3
5.5 15.6 13.4 12.2 11.4 11.1 11.0 11.1
6 18.8 15.8 14.1 13.0 12.4 12.1 12.0
Looking at the results we notice that the rule of ten is appropriate in typical
run scoring environments (4 to 5 runs per game). However, in very low scoring
environments (the upper-left corner of the table), a lower number of runs is
102 Analyzing Baseball Data with R
needed to gain an extra win; on the other hand, in high scoring environments
(lower-right corner), one needs a larger number of runs for an added win.
4.9 Exercises
1. (Relationship Between Winning Percentage and Run Differential
Across Decades)
Section 4.3 used a simple linear model to predict a team’s winning per-
centage based on its run differential. This model was fit using team data
since the 2001 season.
(a) Refit this linear model using data from the seasons 1961-1970, the
seasons 1971-1980, the seasons 1981-1990, and the seasons 1991-2000.
(b) Compare across the five decades the predicted winning percentage
for a team with a run differential of 10 runs.
2. (Pythagorean Residuals for Poor and Great Teams in the 19th
Century)
As baseball was evolving into its ultimate form, nineteenth century leagues
often featured abysmal teams that did not even succeed in finishing their
season, as well as some dominant clubs.
(a) Fit a Pythagorean formula model to the run-differential, win-loss data
for teams who played in the 19th century.
(b) By inspecting the residual plot of your fitted model from (a), did the
The Relation Between Runs and Wins 103
great and poor teams in the 19th century do better or worse than
one would expect on the basis of their run differentials?
CONTENTS
5.1 The Runs Expectancy Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Runs Scored in the Remainder of the Inning . . . . . . . . . . . . . . . . . . . . 106
5.3 Creating the Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4 Measuring Success of a Batting Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5 Albert Pujols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.6 Opportunity and Success for All Hitters . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.7 Position in the Batting Lineup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.8 Run Values of Different Base Hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.8.1 Value of a home run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.8.2 Value of a single . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.9 Value of Base Stealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.10 Further Reading and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
105
106 Analyzing Baseball Data with R
we explore the value of a home run and a single. This chapter is concluded by
using the runs expectancy matrix and runs values to understand the benefit
of stealing a base and the cost of being caught stealing.
To begin, a variable RUNS is created that is equal to the sum of the visitor’s
score (AWAY SCORE CT) and the home team’s score (HOME SCORE CT) at each
plate appearance.
data2011$RUNS <- with(data2011, AWAY_SCORE_CT + HOME_SCORE_CT)
A new variable HALF.INNING is also created, using the paste function, com-
bining the game id, the inning, and the team at bat.
data2011$HALF.INNING <- with(data2011,
paste(GAME_ID, INN_CT, BAT_HOME_ID))
Value of Plays Using Run Expectancy 107
The maximum total score in a half-inning is the sum of the initial total runs
and the runs scored. A new data frame MAX is created, defining a new variable
x equal to the maximum runs scored. The merge function is used to merge this
information with the data frame data2011 and create the new maximum total
score variable MAX.RUNS. (The function ncol gives the number of columns of
a data frame.)
MAX <- data.frame(HALF.INNING=RUNS.SCORED.START$HALF.INNING)
MAX$x <- RUNS.SCORED.INNING$x + RUNS.SCORED.START$x
data2011 <- merge(data2011, MAX)
N <- ncol(data2011)
names(data2011)[N] <- "MAX.RUNS"
Now the runs scored in the remainder of the inning (new variable RUNS.ROI)
can be computed by taking the difference of MAX.RUNS and RUNS.
data2011$RUNS.ROI <- with(data2011, MAX.RUNS - RUNS)
Currently, there are three variables BASE1 RUN ID, BASE2 RUN ID, and
BASE2 RUN ID containing the player codes of the baserunners (if any) who
are respectively on first, second, or third base. Three new binary variables
RUNNER1, RUNNER2, and RUNNER3 are created that are either 1 or 0 if the corre-
sponding base is respectively occupied or empty. (The as.character function
converts a factor variable to a character variable.)
RUNNER1 <- ifelse(as.character(data2011[ ,"BASE1_RUN_ID"]) == "", 0, 1)
RUNNER2 <- ifelse(as.character(data2011[ ,"BASE2_RUN_ID"]) == "", 0, 1)
RUNNER3 <- ifelse(as.character(data2011[ ,"BASE3_RUN_ID"]) == "", 0, 1)
One particular state value would be “011 2” which indicates that there are
currently runners on second and third base with two outs. A second state
value “100 0” indicates there is a runner at first with no outs.
We want to only consider plays in our data frame where there is a change
in the runners on base, number of outs, or the runs scored. Three new variables
NRUNNER1, NRUNNER2, NRUNNER3 are created which indicate, respectively, if first
base, second base, and third base are occupied after the play. (The function
as.numeric converts a logical variable to a numeric variable.) The variable
NOUTS is the number of outs after the play, and RUNS.SCORED is the number of
runs scored on the play. Again the get.state function is used to create the
variable NEW.STATE giving the runners on each base and the number of outs
after the play.
NRUNNER1 <- with(data2011, as.numeric(RUN1_DEST_ID == 1 |
BAT_DEST_ID == 1))
NRUNNER2 <- with(data2011, as.numeric(RUN1_DEST_ID == 2 |
RUN2_DEST_ID == 2 | BAT_DEST_ID==2))
NRUNNER3 <- with(data2011, as.numeric(RUN1_DEST_ID == 3 |
RUN2_DEST_ID == 3 | RUN3_DEST_ID == 3 | BAT_DEST_ID == 3))
NOUTS <- with(data2011, OUTS_CT + EVENT_OUTS_CT)
data2011$NEW.STATE <- get.state(NRUNNER1, NRUNNER2, NRUNNER3, NOUTS)
Before the runs expectancies are computed, one final adjustment is nec-
essary. The play-by-play database includes scoring information for all half-
innings during the 2011 season, including partial half-innings at the end of
the game where the winning run is scored with less than three outs. In our
work, we want to work only with complete half-innings where three outs are
recorded. The ddply function in the plyr package is applied to compute the
number of outs for each half-inning, and the merge function is used to add
a new variable Outs.Inning to the data frame. The subset function is used
to extract the data from the half-innings in data2011 with exactly three outs
– the new data frame is named data2011C. (By removing the noncomplete
innings, one is introducing a small bias since these innings are not complete
due to the scoring of at least one run.)
library(plyr)
data.outs <- ddply(data2011, .(HALF.INNING), summarize,
Outs.Inning=sum(EVENT_OUTS_CT))
data2011 <- merge(data2011, data.outs)
data2011C <- subset(data2011, Outs.Inning == 3)
The expected number of runs scored in the remainder of the inning (the
runs expectancy) is computed for each of the 24 bases/outs situations by use
of the aggregate function, grouping by STATE with the mean function.
RUNS <- with(data2011C, aggregate(RUNS.ROI, list(STATE), mean))
To see how the run expectancy values have changed over time, the 2002
season values as reported in Albert and Bennett (2003) are collected in the
vector RUNS.2002. The 2011 and 2002 expectancies are displayed side-by-side
for comparison purposes.
RUNS.2002 <- matrix(c(.51, 1.40, 1.14, 1.96, .90, 1.84, 1.51, 2.33,
.27, .94, .68, 1.36, .54, 1.18, .94, 1.51,
.10, .36, .32, .63, .23, .52, .45, .78), 8, 3)
dimnames(RUNS.2002) <- dimnames(RUNS.out)
110 Analyzing Baseball Data with R
cbind(RUNS.out, RUNS.2002)
0 outs 1 out 2 outs 0 outs 1 out 2 outs
000 0.47 0.25 0.10 0.51 0.27 0.10
001 1.45 0.94 0.32 1.40 0.94 0.36
010 1.06 0.65 0.31 1.14 0.68 0.32
011 1.93 1.34 0.54 1.96 1.36 0.63
100 0.84 0.50 0.22 0.90 0.54 0.23
101 1.75 1.15 0.49 1.84 1.18 0.52
110 1.41 0.87 0.42 1.51 0.94 0.45
111 2.17 1.47 0.76 2.33 1.51 0.78
It is somewhat remarkable that these run expectancy values have not changed
over the recent history of baseball. That indicates that there have been little
changes in the average run scoring tendencies of MLB teams between 2002
and 2011.
We wish to consider only the batting plays where Albert was the hitter, so
the subset function is used to select the rows where the batting flag (variable
BAT EVENT FL) is true.1
albert <- subset(albert, BAT_EVENT_FL == TRUE)
How did Albert do on his first two plate appearances this season? To
answer this, we display the first two rows of the data frame albert, showing
the original state, new state, and run value variables:
albert[1:2, c("STATE", "NEW.STATE", "RUNS.VALUE")]
STATE NEW.STATE RUNS.VALUE
6556 100 1 000 3 -0.4960492
6574 001 2 001 3 -0.3173913
On his first plate appearance, there was a runner on first with one out. The
outcome of this plate appearance was three outs, indicating that Albert hit
into a double-play, and the runs value for this play was −0.496. On his second
plate appearance, there was a runner on third with two outs. Evidently Albert
got out (the final state had three outs) and the run value was −0.317. Based
on the runs values of these first plate appearances, Albert didn’t have a very
good start to the 2011 season.
When one evaluates the run values for any player, there are two primary
questions. First, we need to understand the player’s opportunities for produc-
ing runs. What were the runner/outs situations for the player’s plate appear-
ances? Second, what did the batter do with these opportunities to score runs?
1 The variable BAT EVENT FL distinguishes batting events from non-batting events such as
We see that Albert generally was batting with the bases empty (000) or with
only a runner on first (100). Most of the time, Albert was batting with no
runners in scoring position.
How did Albert perform with these opportunities? Using the following R
code, we construct a stripchart (using the stripchart function) that shows
the runs values for all plate appearances organized by the runners state. (See
Figure 5.1.) A horizontal line at the value zero is added to the graph – points
above the line (below the line) correspond to positive (negative) contributions.
with(albert, stripchart(RUNS.VALUE ~ RUNNERS, vertical=TRUE, jitter=0.2,
xlab="RUNNERS", method="jitter", pch=1, cex=0.8))
abline(h=0)
There are many duplicate runs values, so we jitter the points (using the
method="jitter" argument) to better show the density of runs values. When
the bases were empty (000), the range of possible runs values was relatively
small. For this state, the large cluster of points at a negative runs value corre-
sponds to the many occurrences when Albert got an out with the bases empty.
The cluster of points at (000) at the value 1 corresponds to Albert’s home
runs with the bases empty. (A home run with runners empty will not change
the bases/outs state and the value of this play is exactly one run.) For other
situations, say the bases-loaded situation (111), there is much variation in the
runs values. For one plate appearance, the state moved from 111 1 to 000
3, indicating that Albert hit into a inning-ending double play with the bases
loaded with a runs value of −1.41. In contrast, Albert did hit a home run with
the bases loaded with no outs and the runs value of this outcome was 2.38.
(The runs value of a grand slam is not 4 since the run potential of the end
state of bases empty is much smaller than the run potential of a bases-loaded
state.)
To understand Albert’s total run production for the 2011 season, the
aggregate function together with the sum and length functions can be used
to compute the number of opportunities and sum of runs values for each of
the runners situations.
Value of Plays Using Run Expectancy 113
●
2
●
● ●●●
●
● ●● ● ●
● ●
●
● ● ●
●
●
●●● ●
1
●
●●●●
● ●
●●
●●● ●●
●●● ● ●●
●●●
RUNS.VALUE
● ● ● ●●●● ●
●
● ●● ●
●
●●●
● ●● ●●●●●
●● ●
● ●
● ●
●
●●● ● ● ●●●
● ●●●●●
●● ●
●
●●
●
●● ●
●
●●
●●
● ●● ●●
● ●
●
●●
● ●
●●
●●
●●●●
●
●● ●●●
●
●●
●●●● ●●● ●●
●● ●● ●●
●
●
●●
●●
●●
●
●●
●
●●
●●● ● ●●
● ● ● ●
●
●●●●●
●
0
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●● ● ●● ●
●
●
●●
●
●●●
●●
●
●●
●
●●
●
●●
●
●●
●
●● ● ●●
● ●
● ●
●
●
●●
●
●●
●●
●●
●
●●
●●
●●
●●
●
●
●● ●
●●
●●
●●
●●
●●●
●●●
●● ●
●●●
●
●●
●●●
●●●● ●
●
●●●● ● ●
●
●
●●
●●
● ●
●
●●
● ●●
●● ●● ●●●
● ● ●
●●●●●●
●●●
●
●●
●
●●●●
●
●●● ● ●●●
●●
● ● ●
●●
●●● ● ●
● ●
●● ● ●
● ●●●
● ●
●
−1
●●●●●
● ●
RUNNERS
FIGURE 5.1
Dotplot of runs values of Albert Pujols for all 2011 plate appearances as a
function of the runners state. The points have been jittered since there are
many plate appearances with identical runs values.
We see, for example, that Albert came to bat with the runners empty 354
times, and his total runs value contribution to these 354 PAs was 10.76. Albert
114 Analyzing Baseball Data with R
didn’t do particularly well with runners in scoring position. For example, there
were 36 PAs where he came to bat with runners on first and second, and
his net contribution in runs for this situation was −4.56. Albert’s total runs
contribution for the 2011 season can be computed by summing the last column
of this matrix.
sum(A$RUNS)
[1] 27.25679
It is not surprising that Albert has a positive total contribution in his PAs in
2011, but it is difficult to understand the size of 27 runs unless this value is
compared with the contribution of other players. In the next section, we will
see how Albert compares to all hitters in the 2011 season.
It is difficult to compare the total run values of two players on face value,
since they have different opportunities to create runs for their teams. One
player in the middle of the batting order may come to bat many times when
there are runners in scoring position and good opportunities to create runs.
Other players towards the bottom of the batting order may not get the same
opportunities to bat with runners on base. One can measure a player’s op-
portunity to create runs by the sum of the runs potential state (variable
RUNS.STATE) over all of his plate appearances. We can summarize a player’s
batting performance in a season by the total number of plate appearances,
the sum of the runs potentials, and the sum of the runs values.
The R function aggregate is helpful in obtaining these summaries. In the
following R code, the data frame runs.pa contains the number of plate appear-
ances for all batters in the 2011 season, the data frame runs.sums contains
the total runs value for these players, and the data frame runs.start contains
the total starting runs potential for the players. Using two applications of the
merge function, we merge the two matrices, creating the data frame runs. A
row of this data frame will contain the number of plate appearances, the total
run value, and the total run potential for a particular player.
runs.sums <- aggregate(data2011b$RUNS.VALUE, list(data2011b$BAT_ID), sum)
runs.pa <- aggregate(data2011b$RUNS.VALUE, list(data2011b$BAT_ID), length)
runs.start <- aggregate(data2011b$RUNS.STATE, list(data2011b$BAT_ID), sum)
Value of Plays Using Run Expectancy 115
The data frame runs contains batting data for both pitchers and nonpitch-
ers. It seems reasonable to restrict attention to nonpitchers, since pitchers and
nonpitchers have very different batting abilities. Also we limit our focus on the
players who are primarily starters on their teams. One can remove pitchers
and nonstarters by focusing on batters with at least 400 plate appearances. A
new data frame runs400 is created by an application of the subset function.
There are 203 players in this data frame; we display the first few rows by use
of the head function.
runs400 <- subset(runs, PA >= 400)
head(runs400)
Batter Runs PA Runs.Start
1 abreb001 9.695694 585 251.6928
15 andir001 -8.511571 511 246.0340
16 andre001 4.847767 665 323.6004
18 ankir001 -4.326331 415 183.3090
19 arenj001 6.358087 486 226.9625
24 avila001 32.125231 551 266.5247
From viewing Figure 5.2, we see that batters with larger values of Runs.Start
tend to have larger runs contributions. But there is a wide spread in the
runs values for these players. In the group of players who have Runs.Start
values between 300 and 350, four of these players actually have negative runs
contributions and other players created over 60 runs in the 2011 season.
From the graph, we see that only a limited number of players created
more than 40 runs for their teams. Who are these players? In the R code, we
create a new data frame runs400.top containing the runs statistics for only
the players who created more than 40 runs. For labeling purposes, we would
116 Analyzing Baseball Data with R
like to obtain the last names of the players available on the “roster2011.csv”
data file. We read in this file using the read.csv function and use the merge
function to merge the roster information with the data frame runs400.top.
By use of the text function, point labels are added to the previous scatterplot
for these outstanding hitters. (See Figure 5.2.)
runs400.top <- subset(runs400, Runs >= 40)
roster2011 <- read.csv("roster2011.csv")
runs400.top <- merge(runs400.top,
roster2011, by.x="Batter", by.y="Player.ID")
with(runs400.top, text(Runs.Start, Runs, Last.Name, pos=1))
From this figure, we learn that the best hitters using the runs criterion are
Miguel Cabrera (71.11), Jose Bautista (67.37), Prince Fielder (60.82), Joey
Votto (60.65), and Matt Kemp (59.64). There is an interesting outlier in this
figure – Mike Napoli created over 40 runs for his team despite only having a
Runs.Start value close to 200. Napoli was a very productive batter for his
team given his opportunities to produce runs.
In the following R code, the players’ run opportunities are plotted against
their runs values. By use of the type="n" option, the axes are drawn but points
are not plotted. Instead, we use the text function with labels contained in
the vector position to display the batting positions as plotting points. (See
Figure 5.3.)
Value of Plays Using Run Expectancy 117
Cabrera
Bautista Fielder
Votto
60
Kemp
Braun Ellsbury
Berkman Gonzalez
Napoli MartinezCano
40
Granderson
Runs
20
0
−20
FIGURE 5.2
Scatterplot of total run value against the runs potential for all players in the
2011 season with at least 400 plate appearances. A smoothing curve is added
to the scatterplot – this shows that players who had more run potential tend
to have large run values. The players with a total runs value at least 40 are
labeled.
From this figure, we better understand the relationship between batting posi-
tion, run opportunities, and run values. The best hitters, the ones who create
a large number of runs, generally bat third, fourth, and fifth in the batting
order. The number of runs created by the leadoff (first) and second batters in
the lineup are much smaller than the runs created by the best hitters in the
middle (third and fourth positions) of the lineup. There are some surprises
from this general pattern of batting positions. Mike Napoli, the unusual hitter
who created over 40 runs with only 200 run opportunities, bats only sixth in
the lineup. Also, there are many cleanup hitters displayed who have mediocre
values of runs created.
118 Analyzing Baseball Data with R
4
3
3 44
60
3 1
5 3
5 5
6
2
40
14 3
34 33 53
3 8 34 4 2
4 3 4 2
3 3
3 4 4
1
4
●
3 3
2
3
11 4 6 4 4 2
Runs
20
3
6 2 1
9 8 1 7 11
3
7 17 1 5 26
3 47 8 4 35 3
5 54
3 42 7 3 7 4 1
311 3 31 5 3 7 2 444 6
2
3 5 28 7 9 31 5 1
5 5 8 16 5 2 5 6 1 4 6 44 2
7 9 2 2 1
79 6 1
0
2 25 6
2 6 6 58 783 88 1 24 1
6 9 816 632 1 14
2 81 61 4 9 58 26
4 51 2 1 2
3 2
7 5
1 64 9 1 1
7
−20
3 2 2
8 6 6 9
5
7
6
Runs.Start
FIGURE 5.3
Scatterplot of total run value against the runs potential for all players in the
2011 season with at least 400 plate appearances. The points are labeled by
the position in the batting lineup and the large point corresponds to Albert
Pujols.
How does Albert Pujols and his total runs value of 27.2 compare among the
group of hitters with at least 400 plate appearances? By use of the subset
function, we find Pujols’ data from the runs400 matrix. Using the points
function, we display Albert’s (Runs.Start, Runs) value by a large solid dot.
In this particular season (2011), Albert was one of the better hitters in terms
of creating runs for his team.
AP <- subset(runs400, Batter == albert.id)
points(AP$Runs.Start, AP$Runs, pch=19, cex=3)
Value of Plays Using Run Expectancy 119
When are the runners/outs states for the home runs hit during the 2011
season? We answer this question by use of the table function.
table(d.homerun$STATE)
000 0 000 1 000 2 001 0 001 1 001 2 010 0 010 1 010 2 011 0 011 1
1226 812 630 15 50 54 57 94 144 17 28
011 2 100 0 100 1 100 2 101 0 101 1 101 2 110 0 110 1 110 2 111 0
32 262 301 286 32 50 61 60 115 128 15
111 1 111 2
42 41
By use of the prop.table function, the relative frequencies are computed and
the round function is used to round the values to three decimal spaces.
round(prop.table(table(d.homerun$STATE)), 3)
000 0 000 1 000 2 001 0 001 1 001 2 010 0 010 1 010 2 011 0 011 1
0.269 0.178 0.138 0.003 0.011 0.012 0.013 0.021 0.032 0.004 0.006
011 2 100 0 100 1 100 2 101 0 101 1 101 2 110 0 110 1 110 2 111 0
0.007 0.058 0.066 0.063 0.007 0.011 0.013 0.013 0.025 0.028 0.003
111 1 111 2
0.009 0.009
120 Analyzing Baseball Data with R
We see from this table that the fraction of home runs hit with the bases empty
is 0.269 + 0.178 + 0.138 = 0.585. So over half of the home runs are hit with
no runners on base.
What are the runs values of these home runs? We already observed in the
analysis of Pujols’ data that the runs value of a home run with the bases
empty is one. A histogram of the run values for all home runs is constructed
using the truehist function in the MASS package.2 (See Figure 5.4.)
library(MASS)
truehist(d.homerun$RUNS.VALUE)
It is obvious from this graph that most home runs (the ones with the bases
empty) have a runs value of one. But there is a cluster of home runs with
values between 1.5 and 2.0, and there is a small group of home runs with
runs value exceeding three. Which runners/outs situation leads to the most
6
d.homerun$RUNS.VALUE
FIGURE 5.4
Histogram of the runs values of the home runs hit during the 2011 season.
The vertical line shows the location of the mean runs value of a home run.
2 We prefer truehist to the hist function since by default density values, instead of
valuable home runs? Using the subset function, the row of the data frame is
extracted corresponding to the largest runs value.
subset(d.homerun, RUNS.VALUE == max(RUNS.VALUE))[1,
c("STATE", "NEW.STATE", "RUNS.VALUE")]
STATE NEW.STATE RUNS.VALUE
6748 111 2 000 2 3.336175
As one might expect, the most valuable home run occurs when there are bases
loaded with two outs. From our earlier work, it was seen that this type of home
run occurred 41 times during the season, and the runs value of this home run
is 3.34.
Overall, what is the runs value of a home run? This question is answered by
computing the runs value of all the home runs in the data frame d.homerun.
mean.HR <- mean(d.homerun$RUNS.VALUE)
mean.HR
[1] 1.392393
A vertical line is drawn on the graph showing the mean runs value and a label
is added to this line. (See Figure 5.4.)
abline(v=mean.HR, lwd=3)
text(1.5, 5, "Mean Runs Value", pos=4)
This average runs value is pretty small, but this value partially reflects the
fact that most home runs are hit with the bases empty.
Looking at the histogram of run values of the single, there are three large
spikes between 0 and 0.5. These large spikes can be explained by constructing
a frequency table of the beginning state.
table(d.single$STATE)
122 Analyzing Baseball Data with R
d.single$RUNS.VALUE
FIGURE 5.5
Histogram of the runs values of the singles hit during the 2011 season. The
vertical line shows the location of the mean runs value of a single.
000 0 000 1 000 2 001 0 001 1 001 2 010 0 010 1 010 2 011 0 011 1
7078 5067 3814 85 315 356 502 844 959 87 225
011 2 100 0 100 1 100 2 101 0 101 1 101 2 110 0 110 1 110 2 111 0
216 1593 2063 1826 154 344 385 346 728 770 100
111 1 111 2
284 277
We see that most of the singles occur with the bases empty, and the three
spikes in the histogram, as one moves from left to right in Figure 5.5, corre-
spond to singles with no runners on and two outs, one out, and no outs. The
small cluster of runs values in the interval 0.5 to 2.0 correspond to singles hit
with runners on base.
What is the most valuable single from the runs value perspective? We use
the subset function to find the beginning and end states for the single that
resulted in the largest runs value.
subset(d.single, d.single$RUNS.VALUE ==
max(d.single$RUNS.VALUE))[ , c("STATE", "NEW.STATE", "RUNS.VALUE")]
Value of Plays Using Run Expectancy 123
In this particular play, the hitter came to bat with the bases loaded and two
outs, and the final state was a runner on third with two outs. How could have
this happened with a single? The data frame does contain a brief description
of the play. But from the data frame we identify the play happening during the
bottom of the 7th inning of a game between the Brewers and Twins on July
3, 2011. We check with www.espn.com to find the following play description.
“D Valencia singled to deep left, J Mauer and M Cuddyer scored, J Thome to
second, J Thome scored, D Valencia to third on error by left fielder M Kotsay.”
So evidently, the left fielder made an error on the fielding of the single that
allowed all three runners to score and the batter to reach third base.
At the other extreme, by use of the subset function, two plays are iden-
tified which achieved the smallest runs value.
subset(d.single, d.single$RUNS.VALUE == min(d.single$RUNS.VALUE))[
, c("STATE", "NEW.STATE", "RUNS.VALUE")]
STATE NEW.STATE RUNS.VALUE
69618 010 0 100 1 -0.5622312
138351 010 0 100 1 -0.5622312
How could the run value of a single be negative one-half a run? With further
investigation, we find that in each case, there was a runner on second who was
thrown out at the plate as a result of the single.
As in the case of the home run, it is straightforward to compute the mean
runs value of a single. We display this mean value on the histogram in Figure
5.5.
mean.single <- mean(d.single$RUNS.VALUE)
mean.single
[1] 0.4424186
abline(v=mean.single, lwd=3)
text(.5, 5, "Mean Runs Value", pos=4)
In this case, we see that the mean value of a single is approximately equal to
the runs value when a single is hit with the bases empty with no outs. It is
interesting that the runs value of a single can be large (in the 1 to 2 range).
These large runs values reflect the fact that the benefit of the single depends
on the advancement of the runners.
outcomes – either the runner will be successful in stealing the base or the
runner will be caught stealing. Overall, is there a net benefit to attempting to
steal a base?
The variable EVENT CD gives the code of the play and codes of 4 and 6
correspond respectively to a stolen base (SB) or caught stealing (CS). Using
the subset function, a new data frame stealing is created that consists of
only the plays where a stolen base is attempted.
stealing <- subset(data2011, EVENT_CD == 6 | EVENT_CD == 4)
4 6
2863 864
Among all stolen base attempts, the proportion of stolen bases is 2863 /(2863
+ 864) = 0.768.
What are common runners/outs situations for attempting a stolen base?
This is answered by constructing a frequency table for the STATE variable.
table(stealing$STATE)
001 1 001 2 010 0 010 1 010 2 011 1 100 0 100 1 100 2 101 0 101 1
10 1 19 114 101 1 758 1018 1160 35 111
101 2 110 0 110 1 110 2 111 1
180 27 102 89 1
We see that stolen base attempts typically happen with a runner only on first
(state “100”). But there is a wide variety of situations where runners attempt
to steal.
Every stolen base attempt has a corresponding runs value that is stored in
the variable RUNS.VALUE. This runs value reflects the success of the attempt
(either SB or CS) and the situation (runners and outs) where this attempt
occurs. Using the truehist function, a histogram is constructed of all of the
runs created for all the stolen base attempts.
library(MASS)
truehist(stealing$RUNS.VALUE)
Generally, all of the successful SBs have positive runs value, although most of
the values fall in the interval from 0 to 0.3. In contrast, the unsuccessful CSs
(as expected) have negative runs values. In further exploration, one can show
the three spikes for negative runs values correspond to CS when there is only
a runner on first with 0, 1, and 2 outs.
Let’s focus on the benefits of stolen base attempts in a particular situation.
We create a new data frame which gives the attempted stealing data when
there is a runner on first base with one out (state “100 1”).
Value of Plays Using Run Expectancy 125
5
4
3
2
1
0
stealing$RUNS.VALUE
FIGURE 5.6
Histogram of the runs values of all steal attempts during the 2011 season.
By tabulating the EVENT CD variable, we see the runner successfully stole 753
times out of 753 + 265 attempts for a success rate of 74.0%.
table(stealing.1001$EVENT_CD)
4 6
753 265
This provides more information than simply recording a stolen base. On 704
occurrences, the runner successfully advanced to second base. On an additional
52 occurrences, the runner advanced to third. Perhaps this extra base was due
126 Analyzing Baseball Data with R
to a bad throw from the catcher or a misplay by the infielder; more can be
learned about the details of these plays by further examination of the other
variables.
We are most interested in the value of attempting stolen bases in this
situation – we address this by computing the mean run values of all of the
attempts with a runner on first with one out.
mean(stealing.1001$RUNS.VALUE)
[1] 0.02649315
Stolen base attempts are worthwhile, although the value overall is about 0.02
runs per attempt. Of course, the actual benefit of the attempt depends on the
success or failure and on the situation (runners and outs) where the stolen
base is attempted.
5.11 Exercises
1. (Runs Values of Hits)
In Section 5.8, we found the average runs value of a home run and a single.
(a) Use similar R code as described in Section 5.8 for the 2011 season
data to find the mean run values for a double, and for a triple.
Value of Plays Using Run Expectancy 127
(b) Albert and Bennett (2001) use a regression approach to obtain the
weights 0.46, 0.80, 1.02, and 1.40 for a single, double, triple, and home
run, respectively. Compare the results from Section 5.8 and part (a)
with the weights of Albert and Bennett.
2. (Value of Different Ways of Reaching First Base)
There are three different ways for a runner to get on base, a single, walk
(BB), or hit-by-pitch (HBP). But these three outcomes have different runs
values due to the different advancement of the runners on base. Use runs
values based on data from the 2011 season to compare the benefit of a
walk, a hit-by-pitch, and a single when there is a single runner on first
base.
3. (Comparing Two Players with Similar OBPs.)
Rickie Weeks (batter id “weekr001”) and Michael Bourne (batter id
“bourm001”) both were leadoff hitters during the 2011 season. They had
similar on-base percentages – .350 for Weeks and .349 for Bourne. By ex-
ploring the runs values of these two payers, investigate which player was
really more valuable to his team. Can you explain the difference in runs
values in terms of traditional batting statistics such as AVG, SLG, or
OBP?
4. (Create Probability of Scoring a Run Matrix)
In Section 5.3, the construction of the runs expectancy matrix from 2011
season data was illustrated. Suppose instead that one was interested in
computing the proportion of times when at least one run was scored for
each of the 24 possible bases/outs situations. Use R to construct this
probability of scoring matrix.
5. (Runner Advancement with a Single)
Suppose one is interested in studying how runners move with a single.
(a) Using the subset function, select the plays when a single was hit.
(The value of EVENT CD for a single is 20.) Call the new data frame
d.single.
(b) Use the table function with the data frame d.single to construct
a table of frequencies of the variables STATE (the beginning run-
ners/outs state) and NEW.STATE (the final runners/outs state).
(c) Suppose there is a single runner on first base. Using the table from
part (b), explore where runners move with a single. Is it more likely
for the lead runner to move to second, or to third base?
(d) Suppose instead there are runners on first and second. Explore where
runners move with a single. Estimate the probability a run is scored
on the play.
128 Analyzing Baseball Data with R
CONTENTS
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2 The lattice Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2.2 The verlander dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2.3 Basic plotting with lattice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2.4 Multipanel conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2.5 Superposing group elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2.6 Scatterplots and dot plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2.7 The panel function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2.8 Building a graph, step-by-step . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.3 The ggplot2 Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3.2 The cabrera dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3.3 The first layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.3.4 Grouping factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.3.5 Multipanel conditioning (faceting) . . . . . . . . . . . . . . . . . . . . . . 149
6.3.6 Adding elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3.7 Combining information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.3.8 Adding a smooth line with error bands . . . . . . . . . . . . . . . . . 151
6.3.9 Dealing with cluttered charts . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3.10 Adding a background image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.1 Introduction
Chapter 3 introduced graphics in R, illustrating a variety of displays with
functions provided by the graphics package. While this traditional package
has sufficient flexibility for many purposes, the functions are more difficult
to use in graphing complicated data structures such as associations among
three variables. In this chapter, the lattice and the ggplot2 packages are
introduced that are well-suited for constructing more sophisticated graphical
129
130 Analyzing Baseball Data with R
displays. In this chapter we focus on the use of these packages for baseball
data.
The datasets used in this chapter are contained in an R workspace, a file
with extension .Rdata which can be loaded with the following command line:
load("data/balls_strikes_count.Rdata")
sampling should be performed with replacement and is set as FALSE by default, and a vector
probs if one wants to give different selection probabilities for the set of integers.
TABLE 6.1
A random sample of twenty rows of the verlander dataset.
(row#) season gamedate pitch type balls strikes pitches speed px pz batter hand
Advanced Graphics
The first two columns of the verlander data frame, season and gamedate,
contain the season and the date in which the game was played. The
pitch type column indicates the type of pitch thrown by the pitcher as a
two-character abbreviation. The five pitches in Verlander’s repertoire are the
four-seam fastball (FF), the two-seam fastball (FT), the curveball (CU), the
change-up (CH), and the slider (SL.) The balls and strikes columns re-
port the ball-strike count on the batter when the pitch was delivered and the
pitches column contains the number of pitches already thrown by the pitcher
in that particular game. Reading the first row in Table 6.1, we see that Verlan-
der threw a four-seam fastball, delivered as the 69th pitch in the game, with
a 1-0 count on the batter. The next three columns include PITCHf/x data:
speed is the recorded speed at the release of the pitch, px and pz define the
location of the pitch when it crosses the front of the plate. For example, the
first pitch in the table crossed the plate 0.35 feet to the right of the middle of
the plate, 3.32 feet from the ground. The final column indicates whether the
opposing player was batting from the left or the right side of the plate.
0.10
30
25 0.08
20
Percent of Total
0.06
Density
15
0.04
10
0.02
5
0 0.00
60 70 80 90 100 60 70 80 90 100
speed speed
(a) (b)
FIGURE 6.1
Histogram and density plot of Verlander’s pitch speeds using functions in
lattice package.
element vector is specified, indicating the number of columns and rows, respectively.
134 Analyzing Baseball Data with R
clearly shows that Verlander’s fastballs are thrown in the mid-90s, his sliders
and changeups in the mid-80s, and his curveballs about 80 mph.
SL
0.20
0.15
0.10
0.05
0.00
FT
0.20
0.15
0.10
0.05
0.00
FF
0.20
Density
0.15
0.10
0.05
0.00
CU
0.20
0.15
0.10
0.05
0.00
CH
0.20
0.15
0.10
0.05
0.00
60 70 80 90 100
speed
FIGURE 6.2
Panels of density plots of Verlander’s pitch speed by pitch type.
CH
CU
FF
FT
SL
0.20
0.15
Density
0.10
0.05
0.00
60 70 80 90 100
speed
FIGURE 6.3
Density plot of Verlander’s pitch speed by pitch type using superposed lines.
2011 2012
99
●
98
● ●
●
● ● 97
● ● ● ●
● ● ● ● ● ● ●
● ●
● ● ●
● ● ●
●● ● ● ● ● ●●● 96
● ●
● ● ● ● ●
● ● ● ● ●
●●
● ●
● ●
● ● ●●
● ● ● ● 95
pitch speed (mph)
● ● ● ●
● ●●
94
2009 2010
●
99
● ●
98 ● ● ● ●
●
● ● ● ●
● ● ●
97 ●
● ● ●
●
●
●● ● ● ● ●
●● ● ● ●
● ● ● ●
96 ● ●● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ●
● ●
● ● ●● ●
95 ●
●
● ●●
94
●
FIGURE 6.4
Scatterplots of the speeds of Verlander’s four-seam fastball against the day of
the year for the four seasons 2009 through 2012.
The graph of Figure 6.5 shows that Verlander’s fastball/change-up speed dif-
ferential has slightly diminished through the years, as Justin lost a very small
amount of speed on his fastball, while simultaneously throwing his change-up
progressively harder from 2009 to 2011.
2012 C F
2011 C F
2010 C F
2009 C F
86 88 90 92 94 96
speed
FIGURE 6.5
Dot plot of Verlander fastball and change-up speeds for the 2009 through 2012
seasons.
speeds at every pitch count for the four seasons considered. The overall average
fastball speed is stored in the avgSpeedComb object.
avgSpeed <- aggregate(speed ~ pitches + season, data=F4verl,
FUN=mean)
xyplot(speed ~ pitches | factor(season),
data=avgSpeed)
avgSpeedComb <- mean(F4verl$speed)
To add the reference lines and text, one adds a panel argument which is
a function describing the different components graphed. The particular panel
function used for our example is displayed below. The panel.xyplot function
draws the scatterplot, panel.abline functions add the vertical line at the 100
pitch count and the horizontal one at the average speed value, the panel.text
function adds the text, and the panel.arrows function draws the arrows.
panel=function(...){
panel.xyplot(...)
panel.abline(v=100, lty="dotted")
Advanced Graphics 139
panel.abline(h=avgSpeedComb)
panel.text(25, 100, "avg. speed")
panel.arrows(25, 99.5, 0, avgSpeedComb,
length .1)
}
The panel.abline and panel.text are the lattice equivalent functions to
the abline and text functions in the base graphics described in Chapter
3. Similarly the lattice function panel.arrows corresponds to the arrows
function in base graphics.3
If the panel function argument is added to the basic xyplot function, a
scatterplot is obtained with the reference lines and text added (see Figure
6.6).
xyplot(speed ~ pitches | factor(season),
data=avgSpeed,
panel=function(...){
panel.xyplot(...)
panel.abline(v=100, lty="dotted")
panel.abline(h=avgSpeedComb)
panel.text(25, 100, "avg. speed")
panel.arrows(25, 99.5, 0, avgSpeedComb,
length=.1)
}
)
This figure clearly shows that the speed of Verlander’s fastball steadily in-
creases during a game, even past the 100-pitch count, a commonly used limit
for starting pitchers.
example, by typing ?arrows in the R console) for figuring out the arguments that can be
specified to the various panel functions.
140 Analyzing Baseball Data with R
0 50 100
2011 2012
●
●
avg. speed avg. speed 100
● ●
●
●●
● ●● ●
●●
● ●
● ● ● 98
●● ● ●
● ●●
● ●●
● ●● ●● ●●●●● ● ●●●
● ●
● ●●
● ●●●● ●
●● ●
●
●
●●●● ●●
● ●
● ●● ● ● ● ●
● ● ●●● ● ●
● ●● ● ● ●● ● ●
●●● ● ● ● ●
●●● ●
●● ● ●●● ●● ●●
● ●●●● ● ●● ● ● ●● ● ●●● ● ● ●● 96
●●●● ● ● ●●●●● ●
●● ● ● ●
● ● ●● ● ●
●● ●● ●● ● ●● ●● ● ● ●
●● ●
●●●● ● ● ●● ● ●● ●● ● ● ● ●
●●
●
●
●●
● ● ●●●●●● ●●
●
● ● ●
● ●●●
●●● ●●
●
●●●● ● ●●
●● ●●
● ●
●● ●●
●●● ● ● ●●● ●
● ●●●
●
●● ● 94
● ●●
● ●
●
speed
2009 2010
●
● ●
100 avg. speed avg. speed
● ● ●
●
●● ●● ●
98 ●● ●
●●
●●
● ● ●●●
● ● ● ●● ●●
●● ● ● ●● ●● ● ●● ●● ●
●● ● ● ●●●●●●●● ● ●●●●●●●● ●●●●● ●● ● ●●
●
● ●
● ●●●● ● ●●● ●●
●● ● ●●● ●● ● ●● ●●● ●● ●● ● ● ● ● ● ●● ●●●● ●● ● ●●
●● ● ● ●● ●
●● ●
● ● ●●● ● ●● ● ● ●●● ● ● ●●●
●●●●●●●●● ● ● ● ●●● ●● ● ●● ●●● ●
● ●●●● ●●● ●
96 ●
●● ● ●
●● ●
●●●●●●●● ● ●●● ●● ● ●● ●● ●● ●
●● ● ● ●
● ●●● ●● ●
● ●●● ●
●
●●●● ● ● ● ● ●● ● ● ●
● ●
●
● ●
● ●
94
0 50 100
pitches
FIGURE 6.6
Verlander’s four-seam fastball speed through the game - Scatterplot with ver-
tical reference line at 100 pitches and horizontal reference line drawn at the
value corresponding overall fastball speed (as indicated by the text labels and
the arrows).
pitch type contains the pitch type. A scatterplot of the pitch locations is
constructed using the function xyplot; by use of the “| batter hand” option,
separate panels are created for left-handed and right-handed batters, and by
use of the groups=pitch type argument, each of the pitch types is represented
by a different symbol. The auto.key=TRUE argument creates a legend for the
plotting symbols. This first graph is displayed in Figure 6.7.
xyplot(pz ~ px | batter_hand, data=NoHit, groups=pitch_type,
auto.key=TRUE)
Since the axes are both measured in feet, it is desirable that the units on
the x-axis are expressed in the same scale of the units on the y-axis. This
can be ensured by choosing isometric scales, which is accomplished by the
aspect="iso" argument option. With this change, one obtains the graph
shown in Figure 6.8.
Advanced Graphics 141
CH ●
CU
FF
FT
SL
−2 −1 0 1 2
L R
●
3
●
● ●
●
pz
2 ●
●
●
●
●
1
−2 −1 0 1 2
px
FIGURE 6.7
Location of Verlander’s pitches in his second career no-hitter - base graph.
Next, we want to limit the plotting area from two feet to the left (of the
center of the plate) to two feet to the right, and from the ground to five feet
above it. The limits of the x and y axes are controlled by the arguments xlim
and ylim. The axes can be labeled with meaningful text using the xlab and
ylab arguments.4 The third graph in our sequence is displayed in Figure 6.9.
xyplot(pz ~ px | batter_hand, data=NoHit, groups=pitch_type,
auto.key=TRUE,
aspect="iso",
xlim=c(-2.2, 2.2),
ylim=c(0, 5),
xlab="Horizontal Location\n(ft. from middle of plate)",
ylab="Vertical Location\n(ft. from ground)")
4 The \n character sequence is used to split the text over multiple lines.
142 Analyzing Baseball Data with R
CH ●
CU
FF
FT
SL
●
3
●
●
●
●
pz
3
●
2 ●
● ●
●
●
1 ●
−2 −1 0 1 2
px
FIGURE 6.8
Location of Verlander’s pitches in his second career no-hitter - graph with
change in aspect ratio.
Further improvements can be made to the display in Figure 6.9. This graph
is missing the rectangle of the strike zone and the legend uses abbreviations
of the pitch type and has neither the title (“pitch type”) nor the border.
The first change is to improve the legend. Other than a logical argument,
the auto.key argument can accept a list of parameters to fine-tune the legend.
The following code prepares this list, after a vector of labels for the pitch
type is created.
CH ●
CU
FF
FT
SL
−2 −1 0 1 2
L R
4
Vertical Location
(ft. from ground)
●
3
●
● ●
●
2 ● ●
●
●
1 ●
−2 −1 0 1 2
Horizontal Location
(ft. from middle of plate)
FIGURE 6.9
Location of Verlander’s pitches in his second career no-hitter - graph with
changes in axes limits and labels.
This final graph is very informative about the location of Verlander’s pitches
during this no-hit game. For example, we see that Verlander was successful
in throwing his slider down and away to right-handed hitters. We also see
Verlander’s tendency to throw his change-up down and away to left-handed
hitters. His fastballs generally were thrown within the strike zone.
−2 −1 0 1 2
L R
pitch type
4 change−up
(ft. from ground)
●
vertical location
3 ● curveball
●
● ● ● 4S−fastball
2 ●
●
●
●
2S−fastball
●
1 ● slider
●
−2 −1 0 1 2
horizontal location
(ft. from middle of plate)
FIGURE 6.10
Location of Verlander’s pitches in his second career no-hitter - graph with
changes in legend and addition of strike zone box.
● ●
●
500 ● ● ●●
● ●
● ● ● ● ●●
● ● ● ●
● ● ●● ●●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●● ●● ● ● ● ●● ●
●●
● ● ● ●● ●● ● ● ●
● ● ● ●
●
● ● ●● ●
●● ●●
● ●
● ● ● ● ●● ● ● ● ●
●
● ●
●● ●● ● ●● ● ● ● ●
● ● ● ● ●● ● ●
●
●
● ● ●● ● ●● ●
● ● ●●
●
●
●● ● ●●● ●● ● ● ● ●●
● ● ●● ● ●● ● ● ● ●
● ● ● ● ● ●● ●
400 ●● ● ●● ● ● ● ●
●● ●●●
● ● ● ●● ● ●● ● ●
● ● ●
●●● ● ● ● ●● ● ●●● ● ●
● ●● ● ●● ● ●●
● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ●●● ●
● ● ●● ● ● ● ● ●● ●● ●● ●
●● ● ● ● ● ●
●●● ● ● ● ● ● ●
●●●● ●
● ● ● ●
● ● ●●●
●● ● ● ● ● ● ● ● ●
●
● ●●
●● ●●
●●●●●
● ● ●
● ●
●
● ● ● ●● ● ● ●
●● ● ●● ● ●
● ● ● ●● ● ● ● ●● ●●● ● ●
● ● ● ● ● ●
●
● ● ●● ●● ●
●● ●●
● ●●●
●● ● ● ●
●●● ● ●● ● ● ●
● ● ●● ● ●
● ● ● ● ●●● ● ● ● ●● ●
● ● ● ● ● ●
● ●●● ● ● ● ●● ●●● ●● ●
●● ●●●
● ●● ● ●●
● ● ● ●●
●
● ● ●●● ●●● ●●●● ● ●●
●● ● ● ●● ● ● ●
●
●
●
●
●
●● ●●● ● ● ●●
● ●●
● ●● ● ●
●
●● ●● ● ● ●
●●
●
●●● ●
● ●● ●
● ● ●● ● ● ●
●
● ● ● ● ●●● ● ● ●● ● ●●●●
● ●
●●●●●● ●●●● ● ● ●●● ●
● ●● ● ● ●
● ●
● ● ● ●
● ● ●●●
● ●●●●● ● ● ●● ●● ● ● ● ● ● ● ● ●
● ● ● ●●●
● ● ● ●●●
300 ● ●
●
●
● ●
●●
●
●●●
● ●●●
●●●
●●
●
●
●
●●
●
●
●
●● ●
● ●
●●●
●●● ● ●
● ●
●● ● ● ●
●
●●● ●
● ● ●●●●●● ●●●
●●
●●●● ● ● ● ●● ●●
● ● ●● ●● ●
● ● ● ●●● ●●●
● ●●● ●●●
● ● ●● ● ●●●
●● ●
●●● ●
●
●● ● ● ●● ●● ●
●
● ● ● ● ●●●
● ●● ●● ●●●● ● ● ● ●●● ●● ●
● ● ● ● ●
●
●●●● ● ● ● ●
● ●
● ● ●●●●● ● ● ● ●●● ●
●●●
●●●
●●●● ● ●
●● ●● ●
● ● ●● ● ● ●●● ●● ●●● ●● ● ● ● ●
● ● ● ● ● ●● ●● ●
●
●
●
●
●
●
● ●●●●
●●●
● ●●●●
●
●
●●●● ●●● ●● ● ● ●
●●●●
●●
●●
●● ●
●●
●
●●
●
●
● ●
● ●●●
●●●●● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ●●
●● ● ●●
● ●●●●●● ●● ● ● ● ●●●
●
●●●●● ●● ● ●●●
hity
●● ●●●● ● ● ● ● ●●●
●
● ● ●●
● ●●● ● ●●
●● ● ●● ● ●●●●●● ●● ● ●●
● ●●
● ●●● ●
●● ●●● ● ● ●●
● ● ● ●●● ●
● ●●● ●●● ●●●●
●●●●
● ● ● ● ●● ● ●
● ●● ●● ● ● ● ●
● ● ● ●
●● ● ●●● ● ● ● ●● ●●
●●● ● ● ●● ● ●●
● ● ●● ● ●● ● ●●●●●
●
● ●●●
● ●●●● ● ● ● ● ●
●
● ●
●●● ●
● ●● ●● ●●
●● ● ● ● ● ● ● ●●● ●
● ● ●
200 ● ● ●● ●● ● ●
●●●● ●
●
●●
●
● ●
●
●
●
●● ● ●● ● ● ● ●●
●●● ● ● ●● ●
●●● ●●●● ●●
● ● ●
● ● ● ● ●● ●
● ● ●● ●● ● ●
● ● ● ● ● ● ● ● ● ●●
● ●● ● ●● ●● ●
● ● ● ● ● ●●
● ●●● ●●●● ● ●● ●● ● ●●
● ● ●●
●●
●● ●● ●●
●●●
●●
●
●●● ● ●
●●●●
●●
●●●
●
●●●●
● ●
●● ●●●●
●●
●●
●● ●
●●
● ● ● ●
●●
●●●
●●
●
●
●●●
●●
●
●
●●
●●
●● ●● ●●●●●
●●●●●
●
●● ●
● ●
●●
●●
●● ●
●●●
● ●● ●●●
●●●●●
● ●●● ● ●
●●●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ● ● ●●●●
●
●
●
●●
●
●
●
●
●●●●
●● ●●
●● ●●● ●●
●
●●
●●
●●●
●
●● ● ●
● ●● ● ●● ●
● ●●●
●●
● ● ●●●● ●●●● ● ● ●
100 ●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●● ●
●
●
●●●●●●
●
●
●
●●
● ●
● ●●● ● ●
●●●
● ●● ● ●
●
●●
● ●●
●
●●
●
●●
●
●
●●
●●●●● ●●
●● ●●●
● ●●
●●
●
●●●
●
●
●●
●
●●
●
●
●●
● ● ●●●●
●● ●
●● ●
●●●
●
●●
●●
●
●●
●●
●
●●●
●● ● ●●●●● ●●●● ●
●●●
●
●
●
● ●●●
●●● ● ●● ●
● ●●
●●● ●●● ● ●
●● ●● ●●●
●●
●●
● ●● ●
● ●●●● ●●
●●
●
● ●● ●
●
● ●●● ● ●●
● ●
●●●●●● ●● ●●●
●●● ●
●● ●● ● ●●
● ●
● ● ● ●
● ●
●
0 ●
●
●
FIGURE 6.11
Scatterplot of Miguel Cabrera’s batted balls (2009 - 2012).
● ●
●
500 ●● ● ● ●● ●
●
●
●●
● ● ●● ● ●●
●●
●●
● ●● ● ● ●
●
●
● ●●
●
● ●● ●
● ●●●● ●● ●
●●
● ● ● ●●
●● ● ● ● ● ● ● ●
● ● ● ●●● ●● ● ●● ●●● ● ●● ● ●
● ●●● ● ●● ● ●● ●●● ● ● ●
●
● ●
●●●● ● ● ●● ● ●●
● ● ● ● ● ●● ● ● ● ●● ●
●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●
● ● ●● ●●● ●● ● ●
400 ● ●● ● ● ●●● ●
●●●●●●
● ●
● ●
● ●● ●
● ●● ● ●● ●●●●
● ● ●
●● ● ● ● ● ●● ●● ● ● ● ●●
● ● ● ●● ● ●●● ●●●● ● ●●●● ● ● ●● ●
● ● ● ●●● ● ●
● ●● ● ● ● ●● ●●●●●●● ● ●● ● ●● ● ● ● ● ● ●●
●● ●●●● ● ● ● ● ● ● ● ●●● ● ●● ●●● ● ●
●● ●●●● ●● ● ● ●
●● ● ● ● ● ● ● ● ● ●●● ●● ●
● ●● ● ● ●
● ● ● ● ● ●
● ●● ●
●●
●● ●
● ● ●●● ●
●●● ●●● ●●●
●●●● ●
●●●●● ●●
●● ● ●●
●
● ●● ● ● ●
● ●●
● ● ● ●● ● ● ● ●● ●● ● ● ● ●
● ●●●●● ●● ●● ●●●
●● ●● ● ● ● ●
●
●
●● ●●● ●●●
●● ● ● ● ●●●●
●●
●
● ●
●● ● ●●●
●●●●
●● ●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
● ●
● ●●●●●● ●● ●●
●●●●● ●
● ● ● ●
●
●●●●●
● ●
●●●
●
●
●
●
●
●●
● ●●
● ● ●
●●● ● ●●●
hit_outcome
● ●● ● ● ●● ●●●● ●●
●●● ●●
● ●● ● ● ●●●● ● ● ●
● ●
300 ● ●
● ●
●
●
● ● ●
●
●
●● ●●
●
●● ●● ● ●
●●●●●
●
●
●
●
●
●
●
●●●●
● ●●● ● ●
● ●
●
●
● ●
●●
●
●●
●
●
●●●● ● ●
●
●
●●● ●
● ● ● ●●
● ●● ●●●●●
●●● ● ●●● ● ●● ●● ●●●●● ●● ● ● ● ●● ● ● ● ●● ●● ● ●
● ●
●● ● ●● ● ● ●●
● ●●
●
●
●
●●
●
●
●●
●
●
●
●
● ●● ●●● ●●
●● ●●
●● ● ●
●
●
●●
●●
●●
● ●●●●●
●●● ● ●●●● ●●● ●
●● ●●●
●
●●
●●
●
●●
●●
●
●●
●●
●● ●
●●
●
●
●
●● ●● ●●● ●● ● E
● ●
● ●● ● ●
●●●●●●●●●● ●●●●●●●●●●●●●●●● ●● ● ● ● ●● ●●●●●● ●●●●
●●●
●●
hity
● ●●
●●
●●
●● ●
● ●●● ● ● ● ● ● ● ● ●● ●
●●● ●
●● ●●
● ●
● ● ●● ●
●●● ●● ● ●● ● ●● ● ●●●●●●●● ●● ●● ●●●
● ●● ●● ●●●●●●● ●● ● ● ● ● ● ●
● ● ● ●●
●● ●●
●● ● ●● ● ●●
● ●●● ●
●● ●● ● ●● ●● ●●●● ●●●
●●
●●● ● ● ●●
●● ● ●
●●●●●●● ●
●
● ●●
● ●●
●●● ●
●● ●● ●
●
●●
●●
● ●● ●●
●● ●
●
●●●
●
●
●●●
●
●
●● ●●
● ● ● ●
● H
200 ● ● ●●●●
●
● ●● ●● ●●
● ●●●
● ●
●●●●
●●
●●
● ● ● ●●●
●
●●
● ● ● ●● ● ●
● ●●● ●●●● ● ●● ●
● ●
● ● ●●● ●●●
●●
●●● ●●● ●
●
●● ●● ●●
● ●●● ● O
●● ● ●
●●● ●● ● ●● ●
●●
●●● ●
●●●●● ● ●●●●●● ●
●
● ● ●●●
● ●● ●●●
●●●● ● ●●
● ●●
●●●
●●
●●●
●
●
●●●
●
●●
●●
● ● ● ●●●●
●● ●
●●
●
●
●● ●●● ● ●●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
● ●● ● ●
●
●● ●
●●
●
●
●●
●●●●
●●● ●
●●●●●●●●
●
● ●
●●
●●
●
●●
●●
●●●
● ● ● ●●
●●
●
●
●
●
●●
●
●
●● ●
● ● ●
100 ●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●●●
● ● ●
●●●●● ●●●●●
●●
●
●●●● ●●●
●
●●●
● ●●●
● ●
●●
●●
●●●● ●
● ●
●● ●●
●●
●●
●
●
●●
●●●●
●
●
●
●●
●●●● ●●● ●
●●●●●●●●
●●● ● ●
●●
● ●●
●
●● ● ●●● ●
●●● ●●● ● ●
●●
●●● ● ●● ●
● ●● ●●●●
● ●● ●
● ●● ●
●●●●● ●●
●●
●● ●
●● ●●
● ●● ●●
●
● ● ● ●● ● ●
●
0 ●
●
●
●
FIGURE 6.12
Outcomes of Miguel Cabrera’s batted balls (2009 - 2012).
This figure illustrates how the locations of Cabrera’s batted balls have changed
over the four seasons.
150 Analyzing Baseball Data with R
2009 2010
● ●
500 ●●●
●●
●●
● ● ●● ●
●
●
● ● ● ●●●● ● ● ● ● ●●●
●● ● ● ●
●●
●
●●● ●● ● ● ●● ● ●●
● ●● ● ●●●
● ●● ●●● ● ●● ●● ●●
● ● ●● ●● ●●●● ●
400 ●●
●● ●● ●
●●
● ●● ● ●● ●●
●●● ● ●● ● ●●
●●●●●●● ●●●● ● ● ● ● ● ●● ●●●● ● ●●
● ● ●
● ● ● ●●● ●● ● ● ●● ●● ● ● ●
●●●● ● ●●● ● ● ● ● ●●
●●
●●●● ●●● ●●●● ● ● ●● ●●● ●●
● ● ●● ●●● ●
●
● ● ●● ●●
● ● ● ● ●●● ●●● ●
●●● ●
●●●●●●● ● ● ●
● ●● ●●●● ●●● ●● ●● ● ●● ●●● ●●●●● ● ● ● ● ●●● ●●
300 ●● ●● ●●●●
●
●●
●●
●●
● ● ●● ●●● ● ● ●●
●
●
●●
●
●●●●
● ●
● ●
●● ●
●● ●●● ● ● ●● ● ● ●●● ●
● ● ●
●●
●
● ●
●● ●
●● ●●● ●● ●
●●●●
●
●●● ●
●●●●
● ● ●●●● ● ●
●
●●●●● ●●●
● ●●
● ●● ●● ●
●●●
● ● ● ● ● ●● ●●
● ●● ● ● ●●●● ●● ● ●● ●●●
● ●●
●●
●●●●
●●
● ●● ● ● ●●●
●●
●● ● ●● ●● ●
●●●
●● ● ●
●● ●
●●
●●●●● ●
●●● ● ●● ●
●● ● ●
● ●● ● ● ●
200 ●
● ● ●
●
● ●● ● ●
● ●●
●
●
●●●●● ●
●●● ●
●●●
●● ●●● ●
● ● ● ● ●●●● ●
● ●● ●●
●
●
● ● ● ● ● ● ●
●●●●● ●●
●
●
●●
●
●●
●
●
●●
●
● ●
●●
●
●●
●
●
● ●●●
●
●●
●●
●
●
●
●
●● ● ●●
●
●
●●●
●
● ● ●
● ●● ● ● ● ● ● ● ●● ●
100 ●● ●
●
●
●
●
●
●
●
●
●
●
●
●●● ●●●●
●
●
●●
●●
●
●
●
●●
● ●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●●
●●●● ●
●
●
●
● ●●
●●
●●
●●●
●● ●●
●
●●
● ●●●
●● ● ●●● ●● ●● ●
●● ●●● hit_outcome
● ●
0 ● ●
●
● ● E
hity
2011 2012 ● H
500 ● ●● ●●
● ●
● ● ● ● ●● ●
● ●●
●
● ● ●● ●● ● ● ●
●
●●
●● ●
●● ●
● ●
● ●● ● ●●● ●●● ● ●
● ● O
●● ●● ●●
● ●● ● ●●●●●● ●
400 ●
●●●●
● ●●
● ●● ● ● ● ●● ● ●
●
●●●●● ● ● ●●
●● ● ● ●
● ● ●● ●●
●●● ●
●● ●
● ●● ● ● ●●● ●● ●
●● ● ●●● ●● ● ●
●●
● ●●●● ●
● ●●● ● ●
●● ●●●●●●
● ●
●
●●
●● ● ● ● ●
●● ● ●● ● ● ● ●● ●●●●
●
●● ●
●●
●● ● ●
● ●●● ● ● ●
●●●● ●● ● ●● ●
●●●
● ● ●●●●
● ● ● ● ●● ● ●● ●●
●●
●
●●●●●● ● ● ●●● ●● ●●
● ● ●● ● ●● ● ●●●●●●
●● ●● ● ●
●●●
300 ●
● ●
●
● ● ●●●
●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
●●
● ●
●●●
●
● ●
●●●
● ●
● ●
●
●
●
● ●● ●●
●
● ●
● ●
●
●●
●● ● ●●
●
●
●
●●●
●
●● ● ● ●
●
●●●
●
●
●●
● ●●●
● ●
●●●● ●● ●●●●● ●● ●●● ● ●● ●● ●● ●●●●● ●
●● ●
● ●●●● ●● ● ●●●●●
● ●●
●
●
●●
●●
●● ●
● ●
●● ● ●●
●
● ●● ● ●●
●
● ●● ● ● ●●●
●● ●
●
● ●●
●● ●
●● ● ● ●●●● ●●
●
● ● ●
●●
●●
●
●●
●●●
●●
● ● ●● ●●●
●● ● ● ● ●● ●●● ● ● ●
●
●●
●●●
●●●●
● ●
●● ●
200 ●●
● ●● ●●● ● ● ●●
● ● ● ●●● ●●● ●●
●●● ●
●●●
● ●● ● ● ●
● ● ●● ●
● ●● ●●● ● ●
● ●●● ●
● ●● ●●●●● ●● ● ● ●
●
●
●●
●
●●
●
● ●
●
● ●
●●
●
● ●● ● ●●●●
●●●
●●●● ●
●
●●
●●●
● ● ●●●●● ●
● ●
●
●
● ●●
100 ●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●●●
●●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●● ●●●●
●
●●
●●
● ●
●● ● ●● ● ● ●●
● ●● ●
●
●● ● ● ●● ● ●
● ● ●● ●
●● ●●● ●
● ●● ●
0 ●
FIGURE 6.13
Outcomes of Miguel Cabrera’s batted balls by season.
Note that, since we are going to draw a path, the coordinates of home plate
are contained both at the beginning and at the end of the values of x and y.
The displaying of the base paths lines (Figure 6.14) is obtained by adding a
layer to the previous plot.
The new plot with base paths added is constructed by adding the
geom path function component to the current plot p3. In this case, data and
the mapping of aesthetics (aes) are specified as arguments in the geom path
function, overriding the defaults set in the original ggplot function. When
calling the p4 object, two additional layers are added for the display of the
foul lines using the geom segment functions.
p4 <- p3 + geom_path(aes(x=x, y=y), data=bases)
p4 +
geom_segment(x=0, xend=300, y=0, yend=300) +
geom_segment(x=0, xend=-300, y=0, yend=300)
2009 2010
● ●
500 ●●●
●●
●●
● ● ●● ●
●
●
● ● ● ●●●● ● ● ● ● ●●●
●● ● ● ●
●●
●
●●● ●● ● ● ●● ● ●●
● ●● ● ●●●
● ●● ●●● ● ●● ●● ●●
● ● ●● ●● ●●●● ●
400 ●●
●● ●● ●
●●
● ●● ● ●● ●●
●●● ● ●● ● ●●
●●●●●●● ●●●● ● ● ● ● ● ●● ●●●● ● ●●
● ● ●
● ● ● ●●● ●● ● ● ●● ●● ● ● ●
●●●● ● ●●● ● ● ● ● ●●
●●
●●●● ●●● ●●●● ● ● ●● ●●● ●●
● ● ●● ●●● ●
●
● ● ●● ●●
● ● ● ● ●●● ●●● ●
●●● ●
●●●●●●● ● ● ●
● ●● ●●●● ●●● ●● ●● ● ●● ●●● ●●●●● ● ● ● ● ●●● ●●
300 ●● ●● ●●●●
●
●●
●●
●●
● ● ●● ●●● ● ● ●●
●
●
●●
●
●●●●
● ●
● ●
●● ●
●● ●●● ● ● ●● ● ● ●●● ●
● ● ●
●●
●
● ●
●● ●
●● ●●● ●● ●
●●●●
●
●●● ●
●●●●
● ● ●●●● ● ●
●
●●●●● ●●●
● ●●
● ●● ●● ●
●●●
● ● ● ● ● ●● ●●
● ●● ● ● ●●●● ●● ● ●● ●●●
● ●●
●●
●●●●
●●
● ●● ● ● ●●●
●●
●● ● ●● ●● ●
●●●
●● ● ●
●● ●
●●
●●●●● ●
●●● ● ●● ●
●● ● ●
● ●● ● ● ●
200 ●
● ● ●
●
● ●● ● ●
● ●●
●
●
●●●●● ●
●●● ●
●●●
●● ●●● ●
● ● ● ● ●●●● ●
● ●● ●●
●
●
● ● ● ● ● ● ●
●●●●● ●●
●
●
●●
●
●●
●
●
●●
●
● ●
●●
●
●●
●
●
● ●●●
●
●●
●●
●
●
●
●
●● ● ●●
●
●
●●●
●
● ● ●
● ●● ● ● ● ● ● ● ●● ●
100 ●● ●
●
●
●
●
●
●
●
●
●
●
●
●●● ●●●●
●
●
●●
●●
●
●
●
●●
● ●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●●
●●●● ●
●
●
●
● ●●
●●
●●
●●●
●● ●●
●
●●
● ●●●
●● ● ●●● ●● ●● ●
●● ●●● hit_outcome
● ●
0 ● ●
●
● ● E
hity
2011 2012 ● H
500 ● ●● ●●
● ●
● ● ● ● ●● ●
● ●●
●
● ● ●● ●● ● ● ●
●
●●
●● ●
●● ●
● ●
● ●● ● ●●● ●●● ● ●
● ● O
●● ●● ●●
● ●● ● ●●●●●● ●
400 ●
●●●●
● ●●
● ●● ● ● ● ●● ● ●
●
●●●●● ● ● ●●
●● ● ● ●
● ● ●● ●●
●●● ●
●● ●
● ●● ● ● ●●● ●● ●
●● ● ●●● ●● ● ●
●●
● ●●●● ●
● ●●● ● ●
●● ●●●●●●
● ●
●
●●
●● ● ● ● ●
●● ● ●● ● ● ● ●● ●●●●
●
●● ●
●●
●● ● ●
● ●●● ● ● ●
●●●● ●● ● ●● ●
●●●
● ● ●●●●
● ● ● ● ●● ● ●● ●●
●●
●
●●●●●● ● ● ●●● ●● ●●
● ● ●● ● ●● ● ●●●●●●
●● ●● ● ●
●●●
300 ●
● ●
●
● ● ●●●
●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
●●
● ●
●●●
●
● ●
●●●
● ●
● ●
●
●
●
● ●● ●●
●
● ●
● ●
●
●●
●● ● ●●
●
●
●
●●●
●
●● ● ● ●
●
●●●
●
●
●●
● ●●●
● ●
●●●● ●● ●●●●● ●● ●●● ● ●● ●● ●● ●●●●● ●
●● ●
● ●●●● ●● ● ●●●●●
● ●●
●
●
●●
●●
●● ●
● ●
●● ● ●●
●
● ●● ● ●●
●
● ●● ● ● ●●●
●● ●
●
● ●●
●● ●
●● ● ● ●●●● ●●
●
● ● ●
●●
●●
●
●●
●●●
●●
● ● ●● ●●●
●● ● ● ● ●● ●●● ● ● ●
●
●●
●●●
●●●●
● ●
●● ●
200 ●●
● ●● ●●● ● ● ●●
● ● ● ●●● ●●● ●●
●●● ●
●●●
● ●● ● ● ●
● ● ●● ●
● ●● ●●● ● ●
● ●●● ●
● ●● ●●●●● ●● ● ● ●
●
●
●●
●
●●
●
● ●
●
● ●
●●
●
● ●● ● ●●●●
●●●
●●●● ●
●
●●
●●●
● ● ●●●●● ●
● ●
●
●
● ●●
100 ●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●●●
●●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●● ●●●●
●
●●
●●
● ●
●● ● ●● ● ● ●●
● ●● ●
●
●● ● ● ●● ● ●
● ● ●● ●
●● ●●● ●
● ●● ●
0 ●
FIGURE 6.14
Outcomes of Miguel Cabrera’s batted balls by season (with base lines super-
posed).
ing smoothed lines with shading reflecting the error bands (Figure 6.16).8
This graph is constructed in ggplot2 in five layers. The geom point function
gives the plotted points, the facet wrap produces separate panels by year,
the geom smooth function provides the smoothed line with error bands, the
geom vline function gives the vertical line at 100 pitches, and the geom line
function draws the horizontal line.9
ggplot(F4verl, aes(pitches, speed)) +
facet_wrap(~ season) +
geom_line(stat="hline", yintercept="mean", lty=3) +
geom_point(aes(pitches, speed),
8 Adding smoothed lines to the lattice version of the plot would also be possible, but
hit_outcome
● H
500 O
●
● ● speed
●
400 ● ● ● ●
●
● ● 70
● ●
●
● ● ● 80
300 ● ●
●
●
●
●
● ● ● ● ● ● ● 90
hity
● ● ●●
200 ● ● ● 100
● ●
pitch_type
100 ●● ● CH ● FT
● CU ● SC
0 ● FC ● SI
FIGURE 6.15
Miguel Cabrera’s batted balls in the final month of his Triple Crown season
(September/October 2012).
data=F4verl[sample(1:nrow(F4verl), 1000),]) +
geom_smooth(col="black") +
geom_vline(aes(xintercept=100), col="black", lty=2)
To avoid an over-cluttering of data points in the graph, the aesthetics mapping
is respecified in the geom point layer, where a random sample of 1000 points
is used from the F4verl data frame. The smoothing has been calculated on
the original F4verl data frame, because no explicit aesthetics mapping has
been specified in the geom smooth layer.
2009 2010
102.5 ●
●
● ●
● ● ●
● ●
100.0 ●
●
● ●● ●● ● ●
●● ●
●
●● ●●●● ● ●● ●
●
●●
●● ● ● ● ● ● ● ● ● ● ●●
●● ●● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●
● ● ●● ●● ●● ●● ●●
● ●● ● ● ●● ● ●●● ●● ●
● ● ● ● ●
● ●● ● ● ● ●●
● ●●
● ● ●● ●● ● ● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●
●
97.5 ●●
● ●
● ●●● ●● ● ● ●
●●
●
●●
●
●
●● ● ●●
● ● ● ●
● ●
●
● ●●
●
●●●
● ●
● ● ● ● ●
● ●
●●●●
●● ●●●●● ●● ●●●●●● ● ●● ●● ● ● ● ●
● ● ● ●● ● ●● ● ●● ●
● ●
● ●● ●● ● ● ● ●●●●●
●●●●
●●●●●● ●●●●●●●● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ●●
● ● ●
● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ●●● ●●●●● ● ●
●● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ●● ● ●● ●●●
● ●● ● ●
● ● ●
● ● ● ●●
●
●● ● ●● ●
● ●●●
● ● ● ●●
● ●● ●● ● ●
●●● ●
● ●
●● ● ● ●● ●● ●● ●
●●● ●
● ● ●●
95.0 ●
●●●
●
●●● ● ●● ●● ●
● ● ● ●● ● ●
● ● ● ●●●● ● ●
● ●● ●●
●●●● ●● ● ● ●●● ●●
●
●●
●● ● ●
●●● ●● ●
●
●● ● ● ● ● ●
●
● ● ● ● ● ● ●
● ●● ● ● ●
● ● ● ● ● ● ●●
●● ● ●● ●
● ●
● ● ●
●● ● ●
●● ● ●● ● ● ●● ●
●
● ● ● ● ● ● ● ● ●
92.5 ● ●
●
● ● ●
●
● ● ●
● ●
● ●
speed
2011 2012
102.5 ●
● ●
● ● ●●
● ● ● ●●
● ●
100.0 ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
●● ● ●● ● ● ● ● ●●● ●
●● ● ● ●
● ● ● ● ●● ●● ● ● ● ●●●● ● ● ●
●● ● ● ●● ●● ● ●● ● ●● ●
97.5 ● ●●●● ● ● ● ●● ●● ● ●
● ● ● ● ●●
●●
●● ● ● ● ● ● ● ● ● ● ●● ●
●
● ●●●● ●
●● ●● ●●● ● ●● ● ●● ● ●● ●●
● ● ● ●● ● ●
● ● ● ●●● ●● ● ●●
●● ● ● ● ● ● ●● ●
● ●● ● ●● ● ● ●
● ● ●● ● ●●● ● ●
● ●●
●● ●● ● ● ●
● ● ●●● ● ●
●
●● ●● ●● ● ●
●● ● ● ● ●● ● ● ●● ● ● ●
● ● ● ●●
●
● ●● ●
●● ● ● ●
● ● ● ●● ●
95.0 ● ● ●●● ● ●● ● ● ●
●● ●
● ● ● ●● ●●●● ●
● ●●● ●
●●● ● ● ● ● ●
●
● ●● ● ● ● ●
● ●● ● ● ● ● ● ● ● ●● ● ●
● ●● ● ● ● ● ● ●● ● ● ●●● ● ●
●●
●● ● ●● ●
●●
●
● ●● ● ● ● ● ●● ● ● ● ● ● ●
● ● ●● ● ●
● ●● ● ●● ● ● ●
● ●●● ●● ●●
●
●
●
●● ● ● ●●● ● ●●●● ●● ● ● ●● ● ●
● ● ●
● ● ● ●● ● ●● ●
●
● ●
●● ● ●
92.5 ●
● ●● ●
● ● ●
●
● ●● ● ●●
● ●● ● ● ●●
● ●
●● ●
● ●
● ●
● ●
0 50 100 0 50 100
pitches
FIGURE 6.16
Verlander’s four-seam fastball speed through the game - Scatterplot with ref-
erence line at 100 pitches and a smooth line with shading for errors.
of statistical transformations of the data. For example, one can add summary
statistics such as means and standard deviations to the current graph.
These stat layers are useful for dealing with cluttered data. As an ex-
ample, if we are to use a scatterplot to graph the locations of Verlander’s
four-seam fastballs from 2009 to 2012 using the following code, we see an
indistinguishable cloud of black points and it is difficult to see any patterns
(Figure 6.17).
kZone <- data.frame(
x=c(inKzone, inKzone, outKzone, outKzone, inKzone),
y=c(botKzone, topKzone, topKzone, botKzone, botKzone)
)
ggplot(F4verl, aes(px, pz)) +
geom_point() +
facet_wrap(~ batter_hand) +
coord_equal() +
geom_path(aes(x, y), data=kZone, lwd=2, col="white")
One way of handling the cluttering of data points is tiling the [x; y] plane
Advanced Graphics 155
L R
●
6
● ●
● ● ●
● ● ● ●● ●
●● ● ●●●● ● ● ● ●
● ● ●● ● ● ● ●● ●● ● ● ●●●
●●● ●● ● ●
● ●●●● ●●● ●● ●●●●● ●●
● ●●●
●●● ●● ● ●
● ● ● ●
● ● ●● ●● ●
●●
● ●● ● ●●
● ●
●●
● ●
●
●●
● ●● ● ●● ●● ● ● ● ●
●●● ●● ●●●● ●
●● ● ●●●● ●
●● ●●● ●●● ●● ●
●●●● ●●● ●● ●● ● ●●
●●●●● ●●●
●
●
● ●●●● ● ● ●●●●● ●
●
●●●● ● ●●
●● ●
●● ● ● ●● ● ●● ●● ●
●● ●●●●● ●●
● ●●●●● ● ● ●●
●●● ● ●● ●●
● ● ● ●● ●
● ● ●● ● ● ●● ●●
● ● ●● ●●●
●●●●●●
● ●●●●● ●●●●
●●
●
● ●
●●
●●
●
● ● ●
●●
● ●●
● ●●
●
●●●
●●
●●● ●●
● ● ●●●●● ●● ●●
●●●● ●● ● ●● ●●
● ●●●●●●● ●●● ●● ● ●
●● ●●● ●
●●●
●●●●●
●●●●●● ●●●
●●
●●●● ●●
● ●● ● ● ●● ●●
● ●●
●● ●
●●●● ●●●●●
● ● ● ●●● ●
4 ● ●● ●
●● ●● ● ●●● ●●●●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●●●
●●
●●●
● ●●
●●
●●●
●
●●
●●●
●●● ●
●
●●●
●●●
●
●
●
●
●●●●●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
● ●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
● ● ●
●
●
● ●●● ●●●●
● ● ●●●●●
●●●
● ●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●●●●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●●
● ●
●
●
●●●
●
●
● ● ●● ●●● ●●
●●●
●● ●●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●●●●
●●
●
●●
●●
●●
●
●●●
●
●●●●
●●
●
●
●
●●
●
●
●
●●●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●●●
●●●●●
●●
●
●
●● ● ●
●●●● ●
● ● ●● ●●●●
●●● ● ●
●●●
● ●●
●●●●●●●
●●
●●
● ●
●● ●
●●●
●●
●
●● ● ● ●
●
●● ●● ●●●● ● ●
●● ●● ●
●
●●●● ●●
●
●●●
● ● ●●●●●●
●
●
●●●● ●●●
●
●●
●
●●●●● ●
●●●●
● ●●●●
●●● ●●●●
●●●●
●
●●
●●
●
●
●
●●●
●●
●
●
●●
●●
●●●
●
●●
●
●
●●
●●
●●●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●●
● ●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
●●
●●●
●
●
●
●
●
●●●
●● ● ● ●
●● ●
●●●●●
●●
●●
●●●
●
● ●
●
● ●●● ●
●
●
●
●
●●
●●
●●●●
●
●●● ●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●● ●
●●
●
●
●●●
●●
●●
●●● ●
●●●●●
●● ● ●
● ●●●
●●● ● ●●●●● ●●
● ●
●
● ●●●
●
●●
● ●●●
●●
●●
●● ●
●●
●●
● ●
●●
●
●
●●
●●
●
●●
●●
●
●
●●
●
●●
●
●
●
●●●
●
●
●●
●●
●●
●
●
●
●
●●
●●
●
●●
●●
●
●●
●
●●●
●
●●●
●●
●
●
●●
●●
●●●
● ●
● ● ●● ●● ● ●●●●● ●●
● ●●●●●
●● ●
●●
●
●●
●
●●
●●●
●
●●●
●●
●●
●●●●
●●●
●
●
●
●●
●
●●
●●
●
●
●●
●●
●●●
●
●●
●
●●●
●
●●
●
●
●●
● ●●●●● ●
● ●●
●
● ●
●
●
●
●●●
●
●
●●
●●
●●
●●
●●
●●
●
●●●
●●●
●●
●●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●●
●
●●
●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●●●
● ●
●●●
●● ●●
●●●
● ●● ●●●●
●●
●●●
●
●
●
●●
●●●●
●
●
●
●
●●●
●
●●
●●
● ● ●
●●
●
●
●
●
●
●
●●
●
●
●●●
● ●●
●●
●●●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●●●
●●●
●●●
●
●●
●●● ●●
pz
● ● ●
● ● ●
●● ●
●● ● ● ● ● ●● ●
●
● ●●
●●●● ●● ● ● ● ●●●
● ● ●
●● ● ●●●●● ● ● ● ● ● ● ●
● ● ●●●●●
●●●
●
● ●
●●
●
●●
●
●
●●●
●
●
●●
●●
●●●
●●●
●●
●
●●●
●
●
●
●●
●
●●
●●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●●
●●
●
●
●
●●
●
●
●
●
●●●
●
●●
●●
●●● ●
●●●● ●●●
●
●
●●●● ●
●
●● ●
●
● ●
●
●●
● ●
●
●
●●
●●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●●●
●●
●
●
●●
●
●
●
●●●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●●
●
●● ●● ●●
● ●
●
●
●●
●●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●●
●●
● ●
●
●●
●● ●● ● ●●
●●●●
● ●
●●
●
●●
●
●
● ●
●●●
●
●
●
●
●
●
●●
● ●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
● ●●● ●
●● ● ●● ● ●
● ●●
●● ●● ●
●● ● ●● ●●●
● ●
●●●●●●●●●
●●
● ●
●●
●● ●
● ●
● ●●●● ●
● ● ●● ● ●● ● ● ● ●
●
●● ●● ●●●
●● ●● ● ●● ● ● ●
●● ●●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●●
●●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●●
●●
●
●
●
●
●
●●●
● ●●● ● ● ●
●●●
●
●●●
● ●●●
●●●
●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●
●●
●●●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●
●
●●●
●
●
●●
●●
●
●
●●●
● ●●
●●
●
●
●● ●
●
●●●●● ● ●● ●●
● ●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●
●●●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●●●●●●
●●●●●●● ● ●●●●
● ●● ●
●● ● ●● ●
●●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●●
●●●
●
●●
●
●
●
●●
●●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●●●●●●
●
● ●
●●●●
●●●
●
●●●● ●
●●●●●
●
●●●●
●
●●
●●● ●●
●●●
●●
●●
●
●●
●●
●
●●
●●
●●
●
●●
●●
●
●●
●●●
●●
●
●●
●●
●
●
●
●●●
●
●
●
●
●●
●
●●
●●
●
●●●
●
●
●●
●
●●
●●
●
●●●●
● ●●
●●
●
● ●
●●●● ● ●●●●
●●●
●
●
●●● ●
●●●
●●
●●
●
●●
●●●
●●
●
●●●●
●●●
●●
●●
●●●
●●
●
●●
● ●
●
●
●●●
●
●
●●
●
●●
●●
●●
●
●
● ●
●
●●
●
●●
●
●●
●
●
●●
●
●●●●● ● ●
● ●●
●●●
● ●●●
●
● ● ●●● ●●
●●
●●●● ●
●
● ●
●
●
●●
●●
●●
●●
● ●
●
●●
●●●●
●
●●
●●
●●●
●
●●
●
●●
●●
●
●●
● ●
●●
●
●●
●●
●●●
●
●
●
●
●●
● ●
●
●
●●
●●
●●●
●
●
●●
●●
●●
●
●●
●
● ●
●●
●●●●
● ●● ●●●●
● ●●●
● ●●●
●● ●●●
●●●●
●●
●
●●
●●
●●
●
●●
●●
● ●●●
●●
●
●●
●
●●●●
●
●●
●●
●●
●
●●
●●●
●
●
●●
●●
●●
●●●
●●
●
●●
●●
●
●●
● ●
●
●●●
●
● ●
●●●
●● ●●●
●
● ● ●●
●●
●
●
●●●
●
●
●●
●●
●●
●
● ●
●
●●●
●
●
●
●●●
●
●●●
●
● ●
●●●
●●●
●
●
●
●●
●
●●
●
●●
●●
●
●
●
●●
●●
●
●●
●
●
●●●
●
●●
●●
●●
● ●
●
●
●●
●●
●
●●
●
● ●
●
●
●●●
●●●
●
●
●
● ●●
●● ● ● ● ●
●●
●
●●
●
●●●
●
●●
● ●●
●
● ●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●●●
●●
●●
●●
●
●●
●
●●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●●●
● ●
● ●●
●
●
●●
●
● ●● ●●
● ●
●●
●
●●●●
●
●
●●
●
●
●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●● ●
●
●●●
●●
●
●
●
●
●●●●● ●●
● ●
● ●● ●
●●
●●●● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●● ●●
●
●●
●●
●
●
●●● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ●
2 ●●
●● ●
● ●●
● ●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●● ●
●●●
● ●
●●
● ●●
●
●
●
●●●●
●●
●
●●●●
●●
●●
●●●●
●
●
●●
●
●
●●●
●●●
●
● ●●
●
●●●●● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
● ●
●●●● ●
●● ● ●
●●●●●
●● ●●
●● ●
●● ●● ●
●●●
● ● ●
● ●
●
●● ●● ● ● ● ● ●● ●
● ●● ●●
●● ● ●●●
● ● ●●●
●● ●●●
●●● ●●● ●
● ●
● ●● ●
●●●● ●
●●●
● ●
●● ●●●
●●
●
●●
● ●●
●
● ●●
●●●●
●
●
●●●● ●●● ●●● ●● ● ●●● ●●●●
●●●● ●● ●●
●
●●
●●●●●
●●● ●
●●●●●●
● ●
●●
●● ●
●●
●● ●
●●●● ●●●● ●
● ●●
●●
●●
●●● ●●
●●●●
●
●●●
●
●
●
●●●●●
●
●●● ●
●●
●●● ●●●● ●●
●
●
●
● ●
●● ● ● ●● ● ● ● ● ●●
● ●
●
●
●
●●
●
●●
●● ●
● ●●●●●●
● ●●●● ●● ● ●
●● ● ● ● ●● ● ● ●● ● ● ●
● ●● ● ● ● ●
●●● ●●●● ● ● ●● ●● ●● ● ●● ●●●●●●● ●
● ●
● ●●
●
●●●● ●● ●
●
●●● ●● ●●
● ● ●● ● ● ●
● ●●
●●●●●● ●●
● ● ●● ●● ●● ●● ●● ●● ●
● ●● ● ● ● ●● ● ●● ●
● ●
● ● ●● ● ● ●
● ● ●
●●
0 ●
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
px
FIGURE 6.17
A cluttered scatterplot - Locations of Justin Verlander’s four-seam fastballs
by batter handedness (2009-2012).
with hexagonal bins and having the bins colored according to the number
of data points they contain. One can create a new graph by replacing the
geom point layer with a stat binhex layer. The new graph is displayed in
Figure 6.18.
ggplot(F4verl, aes(px, pz)) +
stat_binhex() +
facet_wrap(~ batter_hand) +
coord_equal() +
geom_path(aes(x, y), data=kZone, lwd=2, col="white", alpha=.3)
Note how the alpha argument in the geom path layer is used to adjust the
transparency of the strike zone border.
L R
6
count
40
4 30
pz
20
2
10
0
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
px
FIGURE 6.18
Hexagonal binning as a way to portray 2D density - Locations of Justin Ver-
lander’s four-seam fastballs by batter handedness (2009-2012).
ball locations might be an exception, since the figure of the baseball field may
give a better reference guide than the horizontal and vertical axes.
Packages exist for reading common format images into R. For example, if
one obtains a diagram of Detroit’s Comerica Park10 as a jpeg file, the jpeg
package can be used for reading the image into R by the readJPEG function.11
library(jpeg)
diamond <- readJPEG("Comerica.jpg")
The diamond object is a three-dimensional array of dimension x × y × 3,
where x and y correspond to the dimension of the image in pixels. The array
can be seen as a collection of three x × y matrices, containing information
on the red, green, and blue (RGB) components at every pixel of the image
respectively, expressed as values in the [0, 1] range.
10 This diagram was retrieved as a svg file from MLBAM Gameday at gd2.mlb.com/
The diagram can be added in ggplot2 as a layer of the plot using the
annotation raster function. The following code is used to obtain Figure
6.19.
ggplot(cabrera, aes(hitx, hity)) +
coord_equal() +
annotation_raster(diamond, -310, 305, -100, 480) +
stat_binhex(alpha=.9, binwidth=c(5, 5)) +
scale_fill_gradient(low="grey70", high="black")
The four numbers passed to the annotation raster layer indicate where
the image has to be positioned and were found by trial and error.12 In the
stat binhex layer two arguments have been specified: alpha gives the degree
of transparency to the hexagons so that their coloring does not completely
hide the diamond diagram and binwidth sets the dimensions of the hexagons.
Finally, the scale fill gradient layer sets the coloring of the hexagonal bins
as a gradient starting from the color grey70 and ending at the color black.13
6.5 Exercises
1. (Location of Pitches for Left- and Right-Handed Batters)
Use a density plot to display the horizontal location of Justin Verlander’s
pitchers by opponent’s handedness. Choose the conditioning and grouping
variables so that one can easily detect the differences in location by hand-
edness. Add a legend (if necessary) and vertical reference lines indicating
the borders of the strike zone.
2. (Comparing Pitch Locations for Two Pitchers)
12 Since the image is imported in R as an array of RGB values, it would be possible to
retrieve the position of two points (for example home plate and second base) and compute
the four values to be passed in the annotation raster layer. For an example on how to
identify an object inside an image with R look at is-r.tumblr.com/post/36874307174/
finding-a-bright-object.
13 A map of R colors by name is available at research.stowers-institute.org/efg/R/
Color/Chart/ColorChart.pdf.
158 Analyzing Baseball Data with R
500
400
count
15
300
10
hity
200
5
100
FIGURE 6.19
Scatterplot of Cabrera’s batted balls (2009 - 2012) with Detroit’s Comerica
Park diagram in the background. Note: batted balls for games in other ball-
parks are included as well.
The sanchez data frame contains 2008-2012 PITCHf/x data for pitcher
Jonathan Sanchez. The structure of this data frame is similar to the
verlander data frame described in the chapter. Use a graphical display
to compare the ability of Sanchez and Verlander in maintaining their fast-
ball speed through the game. (See Sections 6.2.7 and 6.3.8.) Use either
the lattice or ggplot2 graphics package and display the data either as
a multipanel plot or a superposed lines plot.
3. (Graphical View of the Speeds of Justin Verlander’s Fastballs)
(a) The cut function is useful for recoding a continuous variable into
intervals. Use this function to categorize the pitches variable in the
verlander data frame in groups of ten pitches.
(b) Use the bwplot function from the lattice package to produce a
boxplot of Verlander’s four-seam fastball speed (use the F4verl data
frame) for each ten-pitches group. Compare the information conveyed
by the resulting chart with that of Figure 6.6.
Advanced Graphics 159
(a) Create a data frame by selecting, from the cabrera data frame, the
instances where the hit outcome variable assumes the value H (for
base hit).
(b) Using the hitx and hity variables, create a new variable equal to
the distance, in feet from home plate, of Cabrera’s base hits. (This
variable is computed by simply applying the Pythagorean Theorem–
remember that home plate is at the origin.)
(c) In the newly created data frame, create a gameDay variable indicating
the day of the year (from 0 to 365) in which the game took place (see
Section 6.2.6).
(d) Build a scatterplot featuring gameDay on the x-axis, distance on the
y-axis and a smooth line with error bands. Does the resulting plot
appear to indicate changes in Cabrera’s power during the season?
7
Balls and Strikes Effects
CONTENTS
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.2 Hitter’s Counts and Pitcher’s Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.2.2 An example for a single pitcher . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.2.3 Pitch sequences on Retrosheet . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.2.3.1 Functions for string manipulation . . . . . . . . . . . . 165
7.2.3.2 Finding plate appearances going through a
given count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.2.4 Expected run value by count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.2.5 The importance of the previous count . . . . . . . . . . . . . . . . . . 170
7.3 Behaviors by Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.3.1 Swinging tendencies by count . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.3.1.1 Propensity to swing by location . . . . . . . . . . . . . . 173
7.3.1.2 Effect of the ball/strike count . . . . . . . . . . . . . . . . 176
7.3.2 Pitch selection by count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.3.3 Umpires’ behavior by count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.1 Introduction
In this chapter we explore the effect of the ball/strike count on the behav-
ior of players and umpires and on the final outcome of a plate appearance.
Retrosheet data from the 2011 season is used to estimate how the ball/strike
count affects the run expectancy. Also PITCHf/x data is used to explore how
one pitcher modifies his pitch selection based on the count and, similarly, how
one batter alters his swing zone, and umpires judge pitches according to the
count. Functions for string manipulation are introduced that are useful for
managing the pitch sequences from the Retrosheet files. Level plots and con-
tour plots, created with the use of the lattice package, will be used for the
explorations of batters’ swing tendencies and umpires’ strike zones.
161
162 Analyzing Baseball Data with R
to calculate, it correlates very well, at the team level, with runs scored.
Balls and Strikes Effects 163
Figure 7.1 uses a heat map to display Mussina’s tOPS+ through the vari-
ous counts. If one focuses on a particular number of strikes, a higher number
of balls in the count makes the outcome more likely to be favorable to the
hitter (lighter shades). Conversely, if one fixes the number of balls, the bal-
ance moves towards the pitcher (darker shades) as one increases the number
of strikes. This figure emphasizes the importance from a pitcher’s perspec-
strikes
0 1 2
0 100 72 30
1 118 82 38
balls
2 157 114 64
FIGURE 7.1
Heat map of tOPS+ for Mike Mussina through each balls/strikes count. Data
from Baseball Reference website.
tive of beginning the duel with a strike. When Mussina fell behind 1-0 in his
career, batters performed 18% better than usual in the plate appearance; con-
versely, after a first pitch strike, they were limited to 72% of their potential
performance.
How is the heatmap display of Figure 7.1 created in R? First a data
frame mussina is prepared with all the possible balls/strikes counts, using
the expand.grid function as previously illustrated in Section 4.7. A new vari-
able value is added with the tOPS+ values taken from the Baseball-Reference
website.
mussina <- expand.grid(balls=0 : 3, strikes=0 : 2)
mussina$value <- c(100, 118, 157, 207, 72, 82, 114, 171, 30, 38,
164 Analyzing Baseball Data with R
64, 122)
mussina
The arguments axes=FALSE, xlab="", ylab="" indicate that axes and labels
are not drawn. The axes and labels are explicitly drawn by use of the axis
and mtext functions.
The axis function controls how axes are drawn. The side of the plot where
the axis should appear is controlled by the side parameter; to place an axis
on top of the plot, as was done in Figure 7.1, one specifies side=3. The at
parameter controls the positions at which tick marks are to be drawn, while
labels indicates the text that is to be placed at the tick points. Several
other optional parameters can be passed to the axis function to control the
positioning of the axis, the appearance of the line and the tick marks and the
font for the labels (type ?axis on the R console for more details).
The mtext function allows to place text at the margins of a plot. Similarly
to what happens for axis, the side parameter specifies whether the text
should be placed to the bottom (side=1), the left (2), the top (3), or the right
(4) of the plot. Other parameters are used to indicate the distance from the
margin at which the text should be placed (line) and its appearance.
whether the values should be displayed inside the cells, or an integer number, dictating the
maximum number of decimal places to be displayed.
4 Source: www.retrosheet.org/eventfile.htm.
166 Analyzing Baseball Data with R
TABLE 7.1
Pitch codes used by Retrosheet.
Symbol description
+ following pickoff throw by the catcher
* indicates the following pitch was blocked by the catcher
. marker for play not involving the batter
1 pickoff throw to first
2 pickoff throw to second
3 pickoff throw to third
> indicates a runner going on the pitch
B ball
C called strike
F foul
H hit batter
I intentional ball
K strike (unknown type)
L foul bunt
M missed bunt attempt
N no pitch (on balks and interference calls)
O foul tip on bunt
P pitchout
Q swinging on pitchout
R foul ball on pitchout
S swinging strike
T foul tip
U unknown or missed pitch
V called ball because pitcher went to his mouth
X ball put into play by batter
Y ball put into play on pitchout
Balls and Strikes Effects 167
However, as indicated in Table 7.1, there are some characters in the Retrosheet
strings denoting actions that are not pitches, such as pickoff attempts.
The functions grep and grepl are used to find a pattern within elements
of character vectors. The function grep returns the indices of the elements
for which a match is found, and the function grepl returns a logical vector,
indicating for each element of the vector whether a match is found. For both
functions, the first argument is the string pattern to search and the second
argument is the vector of strings where matches are sought. For example, we
apply the two functions to the vector of pitch sequences sequences in search
for pickoff attempts to first base denoted by the code 1.
sequences <- c("BBX", "C11BBC1S", "1X")
grep("1", sequences)
[1] 2 3
grepl("1", sequences)
[1] FALSE TRUE TRUE
The function grep tells us that “1” is contained in the second (2) and third
(3) components of the character vector sequences, and grepl outputs this
same information by means of a logical vector.
The pattern parameter to search does have to be a single character. For
example we may want to look for consecutive pickoff attempts to first which
is the pattern “11”. The below output shows that “11” is contained in the
second component of sequences.
grepl("11", sequences)
[1] FALSE TRUE FALSE
Also the function gsub allows for the substitution of the pattern found with
a replacement. The replacement can be an empty string, in which case the
pattern is simply removed. For example, the following code removes the pickoff
attempts to first from the pitch sequences.
gsub("1", "", sequences)
[1] "BBX" "CBBCS" "X"
on regular expressions, featuring examples, tutorials, reference for syntax, and a list of
related books.
168 Analyzing Baseball Data with R
GAME_ID EVENT_ID c00 c10 c20 c11 c01 c30 c21 c31 c02 c12
1 ANA201104080 2 1 0 0 0 0 0 0 0 0 0
2 ANA201104080 3 1 0 0 1 1 0 0 0 0 1
3 ANA201104080 4 1 0 0 1 1 0 1 1 0 0
4 ANA201104080 5 1 1 0 1 0 0 0 0 0 1
5 ANA201104080 6 1 0 0 1 1 0 1 0 0 0
c22 c32 RUNS.VALUE
1 0 0 -0.1555661
2 0 0 -0.0953746
3 0 0 0.3571207
4 1 0 -0.3347205
5 1 0 -0.3919715
For example, the at-bat in the second line of the data frame started with a 0-1
count (value 1 in column c01), then moved to the counts 1-1 and to 1-2 and
generated a run value of -0.095. The pbp11rc data frame has all the necessary
information to calculate the run values of the various balls/strikes counts,
in the same way the value of a home run and of a single were calculated in
Chapter 5.
As an illustration, one can measure the importance of getting ahead on
the first pitch. The mean run value is calculated for at-bats starting with a
ball and for the at-bats starting with a strike.
ab10 <- subset(pbp11rc, c10 == 1)
ab01 <- subset(pbp11rc, c01 == 1)
c(mean(ab10$RUNS.VALUE), mean(ab01$RUNS.VALUE))
[1] 0.03969483 -0.03546708
The conclusion is that the difference between a first pitch strike and a first
pitch ball, as estimated with data from the 2011 season, is over 0.07 runs.
The runs value can be calculated for each possible ball/strike count. First
a data.frame named runs.by.count is prepared with the twelve possible
counts as done in Section 7.2 and a zero value is temporarily assigned to all
counts.
170 Analyzing Baseball Data with R
Finally, using the countmap function introduced in Section 7.2 with the
runs.by.count data frame, the run values are visualized for all of the possible
balls/strikes counts. (See Figure 7.2.)
countmap(runs.by.count)
By glancing at the values and shading colors in Figure 7.2, one can con-
struct reasonable definitions for the terms “hitter’s count” and “pitcher’s
count.” Ball/strike counts can be roughly divided in the following four cate-
gories7 :
strikes
0 1 2
0 0 −0.04 −0.09
FIGURE 7.2
Run value for plate appearances through each balls/strikes count. Values es-
timated on data from the 2011 season.
count as having the same run expectancy, no matter if the pitcher started
ahead 0-2 or fell behind 2-0. The implicit assumption in these calculations is
that the previous counts have no influence on the outcome on a particular
count. However, a pitcher getting ahead 0-2 is likely to “waste some pitches.”
That is, he would likely throw a few balls out of the strike zone with the sole
intent of making the batter (who cannot afford another strike) swing at them
and possibly miss or make poor contact. On the other hand, with a plate
appearance starting with two balls, the batter has the luxury of not swinging
at strikes in undesirable locations and wait for the pitcher to deliver a pitch
of his liking.
Given the above discussion, it would seem that the run expectancy on a
2-2 count would be higher if the plate appearance started with two balls than
if the pitcher started quickly with a 0-2 count. Let’s investigate if there is
numerical evidence to actually reflect this conjecture.
We begin by taking the subset of plays from the 2011 season that went
through a 2-2 count and calculate their mean run value.
count22 <- subset(pbp11rc, c22 == 1)
172 Analyzing Baseball Data with R
mean(count22$RUNS.VALUE)
[1] -0.03134373
A new variable after2 is created, denoting the ball/strike count after two
pitches. The mean run value is calculated for each of the three possible levels
of after2.
count22$after2 <- ifelse(count22$c20 == 1, "2-0",
ifelse(count22$c02 == 1, "0-2", "1-1"))
aggregate(RUNS.VALUE ~ after2, data=count22, FUN=mean)
after2 RUNS.VALUE
1 0-2 -0.02440420
2 1-1 -0.03277434
3 2-0 -0.03570539
The above results appear counterintuitive, as they seemingly imply that plate
appearances going through a 2-2 count after having started with two strikes
are more favorable to the hitter than those beginning with two balls.
This surprising result is actually a byproduct of a selection bias. Many
plate appearances starting with two strikes end without ever reaching the
2-2 count, in most cases with an unfavorable outcome for the batter.8 The
plate appearances that survive a 0-2 count reaching 2-2 are hardly a random
sample of all the plate appearances. Likely hard-to-strike-out batters are over-
represented in such sample, as well as pitchers who do not posses a quality
pitch to finish off opponents.
Using a similar study, comparing the paths leading to 1-1 counts gives
results more in line with common sense as this count is less susceptible to the
same selection bias.
count11 <- subset(pbp11rc, c11 == 1)
count11$after1 <- ifelse(count11$c10 == 1, "1-0", "0-1")
aggregate(RUNS.VALUE ~ after1, data=count11, FUN=mean)
after1 RUNS.VALUE
1 0-1 -0.013234572
2 1-0 -0.009274024
The numbers above suggest that after reaching a 1-1 count, the batter is
expected to perform slightly better if the first pitch was a ball than if it was
a strike.
8 In 2011, 80% of plate appearances beginning with two strikes and not reaching the 2-2
After the surface on the dataset has been fit, we are interested in predicting
the likelihood of a swing by Cabrera at various pitch locations. Using the
expand.grid function, a data frame is built consisting of combinations of
horizontal locations from −2 (two feet to the left of the middle of home plate)
to +2 (two feet to the right of the middle of the plate) and vertical locations
cerned than in Figure 7.3 If one wants to obtain the black and white version displayed
in this book, the following line of code has to be inserted before the calling of xyplot:
trellis.par.set(canonical.theme(color=FALSE))
Balls and Strikes Effects 175
●
5
●
●
● ● ●
●
● ● ● ●
●
4 ●
●
●●
● ●
● ●
● ● ●
● ●
●●● ● ● ● ●● ●
●● ● ● ● ● ●● ●
●
● ●● ●
● ● ●● ● ●
● ● ● ●
vertical location (ft.)
●● ●
● ●● ●
3 ● ● ●
●
●●
●● ● ●
●
●● ● ●●
● ●● ●
● ● ●
● ● ● ●
● ●● ●
● ● ●● ● ● ●●
● ● ●●
●●
● ● ●● ● ● ●
●
●
●
● ● ●● ● ●
●
not swung ●
● ●● ●
2 ●
●
●● ●
●● ● ●● ● ●
● ● ●
●
●
swung
● ● ● ● ●● ● ●● ●
● ● ● ●
● ● ● ●● ●
● ● ● ● ●
● ● ● ●
● ●
●● ● ● ● ●
● ● ●
● ● ● ● ●●
● ● ●
1 ● ● ●
● ●
●
● ●
● ● ● ●
● ● ●
● ●
● ●
● ●
0
●
●
−1
−2 −1 0 1 2
FIGURE 7.3
Scatterplot of Miguel Cabrera’s swinging tendency by location. Sample of 500
pitches. View from the catcher’s perspective.
from the ground (value of zero) to six feet of height, using subintervals of
0.1 feet. By using the predict function,10 the likelihood of Miguel’s swing is
obtained at every location in the data frame.
pred.area <- expand.grid(px=seq(-2, 2, 0.1), pz=seq(0, 6, 0.1))
pred.area$fit <- c(predict(miggy.loess, pred.area))
From the data frame pred.area the likelihood that Miguel will swing is
estimated for three different locations – a pitch down the middle and two and
a half feet from the ground (“down Broadway”), a ball that hits the ground in
the middle of the plate (“ball in the dirt”), and another one delivered at mid-
height (2.5 feet from the ground) but way outside (two feet from the middle
of the plate). In each case, the subset function is used to take a subset of the
prediction data frame pred.area with specific values of the horizontal and
vertical locations px and pz.
10 The c function used in the second assignment converts the matrix resulting from the
The results are quite consistent with what one would expect: the pitch right
in the heart of the strike zone generates Cabrera’s swing more than 80 percent
of the time, while the ball in the dirt and the ball outside generates a swing
at 17 percent and six percent rates, respectively.
A contour plot of the likelihood of the swing as a function of the horizon-
tal and vertical locations of the pitch is constructed using the contourplot
function in the lattice package. The meaning of most arguments in the R
code has been explained in Chapter 6; the at parameter is used to indicate
the levels at which we want the contour lines to be drawn. Figure 7.4 shows
the resulting contour plot.
contourplot(fit ~ px * pz, data=pred.area,
at=c(.2, .4, .6, .8),
aspect="iso",
xlim=c(-2, 2), ylim=c(0, 5),
xlab="horizontal location (ft.)",
ylab="vertical location (ft.)",
panel=function(...){
panel.contourplot(...)
panel.rect(inKzone, botKzone, outKzone, topKzone,
border="black", lty="dotted")
})
As expected, the likelihood of a swing decreases the further the ball is delivered
from the middle of the strike zone. The plot also shows that Cabrera has a
tendency to swing at pitches on the inside part of the plate.
0.2
4
0.6
vertical location (ft.)
3 0.8
1
0.4
−1 0 1
FIGURE 7.4
Contour plot of Miguel Cabrera’s swinging tendency by location, where the
view from the catcher’s perspective. The contour lines are labeled by the
probability of swinging at the pitch.
ball-strike count. Using the subset function, a new data frame miggy00 is
constructed consisting of pitch data only when the ball-strike out is 0-0. The
loess function is used to find the likelihood of swing surface, and the predict
function computes the likelihood for every horizontal and vertical location in
the data frame pred.area.
cabrera$bscount <- paste(cabrera$balls, cabrera$strikes, sep="-")
miggy00 <- subset(cabrera, bscount == "0-0")
miggy00loess <- loess(swung ~ px + pz, data=miggy00, control=
loess.control(surface="direct"))
pred.area$fit00 <- c(predict(miggy00loess, pred.area))
The same procedure was repeated for the 0-2 and 2-0 counts. The code
is not shown here, but the reader can write the R code by slight modifica-
tions of the code for the 0-0 count. Once the additional variables fit02 and
fit20 were obtained for the pred.area data frame, contour plots on separate
panels can be constructed to compare Cabrera’s swinging tendencies by pitch
178 Analyzing Baseball Data with R
count (Figure 7.5.) In the contourplot function, the separate panels display
is produced by means of the formula fit00 + fit02 + fit20 ∼ px * pz.
contourplot(fit00 + fit02 + fit20 ~ px * pz, data=pred.area,
at=c(.2, .4, .6),
aspect="iso",
xlim=c(-2, 2), ylim=c(0, 5),
xlab="horizontal location (ft.)",
ylab="vertical location (ft.)",
panel=function(...){
panel.contourplot(...)
panel.rect(inKzone, botKzone, outKzone, topKzone,
border="black", lty="dotted")
})
As expected, Cabrera expands his swing zone when behind 0-2 (his 40% con-
tour line on 0-2 counts has an area comparable to his 20% contour line on 0-0
counts). The third panel on Figure 7.5 does not suggest a shrinkage of Cabr-
era’s swinging zone on 2-0 counts. Miguel seems to be increasingly looking for
pitches up-and-in when ahead in the count.
−1 0 1
0.2
fit00 fit02 fit20
0.2 0.2
0.4
0.2 0.4
vertical location (ft.)
4 0.6
0.6
3
2
0.6
0.4
1 0.
2
−1 0 1 −1 0 1
FIGURE 7.5
Miguel Cabrera’s 50/50 swing zone in different balls/strikes counts. View from
the catcher’s perspective.
(especially when hitters are not expecting them), are harder to control and
rarely used by pitchers behind in the count. In this section we look at one
pitcher (arguably one of the best in MLB at the time of this writing) and
explore how he chooses from his pitch repertoire according to the ball/strike
count.
The verlander data frame, consisting of over 15 thousand observations,
consists of pitch data for Justin Verlander for five seasons. Using the table
command, a frequency table of the types of pitches Verlander has thrown since
2009 is obtained.
table(verlander$pitch_type)
CH CU FF FT SL
2550 2716 6756 2021 1264
As is the case with most major league pitchers, Verlander most frequently
uses the fastball. He uses two variations of a fastball, a four-seamer (FF) and
a two-seamer (FT). He complements his fastballs with a curve ball (CU) ,a
change-up (CH), and a slider (SL).
The prop.table function is used to obtain the pitch type proportions
rather than their frequencies. The input to prop.table is the table of fre-
quencies. To obtain percentages, the result is multiplied by 100 and rounded
(using the round function) to the nearest integer.
round(100 * prop.table(table(verlander$pitch_type)))
CH CU FF FT SL
17 18 44 13 8
It can be seen from the table that 44% of Verlander’s pitches during this
five-season period were four-seamers.
Before moving to exploring pitch selection by ball/strike count, a frequency
table is used to explore the pitch selection by batter handedness. One con-
structs frequencies of pitch type for each batter hand by specifying two pa-
rameters in the table function. To compute proportions of pitch type for
each batter hand, the prob.table function is used with argument margin=2.
(The function computes proportions by row if margin=1 and by column if
margin=2.)
type_verlander_hand <- with(verlander, table(pitch_type,
batter_hand))
round(100 * prop.table(type_verlander_hand, margin=2))
batter_hand
pitch_type L R
CH 23 8
CU 17 18
FF 43 45
FT 15 11
SL 2 17
180 Analyzing Baseball Data with R
Note that the pitch selection is quite different depending on the handedness
of the opposing batter. In particular, the right-handed Verlander uses his
change-up nearly a quarter of the time against left-handed hitters, but only
eight percent of the time against right-handed hitters. Conversely the slider is
nearly absent from his repertoire when he faces lefties, while he uses it close
to one out of five times against righties.
Batter-hand differences in pitch selection are common among major league
pitchers and they exist because the effectiveness of a given pitch depends
on the handedness of the pitcher and the batter. The slider and change-up
comparison is a typical example, a slider is very effective against batters of
the same handedness and a change-up can be successful when facing opposite-
handed batters.
Justin’s pitch selection can be explored by pitch count. First, a new
bscount variable is created by merging the values in the columns balls and
strikes. Since pitch selection depends on batter handedness, pitch mixing is
examined separately by the opponent handedness. In the following code, the
subset function is used to select Verlander’s pitches delivered to right-handed
batters. The table function constructs a table of frequencies by count and
pitch type and the prop.table function with the margin=1 argument com-
putes row proportions.
verlander$bscount <- paste(verlander$balls, verlander$strikes,
sep="-")
verl_RHB <- subset(verlander, batter_hand == "R")
verl_type_cnt_R <- table(verl_RHB$bscount, verl_RHB$pitch_type)
round(100 * prop.table(verl_type_cnt_R, margin=1))
CH CU FF FT SL
0-0 7 11 53 16 13
0-1 6 24 40 10 19
0-2 16 28 27 6 22
1-0 5 11 52 11 21
1-1 8 24 40 10 19
1-2 14 33 28 6 19
2-0 2 2 70 15 12
2-1 6 8 51 14 20
2-2 10 29 36 9 16
3-0 8 0 81 10 0
3-1 2 0 78 12 8
3-2 4 4 69 12 11
The effect of the ball/strike count on the choice of pitches is apparent when
comparing pitcher’s counts and hitter’s counts. When behind 2-0, Verlander
uses his four-seamer seven times out of ten; the percentage goes up to 78%
when trailing 3-1 and 81% on 3-0 counts. Conversely, when Justin has the
chance to strike the batter out, the use of the four-seamer diminishes. In fact
he throws it less than 30 percent of the time both on 0-2 and 1-2 counts. On
a full count, Verlander’s propensity to go with the fastball is similar to the
Balls and Strikes Effects 181
one in hitters’ counts – this is consistent with the numbers in Figure 7.2 that
indicate the 3-2 count being slightly favorable to the hitter. Similarly, one can
explore Verlander’s choices by count when facing a left-handed hitter.
By slightly modifying the code above, the reader can easily repeat the process
for other counts. In this section we compare the 0-0 count to the most extreme
batter and pitcher counts, 3-0 and 0-2 counts, respectively.
In this instance the contour lines of balls/strikes calling in different counts
will be plotted in a single panel, rather than in separate panels as was done
for swinging tendencies in Figure 7.5. The contourLines function12 is helpful
for this task, as it calculates coordinates to plot contour lines. The following
code calculates, for the 0-0 count, the 50% strike calling contour line and
11 www.hardballtimes.com/main/article/the-compassionate-umpire/.
12 The function comes with the grDevices package, which is loaded by default in R.
Analyzing Baseball Data with R
TABLE 7.2
A twenty rows sample of the umpires dataset.
(row #) season umpire batter hand pitch type balls strikes px pz called strike
80357 2012 Doug Eddings R SL 0 0 -0.17 2.88 1
33199 2012 Alfonso Marquez L FF 0 0 0.03 2.53 1
187437 2012 Angel Hernandez L SL 1 1 -1.48 2.35 0
260385 2012 Ed Rapuano R FT 1 1 -1.44 2.05 0
169709 2012 Sam Holbrook L FF 2 1 0.80 2.42 0
195437 2012 Dan Bellino L FT 0 0 1.07 1.59 0
166333 2012 Eric Cooper R FF 0 0 -0.77 2.35 1
264734 2012 Gerry Davis L FF 0 0 0.55 2.08 1
191453 2012 D.J. Reyburn R FC 1 1 -1.13 2.69 1
10811 2012 Kerwin Danley R SL 1 1 0.89 2.89 0
12474 2012 Mark Ripperger R FF 0 0 -1.11 1.80 0
134379 2012 Laz Diaz L FF 0 0 -1.84 3.21 0
55770 2012 Jeff Nelson R CU 2 2 1.70 2.02 0
203183 2012 Tim Timmons R FF 1 0 1.18 2.97 0
202977 2012 Sam Holbrook L SL 2 1 -0.23 1.01 0
58915 2012 Mark Wegner L CH 0 1 -1.65 2.42 0
300411 2012 Tim Timmons L FF 0 0 -1.21 2.40 0
84853 2012 Paul Nauert R CU 3 2 -0.23 1.04 0
312177 2012 Jim Joyce L SL 3 2 -0.25 1.65 0
216589 2012 Brian Runge L FT 1 1 -0.66 2.29 1
182
Balls and Strikes Effects 183
converts the resulting list into a data frame, to which a new variable bscount
indicating the balls/strikes count is added.
ump00contour <- contourLines(x=seq(-2, 2, 0.1),
y=seq(0, 6, 0.1),
z=predict(ump00.loess, pred.area),
levels=c(.5))
ump00df <- as.data.frame(ump00contour)
ump00df$bscount <- "0-0"
This figure shows that the umpire’s strike zone is shrunk in a 0-2 pitch count,
and slightly expanded in a 3-0 count.
184 Analyzing Baseball Data with R
4
vertical location (ft.)
3
balls/strikes count
0−0
0−2
2
3−0
−1 0 1
FIGURE 7.6
Umpires’ 50/50 strike calling zone in different balls/strikes counts viewed from
the catcher’s perspective.
7.5 Exercises
1. (Run Value of Individual Pitches)
(a) Calculate the run value of a ball and of a strike at any count. For
3-ball and 2-strike counts you need the value of a walk and a strike-
out respectively (you can calculate them as done for other events in
Chapter 5).
(b) Compare your values to the ones proposed by John Walsh in the
article www.hardballtimes.com/main/article/
searching-for-the-games-best-pitch/.
CONTENTS
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.2 Mickey Mantle’s Batting Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.3 Comparing Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.3.1 Some preliminary work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.3.2 Computing career statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
8.3.3 Computing similarity scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.3.4 Defining age, OBP, SLG, and OPS variables . . . . . . . . . . . 197
8.3.5 Fitting and plotting trajectories . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.4 General Patterns of Peak Ages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
8.4.1 Computing all fitted trajectories . . . . . . . . . . . . . . . . . . . . . . . . 202
8.4.2 Patterns of peak age over time . . . . . . . . . . . . . . . . . . . . . . . . . . 203
8.4.3 Peak age and career at-bats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
8.5 Trajectories and Fielding Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
8.1 Introduction
The R system is well-suited for fitting statistical models to data. One popular
topic in sabermetrics is the rise and fall of a player’s season batting, fielding,
or pitching statistics as from his MLB debut to retirement. Generally, it is
believed that most players peak in their late 20s, although some players tend
to peak at later ages. A simple way of modeling a player’s trajectory is by
means of a quadratic or parabolic curve. By use of the lm (linear model)
function in R, it is straightforward to fit this model using the player’s age and
his OPS statistics.
We begin in Section 8.2 by considering a famous career trajectory. Mickey
Mantle made an immediate impact on the New York Yankees at age 19 and
quickly matured into one of the best hitters in baseball. But injuries took a toll
on Mantle’s performance and his hitting declined until his retirement at age
36. We use Mantle to introduce the quadratic model – using this model, one
187
188 Analyzing Baseball Data with R
can define his peak age, the maximum performance, and the rate of increase
and decline in performance.
To compare career performances of similar players, it is helpful to contrast
their trajectories and Section 8.3 illustrates the computation of many fitted
trajectories. Using Bill James’ notion of similarity scores, we write a function
that will find players who are most similar to a given hitter. Then we graphi-
cally compare the OPS trajectories of these similar players; by viewing these
graphs we gain a general understanding of the possible trajectory shapes.
A general problem focuses on a player’s peak age. In Section 8.4, we look
at the fitted trajectories of all hitters with at least 2000 career at-bats. The
pattern of peak ages across eras and as a function of the number of career
at-bats is explored. Also, since it is common to compare players who play the
same position, in Section 8.5 we focus on the period 1985-1995 and contrast
the peak ages for players who play different fielding positions.
We extract Mantle’s playerID from the Master data frame. By use of the
subset function, the line in the Master data file is found where nameFirst ==
"Mickey and nameLast == "Mantle". His player id is stored in the variable
mantle.id.
mantle.info <- subset(Master,
nameFirst == "Mickey" & nameLast == "Mantle")
mantle.id <- as.character(mantle.info$playerID)
One small complication is that certain statistics such as SF and HBP were
not recorded for older seasons and are currently coded as NA. A convenient
way of recoding these missing values to 0 is by the recode function in the car
package.
library(car)
Batting$SF <- recode(Batting$SF, "NA = 0")
Batting$HBP <- recode(Batting$HBP, "NA = 0")
To compute Mantle’s age for each season, we need to know his birth year
which is available in the Master data frame. Major League Baseball defines a
Career Trajectories 189
player’s age as his age on June 30 of that particular season. To facilitate the
computation of ages, a new function get.birthyear is defined which gives
the “official” birth year of a player with id player.id, similar to what was
done in the getinfo function of Section 3.8.
get.birthyear <- function(player.id){
playerline <- subset(Master, playerID == player.id)
birthyear <- playerline$birthYear
birthmonth <- playerline$birthMonth
ifelse(birthmonth >= 7, birthyear + 1, birthyear)
}
To check this function, the MLB birth year for Mantle is found using his id
stored in mantle.id.
get.birthyear(mantle.id)
[1] 1932
1.05
1.00
0.95
0.90
OPS
0.85
0.80
0.75
20 25 30 35
Age
FIGURE 8.1
Scatterplot of OPS against age for Mickey Mantle.
Looking at this figure, it is clear that Mantle’s OPS values tend to increase
from age 19 to his late 20s, and then generally decrease until his retirement
at age 36. One can model this up-and-down relationship by use of a smooth
curve. This curve will help us understand and summarize Mantle’s career
batting trajectory and will make it easier to compare Mantle’s trajectory
with other players with similar batting performances.
A convenient choice of smooth curve is a quadratic function of the form
A + B(Age − 30) + C(Age − 30)2 ,
where the constants A, B, and C are chosen so that curve is the “best” match
to the points in the scatterplot. This quadratic curve has the following nice
properties that make it easy to use.
1. The constant A is the predicted value of OPS when the player is 30 years
old.
2. The function reaches its largest value at
B
P EAK.AGE = 30 − .
2C
Career Trajectories 191
This is the age where the player is estimated to have his peak batting
performance during his career.
indicates that OPS is the response and (Age - 30) and (Age - 30)2 are the
predictors. The estimated coefficients A, B, and C are saved in the vector b.
The peak age and maximum value are stored in the variables Age.max and
Max.
fit.model <- function(d){
fit <- lm(OPS ~ I(Age - 30) + I((Age - 30)^2), data=d)
b <- coef(fit)
Age.max <- 30 - b[2] / b[3] / 2
Max <- b[1] - b[2] ^ 2 / b[3] / 4
list(fit=fit,
Age.max=Age.max, Max=Max)
}
Using this model, Mantle peaked at age 27 and his maximum OPS for the curve
is estimated to be 0.985. The estimated value of the curvature parameter is
−0.00352, thus Mantle’s decrease in OPS between his peak age and one year
older is 0.00352.
This best quadratic curve is placed on the scatterplot. The predict func-
tion is used to estimate Mantle’s OPS from the curve for the sequence of age
values and the lines function overlays these values as a line on the current
plot. Two applications of abline show the locations of the peak age and the
maximum, and the text function is used to label these values. The resulting
graph is displayed in Figure 8.2.
lines(Mantle$Age, predict(F2$fit, Age=Mantle$Age), lwd=3)
abline(v=F2$Age.max, lwd=3, lty=2, col="grey")
abline(h=F2$Max, lwd=3, lty=2, col="grey")
text(29, .72, "Peak.age" , cex=2)
text(20, 1, "Max", cex=2)
Although the focus was on the best fitting quadratic curve, more details
about the fitting procedure are stored in the output of lm that is stored in the
variable fit. We display part of the output display by finding the summary of
the fit.
summary(F2$fit)
...
Residual standard error: 0.07501 on 15 degrees of freedom
Multiple R-squared: 0.6093, Adjusted R-squared: 0.5572
F-statistic: 11.69 on 2 and 15 DF, p-value: 0.0008692
The value of R2 is 0.6093 – this means that approximately 61% of the vari-
ability in Mantle’s OPS values can be explained by the quadratic curve. The
residual standard error is equal to 0.075. Approximately 2/3 of the vertical
deviations (the “residuals”) from the curve fall between plus and minus one
residual standard error. In this case, the interpretation is that approximately
2/3 of the residuals fall between −0.075 and 0.075.
1.05
1.00
Max
0.95
0.90
OPS
0.85
0.80
0.75
Peak.age
20 25 30 35
Age
FIGURE 8.2
Scatterplot of OPS against age for Mickey Mantle with a quadratic fit added.
The location of the peak age and the maximum OPS fit are displayed.
Many players in the history of baseball have had short careers and in our
study of trajectories, it seems reasonable to limit our analysis to players who
have had a minimum of at-bats. We consider only players with 2000 at-bats –
this will remove hitting data of pitchers and other players with short careers.
To take this subset of the Batting data frame, we use the ddply function
in the plyr package to compute the career at-bats for all players – the new
variable is called Career.AB. By use of the merge function, we add this new
variable to the Batting data frame. Finally, using the subset function, a new
data frame Batting.2000 is created consisting only of the “minimum 2000
AB” hitters.
194 Analyzing Baseball Data with R
library(plyr)
AB.totals <- ddply(Batting, .(playerID),
summarize,
Career.AB=sum(AB, na.rm=TRUE))
Batting <- merge(Batting, AB.totals)
Batting.2000 <- subset(Batting, Career.AB >= 2000)
In the following code, PLAYER is a character vector of player ids for all players
with at least 2000 career at-bats. The sapply function is used to find the pri-
mary fielding position for all players and the new data frame Fielding.2000
is created using the function data.frame containing the player ids and the
fielding positions (variable POS). This new information is placed into the
Batting.2000 data frame by use of the merge function.
PLAYER <- as.character(unique(Batting.2000$playerID))
POSITIONS <- sapply(PLAYER, find.position)
Fielding.2000 <- data.frame(playerID=names(POSITIONS),
POS=POSITIONS)
Batting.2000 <- merge(Batting.2000, Fielding.2000)
library(plyr)
C.totals <- ddply(Batting.2000, .(playerID),
summarize,
C.G=sum(G, na.rm=TRUE),
C.AB=sum(AB, na.rm=TRUE),
C.R=sum(R, na.rm=TRUE),
C.H=sum(H, na.rm=TRUE),
C.2B=sum(X2B, na.rm=TRUE),
C.3B=sum(X3B, na.rm=TRUE),
C.HR=sum(HR, na.rm=TRUE),
C.RBI=sum(RBI, na.rm=TRUE),
C.BB=sum(BB, na.rm=TRUE),
C.SO=sum(SO, na.rm=TRUE),
C.SB=sum(SB, na.rm=TRUE))
In the new data frame, we compute each player’s career batting average
C.AVG and his career slugging percentage C.SLG.
C.totals$C.AVG <- with(C.totals, C.H / C.AB)
C.totals$C.SLG <- with(C.totals,
(C.H - C.2B - C.3B - C.HR + 2 * C.2B +
3 * C.3B + 4 * C.HR) / C.AB)
The career statistics data frame C.totals is merged with the fielding data
frame Fielding.2000. Each fielding position has an associated value, and a se-
ries of ifelse functions are used to define a value position variable Value.POS
from the position variable POS.
C.totals <- merge(C.totals, Fielding.2000)
C.totals$Value.POS <- with(C.totals,
ifelse(POS == "C", 240,
ifelse(POS == "SS", 168,
ifelse(POS == "2B", 132,
ifelse(POS == "3B", 84,
ifelse(POS == "OF", 48,
ifelse(POS == "1B", 12, 0)))))))
The function similar will find the players most similar to a given player
using similarity scores on career statistics and fielding position. One inputs
the id for the particular player and the number of similar players to be found
(including the given player). The output is a data frame of player statistics,
ordered in decreasing order by similarity scores.
similar <- function(p, number=10){
P <- subset(C.totals, playerID == p)
C.totals$SS <- with(C.totals,
1000 -
floor(abs(C.G - P$C.G) / 20) -
floor(abs(C.AB - P$C.AB) / 75) -
floor(abs(C.R - P$C.R) / 10) -
floor(abs(C.H - P$C.H) / 15) -
floor(abs(C.2B - P$C.2B) / 5) -
floor(abs(C.3B - P$C.3B) / 4) -
floor(abs(C.HR - P$C.HR) / 2) -
floor(abs(C.RBI - P$C.RBI) / 10) -
floor(abs(C.BB - P$C.BB) / 25) -
floor(abs(C.SO - P$C.SO) / 150) -
floor(abs(C.SB - P$C.SB) / 20) -
floor(abs(C.AVG - P$C.AVG) / 0.001) -
floor(abs(C.SLG - P$C.SLG) / 0.002) -
abs(Value.POS - P$Value.POS))
C.totals <- C.totals[order(C.totals$SS, decreasing=TRUE), ]
C.totals[1:number, ]
}
From reading the player ids, we see five similar players, in terms of career
Career Trajectories 197
hitting statistics and position: Eddie Mathews, Mike Schmidt, Gary Sheffield,
Frank Thomas, and Sammy Sosa.
A small complication is that the birth year is not recorded for a few 19th
century ballplayers, and so the age variable is missing for these variables.
The complete.cases function is used to record the age records that are not
missing, and the updated data frame Batting.2000 only contains players for
which the Age variable is available.
Batting.2000 <- Batting.2000[complete.cases(Batting.2000$Age), ]
require(ggplot2)
get.name <- function(playerid){
d1 <- subset(Master, playerID == playerid)
with(d1, paste(nameFirst, nameLast))
}
player.id <- subset(Master,
nameFirst == first & nameLast == last)$playerID
player.id <- as.character(player.id)
player.list <- as.character(similar(player.id, n.similar)$playerID)
Batting.new <- subset(Batting.2000, playerID %in% player.list)
Looking at Figure 8.3 and Figure 8.4, we see notable differences in these
trajectories.
• There are players such as Eddie Mathews, Frank Thomas, Mickey Man-
tle, and Roberto Alomar who appeared to peak early in their careers.
• In contrast, other players such as Mike Schmidt, Craig Biggio, and Julio
Franco who peaked in their 30s.
• The players also show differences in the shape of the trajectory. Johnny
Damon and Julio Franco had relatively constant trajectories, and Frankie
Frisch and Roberto Alomar had trajectories with high curvature.
One can summarize these trajectories by the peak age, the maximum value,
and the curvature. A short function summarize.trajectory is written to
compute these quantities for a particular fitted trajectory. The input is the
data frame d containing the batting statistics for a player and the output is a
data frame with three variables Age.max, Max, and Curve.
200 Analyzing Baseball Data with R
1.0
0.9
0.8
0.7
0.6
1.0
0.9
Fit
0.8
0.7
0.6
1.0
0.9
0.8
0.7
0.6
20 25 30 35 40 20 25 30 35 40
Age
FIGURE 8.3
Fitted trajectories of OPS against age for Mickey Mantle and five similar
players.
Recall that the output of plot.trajectories was a data frame containing the
season batting statistics for a group of players. One can use the ddply function
together with summarize.trajectory to find the summary statistics for all
players. This is illustrated for Jeter and eight similar players.
d <- plot.trajectories("Derek", "Jeter", 9, 3)
S <- ddply(d, .(playerID), summarize.trajectory)
S
playerID Age.max Max Curve
1 alomaro01 28.3 0.885 -0.00309
Career Trajectories 201
0.85
0.75
0.65
0.55
Derek Jeter Frankie Frisch Johnny Damon
0.95
0.85
Fit
0.75
0.65
0.55
Julio Franco Roberto Alomar Robin Yount
0.95
0.85
0.75
0.65
0.55
20 30 40 50 20 30 40 50 20 30 40 50
Age
FIGURE 8.4
Fitted trajectories of OPS against age for Derek Jeter and eight similar play-
ers.
0.0000
−0.0035 −0.0030 −0.0025 −0.0020 −0.0015 −0.0010 −0.0005
Franco
Damon
Curvature
Jeter Larkin
Biggio
Yount
Gehringer
Alomar
Frisch
28 30 32 34 36
Peak Age
FIGURE 8.5
Scatterplot of peak age and curvature measures for Derek Jeter and eight
similar players. The points are labeled by the player last names.
Looking at Figure 8.5, this clearly indicates that Alomar peaked at an early
age, Franco at a late age, and Alomar and Frisch exhibited the greatest cur-
vature, indicating they rapidly declined in performance after the peak.
For each player, the variable playerID contains the seasons played. We
define a new variable Midyear defined to be the average of a player’s first and
last seasons. The function ddply is used to compute Midyear for all players
and this new variable is added to the Batting.2000 data frame using the
merge function.
library(plyr)
midcareers <- ddply(Batting.2000, .(playerID),
summarize,
Midyear=(min(yearID) + max(yearID))/2)
Batting.2000 <- merge(Batting.2000, midcareers)
Quadratic curves to all of the career trajectories are fit by another appli-
cation of the ddply function. A short function coefficients.trajectory is
defined that fits the quadratic model for the season data for a particular player
and returns the coefficients A, B, C, Midyear, and the player’s career at-bats
Career.AB. We apply ddply where playerID is the grouping variable and
coefficients.trajectory is the function to be applied on each subset. The
output Beta.coef is a data frame containing the coefficients for all players,
where a row corresponds to a particular player.
coefficients.trajectory <- function(d){
b <- coef(lm(OPS ~ I(Age - 30) + I((Age - 30) ^ 2), data=d))
data.frame(A=b[1], B=b[2], C=b[3],
Midyear=d$Midyear[1], Career.AB=d$Career.AB[1])
}
Beta.coef <- ddply(Batting.2000, .(playerID), coefficients.trajectory)
The estimated peak ages are computed for all players using the formula
P eak.age = 30 − B/(2C). The new variable Peak.age is added to the data
frame Beta.coef.
Beta.coef$Peak.age <- with(Beta.coef, 30 - B / 2 / C)
i <- is.finite(Beta.coef$Peak.age)
with(Beta.coef,
lines(lowess(Midyear[i], Peak.age[i]), lwd=3))
40
35
Peak Age
30
25
20
FIGURE 8.6
Scatterplot of peak age against mid career for all players with at least 2000
career at-bats.
Looking at this figure, we see a gradual increase in peak age over time. The
peak age for an average player was approximately 27 in 1880 and this average
has gradually increased to 28 from 1880 to 2000.
FIGURE 8.7
Scatterplot of peak age against log career AB for all players with at least 2000
career at-bats.
Here we see a clear relationship. Players with relatively short careers with
2000 career at-bats tend to peak about age 27. In contrast, players with long
careers, say 9000 or more at-bats, tend to peak at ages closer to 30.
1995. Using the subset function, a new data frame Batting.2000a is created
consisting of only these players.
Batting.2000a <- subset(Batting.2000, Midyear >= 1985 & Midyear <= 1995)
A stripchart is used to graph the peak ages of the players against the
fielding position. (See Figure 8.8.) Since some of the peak age estimates are
not reasonable values, the limits on the horizontal axis are set to 20 and 40.
stripchart(Peak.Age ~ Position, data=Beta.estimates1,
xlim=c(20, 40), method="jitter", pch=1)
special <- with(Beta.estimates1, identify(Peak.Age, Position,
n=5, labels=nameLast))
Generally, for all fielding positions, the peak ages for these 1990 players tend
to fall between 27 and 32. The variability in the peak age estimates reflects
the fact that hitters have different career trajectory shapes. There are three
outfielders and two shortstops who seem to stand out by having a high peak
age estimate. Using the identify function, the mouse is used to point out
these unusual values and the row numbers for these players are stored in
the variable special. The five players are Eric Davis, Gary DiSarcina, Jim
Eisenreich, Alvaro Espinoza, and Tony Phillips.
A new data frame dnew is formed containing the hitting statistics for these
Career Trajectories 207
Davis
OF Phillips
Eisenreich
C
DiSarcina
SS
Espinoza
3B
2B
1B
20 25 30 35 40
Peak.Age
FIGURE 8.8
Peak age estimates for players with mid-career 1985–1995 graphed against
fielding position.
five players with large peak age estimates, and this information is merged with
the Master data frame. The ggplot function is used to graph the data together
with the quadratic fits for these five players. (See Figure 8.9.) The geom point
function adds the points, the stat smooth function adds the quadratic curves,
and the facet wrap function creates separate panels for the five players.
dnew <- subset(Batting.2000, playerID %in% Beta1$playerID[special])
dnew <- merge(dnew, Master)
ggplot(dnew, aes(Age, OPS)) + geom_point(size=4) +
facet_wrap(~ nameLast, ncol=2) + ylim(0.4, 1.05) +
stat_smooth(method="lm", se=FALSE, size=1.5,
formula=y ~ poly(x, 2, raw=TRUE)) + theme_bw()
Davis DiSarcina
1.0 ● ● ●
● ● ●
● ●
● ●
0.8 ●● ●
●
● ● ●
● ● ● ●
0.6 ● ●● ●
● ●
0.4
Eisenreich Espinoza
1.0
●
0.8 ●● ● ●●●
OPS
● ●● ● ●
● ●
0.6
● ●
● ●●
● ● ●
● ●
0.4 ●
Phillips
1.0
●●
●
0.8 ● ● ●● ●
● ●
● ●● ●●
● ●
0.6 ●
0.4
25 30 35
Age
FIGURE 8.9
OPS trajectory graphs for five players with unusually large peak age estimates.
8.7 Exercises
1. (Career Trajectory of Willie Mays)
(a) Use the gets.stats function to extract the hitting data for Willie
Mays for all of his seasons in his career.
(b) Construct a scatterplot of Mays’ OPS season values against his age.
(c) Fit a quadratic function to Mays’ career trajectory. Based on this
model, estimate Mays’ peak age and his estimated largest OPS value
based on the fit.
2. (Comparing Trajectories)
(a) Using James’ similarity score measure (function similar), find the
five hitters with hitting statistics most similar to Willie Mays.
(b) Fit quadratic functions to the (Age, OPS) data for Mays and the five
similar hitters. Display the six fitted trajectories on a single panel.
(c) Based on your graph, describe the differences between the six player
trajectories. Which player had the smallest peak age?
3. (Comparing Trajectories of the Career Hits Leaders)
(a) Find the batters who have had at least 3200 career hits.
(b) Fit the quadratic functions to the (Age, AVG) data for this group of
hitters, where AVG is the batting average. Display the fitted trajec-
tories on a single panel.
(c) On the basis of your work, which player was the most consistent hitter
on average? Explain how you measured consistency on the basis of
the fitted trajectory.
4. (Comparing Trajectories of Home Run Hitters)
(a) Find the ten players in baseball history who have had the most career
home runs.
(b) Fit the quadratic functions to the home run rates of the ten players,
where HR.RAT E = HR/AB. Display the fitted trajectories on a
single panel.
(c) On the basis of your work, which player had the highest estimated
home run rate at his peak? Which player among the ten had the
smallest peak home run rate?
(d) Do any of the players have unusual career trajectory shapes? Is there
any possible explanation for these unusual shapes?
5. (Peak Ages in the History of Baseball)
210 Analyzing Baseball Data with R
(a) Find all the players who entered baseball between 1940 and 1945 with
at least 2000 career at-bats.
(b) Find all the players who entered baseball between 1970 and 1975 with
at least 2000 career at-bats.
(c) By fitting quadratic functions to the (Age, OPS) data, estimate the
peak ages for all players in parts (a) and (b).
(d) By comparing the peak ages of the 1940s players with the peak ages
of the 1970s players, can you make any conclusions about how the
peak ages have changed in this 30-year period?
9
Simulation
CONTENTS
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
9.2 Simulating a Half Inning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
9.2.1 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
9.2.2 Review of work in runs expectancy . . . . . . . . . . . . . . . . . . . . . 213
9.2.3 Computing the transition probabilities . . . . . . . . . . . . . . . . . . 215
9.2.4 Simulating the Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
9.2.5 Beyond runs expectancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
9.2.6 Transition probabilities for individual teams . . . . . . . . . . . . 220
9.3 Simulating a Baseball Season . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
9.3.1 The Bradley-Terry model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
9.3.2 Making up a schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.3.3 Simulating talents and computing win probabilities . . . . 225
9.3.4 Simulating the regular season . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.3.5 Simulating the post-season . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.3.6 Function to simulate one season . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.3.7 Simulating many seasons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
9.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
9.1 Introduction
A baseball season consists of a collection of games between teams, where each
game consists of nine innings, and a half-inning consists of a sequence of plate
appearances. Because of this clean structure, the sport can be represented
by relatively simple probability models. Simulations from these models are
helpful in understanding different characteristics of the game.
One attractive aspect of the R system is its ability to simulate from a
wide variety of probability distributions. In this chapter, we illustrate the
use of R functions to simulate a game consisting of a large number of plate
appearances. Also, R is used to simulate the game-to-game team competition
of teams during an entire season.
Section 9.2 focuses on simulating the events in a baseball half-inning us-
211
212 Analyzing Baseball Data with R
ing a special probability model called a Markov chain. The runners on base
and the number of outs define a state and this probability model describes
movements between states until one reaches three outs. The movement or
transition probabilities are found using actual data from the 2011 season. By
simulating many half-innings using this model, one gets a basic understanding
of the pattern of run scoring.
Section 9.3 describes a simulation of an entire baseball season using the
Bradley-Terry probability model. Teams are assigned talents from a bell-
shaped (normal) distribution and a season of baseball games is played us-
ing win probabilities based on the talents. By simulating many seasons, one
learns about the relationship between a team’s talent and its performance in
a 162-game season. We describe simulating the post-season series and assess
the probability that the “best” team, that is, the team with the best ability
actually wins the World Series.
library(plyr)
data.outs <- ddply(data2011, .(HALF.INNING), summarize,
Outs.Inning=sum(EVENT_OUTS_CT))
data2011 <- merge(data2011, data.outs)
data2011C <- subset(data2011, Outs.Inning == 3)
Let’s contrast this with the possible transitions starting from the “010 2”
state, runner on second with two outs. The most likely transitions are “3 outs”
(probability 0.640), “runners on first and second with two outs” (probability
0.163), and “runner on first with 2 outs” (probability 0.083).
P2 <- round(P.matrix["010 2", ], 3)
data.frame(Prob=P2[P2 > 0])
Prob
000 2 0.020
001 2 0.006
010 2 0.055
216 Analyzing Baseball Data with R
100 2 0.083
101 2 0.034
110 2 0.163
3 0.640
There are two key statements in this simulation. If the current state is s, the
function sample will simulate a new state using the s row in the transition
matrix P – the new state is denoted s.new. The total number of runs scored
in the inning is updated using the value in the s row and the s.new column
of the runs matrix R.
Using the replicate function, one can simulate a large number of half-
innings of baseball. In the below code, we simulate 10,000 half-innings starting
with no runners and no outs (state 1), collecting the runs scored in the vector
RUNS.
RUNS <- replicate(10000, simulate.half.inning(T.matrix, R))
To find the possible runs scored in a half-inning, the table function is used
to tabulate the values in RUNS.
table(RUNS)
RUNS
0 1 2 3 4 5 6 7 8 9 10
7483 1334 659 312 133 44 22 8 1 3 1
The mean number of runs scored is computed by applying the mean function
on RUNS.
mean(RUNS)
[1] 0.4584
Recall that our simulation model is based only on batting plays. To under-
stand the effect of non-batting plays (stealing, caught stealing, wild pitches,
etc.) on run scoring, we compare this runs expectancy matrix with the one
found in Chapter 5 using all batting and non-batting plays. A new matrix
Runs is created with the earlier matrix and we compute the difference Runs
- Runs.Expectancy – this is the contribution of non-batting plays to the
average number of runs scored.
Runs <- matrix(
c(0.47, 0.25, 0.10, 1.45, 0.94, 0.32,
1.06, 0.65, 0.31, 1.93, 1.34, 0.54,
0.84, 0.50, 0.22, 1.75, 1.15, 0.49,
1.41, 0.87, 0.42, 2.17, 1.47, 0.76),
8, 3, byrow=TRUE)
Runs - Runs.Expectancy
0 outs 1 out 2 outs
000 0.02 -0.01 0.01
001 0.09 0.03 0.00
010 -0.02 0.02 0.01
011 0.03 0.01 0.04
100 0.02 0.01 0.01
101 0.02 0.02 0.02
110 -0.02 0.01 0.02
111 -0.06 -0.05 0.07
Note that most of the values of the difference are positive, indicating that
these non-batting plays generally do create runs. The three largest values are
0.09, 0.07, and 0.04 corresponding to the “001, 0 outs”, “111, 2 outs”, and
“011, 2 outs” situations. These positive values make sense since these are all
Simulation 219
situations with a runner on third who can score with a wild pitch or passed
ball.
The first row of P.matrix.3 gives the probabilities of being in each of the 25
states after three hitters starting at the “000 0” state. We round these values
to three decimal places, sort from largest to smallest, and display the largest
values.
sorted.P <- sort(round(P.matrix.3["000 0", ], 3), decreasing=TRUE)
head(data.frame(Prob=sorted.P))
Prob
3 0.369
100 2 0.241
110 1 0.086
010 2 0.083
000 2 0.045
001 2 0.030
After three PAs, the most likely outcomes are three outs (probability 0.369),
runner on first with 2 outs (probability 0.241), and runners on first and second
with one out (probability 0.086).
It is also easy to learn about the number of visits to all runner-outs states.
Define the matrix Q to be the 24-by-24 submatrix found from the transition
matrix by removing the last row and column (the three outs state). By sub-
tracting the matrix Q from the identity matrix and taking the inverse of the
result, we obtain the fundamental matrix N of an absorbing Markov chain.
(The diag function is used to construct the identity matrix and the function
solve takes the matrix inverse.)
Q <- P.matrix[-25, -25]
N <- solve(diag(rep(1, 24)) - Q)
N
000 0 1.04
000 1 0.74
000 2 0.58
001 0 0.01
001 1 0.03
001 2 0.05
Starting at the beginning of the inning (the “000 0” state), the average number
of times the inning will be in the “000 0” state is 1.04, the average number
of times in the “000 1” state is 0.74, the average number of times in the “000
2” state is 0.58, and so on. By using the sum function, we find the average
number of states that are visited.
sum(N.0000)
[1] 4.28
This tells us the length of the remainder of the inning, on average, starting
with each possible state. For example, starting at the bases empty, one out
state, we expect on average to have 2.88 more batters. In contrast, with a
runner on third with two outs, we expect to have 1.51 more batters.
that gives the batting team in each half-inning. By use of the substr function,
we define the home team variable HOME TEAM ID, and an ifelse function is
used to define the batting team.
data2011C$HOME_TEAM_ID <- with(data2011C, substr(GAME_ID, 1, 3))
data2011C$BATTING.TEAM <- with(data2011C,
ifelse(BAT_HOME_ID == 0,
as.character(AWAY_TEAM_ID),
as.character(HOME_TEAM_ID)))
For example, the matrix Team.T[’ANA’, , ] gives the transition counts for
Anaheim in the 2011 season.
If one is interested in comparing run productions for different batting
teams, it is necessary to make some adjustments to the team transition prob-
ability matrices to get realistic predictions of performance. To illustrate the
problem, we focus on transitions from the “100 2” state. The transition counts
are stored in the variable Team.T.S and a few rows of this table are displayed.
for six of the teams.
d.state <- subset(data2011C, STATE == ’100 2’)
Team.T.S <- with(d.state, table(BATTING.TEAM, NEW.STATE))
Team.T.S
NEW.STATE
BATTING.TEAM 000 2 001 2 010 2 011 2 100 2 101 2 110 2 3
ANA 11 3 7 8 0 16 56 253
ARI 11 4 13 2 0 15 73 240
ATL 7 2 4 7 0 23 68 273
...
TEX 12 5 16 6 1 20 67 268
TOR 7 3 9 10 1 18 51 269
WAS 9 1 5 10 0 25 61 243
For some of the less common transitions, there is much variability in the
counts across teams and this causes the corresponding team transition prob-
abilities to be unreliable. If pT EAM represents the team’s transition probabil-
ities for a particular team, and pALL are the average transition probabilities,
then a better estimate at the team’s probabilities has the form
n K
pEST = pT EAM + pALL ,
n+K n+K
where n is the number of transitions for the team and K is a smoothing count.
The description of the methodology is beyond the level of this book, but in this
222 Analyzing Baseball Data with R
Note that the improved transition proportions are a compromise between the
team’s proportions and the overall values. For example, for a transition from
the state “100 2” to “010 2”, the Washington value is 0.0141, the overall value
is 0.0242, and the improved value 0.0220 falls between the Washington and
overall values. This method is especially helpful for transitions such as “100
2” to “100 2” which did not occur for Washington in this season but we know
there is a positive chance of these transitions happening in the future.
This smoothing method can be applied for all teams and all rows of the
transition matrix to obtain improved estimates at teams’ probability transi-
tion matrices. With the team transition matrices computed in this way, one
can explore the run-scoring behavior of individual batting teams.
Simulation 223
exp(TA )
P (A wins) = .
exp(TA ) + exp(TB )
This model is closely related to the log5 method developed by Bill James
in his Baseball Abstract books in the 1980s. If PA and PB are the winning
percentages of teams A and B, then James’ formula is given by
PA /(1 − PA )
P (A wins) = .
PA /(1 − PA ) + PB /(1 − PB )
Comparing the two formulas, one sees that the log5 method is a special case
of the Bradley-Terry model where a team’s talent T is set equal to the log
odds of winning log(P/(1 − P )). A team with a talent T = 0 will win (in the
long run) half of its games (P = 0.5). In contrast, a team with talent T = 0.2
will win (using the log 5 values) approximately 55% of its games and a team
with talent T = −0.2 will win 45% of its games.
Using this model, one can simulate a baseball season as follows.
1. Construct the 1968 baseball schedule. In this season, each of the 10 teams
in each league play each other team in the same league 18 games, where 9
games are played in each team’s ballpark. (There was no interleague play
in 1968.)
2. Simulate 20 talents from a normal distribution with mean 0 and standard
deviation sT . The value of sT is chosen so that the simulated season win-
ning percentages from this model resemble the actual winning percentages
during this season.
224 Analyzing Baseball Data with R
3. Using the probability formula and the talent values, one computes the
probabilities that the home team wins all games. By a series of coin flips
with these probabilities, one determines the winners of all games.
4. Determine the winner of each league (ties need to be broken by some
random mechanism) and play a best-of-seven World Series using winning
probabilities computed using the Bradley-Terry formula and the two talent
numbers.
This function is used to construct the schedule for the 1968 season. Two
vectors NL and AL are constructed containing abbreviations for the National
League and American League teams. We apply the function make.schedule
twice, once for each league, using k = 9 since one team hosts another team
nine games. The rbind is used to paste together the NL and AL schedules,
creating the data frame schedule.
NL <- c("ATL", "CHN", "CIN", "HOU", "LAN", "NYN", "PHI",
"PIT", "SFN", "SLN")
AL <- c("BAL", "BOS", "CAL", "CHA", "CLE", "DET", "MIN",
"NYA", "OAK", "WS2")
teams <- c(NL, AL)
league <- c(rep(1, 10), rep(2, 10))
schedule <- rbind(make.schedule(NL, 9),
make.schedule(AL, 9))
Simulation 225
The first six rows of the data frame SCH are displayed, where one sees the
games scheduled, the talents of the home and away teams, and the probability
that the home team wins the matchup.
head(SCH)
Visitor Home League.x Talent.Home League.y Talent.Visitor prob.Home
1 ATL PHI 1 -0.02757542 1 -0.06858703 0.5102515
2 ATL PIT 1 0.14765145 1 -0.06858703 0.5538500
3 ATL SFN 1 -0.07264467 1 -0.06858703 0.4989856
4 ATL NYN 1 -0.20925903 1 -0.06858703 0.4648899
5 ATL HOU 1 -0.19522186 1 -0.06858703 0.4683835
6 ATL SLN 1 -0.09780258 1 -0.06858703 0.4926966
The teams, home win probabilities, and outcomes of the first six games are
displayed.
head(SCH[, c("Visitor", "Home", "prob.Home", "outcome","winner")])
Visitor Home prob.Home outcome winner
1 ATL PHI 0.5102515 0 ATL
2 ATL PIT 0.5538500 1 PIT
3 ATL SFN 0.4989856 1 SFN
4 ATL NYN 0.4648899 1 NYN
5 ATL HOU 0.4683835 1 HOU
6 ATL SLN 0.4926966 1 SLN
How did the teams perform during this particular simulated season? Using
the table function, we find the number of wins for all teams. This information
is collected together with the team names in the data frame WIN, and using
the merge function the season results are combined with the team talents to
create the data frame RESULTS.
wins <- table(SCH$winner)
WIN <- data.frame(Team=names(wins), Wins=as.numeric(wins))
RESULTS <- merge(TAL, WIN)
16 PHI 1 -0.057648893 79 0 0
17 PIT 1 0.678827397 111 1 1
18 SFN 1 -0.191052425 85 0 0
19 SLN 1 -0.230921646 77 0 0
20 WS2 2 0.451958661 103 1 0
This function is applied twice (once for each league) and the cbind function
combines the two standings into a single data frame. The league champions
and the World Series winner are also displayed.
cbind(display.standings(RESULTS, 1), display.standings(RESULTS, 2))
Team Wins Losses Team Wins Losses
17 PIT 111 51 WS2 103 59
1 ATL 93 69 BOS 93 69
18 SFN 85 77 OAK 91 71
7 CIN 84 78 NYA 86 76
11 LAN 79 83 MIN 81 81
16 PHI 79 83 CLE 79 83
6 CHN 78 84 DET 77 85
19 SLN 77 85 CHA 72 90
10 HOU 63 99 BAL 70 92
14 NYN 61 101 CAL 58 104
with(RESULTS, as.character(Team[Winner.Lg == 1]))
[1] "PIT" "WS2"
with(RESULTS, as.character(Team[Winner.WS == 1]))
[1] "PIT"
Many.Results data frame to NULL and use a for loop to repeat the simulation
for 1000 seasons, storing the output in Many.Results.
Many.Results <- NULL
for(j in 1:1000)
Many.Results <- rbind(Many.Results, one.simulation.68(0.20))
The data frame Many.Results contains the talent number and number of
wins for 1000 × 20 = 20,000 teams. The smoothScatter function is used to
construct a smoothed scatterplot of Talent and Wins and Figure 9.1 shows
the result. (Here the plot function would have resulted in an overly cluttered
scatterplot.)
with(Many.Results, smoothScatter(Talent, Wins))
120
100
Wins
80
60
40
FIGURE 9.1
Smoothed scatterplot of talent and number of season wins for teams in 1000
simulated seasons.
As expected, there is a positive trend in the graph, indicating that better teams
tend to win more games. But there is much vertical spread in the scatterplot
which says that the relationship between talent and wins is not strong.
To reinforce the last point, suppose we focus on “average” teams that have
a talent number between −0.05 and 0.05. Using the subset function, a new
230 Analyzing Baseball Data with R
data frame Results.avg is created containing the talent and wins data for
these average teams. A histogram is constructed of the season wins for these
teams. (See Figure 9.2.)
Results.avg <- subset(Many.Results, Talent > -0.05 & Talent < 0.05)
hist(Results.avg$Wins)
Histogram of Results.avg$Wins
1000
800
600
Frequency
400
200
0
50 60 70 80 90 100 110
Results.avg$Wins
FIGURE 9.2
Histogram of the number of season wins for “average” teams in the 1000
simulated seasons.
One expects these average teams to win about 80 games. But what is surprising
is the variability in the win totals – average teams can regularly have win totals
between 70 and 90, and it is possible (but not likely) to have a win total close
to 100.
What is the relationship between a team’s talent and its post-season suc-
cess? Consider first the relationship between a team’s talent (variable Talent)
and winning the league (the variable Winner.Lg). Since Winner.Lg is a bi-
nary (0 or 1) variable, a common approach for representing this relationship
is a logistic model – this is a generalization of the usual regression model
where the response variable is binary instead of continuous. The glm func-
tion with the family=binomial argument is used to fit a logistic model – the
Simulation 231
exp(a + bT )
p= ,
1 + exp(a + bT )
where T is a team’s talent, (a, b) are the regression coefficients, and p is the
probability of the event. In the following code, the regression coefficients of
the “win pennant” logistic fit are stored in the variable b1. By use of the
curve function, the fitted probability of winning the pennant is graphed as
a function of the talent. A second application of curve is used to overlay
the fitted probability of winning the World Series. The completed graph is
displayed in Figure 9.3.
b1 <- coef(fit1)
curve(exp(b1[1] + b1[2] * x) / (1 + exp(b1[1] + b1[2] * x)),
-0.4, 0.4, xlab="Talent", ylab="Probability", lwd=2,
ylim=c(0, 1))
b2 <- coef(fit2)
curve(exp(b2[1] + b2[2] * x) / (1 + exp(b2[1] + b2[2] * x)),
add=TRUE, lwd=2, lty=2)
legend(-0.2, 0.8, legend=c("Win Pennant", "Win World Series"),
lwd=2, lty=c(1, 2))
As expected, the chance of a team winning the pennant (solid line) increases
as a function of the talent. An average team with T = 0 has only a small
chance of winning the pennant; an excellent team with a talent close to 0.4
has about a 60% chance of winning the pennant. The probabilities of winning
the World Series (represented by a dashed line) are substantially smaller than
the chances of winning the pennant. For example, this excellent (T = 0.4)
team has only about a 35% chance of winning the World Series. In fact, it can
be demonstrated that the team winning the World Series is likely not to be
the team with the best talent (largest value of T ).
1.0
0.8
Win Pennant
Win World Series
0.6
Probability
0.4
0.2
0.0
FIGURE 9.3
Probability of winning the league and the World Series for teams of different
talents.
9.5 Exercises
1. (A Simple Markov Chain)
Suppose one is interested only in the number of outs in an inning. There
are four possible states in an inning (0 outs, 1 out, 2 outs, and 3 outs)
and you move between these states in each plate appearance. Suppose at
each PA, the chance of not increasing the number of outs is 0.3, and the
Simulation 233
probability of increasing the outs by one is 0.7. The following R code puts
the transition probabilities of this Markov chain in a matrix P.
P <- matrix(c(.3, .7, 0, 0,
0, .3, .7, 0,
0, 0, .3, .7,
0, 0, 0, 1), 4, 4, byrow=TRUE)
(a) If one multiplies the matrix P by itself P to obtain the matrix P2:
P2 <- P %*% P
The first row of P2 gives the probabilities of moving from 0 outs
to each of the four states after two plate appearances. Compute P2.
Based on this computation, find the probability of moving from 0
outs to 1 out after two plate appearances.
(b) The fundamental matrix N is computed as
N <- solve(diag(c(1, 1, 1)) - P[-4, -4])
The first row gives the average number of PAs at 0 out, 1 out, and 2
outs in an inning. Compute N and find the average number of PAs in
one inning in this model.
(a) Write a function to simulate a World Series. The input is the the
probability p the AL team will defeat the NL team in a single game.
(b) Suppose an AL team with talent 0.40 plays a NL team with talent
0.25. Using the Bradley-Terry model, determine the probability p
that the AL wins a game.
(c) Using the value of p determined in part (b), simulate 1000 World
Series and find the probability the AL team wins the World Series.
Simulation 235
(d) Repeat parts (b) and (c) for AL and NL teams who have the same
talents.
10
Exploring Streaky Performances
CONTENTS
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
10.2 The Great Streak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
10.2.1 Finding game hitting streaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
10.2.2 Moving batting averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
10.3 Streaks in Individual At-Bats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.3.1 Streaks of hits and outs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.3.2 Moving batting averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
10.3.3 Finding hitting slumps for all players . . . . . . . . . . . . . . . . . . . 243
10.3.4 Were Suzuki and Ibanez unusually streaky? . . . . . . . . . . . . 246
10.4 Local Patterns of Weighted On-Base Average . . . . . . . . . . . . . . . . . . . 249
10.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
10.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.1 Introduction
Some of the most interesting phenomena in baseball are streaky or hot/cold
performances by hitters and pitchers. During particular periods in the season,
a particular player will hit for a high batting average, and in other periods,
the player will be in a “cold streak” and all batted balls appear to be fielded
for outs. In this chapter, we’ll use R to explore streaky hitting performances.
One of the great hitting accomplishments in baseball history is Joe DiMag-
gio’s 56-game hitting streak and Section 10.2 explores DiMaggio’s game-to-
game hitting for the 1941 season. An R function is used to find all of DiMag-
gio’s hitting streaks, and a moving average function is used to explore DiMag-
gio’s batting average over short time intervals. Retrosheet play-by-play data
records batters’ performances in all plate appearances and we use this data in
Section 10.3 to explore hitting streaks in individual at-bats. Suppose a hitter
is going through an “0 for 20” hitting slump – should we be surprised? One
way of answering this question is to find the longest hitting slumps for all
hitters in a particular baseball season. A second way to understand the size of
this hitting slump is to contrast this hitting with pattern of slumps under a
random model. A method for simulating a random pattern of hits and outs is
237
238 Analyzing Baseball Data with R
described and this method is used to assess if a particular player exhibits more
streakiness in his hitting sequence than what one would expect by chance.
This discussion of streakiness focuses on patterns of hits and outs, and
certainly the quality of an at-bat depends on more than just getting a hit.
Section 10.4 discusses patterns of streakiness using the players’ weighted on-
base percentage (wOBA) where positive outcomes of a plate appearance are
weighted by their run values. We look at players’ wOBA over groups of five
games during a season. A way to describe streaky hitting behavior is to look
at the variability of the five-game wOBA values. Using this measure of streak-
iness, we identify the streaky hitters during the 2011 season.
For each game during the season, the data frame records AB, the number of
at-bats, and H, the number of hits. As a quick check that the data has been
entered correctly, we compute DiMaggio’s season batting average by summing
the game hit totals and dividing by the total at-bats.
sum(joe$H) / sum(joe$AB)
[1] 0.3567468
The result agrees with DiMaggio’s 1941 batting average of .357. (Actually,
although this was a high average, it was overshadowed by Ted Williams’ .406
average during the 1941 season.)
A hitting streak is commonly defined as the number of consecutive games in
which a player gets at least one base hit. Suppose we’re interested in computing
all of DiMaggio’s hitting streaks for the 1941 season. Towards this goal, using
Exploring Streaky Performances 239
the ifelse function, we create a new variable HIT for each game that is either
1 or 0 depending on whether DiMaggio recorded at least one hit in the game.
joe$HIT <- ifelse(joe$H >= 1, 1, 0)
We display the values of HIT that visually shows DiMaggio’s streaky hitting
performance.
joe$HIT
[1] 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 1 0 0 1 1 1 1
[33] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[65] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
[97] 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 1 1
[129] 0 1 0 1 1 1 1 1 0 1 1
We see that DiMaggio started the season with an eight-game hitting streak,
then had three games with no hits, a hitting streak of three games, and so on.
Suppose we wish to compute all hitting streaks for a particular player.
This is conveniently done using the following user-defined function streaks.
The input to this function is a vector y of 0s and 1s corresponding to game
results where the player was hitless (0) or received at least one hit (1). The
output will be a vector containing the lengths of all hitting streaks.
streaks <- function(y){
n <- length(y)
where <- c(0, y, 0) == 0
location.zeros <- (0 : (n+1))[where]
streak.lengths <- diff(location.zeros) - 1
streak.lengths[streak.lengths > 0]
}
0.5
0.4
Average
0.3
0.2
20 40 60 80 100 120
Game
FIGURE 10.1
Moving average plot of DiMaggio’s batting average for the 1941 season using
a window of 10 games. The horizontal line shows DiMaggio’s season batting
average. The games where DiMaggio had at least one base hit are displayed
on the horizontal axis.
The subset function is used to define a new data frame ichiro.AB; records
are chosen where the batting id is “suzui001” (Suzuki’s code id) and the at-bat
flag is TRUE. (In this exploration, only Suzuki’s official at-bats are considered.)
ichiro.AB <- subset(data2011, BAT_ID == "suzui001" & AB_FL == TRUE)
From the variable HIT, the lengths of all hitting streaks are identified,
where a streak refers to a sequence of consecutive base hits. Using the streaks
function defined in Section 10.2, the streak lengths are obtained for Suzuki in
the 2011 season.
streaks(ichiro.AB$HIT)
[1] 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2 2 1 1 1 4 1 1 1 1 2 3 1 1
[31] 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 2 2 1 1 1 2 1 2 1 1 2 1 3 1 3
[61] 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 1 3 1 1 1 2 1 4 1 1 2 3 1
[91] 1 1 1 1 1 1 2 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
[121] 1 1 4 1 1 1 1 2 2 1 1 3 1 1 1 2
Exploring Streaky Performances 243
As expected, most of the hitting streaks lengths are 1, although several times
Suzuki had four consecutive hits.
It may be more interesting to explore the lengths of the gaps between hits.
By the operation 1 - HIT, the roles of 0 and 1 are reversed in the sequence,
and the function streaks is applied to find the lengths of all of the gaps
between hits that are 1 or larger.
streaks(1 - ichiro.AB$HIT)
[1] 1 2 2 4 8 1 3 6 2 3 6 1 2 8 4 1 1 1 5 2
[21] 1 1 4 4 1 1 4 1 5 6 2 3 1 2 4 4 5 4 3 2
[41] 10 2 2 12 11 5 16 2 2 15 2 1 1 2 1 1 2 2 4 2
[61] 3 3 9 1 3 2 1 3 10 2 1 1 10 3 15 6 4 3 1 2
[81] 3 3 3 5 4 11 5 1 2 4 3 5 1 3 4 5 10 3 3 3
[101] 1 1 3 2 1 2 3 2 2 2 1 1 1 4 6 1 5 3 1 8
[121] 4 1 1 7 1 7 6 5 1 2 6 5 6 2 3 6
1 2 3 4 5 6 7 8 9 10 11 12 15 16
35 28 22 15 11 9 2 3 1 4 2 1 2 1
It is seen that Suzuki had a streak of 12 outs once, a streak of 15 outs twice,
and a streak of 16 outs once.
0.5
0.4
Average
0.3
0.2
0.1
AB
FIGURE 10.2
Moving average plot of Ichiro Suzuki’s batting average for the 2011 season
using a window of 30 at-bats. The horizontal line shows Suzuki’s season batting
average. The at-bats where Suzuki had at least one base hit are shown on the
horizontal axis.
Exploring Streaky Performances 245
a hitting slump of length 16? Let’s compare Suzuki’s long slump with the
longest slumps for all regular players during the 2011 season.
First a new function longest.ofer is written that computes the length of
the longest hitting slump for a given batter. (An “ofer” is a slang word for a
hitless streak in baseball.) The inputs to this function are the batter id code
batter and the batting data frame data. The output of the function is the
length of the longest slump.
longest.ofer <- function(batter, data){
d.AB <- subset(data, BAT_ID == batter & AB_FL == TRUE)
d.AB$HIT <- ifelse(d.AB$H_FL > 0, 1, 0)
d.AB$DATE <- substr(d.AB$GAME_ID, 4, 12)
d.AB <- d.AB[order(d.AB$DATE), ]
source("streaks.R")
max(streaks(1 - d.AB$HIT))
}
After reading this function into R, we confirm that it works by finding the
longest hitting slump for Suzuki.
longest.ofer("suzui001", data2011)
[1] 16
Suppose we want to compute the length of the longest hitting slump for all
players in this season with at least 400 at-bats. Using the aggregate function,
the number of at-bats is computed for all players, and players.400 is defined
to be the vector of the id codes of all players with 400 or more at-bats. By use
of the sapply function together with the longest.ofer function, the length
of the longest slump is computed for all regular hitters. Using the data.frame
function, the vector S is converted to a data frame, adding the variable Player.
(The rownames(S) <- NULL command is used to remove the player ids from
the row names.)
A <- aggregate(data2011$AB_FL, list(Player=data2011$BAT_ID), sum)
players.400 <- A$Player[A$x >= 400]
S <- sapply(players.400, longest.ofer, data2011)
S <- data.frame(Player=names(S), Streak=S)
rownames(S) <- NULL
To decipher the player ids, it is helpful to merge the data frame of the
longest hitting slumps S with the player roster information contained in the
Retrosheet file “roster2011.csv.” This roster file is read into R, saving it in
the data frame roster2011. Then the merge function is applied, merging
data frames S and roster2011, matching on the variables Player (in S) and
Player.ID (in roster2011).
roster2011 <- read.csv("roster2011.csv")
S1 <- merge(S, roster2011, by.x="Player", by.y="Player.ID")
246 Analyzing Baseball Data with R
The slump lengths are ordered in decreasing order using the function order
with the decreasing=TRUE argument and the rows of the data frame S1 are
reordered using this ordering. The top six slump lengths are displayed by the
head function.
S.ordered <- S1[order(S1$Streak, decreasing=TRUE), ]
head(S.ordered)
Player Streak X Last.Name First.Name Bats Pitches Team V7
80 ibanr001 35 941 Ibanez Raul L R PHI OF
6 aybae001 30 3 Aybar Erick B R ANA SS
113 mcgec001 27 726 McGehee Casey R R MIL 3B
122 olivm001 27 1095 Olivo Miguel R R SEA C
152 ruizc001 26 958 Ruiz Carlos R R PHI C
48 ellim001 25 896 Ellis Mark R R OAK SS
The six longest hitting slumps during the 2011 season were by Raul Ibanez
(35), Erick Aybar (30), Casey McGehee (27), Miguel Olivo (27), Carlos Ruiz
(26), and Mark Ellis (25). Relative to these long hitting slumps, Suzuki’s
hitting slump of 16 at-bats looks short.
This method is first illustrated for Ichiro Suzuki’s 2011 hitting data. As
before, Suzuki’s hitting data is read into the data frame ichiro.AB – the
variable HIT is a sequence of 0s and 1s, where 0 corresponds to an out and 1
corresponds to a hit.
ichiro.AB <- subset(data2011, BAT_ID == "suzui001" & AB_FL == TRUE)
ichiro.AB$HIT <- ifelse(ichiro.AB$H_FL > 0, 1, 0)
ichiro.AB$DATE <- substr(ichiro.AB$GAME_ID, 4, 12)
ichiro.AB <- ichiro.AB[order(ichiro.AB$DATE), ]
Since the value of 3047 is in the middle of the histogram distribution, the
streakiness pattern in Suzuki’s hitting is consistent with a random model.
There is insufficient evidence that Suzuki is truly streaky.
248 Analyzing Baseball Data with R
Suzuki
0.0015
0.0010
0.0005
0.0000
ST
FIGURE 10.3
Histogram of one thousand values of the clumpiness statistic assuming all
arrangements of hits and outs for Suzuki are equally likely. The observed
value of the clumpiness statistic for Suzuki is shown using a vertical line.
Exploring Streaky Performances 249
This method can be used to check if the streaky patterns of any hitter
are non-random. A new function clump.test is constructed using the R code
previous discussed. The input is the player id code playerid and the season
batting data frame data. One thousand values of the clumpiness measure are
computed by 1000 replications of the simulation procedure. A histogram of
the clumpiness measures is constructed and the observed clumpiness statistic
is shown as a vertical line.
clump.test <- function(playerid, data){
player.AB <- subset(data, BAT_ID == playerid & AB_FL == TRUE)
player.AB$HIT <- ifelse(player.AB$H_FL > 0, 1, 0)
player.AB$DATE <- substr(player.AB$GAME_ID, 4, 12)
player.AB <- player.AB[order(player.AB$DATE), ]
ST <- replicate(1000, random.mix(player.AB$HIT))
truehist(ST, xlab="Clumpiness Statistic")
stat <- sum((streaks(1 - player.AB$HIT)) ^ 2)
abline(v=stat, lwd=3)
text(stat * 1.05, 0.0010, "OBSERVED", cex=1.2)}
Note that Ibanez’s clumpiness measure is in the right tail of this distribution,
indicating that Ibanez clearly displayed more streakiness than one would ex-
pect by chance.
0.0015
0.0010
OBSERVED
0.0005
0.0000
FIGURE 10.4
Histogram of one thousand values of the clumpiness statistic assuming all
arrangements of hits and outs for Raul Ibanez are equally likely. The observed
value of the clumpiness statistic for Ibanez is shown using a vertical line.
Exploring Streaky Performances 251
if(n.r > 0) {
i <- (n.g * g + 1) : n
gp[i] <- rep(n.g, length(i))
}
aggregate(d, list(gp), sum)[, -1]
}
The process of finding the five-game hitting data has been illustrated for
Suzuki. When we look at the sequence of five-game wOBAs for an arbitrary
player, the wOBAs for a consistent player will have small variation, and the
wOBA values for a streaky player will have high variability. A common mea-
sure of variability is the standard deviation, the average size of the deviations
from the mean.
A new function is written to compute the mean and standard deviation of
the group wOBAs for a given player. This function get.streak.data performs
this operation for a given player with id code playerid, the weight matrix
for all plays S, and a grouping of g games (by default g = 5). The output is
Exploring Streaky Performances 253
a vector with the sum of plate appearances N, the mean of the group wOBAs
Mean and the standard deviation of the group wOBAs SD.
get.streak.data <- function(playerid, S, g=5){
S.player <- subset(S, Player == playerid)
S.player$Date <- with(S.player, substr(Game, 4, 12))
S.player <- S.player[order(S.player$Date), ]
S.player.gp <- regroup(S.player$x, g)
s.woba.avg <- with(S.player.gp, V1 / V2)
c(N=sum(S.player.gp$V2),
Mean=mean(s.woba.avg), SD=sd(s.woba.avg))
}
Suzuki had 721 plate appearances, the mean of his five-game wOBAs was
0.285 and the standard deviation of his five-game wOBAs was 0.092.
The function get.streak.data is applied to all players in the 2011 season.
The vector player.list is defined to be the vector of all unique player ids
and the sapply function is used to apply get.streak.data to all players in
player.list. We want to focus on “regular” players and the subset function
is used to collect the streakiness data only for players where the number of
plate appearances (variable N) is 500 or greater.
player.list <- unique(S$Player)
Results <- data.frame(Player=player.list,
t(sapply(unique(S$Player),
get.streak.data, S)))
Results.500 <- subset(Results, N >= 500)
The streakiest hitter during the 2011 season using this standard deviation
measure was Justin Upton. Likewise, the most consistent player, Ichiro Suzuki,
is identified as the one with the smallest standard deviation of the period
254 Analyzing Baseball Data with R
Upton
0.16
0.14
0.12
SD
0.10
0.08
Suzuki
FIGURE 10.5
Scatterplot of mean and standard deviation of five-game wOBAs for all players
in the 2011 with at least 500 PA. Two players are identified, Suzuki and Upton,
who had small and large standard deviations, respectively.
Exploring Streaky Performances 255
The function returns the period number Period and the corresponding
weighted on-base percentage wOBA.
Using this new function, a data frame d1 is created with Suzuki’s data and
similar data frame d2 is created for Upton’s data and the two data frames
are merged by use of the rbind function. The graphics functions ggplot,
geom line, and facet grid in the ggplot2 package are used to create the
line graphs. One nice feature of ggplot2 graphics is that it automatically
uses the same vertical scale for the two panels and shows the player names on
the right of the graph.
d1 <- get.streak.data2("suzuk001", S)
d2 <- get.streak.data2("uptoj001", S)
d <- rbind(data.frame(Player="Suzuki", d1),
data.frame(Player="Upton", d2))
library(ggplot2)
ggplot(d, aes(Period, wOBA)) +
geom_line(size=1) + facet_grid(Player ~ . )
Note that, as expected, Suzuki and Upton have dramatically different patterns
of five-game wOBAs. Most of Suzuki’s five-game wOBAs fall between 0.200
and 0.400. In contrast, Upton had a change in wOBA of 0.000 to 0.740 in two
periods; he was a remarkably streaky hitter during the 2011 season.
0.6
0.4
Suzuki
0.2
0.0
wOBA
0.6
0.4
Upton
0.2
0.0
0 10 20 30
Period
FIGURE 10.6
Line plots of five-game wOBAs against period number for Ichiro Suzuki and
Justin Upton for the 2011 season. Suzuki had a very consistent pattern of
wOBAs, while Upton’s pattern of wOBAs is very volatile.
Exploring Streaky Performances 257
10.6 Exercises
1. (Ted Williams)
The data file “williams.1941.csv” contains Ted Williams game-to-game
hitting data for the 1941 season. This season was notable in that Williams
had a season batting average of .406 (the most recent season batting av-
erage exceeding .400). Read this dataset into R.
(a) Using the R function streaks, find the lengths of all of Williams’
hitting streaks during this season. Compare the lengths of this hitting
streaks with those of Joe DiMaggio during this same season.
(b) Use the function streaks to find the lengths of all hitless streaks of
Williams during the 1941 season. Compare these lengths with those
of DiMaggio during the 1941 season.
2. (Ted Williams, Continued)
(a) Use the R function moving.average to find the moving batting av-
erages of Williams for the 1941 season using a window of 5 games.
Graph these moving averages and describe any hot and cold patterns
in Williams hitting during this season.
(b) Compute and graph moving batting averages of Williams using sev-
eral alternative choices for the window of games.
3. (Streakiness of the 2008 Lance Berkman)
Lance Berkman had a remarkable hot period of hitting during the 2008
season.
(a) Download the Retrosheet play-by-play data for the 2008 season, and
extract the hitting data for Berkman.
(b) Using the function streaks, find the lengths of all hitting streaks of
Berkman. What was the length of his longest streak of consecutive
hits?
(c) Use the streaks function to find the lengths of all streaks of consec-
utive outs. What was Berkman’s longest “oh-for” during this season?
(d) Construct a moving batting average plot using a window of 20 at-
bats. Comment on the patterns in this graph – was there a period
when Berkman was unusually hot?
4. (Streakiness of the 2008 Lance Berkman, Continued)
(a) Use the method described in Section 10.3.4 to see if Berkman’s
streaky patterns of hits and outs are consistent with patterns from a
random model.
258 Analyzing Baseball Data with R
(b) The method of Section 10.3.4 used the sum of squares of the gaps as
a measure of streakiness. Suppose one uses the longest streak of con-
secutive outs as an alternative measure. Rerun the method with this
new measure and see if Berkman’s longest streak of outs is consistent
with the random model.
5. (Streakiness of All Players During the 2008 Season)
(a) Using the 2008 Retrosheet play-by-play data, extract the hitting data
for all players with at least 400 at-bats.
(b) For each player, find the length of the longest streak of consecutive
outs. Find the hitters with the longest streaks and the hitters with
shortest streaks. How does Berkman’s longest “oh-for” compare in
the group of longest streaks?
CONTENTS
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
11.2 Installing MySQL and Creating a Database . . . . . . . . . . . . . . . . . . . . . 260
11.3 Connecting R to MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
11.3.1 Connecting using package RMySQL . . . . . . . . . . . . . . . . . . . . . . . 262
11.3.2 Connecting using Package RODBC . . . . . . . . . . . . . . . . . . . . . . . . 263
11.4 Filling a MySQL Game Log Database from R . . . . . . . . . . . . . . . . . . 264
11.4.1 From Retrosheet to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
11.4.2 From R to MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
11.5 Querying Data from R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
11.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
11.5.2 Coors Field and run scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
11.6 Baseball Data as MySQL Dumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
11.6.1 Lahman’s database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
11.6.2 Retrosheet database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
11.6.3 PITCHf/x database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
11.7 Calculating Basic Park Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
11.7.1 Loading the data in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
11.7.2 Home run park factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
11.7.3 Assumptions of the proposed approach . . . . . . . . . . . . . . . . . 277
11.7.4 Applying park factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
11.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
11.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
11.1 Introduction
In this book, analyses were performed entirely from baseball datasets loaded
into R. That was possible because we were dealing with datasets with a rel-
atively small number of rows. However, when one wants to work on multiple
seasons of play-by-play (or pitch-by-pitch) data, it become more difficult to
259
260 Analyzing Baseball Data with R
manage all of the data inside R.1 While Retrosheet gamelogs consist of ap-
proximately 250,000 records, there are approximately 10 million Retrosheet
play-by-play events, and MLBAM provides data on roughly 800,000 pitches
per year for MLB games.
A solution to this “big data” problem is to store the data using a Database
Management System (DBMS) and, by communicating the system with R,
access only the data needed for the particular analysis. In this chapter some
guidance is provided on this task. Our choice for the DBMS is MySQL, likely
the most popular open-source DBMS. However, readers familiar with other
software can find similar solutions for their DBMS of choice.
The use of MySQL and the R interface are used to gain some understanding
of park effects in baseball. Unlike most of the other team sports, baseball
ballparks vary greatly in shape and dimensions. The left-field wall in Fenway
Park, home of the Boston Red Sox, is listed at 310 feet from home plate, while
the left-field fence in Wrigley Field (home of the Chicago Cubs) is 355 feet
away. The left-field wall in Boston, commonly known as The Green Monster,
is 37 feet high, while the left-field fence Dodger Stadium in Los Angeles is
only four feet high. Such differences in ballpark shapes and dimensions and
the prevalent weather conditions have a profound effect on the game and the
associated player measures of performance.
We first show how to obtain and set up MySQL, and then illustrate con-
necting R to a MySQL database for the purpose of retrieving data in R and
appending data to MySQL tables. This interface is used to present evidence
of the effect of Coors Field (home of the Colorado Rockies) on run scoring.
The reader is directed to online resources providing baseball data (of the sea-
sonal to pitch-by-pitch type) ready for import into MySQL. The chapter is
concluded by providing the readers with a basic approach for calculating park
factors and using these factors to make suitable adjustments to players’ stats.
it can read.
2 www.apachefriends.org/en/xampp.html
Learning About Park Effects by Database Management Tools 261
the Windows distribution, but readers using other operating systems should
find a similar installation procedure.
If one has not modified the defaults during the installation, XAMPP can
be launched from the Programs option under the Start Menu. In the box
that appears (Figure 11.1) one clicks on the Start button. beside Apache and
MySQL to start those services. By clicking on Admin... on the right side of
MySQL the default browser is opened and one is taken to the localhost/
phpmyadmin page (Figure 11.2).
FIGURE 11.1
XAMPP Control Panel Application.
Let’s now create a new database named RBaseball. First, click on the
Databases tab, then type RBaseball on the text box under Create new
database and click the Create button. After a few seconds one should see
the RBaseball database listed on the left frame.3 If one clicks on the newly
created database, a message will appear saying there are no tables in the
database.
3 If that is not the case, simply clicking the Reload navigation frame green arrow will
make it appear.
262 Analyzing Baseball Data with R
FIGURE 11.2
Browser window displaying the localhost/phpmyadmin/ page.
accessing the MySQL database (if they have been specified when the database
was created), while dbname indicates the default database to which R will be
connected (in our case RBaseball, created in Section 11.2). Note that the
connection is assigned to an R object (conn here), as it will be required as an
argument by several functions.
library(RMySQL)
conn <- dbConnect(MySQL(), user=’username’, pwd=’password’
, dbname=’RBaseball’)
(a) Enter a Data Source Name (DSN): in our case we choose RBaseball.
(b) Optionally add a Description.
(c) Enter localhost in the Server text box and leave the default Port
value.
(d) Enter a User and a Password for the connection.
(e) From the Database popup menu, select the RBaseball database.
7. Click OK and the DSN is saved.6
Once the ODBC connector has been configured, the database can be ac-
cessed from R by use of the function odbcConnect. The argument dsn is the
name of the data source used when creating the ODBC connection (point
6a in the previous subsection), while uid and pwd need to be provided if a
user name and a password have been specified (6d) when setting the ODBC
connection.
library(RODBC)
conn <- odbcConnect(dsn="RBaseball", uid="user"
, pwd="password")
ured.
Learning About Park Effects by Database Management Tools 265
After the function has been read into R, one season of game logs (for
example, the year 2012) is inputted into R by typing the command:
gl2012 <- load.gamelog(2012)
Since game log files downloaded from Retrosheet do not have column head-
ers, the resulting gl2012 data frame has column names assigned by de-
fault by R. They can be replaced with meaningful names stored in the
game log header.csv file as done elsewhere in this book.
glheaders <- read.csv("retrosheet/game_log_header.csv")
names(gl2012) <- names(glheaders)
library(RODBC)
conn <- odbcConnect("RBaseball")
sqlSave(channel=conn, dat=gl2012, tablename="gamelogs"
, append=TRUE, rownames=FALSE)
• The channel argument requires an open connection; here the one that
was previously set (conn) is specified.
• The dat argument requires the name of the R data frame to be appended
to the table in the MySQL database.
• The tablename argument requires a string indicating the name of the
table (in the database) where the data are to be appended.
• Setting append to TRUE indicates that, should a table by the name
"gamelogs" already exist, data from gl2012 will be appended to the ta-
ble. If append is set to FALSE the table "gamelogs" (if it exists) will be
overwritten.
• The rownames=FALSE argument indicates the row names will not be
appended.
The R code using RMySQL is similar to the RODBC code, but requires signif-
icantly less time for appending data. The following code is preferable if one
has successfully installed the RMySQL package.
library(RMySQL)
conn <- dbConnect(MySQL(), user=’username’
, dbname=’RBaseball’)
dbWriteTable(conn, name="gamelogs", value=gl2012
, append=TRUE, row.names=F)
In the previous section, code was provided for appending one season of
game logs into a MySQL table. However, we have demonstrated in previous
chapters that it is straightforward to use R to work with a single season of
game logs. To fully appreciate the advantages of storing data in a DBMS, a
MySQL table will be populated with game logs going back through baseball
history. With a historical database and an R connection, we demonstrate the
use of R to perform analysis over multiple seasons.
A new function appendGameLogs is written which loops through a specified
set of years (potentially from 1871 to the present), downloads the files from
Retrosheet,7 and appends the data to the gamelogs table in the RBaseball
7 The downloading of data from Retrosheet is performed by the previously presented
load.gamelog function, thus the reader has to make sure said function is loaded for the
code in this section to work.
Learning About Park Effects by Database Management Tools 267
MySQL database.8 The whole process may take several minutes. If one is not
interested in downloading files dating back to 1871, seasons from 1995 are
sufficient for reproducing the example of the next section.
The function appendGameLogs takes the following parameters as inputs:
• start and end indicate the seasons one wants to download from Ret-
rosheet and append to the MySQL database. By default the function will
work on seasons from 1871 to 2012.
• connPackage provides the user with the option of selecting whether to
use the RODBC or the RMySQL package for performing the work. Note that
RMySQL works faster and is the preferred choice.
• headersFile points to the full path where the file containing the game
log headers is stored.
• dbTableName specifies the name of the table in the MySQL database
where the data are to be uploaded.
choice.
9 The three-dots (...) construct is used here for allowing the user to specify additional
arguments to the appendGameLogs function as needed by the functions called inside it (either
dbConnect or odbcConnect).
268 Analyzing Baseball Data with R
, gamelogs$DoubleHeader, sep="")
gamelogs$YEAR_ID <- substr(gamelogs$Date, 1, 4)
if(connPackage == "RMySQL"){
dbWriteTable(conn, name=dbTableName, value=gamelogs
, append=TRUE, row.names=F)
} else {
sqlSave(conn, dat=gamelogs, tablename=dbTableName
, append=TRUE, rownames=FALSE)
}
}
}
head(chiAttendance)
hometeam dayofweek attendance
1 CHN Thu 55000
2 CHN Mon 38655
3 CHN Wed 26838
4 CHN Thu 20152
5 CHN Fri 21324
6 CHA Fri 38912
The sqlQuery function provides the database query. Its arguments in this
function are the connection handle established in the previous line (conn)
and a string consisting of a valid SQL statement. Readers familiar with SQL
will have no problem in understanding the meaning of the query. For those
unfamiliar with SQL, we present here a brief explanation of the purpose of
the query, inviting anyone who is interested in learning about the language to
look for the numerous resources devoted to the subject.
The first row in the SQL statement indicates the columns of the table that
are to be select-ed (in this case date, hometeam, dayofweek, and attendance).
The second line states from which table they have to be retrieved (gamelogs).
Finally, the where clause specifies conditions for the rows that are to be re-
trieved: the date has to be greater than 20000101 and the value of hometeam
has to be one of CHN and CHA.
The same task can be performed with the following code if one wants to
use RMySQL rather than RODBC.
library(RMySQL)
conn <- dbConnect(MySQL(), user=’username’, pwd=’password’
, dbname=’RBaseball’)
chiAttendance <- dbGetQuery(conn, "
select date, hometeam, dayofweek
, attendence
from gamelogs
where date > 20000101
and hometeam in (’CHN’, ’CHA’)
").
day and a single ticket is required for attending both) the attendance is reported only for
the second game, while it is set at zero for the first.
270 Analyzing Baseball Data with R
library(lattice)
xyplot(attendence ~ dayofweek, data=avgAtt
, groups=hometeam
, pch=c("S", "C"), cex=2, col= "black"
, xlab="day of week"
, ylab="attendance"
)
40000
C
C
C
C
C
C C
35000
attendance
S
30000
S
S
S
25000
S
S S
day of week
FIGURE 11.3
Comparison of attendance by day of the week on games played at home by
the Cubs (C) and the White Sox (S).
Learning About Park Effects by Database Management Tools 271
The game data is conveniently stored in the rockies.games data frame and
can be explored with R commands. The sum of runs scored in each game is
computed by adding the runs scored by the home team and the visiting team.
A new column coors is added indicating whether the game was played at
Coors Field.12
rockies.games$runs <- rockies.games$awR + rockies.games$hmR
rockies.games$coors <- (rockies.games$parkid == "DEN02")
visitorrunsscored as awR tells SQL that, in the results returned by the query, the column
visitorrunsscored will be named awR.
12 Retrosheet code for Coors Field is DEN02. A list of all ballparks codes is available at
www.retrosheet.org/parkcode.txt.
272 Analyzing Baseball Data with R
stat_summary(fun.data="mean_cl_boot") +
xlab("season") +
ylab("runs per game (both teams combined)") +
scale_linetype_discrete(name="location"
, labels=c("other parks", "Coors Field"))
The stats summary layer is used in ggplot2 to summarize the y values at
every unique value of x. The fun.data argument lets the user specify a sum-
marizing function; in this case "mean cl boot" implements a nonparametric
bootstrap procedure for obtaining confidence bands for the population mean.
The output resulting from this layer are the vertical bars appearing for each
data point. The scale linetype discrete layer is used for labeling the series
(name argument) and assigning a name to the legend (labels).
16
●
●
●
runs per game (both teams combined)
14
●
●
●
● ●
●
● location
12 ●
● other parks
● ● Coors Field
● ● ●
●
●
●
10
●
●
●
● ● ●
●
●
● ●
●
●
●
●
8 ●
● ●
●
FIGURE 11.4
Comparison of runs scored by the Rockies and their opponents at Coors Field
and in other ballparks.
From Figure 11.4 one notices how Coors Field has been an offense-friendly
park, boosting run scoring by as high as six runs per game. However the effect
of the Colorado ballpark has somewhat decreased in the new millennium,
displaying minimal differences in the 2006-2008 period. One reason for Coors
becoming less of an extreme park is the installation of a humidor. Since the
Learning About Park Effects by Database Management Tools 273
2002 season, baseballs have been stored, prior to each game, in a room at
a higher humidity, with the intent of compensating for the unusual natural
atmospheric conditions.13
Note that the SQL dump file creates a new database named lahman.
13 For a detailed analysis of the humidor effects, see this article by Alan Nathan,
Zimmerman’s SQL dump creates a number of lookup tables, useful for decod-
ing information such as the type of batted ball, the base situation, and so on.
Details on coded values are provided in Appendix A.
The SQL dump also contains a games table, featuring general information
on the games, which can be linked to atbats with the common game id col-
umn. Finally a few reference tables provide details on players, umpires, teams,
and pitch types.
# get data
hrPF <- sqlQuery(conn, "
select away_team_id, home_team_id, event_cd
from events
where year_id=1996
and event_cd in (2, 18, 19, 20, 21, 22, 23)
")
16 Refer to Section 11.6.2 for performing the necessary steps to get the data into a MySQL
database.
276 Analyzing Baseball Data with R
head(evCompare)
team_id event_fl_A event_fl_H
1 ANA 0.03808699 0.03053270
2 ARI 0.03013345 0.03935524
3 ATL 0.04524887 0.04268827
4 BAL 0.03729904 0.04293456
5 BOS 0.04663212 0.03374167
6 CHA 0.04045377 0.05190166
Park factors are typically calculated so that the value 100 indicates a neutral
ballpark (one which has no effect on the particular statistic) while values over
100 indicate playing fields that increase the likelihood of the event (home
run in this case) and values under 100 indicate ballparks that decrease the
likelihood of the event.
The 1996 home run park factors are obtained with the first line of the
following code. The resulting data frame evCompare is ordered in descending
order by PF. The head function is used to display the most HR-friendly parks
and the tail function displays the least-friendly home run parks. Coors Field
is at the top of the HR-friendly list, displaying an extreme value of 158 – this
Learning About Park Effects by Database Management Tools 277
park boosted home run frequency by over 50% in 1996. At the other end of
the spectrum, in 1996 was Dodgers Stadium in Los Angeles, featuring a home
run park factor of 71.
evCompare$PF <- round(100 * evCompare$event_fl_H
/ evCompare$event_fl_A)
evCompare <- evCompare[order(-evCompare$PF),]
head(evCompare)
team_id event_fl_A event_fl_H PF
9 COL 0.03405430 0.05383393 158
4 CAL 0.03870425 0.04831843 125
1 ATL 0.03228680 0.03717311 115
3 BOS 0.03849444 0.04425145 115
10 DET 0.04571550 0.05058280 111
6 CHN 0.03741794 0.04066623 109
tail(evCompare)
team_id event_fl_A event_fl_H PF
13 KCA 0.03380102 0.02905167 86
8 CLE 0.04402386 0.03722397 85
5 CHA 0.04240652 0.03486662 82
12 HOU 0.03441600 0.02723649 79
19 NYN 0.03634809 0.02888087 79
14 LAN 0.03595703 0.02561144 71
url www.retrosheet.org/neutral.htm.
278 Analyzing Baseball Data with R
has seen left-handed batters take advantage of the short distance of the right
field fence.
Finally, the proposed park factors (as well as most published versions of
park factors) essentially ignore the players involved in each event (in this case
the batter and the pitcher). As teams rely more on the analysis of play-by-play
data, they typically adapt their strategies to accommodate the peculiarities
of ballparks. For example, while the diminished effect of Coors Field on run
scoring displayed in Figure 11.4 is mostly attributable to the humidor, part of
the effect is certainly due to teams employing different strategies when playing
in this park. For example, teams can use pitchers who induce a high number
of groundballs that are less impacted by the effect of the rarified air.
# identify HRs
andres$event_fl <- ifelse(andres$event_cd == 23, 1, 0)
Then the previously calculated park factors are added to the andres data
frame. This is done by merging data frames andres and evCompare, using
Learning About Park Effects by Database Management Tools 279
the columns home team id and team id as the merging columns. Using the
merged data frame adresPF, we calculate the mean park factor for Galarraga’s
plate appearances.
andres <- merge(andres, evCompare[,c("team_id", "PF")]
, by.x="home_team_id"
, by.y="team_id")
andresPF <- mean(andres$PF)
andresPF
[1] 129.1384
The compounded park factor for Galarraga, derived from the 252 batted balls
he had at home and the 225 he had on the road (ranging from 9 in Dodgers
Stadium in Los Angeles to 23 at the Astrodome in Houston), indicates Andres
had his home run frequency increased by an estimated 29% relative to a
neutral environment. In order to get the estimate of home runs in a neutral
environment, we divide Galarraga’s home runs by his average home run park
factor multiplied by 100.
47 / andresPF * 100
[1] 36.39507
According to our estimates, Galarraga’s benefit from the ballparks he played
in (particularly his home Coors Field) amounted to roughly 47 - 36 = 11 home
runs in the 1996 season.
11.9 Exercises
1. (Runs Scored at the Astrodome)
(a) Using either the sqlQuery function from the RODBC package or the
280 Analyzing Baseball Data with R
(b) Choose an event (even different from the seven shown at SeamHeads)
and calculate how ballparks affect its frequency. As a suggestion, the
reader may want to look at seasons in the ’80s, when artificial turf
was installed in close to 40% of MLB fields, and verify whether parks
with concrete/synthetic grass surfaces featured a higher frequency of
batted balls (home runs excluded) converted into outs.19
19 SeamHeads provides information on the surface of play in each stadium’s page. For
example, in the previously mentioned page relative to the Astrodome, if one hovers the
mouse over the ballpark name in a given season, a pop-up will appear providing information
both on the ballpark cover and its playing field surface. SeamHeads is currently providing
its ballpark database as a zip archive containing comma-separated-value (.csv) files that
can be easily read by R: the link for downloading it is found at the bottom of each page in
the ballpark database section.
12
Exploring Fielding Metrics with Contributed
R Packages
CONTENTS
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
12.2 A Motivating Example: Comparing Fielding Metrics . . . . . . . . . . . 284
12.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
12.2.2 The fielding metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
12.2.3 Reading an Excel spreadsheet (XLConnect) . . . . . . . . . . . . . 286
12.2.4 Summarizing multiple columns (doBy) . . . . . . . . . . . . . . . . . . 287
12.2.5 Finding the most similar string (stringdist) . . . . . . . . . . 288
12.2.6 Applying a function on multiple columns (plyr) . . . . . . . 291
12.2.7 Weighted correlations (weights) . . . . . . . . . . . . . . . . . . . . . . . . 291
12.2.8 Displaying correlation matrices (ellipse) . . . . . . . . . . . . . . 292
12.2.9 Evaluating the fielding metrics (psych) . . . . . . . . . . . . . . . . . 293
12.3 Comparing Two Shortstops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
12.3.1 Reshaping the data (reshape2) . . . . . . . . . . . . . . . . . . . . . . . . . 296
12.3.2 Plotting the data (ggplot2 and directlabels) . . . . . . . . 296
12.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
12.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
12.1 Introduction
R packages have been introduced in Chapter 2 of this book. Here the evalua-
tion of fielding is used as a motivating example to further illustrate the capa-
bilities of a number of R packages. To use any of the packages described in this
chapter, the package needs to be installed in R before loading the package with
the library function. One installs a package by use of the install.packages
function or the Install Packages button in RStudio’s Packages tab.
283
284 Analyzing Baseball Data with R
tem. Its numbers are based on the results of a yearly survey among baseball
fans conducted by analyst and blogger Tom Tango.
• Wizardry ’s Runs (Runs) are taken from Humphreys’ book and are
derived from seasonal data available in the Lahman’s database. One ad-
vantage of this system is that it is possible to calculate fielding ratings
throughout baseball history.
As a rule of thumb for the counting metrics (all of the above, except
RZR), one can identify the players who are credited with saving 15 or more
runs as extremely good ones and, conversely, as awful defenders those who are
estimated costing 15 or more runs to their teams. For RZR, where an average
fielder usually posts a rate around .835, the best players are found at the .940
mark, while the worst ones record values close to .700.
library(XLConnect)
xlwzr <- loadWorkbook("Appendix_C_Shortstop.xls")
wzr <- readWorksheet(xlwzr, sheet=1, startRow=7)
wzr <- subset(wzr, Year == 2009)
wzr$Name <- paste(wzr$First, wzr$Last)
head(wzr)
Year L T.mcs First Last Pos IP Runs
236 2009 N FLA Alfredo Amezaga SS 42.0000 1.0780581
275 2009 A BAL Robert Andino SS 478.3333 -0.7244126
284 2009 A TEX Elvis Andrus SS 1238.0000 5.0929674
417 2009 A KC Mike Aviles SS 269.3333 5.0588753
2 www.oup.com/us/companion.websites/9780195397765/appendices/?view=usa.
Exploring Fielding Metrics with Contributed R Packages 287
passed, and thus performed, at once. For example with FUN=c(mean, sd) one simultaneously
computes the mean and the standard deviation on the variables specified on the left side
of the formula. When more than one function is passed to FUN, it is not possible to set
keep.names as TRUE.
288 Analyzing Baseball Data with R
library(doBy)
wzr <- summaryBy(IP + Runs + v.Tm ~ Name, data=wzr, FUN=sum,
keep.names=TRUE)
To confirm that this function is operating correctly, the first few lines of the
wzr data frame are displayed.
head(wzr)
Name IP Runs v.Tm
1 A. Hernandez 289.6667 -3.2744424 -0.6064142
2 Aaron Miles 44.0000 0.0488766 -0.3746494
3 Adam Everett 942.6667 2.7007132 -2.3298411
4 Adam Rosales 33.0000 0.8441731 0.7099337
5 Alberto Callaspo 2.0000 -0.2630282 -0.2770827
6 Alberto Gonzalez 279.3333 -4.9981058 -4.6320992
fg.mismatches
[1] "Brent Lillibridge" "Anderson Hernandez"
[3] "Willie Bloomquist" "Yuniesky Betancourt"
wzr.mismatches
[1] "A. Hernandez" "B. Lillibridge" "W. Bloomquist"
[4] "Y. Betancourt"
We have learned that Humphreys has elected to use the initial of the first
name for some players with a long last name. While in this case, with only
four elements on each side, matching by hand would be feasible, but it is
helpful to demonstrate a way for identifying similar strings, which could be
useful in case of longer lists of elements to match. The stringdist pack-
age provides functions for computing so-called string distances. The function
stringdistmatrix compares the elements of the fg.mismatches and the
wzr.mismatches vectors, yielding the distance matrix shown in the follow-
ing code.
library(stringdist)
290 Analyzing Baseball Data with R
The coalesce function is then applied to every column of defense. One can
avoid calling this function multiple times thanks to the colwise function in
the plyr package which takes a vector-input function as the argument and
returns another function which works columnwise on a data frame.
library(plyr)
coalesceColumns <- colwise(coalesce)
defense <- coalesceColumns(defense)
round(cor(defense[,-1:-3]),3)
UZR DRS TZL RZR FSR Runs
UZR 1.000 0.782 0.727 0.142 0.489 0.305
DRS 0.782 1.000 0.724 0.090 0.596 0.387
TZL 0.727 0.724 1.000 0.124 0.495 0.322
RZR 0.142 0.090 0.124 1.000 0.050 0.059
FSR 0.489 0.596 0.495 0.050 1.000 0.089
Runs 0.305 0.387 0.322 0.059 0.089 1.000
Every observation (i.e., every row in the defense data frame) counts the
same when calculating the correlations. This may be inappropriate – a short-
stop who has more opportunities perhaps should have greater weight than
a part-time shortstop in the computation of the correlations. The function
wtd.cor in the weights package can compute a correlation allowing for dif-
ferent weights for the observations. The function wtd.cor has two inputs: the
matrix of observations and a vector of weights which in this case corresponds
to the number of innings played. The function returns three matrices: a ma-
trix of correlation coefficients, and matrices of t-values and p-values. In the
code that follows we limit the output to the correlation matrix, which can be
compared to the one produced by cor.
library(weights)
round(wtd.cor(defense[,-1:-3], weight=defense$Inn)$correlation, 3)
UZR DRS TZL RZR FSR Runs
UZR 1.000 0.794 0.731 0.618 0.502 0.241
DRS 0.794 1.000 0.745 0.433 0.598 0.364
TZL 0.731 0.745 1.000 0.565 0.487 0.288
RZR 0.618 0.433 0.565 1.000 0.275 0.171
FSR 0.502 0.598 0.487 0.275 1.000 0.018
Runs 0.241 0.364 0.288 0.171 0.018 1.000
The matrix of weighted correlations is slightly different than the raw (un-
weighted) correlations. In particular, it appears the very low correlation be-
tween the Revised Zone Rating (RZR) and the other metrics was highly affected
by the numbers posted by the occasional players.
library(ellipse)
Dcor <- wtd.cor(defense[,-1:-3], weight=defense$Inn)$correlation
plotcorr(Dcor)
Runs
DRS
UZR
RZR
FSR
TZL
UZR
DRS
TZL
RZR
FSR
Runs
FIGURE 12.1
Visualization of the correlation matrix for defensive metrics.
defensive ratings for the seasons 2007 to 2009. In addition to the six met-
rics previously presented, this file features an additional rating, the Defensive
Efficiency Record, labeled DER. This fielding measure, first proposed by Bill
James, is calculated by dividing the outs recorded on batted balls by the
number of balls playable by the defense (every batted ball except home runs).
DER provides a fair, albeit not perfect, estimation of defensive value at the
team level.9 In the remainder of this section the six previously shown metrics
will be compared with DER.
The teamD.csv file is read into R and the weighted correlations of the six
metrics are computed, with the help of the wtd.cor function in the weights
package, as done in Section 12.2.7.
teamD <- read.csv("teamD.csv")
Dcor <- wtd.cor(teamD[,c("DER", "UZR", "DRS", "TZL", "RZR"
, "FSR", "Runs")], weight=teamD$IP)
A new visualization, produced by the function cor.plot in the psych, is
used to display the resulting correlation matrix. In our example, the function
mat.sort is first used to sort the correlation matrix so that similar items
are grouped together. Then the function cor.plot displays the correlation
matrix where darker boxes correspond to larger correlation values. By setting
the numbers argument to TRUE), the values of the correlations, on a 0 to 100
scale, are also displayed.
library(psych)
sortedCor <- mat.sort(Dcor$correlation)
cor.plot(sortedCor, numbers=TRUE)
According to Figure 12.2, TZL has the highest correlation with DER for the
three-year period.
same, thus not accounting for the different outcomes of batted balls (singles, doubles, and
so on).
10 We will actually be using part of the career of Rollins and Jeter, particularly the seasons
covered by BIS data, thus having UZR and DRS values available.
11 Registration is free of charge.
Exploring Fielding Metrics with Contributed R Packages 295
DER 100 76 78 63 57 69 30
Runs 76 100 66 62 66 59 20
TZL 78 66 100 58 56 58 40
UZR 63 62 58 100 63 50 45
DRS 57 66 56 63 100 50 63
RZR 69 59 58 50 50 100 44
FSR 30 20 40 45 63 44 100
DER
Runs
TZL
UZR
DRS
RZR
FSR
FIGURE 12.2
Correlation plot for the comparison of team defensive metrics.
library(ggplot2)
p <- ggplot(jetrol2, aes(x=Season, y=runs
, col=fieldingMetric)) +
geom_line() +
facet_grid(. ~ Name) +
scale_color_manual(name="Fielding\nmetric"
, values=c("black", "grey70")) +
scale_x_continuous(breaks=seq(2004, 2012, 4)) +
geom_hline(yintercept=0, lty=3)
The package directlabels provides a quick-to-use function direct.label
which adds labels directly to the plot, thus removing the color legend, as shown
in Figure 12.3.12
library(directlabels)
direct.label(p)
The plots in Figure 12.3 show that Jeter has been constantly rated as a
below-average fielder (despite both UZR and DRS indicating an unusually
good season for him in 2009), while Rollins has mostly been an above-average
fielding shortstop. UZR and DRS numbers follow more or less the same path
for the Yankee captain, but their portrait of the Philadelphia shortstop convey
different information: where DRS seems to indicate a steep decline in Jimmy’s
defensive value, he appears to have maintained his ability according to UZR.
10
UZR
0 UZR
runs
−10
−20
DRS
2004 2008 2012 2004 2008 2012
Season
FIGURE 12.3
Derek Jeter’s and Jimmy Rollins’ defensive value through the years as mea-
sured by UZR and DRS. Labels added directly on the plot area.
12.5 Exercises
1. (Data Reshaping: Exploring Team Plate Discipline)
(a) On the FanGraphs website (www.fangraphs.com/) generate a report
containing team batting plate discipline stats from 2008 to the latest
completed season. Export the data as a csv file and load it into R.
Rename the columns if necessary.
(b) Calculate the correlation between percentage of swings outside the
zone (O-Swing%) and the percentage of pitches in the strike zone
(Zone%)). Are teams with a high propensity toward swinging at bad
pitches fed with a lower number of pitches in the strike zone?
(c) Reshape the loaded data so that the new data frame will consist of
four columns: season, team, the name of the discipline statistic, and
the value of the discipline statistic.
Exploring Fielding Metrics with Contributed R Packages 299
(d) Subset the data frame so that it only contains O-Swing% and Zone%
rows.
(e) Draw a plot where teams are displayed in different panels, each fea-
turing the year-to-year trend of both O-Swing% and Zone%.
2. (Applying a Function on Multiple Columns: Exploring the Avail-
ability of Statistics Through Baseball History)
Not every baseball statistic has been tracked since the early days of base-
ball. Stolen Bases and Caught Stealing are examples of numbers which are
not available in some earlier seasons.
(a) Load the batting statistics from Lahman’s database and create a new
data frame containing only the following information: year, league,
stolen bases, and runners caught stealing.
(b) Write a function that takes a vector as its only argument and returns
the percentage of missing values (NAs) in the vector. Note that is.na
is a function for identifying NAs and that the function length returns
the number of elements in a vector.
(c) Apply the newly written function to the columns of the previously
created data frame.
(d) Use the ddply function (from the plyr package) to apply the function
by year and by league. Was the recording of SBs and CSs introduced
at the same time in the National and the American League?
3. (Comparing Defensive Ratings and Stealing Ability in Center-
fielders)
As speed is an important trait of a centerfielder skill set, good base stealers
are often found among players manning the middle outfielder position.
(a) Download seasonal fielding ratings for centerfielders from Wizardry’s
online resources. Read the Excel file into R.
(b) Using tables from Lahman’s database, prepare a data frame contain-
ing seasonal data on offensive stolen bases and caught stealing for
players who have appearances (in the given season) at centerfield.
(c) Estimates like those performed in Chapter 5 indicate that successful
stolen bases are worth roughly 0.2 runs to the offensive team, while
a caught stealing cost it 0.5 runs. Using these values compute stolen
base runs for every player-season.
(d) Merge defensive ratings with stolen base rating data. Note that on
the previous point you may want to gather the players’ first and
last names from the Master table. Try to perform string matching
as shown in Section 12.2.5 to match the highest number of players-
seasons.
300 Analyzing Baseball Data with R
(e) Compute the correlation between Stolen Base Runs and Defensive
Runs (as reported in Wizardry).
A
Retrosheet Files Reference
CONTENTS
A.1 Downloading Play-by-Play Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
A.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
A.1.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
A.1.3 Using a special function for a particular season . . . . . . . . . 302
A.1.4 Reading the files into R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
A.1.5 The function parse.retrosheet.pbp . . . . . . . . . . . . . . . . . . . 302
A.2 Retrosheet Event Files: a Short Reference . . . . . . . . . . . . . . . . . . . . . . . 304
A.2.1 Game and event identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
A.2.2 The state of the game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
A.3 Parsing Retrosheet Pitch Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
A.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
A.3.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
A.3.3 Evaluating every count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
301
302 Analyzing Baseball Data with R
A.1.2 Setup
In the current R working directory, create a new folder “download.folder”
and in this new folder, create two folders “zipped” and “unzipped.” There
are special software tools for working with the Retrosheet files that should be
downloaded from sourceforge.net/projects/chadwick/files/. When the
compressed file is expanded, the file “cwevent.exe” should be placed in the
“unzipped” folder.
If you look in the “download.folder/unzipped” folder, you will see two new
files: “all1950.csv” contains the play for play records for the 1950 season, and
“roster1950.csv” contains roster information for all players of that season.
shell("del TEAM*")
setwd(wd)
}
download.retrosheet(season)
unzip.retrosheet(season)
create.csv.file(season)
create.csv.roster(season)
cleanup()
}
net/doc/cwtools.html. In particular, the tool for processing the event files (cwevent) is
documented at chadwick.sourceforge.net/doc/cwevent.html#cwtools-cwevent.
Retrosheet Files Reference 305
doubleheaders (thus “1” indicates the first game, “2” the second game, and
“0” means only one game was played on the day).
Events are progressively numerated in each game (column EVENT ID), thus
every single action in the Retrosheet database can be uniquely identified by
the combination of the game identifier and the event identifier.
TABLE A.1
Retrosheet coding for the situation of runners on base.
stolen bases, wild pitches and, generally, any event that does not mark the
end of a plate appearance.
• H CD is a numeric code indicating the base hit type, going from 1 for
a single to 4 for a home run.
• BATTEDBALL CD is a single character code denoting the batted ball
type. It can assume one of the following values: G (ground ball), L (line
drive), F (fly ball), P (pop-up). Note that for most of the seasons in the
Retrosheet database, the batted ball type is reported only for plate appear-
ances ending with the batter making an out, while they are not available
on base hits.
• BATTEDBALL LOC TX is a string indicating the batted ball loca-
tion, coded according to the diagram shown at www.retrosheet.org/
location.htm. Note that this information is available for a limited num-
ber of seasons.
• FLD CD is a numeric code denoting the fielder first touching a batted
ball, coded with the conventional baseball fielding notation going from 1
(the pitcher) to 9 (the right fielder).
The sequence of pitches is recorded in the PITCH SEQ TX and has been
addressed in Chapter 7, where Table 7.1 displays how the different pitch out-
comes are coded. Several columns are generated from this one, indicating
counts of the various types of pitch outcomes, as displayed in Table A.3.
A.3.2 Setup
We first load Retrosheet data for the 2011 season.
pbp2011 <- read.csv("retrosheet/all2011.csv")
headers <- read.csv("retrosheet/fields.csv")
names(pbp2011) <- headers$Header
Retrosheet Files Reference 307
TABLE A.2
Retrosheet coding for the type of event.
Then a new column sequence is created in which the pitch sequence is re-
ported, stripped by any character not indicating an actual pitch to the batter.5
pbp2011$sequence <- gsub("[.>123+*N]", "", pbp2011$PITCH_SEQ_TX)
TABLE A.3
Columns reporting counts of various pitch types.
|[BIPV][CFKLMOQRST][CFKLMOQRST]
|[CFKLMOQRST][BIPV][CFKLMOQRST])", pbp2011$sequence)
pbp2011$c22 <- grepl("^(
[CFKLMOQRST][CFKLMOQRST][FR]*[BIPV][FR]*[BIPV]
|[BIPV][BIPV][CFKLMOQRST][CFKLMOQRST]
|[BIPV][CFKLMOQRST][BIPV][CFKLMOQRST]
|[BIPV][CFKLMOQRST][CFKLMOQRST][FR]*[BIPV]
|[CFKLMOQRST][BIPV][CFKLMOQRST][FR]*[BIPV]
|[CFKLMOQRST][BIPV][BIPV][CFKLMOQRST]
)", pbp2011$sequence)
pbp2011$c32 <- grepl("^[CFKLMOQRST]*[BIPV][CFKLMOQRST]*
[BIPV][CFKLMOQRST]*[BIPV]", pbp2011$sequence)
& grepl("^[BIPV]*[CFKLMOQRST][BIPV]*[CFKLMOQRST]"
, pbp2011$sequence)
B
Accessing and Using MLBAM Gameday and
PITCHf/x Data
CONTENTS
B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
B.2 Where are the Data Stored? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
B.3 Suitable Formats for PITCHf/x Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
B.3.1 Obtaining data from on-line resources . . . . . . . . . . . . . . . . . . 314
B.3.2 Parsing in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
B.3.2.1 A wrapper function . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
B.4 Details on the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
B.4.1 atbat attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
B.4.2 pitch attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
B.4.3 hip attributes (hit locations data) . . . . . . . . . . . . . . . . . . . . . . 318
B.5 Special Notes About the Gameday and PITCHf/x Data . . . . . . . . 319
B.6 Miscellanea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
B.6.1 Calculating the pitch trajectory . . . . . . . . . . . . . . . . . . . . . . . . . 320
B.6.2 An R package for getting and visualizing PITCHf/x
data: pitchRx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
B.6.3 Cross-referencing with other data sources . . . . . . . . . . . . . . . 323
B.6.4 Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
B.1 Introduction
This section provides further details on Gameday and PITCHf/x data. It
will be shown where those data are stored, how they can be retrieved us-
ing resources available on the Web, and how they can be parsed using R. A
description of the most important fields will be provided, together with an
overview of the issues a researcher should be aware of when analyzing these
data.
311
312 Analyzing Baseball Data with R
FIGURE B.1
MLB directory of games played on June 13, 2012.
games.
Accessing and Using MLBAM Gameday and PITCHf/x Data 313
inning all.xml XML page and they appear on the Web page as in Figure
B.2. Similarly, data on hit locations are stored in the inning hit.xml page.
FIGURE B.2
Pitch-by-pitch data of the Astros – Giants game played on June 13, 2012.
The form of XML documents is a tree structure. The inning all.xml files
have the following structure.
<game>
<inning>
<top>
<atbat>
<pitch>
</atbat>
</top>
<bottom>
<atbat>
<pitch>
</atbat>
</bottom>
</inning>
</game>
The <game> element is said to be the root element of the tree – this indicates
314 Analyzing Baseball Data with R
that this document is a game. The <inning> element is a child of the root,
and <top> and <bottom> are in turn children of <inning>. Both <top> and
<bottom> elements have <atbat> children, which then have <pitch> children.
Note that all elements are delimited by an <element> tag at the beginning
and an </element> tag at the end.
In the opening tag (the one without the slash) of elements, attributes can
be found. For example, the <inning> element usually has the format
B.3.2 Parsing in R
The XML package provides XML parsing functionality in R. The first step in
the parsing process reads the XML file into R using the xmlParse function.
Accessing and Using MLBAM Gameday and PITCHf/x Data 315
library(XML)
gameUrl <- "gd2.mlb.com/components/game/mlb/year_2012/mo
nth_06/day_13/gid_2012_06_13_houmlb_sfnmlb_1/inning/inning_al
l.xml"
xmlGame <- xmlParse(gameUrl)
Several functions are available to access specific nodes of the tree. For
example, one can obtain the inning nodes by using the getNodeSet function.
xmlInnings <- getNodeSet(xmlGame, "//inning")
length(xmlInnings)
[1] 9
By using the xmlAttrs function, one can retrieve the attributes of a node.
The following code obtains the attributes for the first inning.
xmlAttrs(xmlInnings[[1]])
frame (hence the ”d” at the beginning of its name), ldply requires a list (hence the ”l”).
316 Analyzing Baseball Data with R
dim(pitchesData)
[1] 279 39
For the June 13, 2012 game between Houston and San Francisco, data on 279
pitches are parsed into the pitchesData data frame which consists of 279
rows and 39 columns. Functions specifically devised for retrieving PITCHf/x
data are available in the pitchRx package, introduced in Chapter 12.
• stand and p throws: handedness of the batter and the pitcher for the
at-bat.
• des and event: detailed and short descriptions of the at-bat outcome.
right position and the height of the release point in the same coordinate
system as px and pz.
• vx0, vy0, and vz0: Components of the pitch velocity in three dimen-
sions, measured at release in feet per second.
• ax, ay, and az: Components of the pitch acceleration in three dimen-
sions, measured at release in f t/s2 .
5 Fora visual explanation of the three break attributes, see the figure displayed at the
end of Mike Fast’s glossary.
Accessing and Using MLBAM Gameday and PITCHf/x Data 319
TABLE B.1
Key for the pitch types abbreviations used by MLBAM.
abbreviation description
AB Automatic Ball
CH Change-up
CU Curveball
EP Eephus pitch
FA Fastball (unspecified)
FC Cut-fastball (cutter)
FF Four-seam fastball
FO Forkball
FS Split-fingered fastball (splitter)
FT Two-seam fastball
IN Intentional ball
KC Knuckle curve
KN Knuckleball
PO Pitchout
SC Screwball
SI Sinker
SL Slider
UN Unknown pitch type
on-it/.
320 Analyzing Baseball Data with R
makes use of the PITCHf/x data. The algorithm has been altered through
the years. While the modifications have generally improved the accuracy
of the classification of pitches, the year-to-year comparison of pitchers’
repertoires require extra caution, as differences might be a product of the
changes in the classifying algorithm.
• Batted ball locations: Batted ball location data are recorded man-
ually and suffer from several problems. First, coordinates systems vary
from ballpark to ballpark, as stringers mark spots on 250 × 250 pixel field
diagrams with inconsistent home plate positioning and pixel-to-feet ratio.
Second, researchers have shown that biases exist due to both the position
(i.e. height) the stringer is assigned at the ballpark and the outcome of the
batted ball. Finally, the stringers are instructed to mark the place where
the ball is collected by a fielder and in case of deflections or caroms off the
walls, it is impossible to infer the original angle of the batted ball.
B.6 Miscellanea
B.6.1 Calculating the pitch trajectory
As seen in the previous sections, PITCHf/x tracks data on location, velocity,
and acceleration of a pitch. Using the kinematics equation for constant accel-
eration, the position of the ball at a given time t can be determined by the
following equations:
1
x = x0 + xv0 t + axt (B.1)
2
1
y = y0 + yv0 t + ayt (B.2)
2
1
z = z0 + zv0 t + azt (B.3)
2
The previous equations are translated to R with use of the following func-
tion pitchloc.7
pitchloc <- function(t, x0, ax, vx0, y0, ay, vy0, z0, az, vz0) {
x <- x0 + vx0 * t + 0.5 * ax * I(t ^ 2)
y <- y0 + vy0 * t + 0.5 * ay * I(t ^ 2)
z <- z0 + vz0 * t + 0.5 * az * I(t ^ 2)
if(length(t) == 1) {
loc<-c(x, y, z)
7 The code in this section has been slightly adapted from code.google.com/p/
r-pitchfx/.
Accessing and Using MLBAM Gameday and PITCHf/x Data 321
} else {
loc <- cbind(x, y, z)
}
return(loc)
}
The function pitch.trajectory calculates the trajectory of a pitch from
release point to home plate at specified time intervals (the default choice of
the argument interval is 0.01 seconds).
pitch.trajectory <- function(x0, ax, vx0, y0, ay, vy0, z0, az,
vz0, interval = .01) {
cross.plate <- (-1 * vy0 - sqrt(I(vy0 ^ 2) - 2 * y0 * ay)) / ay
tracking <- t(sapply(seq(0, cross.plate, interval), pitchloc,
x0 = x0, ax = ax, vx0 = vx0, y0 = y0, ay = ay, vy0 = vy0,
z0 = z0, az = az, vz0 = vz0))
colnames(tracking) <- c("x", "y", "z")
tracking <- data.frame(tracking)
return(tracking)
}
strikeFX(pitches, geom="tile"
, layer=facet_grid(pitcher_name ~ stand))
L R
5
Mariano Rivera
3
2
density
Height from Ground
0.3
1
0 0.2
5
0.1
4
Phil Hughes
3
0
−2 −1 0 1 2 −2 −1 0 1 2
Horizontal Pitch Location
FIGURE B.3
Locations of four-seamers and cutters delivered by Mariano Rivera and Phil
Hughes in 2011, by batter handedness.
[1] Adler, J. (2006), Baseball Hacks: Tips & Tools for Analyzing and Win-
ning with Statistics, O’Reilly Media.
[6] Albert, J. (2009), “Is Roger Clemens whip trajectory unusual?” Chance,
22, 2, 9-20.
[7] Albert, J. and Rizzo, M. (2012), R by Example, Springer, New York.
[8] Allen, D. (2009a), “Deconstructing the Non-Fastball Run Maps”,
Baseball Analysts website, http://baseballanalysts.com/archives/
2009/03/deconstructing_1.php.
[9] Allen, D. (2009b), “Platoon Splits for Three Types of Fastballs”, Baseball
Analysts website, http://baseballanalysts.com/archives/2009/05/
platoon_splits.php.
[10] Berry, S., Reese, S., and Larkey, P. (1999), “Bridging different eras in
sports,” Journal of the American Statistical Association, 94, 661-676.
[11] Berry, S. (1990), “The summer of ’41: a probabilistic analysis of DiMag-
gio’s streak and Williams’s average of .406,” Chance 4, 4, 8-11.
[12] Bradley, R. and Terry, M. (1952), “Rank analysis of incomplete block
designs: I. The method of paired comparisons,” Biometrika, 39, 324-345.
[13] Bukiet, B., Elliotte R., and Palacios, J. (1997), “A Markov chain ap-
proach to baseball,” Operations Research, 45, 14-23.
325
326 Bibliography
[19] Dewan, J. (2009), The Fielding Bible Volume II. ACTA Publications.
[20] Dewan, J. and Jedlovec, B. (2012), The Fielding Bible Volume III. ACTA
Publications.
[21] Dolphin, A., Lichtman, M. and Tango, T. (2007), The Book: Playing the
Percentages in Baseball, Potomac Books Inc.
[41] R Development Core Team (2013), “R: A language and environment for
statistical computing,” R Foundation for Statistical Computing, Vienna,
Austria, www.R-project.org.
[42] Rickey B. (1954), “Goodby to some old baseball ideas,” In LIFE, August
2, 1954 issue. Available at goo.gl/mZiG5.
[43] RStudio (2013). RStudio: Integrated development environment for R
(Version 0.97.336) [Computer software]. Boston, MA. Retrieved June,
22, 2013. Available from www.rstudio.org.
[44] Sarkar, D. (2008), Lattice: Multivariate Data Visualization with R (Use
R!), Springer, New York.
[45] Schwarz, A. (2005), The Numbers Game: Baseball’s Lifelong Fascination
with Statistics, St. Martin’s Griffin.
[46] Seidel, M. (2002), Streak: Joe DiMaggio and the summer of ’41, Bison
Books.
328 Bibliography
[47] Star, J. (2011), “The Road to October: Sept. 29, 2011,” MLB
website, http://mlb.mlb.com/news/article.jsp?content_id=
25380714&vkey=roadtooctober2011&ymd=20110929.
[48] Triumph Books (2012), 2012 Official Rules of Major League Baseball,
Triumph Books.
[49] Venables, W. N., Smith, D. M., and the R Development Core Team
(2011). “An Introduction to R,” Version 2.13.0 (2011-04-13).
[50] Walsh, J. (2008), “Searching for the games best pitch,” Hard-
ball Times website, http://www.hardballtimes.com/main/article/
searching-for-the-games-best-pitch/.
[51] Walsh, J. (2010), “The Compassionate Umpire,” Hardball
Times website, http://www.hardballtimes.com/main/article/
the-compassionate-umpire/.
[52] Wickham, H. (2009), ggplot2: Elegant Graphics for Data Analysis,
Springer, New York.
K16473