\keepXColumns

Golf Strategy Optimization and the Value of Golf Skills

Gautier Stauffer Faculty of Business and Economics (HEC Lausanne), Department of Operations, University of Lausanne, Quartier Unil-Chamberonne, 1015 Lausanne, Switzerland, Email: [email protected] Matthieu Guillot Laboratoire DISP, IUT Lumière, Université Lumière Lyon2, France, Email: [email protected]

Abstract

This study investigates strategic considerations in professional golf’s Stroke Play format. We develop a Markov Decision Process (MDP) model, specifically a stochastic shortest path model, to optimize a golfer’s strategy for any golf course. The model integrates golf course layout details and player skills. We demonstrate this approach using Professional Golfers’ Association Tour data and aerial views of golf courses. To the best of our knowledge, this is the first exact, data-driven approach for golf strategy optimization in the literature. While MDPs are commonly used for sport strategy optimization, scaling this approach to golf poses a challenge due to the curse of dimensionality. Our primary objective is to prove that an exact approach is computationally feasible for such large-scale problems, provided that low-level coding and meticulous code optimization are employed. Furthermore, we illustrate how this framework could be used to determine which aspects of a player’s game should be prioritized for improvement and challenge the ‘Drive for show, putt for dough’ adage. Additionally, we demonstrate how our methodology can be used to quantify the value of different golf skills.To ensure replicability and facilitate the adaptation and extension of our methodology, we provide open access to all our codes and analyses (in R and C++).

1 Introduction

Golf is a game in which players aim to put a ball into a cup (sometimes also called a hole) with the help of clubs (the main types are woods, irons, and putters) using the least number of shots (see [1] for the 2019 official rules of golf). The field where the golfer plays is called a (golf) course. It consists of eighteen independent (and different) holes. A hole comprises different areas as shown in Fig. 1 (see subsection 1 for more details).

Refer to caption — Figure 1: A hole and its different areas, the area beyond is out-of-bounds

The golfer’s score on a hole is essentially the number of shots taken to put the ball into the cup, plus any possible penalties (when the ball ends up in the water or goes out-of-bounds) [1]. The main type of competition is stroke play where the players usually play 4 rounds of 18 holes. The player’s final score is obtained by summing up the score of the 72 holes and the winner is the player with the lowest total score. We focus on this type of competition. Before a competition, elite amateurs and professional players typically practice on the tournament’s golf course to identify the main hazards and risks. This usually includes inspecting the course for obstacles, hazards, slopes, and weather conditions. They typically report such information on a golf course booklet and they include personal recommendations on which club to use and target to aim for under different scenarios (pin position, wind, etc.). The main purpose of the course booklet is to help golfers develop an effective strategy for playing their game during the competition. An ideal personalized booklet would describe which shot to play from any (reachable) position on the golf course, given the current performances of the player. Building such a booklet is beyond human capabilities, and in this paper, we develop a methodology to automate the construction of such a ”strategic” booklet through the use of (available) historical data on the performance of players and Markov Decision Processes and we show that it is computationally tractable. The underlying problem is strategy optimization.

Markov chains and Markov decision processes are the models of choice for performance assessment and optimization in sports, as they capture the inherent probabilistic nature of the success of every ‘action’ performed by athletes or teams. They have been used, for instance, in tennis, basketball, volleyball, ice-hockey, golf, soccer, darts, and snooker, e.g. [40, 41, 36, 34, 26, 27, 23, 32, 11, 39]. Building such a model for golf strategy optimization requires detailed information about the golf course and the player’s past performances. In this work we use simple 2D information on golf course extracted from aerial views and historical data on the player’s performances taken from the Shotlink^™ database.

The introduction of the Shotlink^™ intelligence program (an initiative of the US Professional Golfer Association (PGA) to share data it collects in real-time on all shots taken on the PGA Tour - the PGA championship - since 2001) has stimulated a lot of academic research in the past 15 years. Broadie’s pionering work on the stroke-gained method [9, 10] has revolutionized the analyses of professional golfers’ performances on the PGA Tour. In addition, a large body of work exploits the Shotlink^™ database to study various aspects of the game of golf (such as the effect of luck, pressure on performance, the existence of the hot hand phenomenon) through statistical analyses (e.g. [6, 33, 18, 16, 35, 38, 19, 15, 14, 13, 25, 24, 3, 22]), performance prediction through machine learning (e.g. [29, 28, 37, 42, 31, 17], see [12] for a recent survey), and the evaluation of different parameters (distance, dispersion, hole size) on performance through simulation and/or optimization [5, 11]. This work is part of the area of research called golf analytics.

Golf strategy optimization was addressed first by [39]. The authors show how to approximate the optimal strategy of a player using a skill-model, simulation and Q-learning. The underlying simulation and skill-models are similar to [11], who used a greedy approach to model a golfer’s strategy (basically they assume that a golfer always chooses the best shot assuming he will play a perfect one - which is actually a strategy often chosen by amateurs). [39] show-case their approach using different types of ”average” players whose skills are characterized by parametrized distributions (the parameters are taken from statistical information available on ”average” players). In this work, we use similar simulation and skill models as those presented in [39] and [11], but we apply the methodology using empirical distributions of PGA Tour players, constructed from historical data available through the Shotlink^™ database.

We show in particular that the natural Markov Decision Problem associated with the strategic optimization problem (this is essentially the same underlying MDP in [39]) can be solved exactly in a reasonable amount of time. In addition, we illustrate how the corresponding methodology can be used for skill improvement. The corresponding approach should help PGA tour professionals (and amateurs collecting similar data using systems such as Arccos^™) to substantially improve their performances.

Some golf terminologies

The tee is the area where the player starts (there might be several potential areas but the official starting point for a (round of a) tournament is delimited by tee balls - usually corresponding to a rectangular area of around 5 $m^{2}$ ): the grass is short and this is the only place where the golfer can use a tee (same name but referring to a small t-shaped piece of wood) to raise the ball from the ground (and ease the shot). The green is the area where the grass is very closely mown, and the cup and pin (a flagstick that makes the cup visible from a long distance) are placed. The fairway is the part of the hole between the tee and the green where the grass is kept short and shows the (intended) path to the hole. The rough is an area of higher grass around the fairway. Usually, the further you get from the fairway, the higher the grass. There are usually different types of roughs that vary in grass height and density, namely light or heavy roughs. The bunkers are hollows filled with sand that serve as “traps” from which it might be difficult to escape (depending on the depth and texture of the sand). The water hazards are typically ponds or other bodies of water where the player cannot usually play (unless very shallow or the ball lies on the shore) with a 1-shot penalty to get out (usually close to the point of entrance). Out-of-bounds is an area where it is forbidden to play: the player receives a 1-shot penalty if shooting a ball out-of-bounds and has to place the ball in the previous position. There are other “obstacles” (formally speaking, obstacles refer to water or bunkers only in the game of golf) such as bushes or trees. See again Fig. 1 for an illustration of the different areas.

2 Modelling and building representative PGA players’ skills from the Shotlink^™ data

In golf, like in many skill games, the result of a shot might differ from the intention. There are several elements that may influence the deviation of a shot from the intended target: subtle differences in golf swing that can result in different launch parameters (speed, spin, angle, etc.), weather conditions (wind, humidity, etc.) and lie conditions (fairway, rough, bunker, uphill, downhill, buried ball, ball in a divot, etc.). At the strategic level, one needs not take into account all these aspects. Some can be taken into account at the operational level when playing the game. We focus here on the variation of launch parameters and different initial surface (fairway, bunker, rough). To understand the effect of different launch conditions from the same surface (fairway), we present in Figure 2 data collected for an elite amateur golfer using Trackman^™.

A Trackman^™ profile provides, for certain explicit targets, and given the type of surface, an empirical distribution of the realization of the shots for the player (note that this does not take into account the roll of the ball and this assumes 2D trajectories). These profiles are what we consider in the remainder of the manuscript. More formally, we assume that the position of the balls in 2D, around a targeted point $(0,d)$ , with $d\in[0,D_{{max}}]$ , and from a surface $s$ (fairway, bunker, rough) is a random variable $X_{d,s}$ that can be described by a probability density function in 2D ( $D_{{max}}$ is the maximum distance a player can target). More formally, we consider the sample space $\Omega:=\{(x,y)\in\mathbb{R}^{2}\}$ and we assume that $X_{d,s}$ follows a probability density function $f_{d,s}:(x,y)\in\Omega\mapsto f_{d,s}(x,y)\in\mathbb{R}_{+}$ for any $d\in[0,D_{{max}}]$ and for any $s\in\{fairway,rough,bunker\}$ . The Trackman^™ profile from Figure 2 hence represent samples from the random variables associated with the 5 distances targeted, from the fairway. In this work, we do not impose specific distributional forms (e.g., normal distributions) for the functions $f_{d,s}$ . Instead, we rely on empirical distributions derived from observed data, assuming these are representative samples of the corresponding random variables across specific distances and surfaces.

These empirical distributions can be represented by a Trackman^™ profile. Hence in the following, we will consider that skills of the player, outside of the green, are given to us through a Trackman^™ profile.

Collecting accurate Trackman^™ profiles of a player is nearly impossible, as it requires the player to hit thousands of balls from the same surface with different target distances. Instead, we will infer approximate Trackman^™ profiles by exploiting the Shotlink^™ database and common strategies of PGA tour players. The Shotlink^™ database collects the positions of the ball (3D coordinates) of PGA Tour golfers since 2004 in every PGA Tour competition (as well as other information). Recovering Trackman^™ profiles from such data is not straightforward, as we have no information about the player’s intention: knowing the final destination does not help in assessing the deviation from the intended target. Moreover, we do not have information on the carry and roll of the ball. However, there are some invariants in professional game plans, and we build upon common strategies that professionals use to infer the intention and the ball’s trajectory from the database. Next, we explain the core idea of our approach. It is important to note that our goal is not to replicate exact Trackman profiles of the players, but rather to create profiles that are sufficiently realistic to effectively demonstrate our methodology on data representative of PGA tour players.

First, professionals tend to target the pin whenever possible and not too risky, which is the case for reasonably short shots - say less than 150m (in reality, they tend to aim slightly off the pin if there is an obvious obstacle close to the pin (e.g. a water hazard or bunker), or if the green is not flat, as they generally prefer uphill to downhill putts, but we omit such details). For longer shots, they might play it a bit safer, aiming between the pin and the middle of the green, but since we do not have information about the geometry of the green (and the position of the pin with respect to this geometry), we also assume in this case that they target the pin again. Second, professional golfers tend to choose a general strategy for each of the tee shots (on par-4 or par-5) before the first day of the tournament (in the training rounds on the previous days). So unless there are very different conditions between two rounds (e.g. tee positions, weather conditions), we can reasonably assume that they aim at the same target.

As mentioned, we do not have information about the carry and roll of the ball (nor on the potential lateral spin). We therefore assume for simplicity - and lack of data - that the trajectories are straight and that the final endpoint is independent of the carry/roll trajectory, so that the empirical distribution of a shot around the target only depends on the distance and on the lie of the ball. We believe that the first assumption is not very restrictive, as most shots played by professional golfers are straight (some players may have a slight preference for a certain lateral spin - fade or draw - but this is usually very subtle). For the second hypothesis, the main shortcoming might derive from very short game situations - say below 30m - where different trajectories can be chosen (with a preference usually for rolling the ball on the green as much as possible, but where a “lob shot” is needed when there is an “obstacle” (say a bunker) close to the pin and in between the ball and the hole). In such situations, we implicitly assume that the distribution of the shots around the target does not depend on the type of shot played, which is certainly a limitation. For other shots, the ball tends to roll more on the fairway and the green than in the rough when coming from a long distance (but not much when at a distance of less than 150m). However, we believe that this effect is limited.

So, in concrete terms, we build the trackman profiles from the trace of the shots in Shotlink^™. The position of the ball is stored as a 3D coordinate $(x,y,z)$ in the database, where $x$ and $y$ are essentially the longitude and latitude of the ball (possibly expressed in a different coordinate system) and $z$ is the elevation. We assume here for the exposition that the coordinates are expressed in meters. We get rid of the $z$ coordinate as we assume, in this project, that the trajectories are flat, for computational efficiency. We proceed as follows. Assume that $(x_{0},y_{0})$ is the position of the ball before the shot and that the ball lies on the fairway, $(x_{1},y_{1})$ is the position of the ball after the shot, and $(x^{P},y^{P})$ is the position of the pin. Let $d=||(x-x^{P},y-y^{P})||_{2}=\sqrt{(x-x^{P})^{2}+(y-y^{P})^{2})}$ and let $M$ be the (transpose of the) rotation matrix, that maps $(x^{P}-x_{0},y^{P}-y_{0})$ to $(0,d)$ . We assume that $M\cdot(x_{1}-x_{0},y_{1}-y_{0})$ is drawn from $X_{d,s}$ (with $s=fairway$ ) and thus we can build proxies of trackman profile using this procedure by using all data from a player in the database (we restrict to data from the last 12 months). For tee shots, we assume, as explained above, that the player targets the average position over the 4 rounds that he played (if he played only two rounds, we discard the corresponding data - we also check that the tee positions are within a 5 meter radius over the 4 rounds) and we applied a similar procedure.

We have inferred the Trackman^™ profiles from this procedure. The Fig. 3 “original” panel shows the outcome for Phil Mickelson on fairway shots. Two things clearly emerge from the Fig. 3 “original” panel (or Fig. 5 and Fig. 7 for shots played from the bunker and the rough). First, we do not have historical data for any distance. Second, there are pairs that look suspicious: the pair with a target point at a distance of roughly 370m and final destination at 250m is probably an outlier. Indeed in this case, it is fairly obvious that the player did not try to target the pin (way too far to be within reach), but probably played a safe shot on the fairway between his position and the target. One way of detecting this kind of outlier is to put a cap on the maximum distance error that a player could make in principle. Indeed, professional players are known to be fairly accurate in terms of distance control (unless there are very particular conditions, e.g. hard ground, wind blowing suddenly, hit of a tree, etc…). We used the Trackman^™ data for PGA Tour professionals taken from https://blog.trackmangolf.com/category/tour-stats (see Fig. 3 (left)) to set the cap, as we explain below.

We chose to keep all wedge shots (that is, shots under 100m), as in this case, the assumption of targeting the pin makes perfect sense (there might still be outliers, for instance, if the player’s ball ends up in a water hazard and the player drops it close to the entry point, then the observed deviation does not correspond to the true one), while we cap the maximum distance error to 20m for shots over 100m (the maximum in the PGA Tour Trackman^™ data available to us is 15m, see Fig. 3 (left)). The results are presented in the “cleaned” panel of Fig. 3 . This choice is consistent with statistics from the literature (see Fig. 3 in [30]), and we validated this number with two trainers of the French U21 elite amateur team using data from their elite amateurs. Fig. 4 shows the results for 24 PGA Tour professionals.

Although rigorous statistical validation is not possible (since we do not have the true Trackman^™ profiles for the corresponding players), the figures are fairly consistent with the limited Trackman^™ data we have (see Fig. 3 (left)). However, there are still some obvious outliers (e.g. Grace and Kisner have shots that end up behind them - perhaps due to hitting a tree), but we are not too concerned about these few remaining cases, as the effect of these outliers will be smoothened in the next preprocessing phase.

The task is somewhat harder from the rough, as there are many types from light to heavy, and the lie of the ball might vary substantially and might have a strong impact on the outcome: it could be more difficult to hit a buried ball in a light rough than a ball lying on the surface of a heavy rough. As we do not have access to such information, we collected data from the rough without differentiating these situations. Of course, this is a limitation. For similar reasons as the fairway shots, we can remove outliers when the distance error is too large. We do not have Trackman^™ data of PGA professionals from the rough, so we need to set the thresholds somewhat more arbitrarily. The distance control error might be more important in the rough. The accuracy error on fairway and rough shots was investigated in [30]. Based on the statistics reported in Fig. 2 of the corresponding manuscript, we set the threshold on distance control error for outlier detection at 30m. We again validated this number with two of the trainers of the French U21 elite amateur team using data from their elite amateurs. The results are shown in Fig. 5 and Fig. 6.

For the bunkers, we apply the exact same strategy as for the rough. The results are shown in Fig. 7 and Fig. 8. We have very scarce data for distance over 40-50 meters, which is not surprising since there are many more bunkers around the greens than so-called fairway bunkers. We will consider how to interpolate the missing data. While interpolation has limitations (especially where we have very few points), it has minimal consequences, as few shots are played from fairway bunkers.

We now focus on the driving data off the tee. The results are shown in Fig. 9 and Fig. 10. Here again we applied a threshold of 30m to the distance control error for outlier detection.

To create complete Trackman profiles, we would need empirical distributions for any possible distance. We use a form of bootstrapping to generate missing data. Observe that this procedure would be needed even if we had true Trackman profiles from the beginning, such as the one in Fig. 2, as there contain only a limited number of targeted distances and sample points. The main idea is to use a local linear approximation. For any distance $d$ we might target, the basic idea is to grow a disk around $(0,d)$ until it contains enough sample target-destination pairs (from the inferred data), and scale the coordinates of the arrival points by the ratio of the original targeted distance and $d$ (that is, assuming a linear relation: if the targeted point was $(0,t)$ (within the disk) and the arrival point $(x,y)$ , we assume that for the hypothetical target $(0,d)$ this would have resulted in the arrival point $\frac{d}{t}\cdot(x,y)$ ). The radius of the disk is defined so as to grab enough data to be statistically relevant, but not too many when this includes points that are too remote - this is the case when we have fewer data available, such as for fairway bunkers, for instance: we set the number of points to 50 and put a cap of 30 meters on the radius unless we have fewer than 10 points available (in which case, we take the closest 10 points). We have also assumed that the distribution is symmetric along the y-axis, and that the lateral error and the distance control error are independent. We used these hypotheses to “shuffle” the x and y coordinates of the arrival points and avoid too much bias toward the existing data points. Additionally, we used the 95th percentile of the maximum observed target distance as the maximum targetable distance of a surface, and capped the maximum distance that the ball might reach by the maximum observed distance. The results are shown in Fig. 11, Fig. 12, Fig. 13, Fig. 14 and Fig. 15. In all the figures, we have restricted the target distances to multiples of 2.5m, and generated 15 realizations for each distance. Note that for consistency, we ensured that the average lateral dispersion would increase with distance. Hence, we slightly rescaled the lateral dispersion by the inverse of the ratio with the average of the previous distance when this was not the case. Furthermore, we paid attention to the fact that the average lateral deviation from the rough and from the bunker (for a given distance) could not be less than that from the fairway to compensate for missing data that could bias the results. Although there are still a few inconsistent points, the result appear fairly clean. While the parameters used above could be fine-tuned, we are satisfied with the results from the current settings (see Table 1 and Table 2 and the associated discussion) as, again, our objective is to construct realistic PGA Tour player profiles, not to create exact virtual clones.

Now the last skills we did not consider yet are putting skills. On a green, a professional PGA Tour golfer typically makes 1, 2, or 3 putts (perhaps 4 or 5 putts in exceptional situations, but this represents less than 3 cases out of 10000 in our data, so we assume that only these three situations can occur). As we do not have details about the slopes of the greens, we assume that the average number of putts is simply a function of the distance to the hole, that is it follows a certain function $p:d\in\mathbb{R}\mapsto p(d)\in\mathbb{R}$ . In order to evaluate this function, we estimated the probability of 1-putt, 2-putts, and 3-putts for any possible distance on a green. We focus now on the 1-putt and 3-putts probabilities, as the 2-putts probability is easily computed from the other two.

The longest possible distance on a green (that is, the diameter of the geometrical object) is usually no more than 25m, especially in the US (the old course in Saint-Andrews (UK) is an exception with a diameter of some greens reaching 50m). We thus consider distance to hole below 32m (above this value, we have very little data: about 1 per 16,000). To build the histogram, we need enough data for each distance. Unfortunately, we have more data close to the pin than far from it (as players usually get closer each time they play without putting the ball into the hole). For individual players, we thus inferred the probability from the data available below 16m. We used buckets of doubling size (to have enough data within each bucket for statistical relevance - more than 30 points typically) with the following “breakpoints” (in meters): (0, 0.5, 1, 2, 4, 8, 16) and assigned the value to the midpoint of the interval. so, in concrete terms, for a given player, we collected all data regarding putts made from a distance in the range of distance, say $[1,2]$ , and we estimated the 1-putt (2-putt, 3-putt resp.) probability as the frequency of 1-putt (2-putt, 3-putt resp.). We then assumed that all putts were played from the mid-distance, that is 1.5 m here. For distances between 16m and 32m, due to the lack of data for individual players, we aggregated all data from all players to estimated the corresponding probabilities. We then used a linear interpolation to build a proxy of the function $p$ for any player.

The results are presented in Fig. 17. If we aggregate all the data from professional PGA Tour players, we obtain the result in Fig. 16, which is in line with the literature (see Fig. 1 in [30]). We also compare the average number of putts as a function of the distance for different PGA Tour players. The results are shown in Fig. 18 (of course, our estimations could probably be improved and smoothened, but we believe that using more advanced approaches is not relevant at this stage, as there are other simplifications in our models that probably dominate this one).

We preprocessed the data with R, and the corresponding code is available, upon request, in a companion zip file for the replicability of our results.

3 Modelling the game of golf

In order to optimize a golfer strategy with a Markov Decision Process, we need a model to predict the (stochastic) outcome of any ”action” we may use. In our case, as will become clear later, actions correspond to shots and we need to specify the result of any (possible) shot.

We explained in the previous section how we inferred reasonnable Trackman^™ profiles of players on each surface and for each distance given the “trace” of the players on the different tournaments available in the database. These profile are 2D and we assume in the following that they represent the projections of the flight of the ball: we thus implicitly assume that the ball does not roll and that the trajectories are straight. Now we use a 2D representation of the golf course as well, using stylized 2D pictures of the holes and a clear encoding of each surface and obstacles similar to Fig. 1 (bushes/trees in dark green, roughs in green, fairways in light green, greens in yellow green, and bunkers in egg nog). We actually created (manually) 2D raster of the holes of three golf courses using aerial views from google^™ maps: Augusta National Golf Club (that hosts one of the four major tournaments: the Masters), Le Golf National (that hosted the Ryder Cup competition in 2018, a very famous biennial competition between male teams from Europe and the United States) and the Bay Hill Club and Lodge, Orlando (that hosts the Arnold Palmer Invitational). We chose the resolution so that 1 cell roughly represents a square region of side 1m (typically between 0.7m and 1.5m, depending on the hole).

For a position $(x,y)$ on a surface $s$ (actually a cell in the raster), a shot is essentially a selection of a target point $(x^{t},y^{t})$ . This target point can be characterized by a distance $d$ and an direction/angle. Let $M$ be the (transposed of the) rotation matrix associated with the corresponding angle. The target point is $(x,y)+M\cdot(0,d)$ . Now in order to estimate the distribution of the outcome of the shot, we consider $k$ realizations sampled from $X_{d,s}$ . Let $(x_{1},y_{1}),...,(x_{k},y_{k})$ be the corresponding samples. In the absence of trees, water hazard and out-of-bound area, the distribution of the outcome could be approximated by the sample $(x^{\prime}_{1},y^{\prime}_{1}),...,(x^{\prime}_{k},y^{\prime}_{k})$ , where $(x^{\prime}_{i},y^{\prime}_{i})=(x,y)+M\cdot(x_{i},y_{i})$ for all $i=1,...,k$ . We call this the hypothetical empirical distribution.

In the presence of trees (and the like) in the ball’s trajectory, we assume that the ball will stop on the trajectory right before the first obstacle it encounters. That is, the ball will hit the obstacle and it will neither “bounce off” nor penetrate the obstacle, it will simply stop (consider a collision with a dense tree, like fir). More precisely, we assume that trees are infinitely high, and that the ball simply stops right before the contact point. When the ball falls into a water hazard, we assume that the player will “drop” the ball at the entry point (which is the most common option out of the different possible options for a golfer in such situation). When the ball ends up out-of-bounds, we (re)position the ball at the origin of the shot (with a 1-shot penalty according to the rules of golf).

Technically speaking, we apply the following procedure to identify the destination cell. We use Bresenham’s algorithm ([8]) to identify an ordered set of cells from the raster/picture that are traversed by the trajectory. Suppose we consider the trajectory from $(x_{1},y_{1})$ to $(x^{\prime}_{1},y^{\prime}_{1})$ . Let $\mathcal{C}=\{c_{1},...,c_{l}\}$ be the ordered set of cells return by Bresenham’s algorithm. Let $i$ be the index of the first tree cell in $\mathcal{C}$ ( $i=l+1$ if there is none). We first truncate the trajectory to $\mathcal{C}^{\prime}=\{c_{1},...,c_{i-1}\}$ (observe that we do not allow a player to play from a tree so $i>1$ ). Then, if $c_{i-1}$ is a water cell, we let $j$ be the largest index such that $c_{j}$ is not a water cell (again here we do not allow a golfer to play from a water hazard so $i-1>j>1$ ) and we truncate the trajectory to $\mathcal{C}^{\prime\prime}=\{c_{1},...,c_{j-1}\}$ . Finally if $c_{l}$ is an out-of-bound cell, we truncate the trajectory to $\mathcal{C}^{\prime\prime}=\{c_{1}\}$ . An illustrative (non-realistic) example of a simulation is given in Fig. 19.

We have implemented the corresponding model in C++ for better performance (as will become clear later, we need to call this model a billion times just to create the stochastic shortest path model). The corresponding C++ code is available, upon request, in a companion zip file for the replicability of our results.

4 The optimization model

Before explaining the model, we start with a brief introduction of the stochastic shortest path problem, following [21].

The stochastic shortest path (SSP) problem is a Markov decision process (MDP) that generalizes the classic deterministic shortest path problem. We want to control an agent who evolves dynamically in a system composed of different states, so as to converge to a predefined target. The agent is controlled by taking actions in each time period (we focus here on discrete time (infinite) horizon problems): actions are associated with costs, and transitions in the system are governed by probability distributions that depend exclusively on the previous action taken, and are thus independent of the past. We restrict to finite state/action spaces: the goal is to choose an action for each state, i.e. a deterministic and stationary policy, that reaches the target state with probability one (such a policy is called proper), so as to minimize the total expected cost incurred by the agent before reaching the (absorbing) target state when starting from a given initial state. The problem is well-defined when there is a way to reach the target from any state and when there is no improper policy that allows accumulating an infinitely negative cost [21].

More formally, a stochastic shortest path instance is defined by a tuple $(\mathcal{S},\mathcal{A},J,P,c)$ where ${\mathcal{S}}=\{0,1,\ldots,n\}$ is a finite set of states, ${\mathcal{A}}=\{0,1,\ldots,m\}$ is a finite set of actions, $J$ is a 0/1 matrix with $m$ rows and $n$ columns and general term $J(a,s)$ , for all $a\in\{1,...,m\}$ and $s\in\{1,...,n\}$ , with $J(a,s)=1$ if and only if action $a$ is available in state $s$ , $P$ is a row substochastic matrix (a row substochastic matrix is a matrix with nonnegative entries so that every row adds up to at most $1$ . Observe that it is not a usual stochastic matrix as state $0$ and action $0$ are left out) with $m$ rows and $n$ columns and general term $P(a,s):=p(s|a)$ (probability of ending in $s$ when taking action $a$ ), for all $a\in\{1,...,m\}$ , $s\in\{1,...,n\}$ , and a cost vector $c\in{\mathbb{R}}^{m}$ . The state $0$ is called the target state and the action $0$ is the unique action available in that state. Action $0$ leads to state $0$ with probability $1$ . We denote with ${\mathcal{A}}(s)$ the set of actions available from $s\in\{1,...,n\}$ and assume without loss of generality ( If not, we simply duplicate the actions) that for all $a\in\mathcal{A}$ , there exists a unique $s$ , such that $a\in\mathcal{A}(s)$ . We denote with ${\mathcal{A}}^{-1}(s)$ the set of actions that lead to $s$ , i.e. ${\mathcal{A}}^{-1}(s):=\{a:P(a,s)>0\}$ .

A (deterministic and stationary) policy $\Pi$ is a function $\Pi:s\in{\mathcal{S}}\mapsto{\mathcal{A}}(s)$ , that is, it assigns an action for each possible state. Let $y^{\Pi}_{k}\in{\mathbb{R}}_{+}^{n}$ be the substochastic (in general, not a purely stochastic vector, as state $0$ is left out.) vector representing the state of the system in period $k$ when following policy $\Pi$ (from an initial distribution $y^{\Pi}_{0}$ ). That is, $y^{\Pi}_{k}(s)$ is the probability of being in state $s$ , for all $s=1,...,n$ at time $k$ following policy $\Pi$ . Similarly, we denote with $x_{k}^{\Pi}\in\mathbb{R}_{+}^{m}$ the substochastic (in general, not a purely stochastic vector, as action $0$ is left out) vector representing the probability of performing action $a$ , for all $a=1,...,m$ , at time $k$ following policy $\Pi$ . Given a policy $\Pi$ and an initial distribution $y^{\Pi}_{0}$ at time $0$ , by the law of total probability (and because each action is available in exactly one state), we have $x_{k}^{\Pi}=\Pi^{T}\cdot y^{\Pi}_{k}$ for all $k\geq 0$ .

Given a state $s\in\{1,...,n\}$ , a policy $\Pi$ is said to be $s$ -proper if $\sum_{k\geq 0}x_{k}^{\Pi}$ is finite, when $y^{\Pi}_{0}:=e_{s}$ . Observe that $\sum_{k\geq 0}y_{k}^{\Pi}$ is also finite for s-proper policies (as $y_{k}^{\Pi}=P^{T}x_{k-1}^{\Pi}$ ). In particular, $\lim_{k\rightarrow+\infty}y^{\Pi}_{k}=0$ , and thus the policy leads to the target state $0$ with probability $1$ from state $s$ . An $s$ -proper policy is thus a policy that converges to the target with probability one, and whose expected number of visits in each action is finite. The expected cost of such policy is thus the well-defined value $c^{T}\sum_{k\geq 0}x_{k}^{\Pi}$ . The $s$ -stochastic-shortest-path problem ( $s$ -SSP for short) is the problem of finding an $s$ -proper policy $\Pi$ of minimal cost $c^{T}\sum_{k\geq 0}x_{k}^{\Pi}$ .

As explained in the previous sections, we built (discrete) 2D models for holes and Trackman^™ profiles, and a 2D simulator of ball trajectories (using Bresenham algorithm). With these elements, we can evaluate a player’s performance for any strategy (a strategy is the choice of shot for any position on the corresponding hole, that is, essentially a choice of direction and targeted distance, according to our 2D representations) on any hole. Indeed, given a strategy, we can build a Markov chain whose states are the possible positions on the hole (the pixels basically), and the transition matrix can be built from the choice of direction and targeted distance in any state as follows: take the different empirical realizations from the Trackman^™ profile corresponding to the surface where the ball lies, and simulate the outcome of the different realizations on the corresponding hole (we have 15 realizations for each shot from the Trackman profiles generated in Section 2). This provides an empirical distribution over the state space and an expected cost for the corresponding “action” (1 if no penalty occurs). If the strategy is sound (that is, converging to the target from any position), the corresponding Markov chain is absorbing, and the expected number of steps before being absorbed by the cup can be easily evaluated through computing the fundamental matrix (see for instance [20]).

We can go one step further and find the optimal strategy by building an absorbing Markov decision process (an SSP in fact) by simply adding all the possible sets of actions available in a given state to the Markov chain described above. That is, in this SSP model, the states are still the positions on the hole (again the pixels), the actions are the triplets (state, targeted distance, direction), the (empirical) transition matrix and the costs are computed as explained above for the Markov chain model. We have restricted the set of possible directions to an angle (in radian) in $\{0,\frac{2\pi}{180},\ldots,179\cdot\frac{2\pi}{180}\}$ (with this discretization, the player has an aiming precision of at least $\approx$ 1.75m at a distance of 100m), and the targeted distance from the Trackman^™profiles is restricted to multiples of 2.5m as discussed earlier).

In our 2D representations of the hole, we ensured that only locations where the target could be reached were kept (for instance, no rough surrounded by trees or out-of-bounds only: in such case, we would redefine the corresponding zone as a tree area or as an out-of-bounds area), and hence our models are well-defined instances of SSP.

The corresponding instances have the order of 10 thousand states and 150 million actions. We implemented the value iteration algorithm in C++ (see [21] for details of the algorithm). Most of the time spent on the SSP problem resolution actually entails creating the model (with more than a billion calls to the simulator and Bresenham’s algorithm, as we have 15 realizations to simulate for each 150 million actions). Although we have attempted to optimize our code as much as possible, optimizing the computational performance is not the main purpose of this study. Indeed, the computational performance would clearly improve by parallelizing the construction of the model. Instead again we aim at showing that the models are tractable computationally to stimulate further investigations.

Computational experiments

We conducted our experiments on the SD530 nodes within the Curta platform, which consists of 336 nodes. Each node is powered by an Intel Xeon Gold SKL-6130 processor operating at 2.1 GHz, has 32 cores, and is equipped with 96 GB of RAM. More details on the hardware can be found at https://redmine.mcia.fr/projects/cluster-curta.

Our analysis focused on 119 golf players from the 2018 Arnold Palmer Invitational (out of 165 participants) for whom we had ShotLink data available to construct TrackMan profiles. We utilized all relevant data from 2017, as well as data from early 2018 leading up to the tournament in March. The average time required to build the model and optimize the strategy for a single hole was approximately 27 minutes, with a standard deviation of 4.5 minutes. The fastest and slowest times recorded were 12 minutes and 44 minutes, respectively. Note that the main reason for setting the number of realizations to 15 was memory limits: the minimum and maximum memory consumption was 67 and 77 GB respectively (and no more than 92 GB could be reserved on the nodes).

The bulk of the computational effort is devoted to model building (a couple of minutes is needed for value iteration typically - 115 seconds on average with a standard deviation of 42 seconds and a maximum of 304 seconds). The computation time for model creation is essentially multi-linear in the parameters that influence the number of simulations/Bresenham calculations. These parameters include the angle discretization, which affects the number $d$ of possible shot directions; the target distance discretization, which affects the number $t$ of target distances; and the number $r$ of realizations used to generate the bootstrapped Trackman profiles. Therefore, the empirical computation time is of the order $O(d\cdot t\cdot r)$ .

Significant improvements in computation time could be achieved through parallel processing. Since the outcomes of individual actions within the model are independent, parallelizing the model creation over $M$ machines could reduce the time to $O\left(\frac{d\cdot t\cdot r}{M}\right)$ . The value iteration algorithm could also be parallelized, although this would require careful organization due to the dependencies between computations. However, the focus of this paper is to demonstrate the feasibility of solving exactly the SSP model associated with golf strategy optimization, rather than exploring computational optimization through parallelization. We should note also that one could easily reduce the action space by filtering out actions that are dominated, e.g. aiming in a direction in the opposite side of the pin is most of the time clearly suboptimal.

The corresponding C++ code and script to launch the code on the Curta platform (or similar) are available, upon request, in a companion zip file.

5 Representativeness of our virtual PGA tour players

We have simulated the performances of the 119 golf players in the Arnold Palmer Invitational to ensure they reasonably represent PGA Tour players, though they are not exact clones of each individual.

In Tables 1 and 2, we present standard golf metrics to compare the performances of these virtual players with their actual performances at the 2018 Arnold Palmer Invitational, held at Bay Hill Club and Lodge in Orlando, a prestigious PGA Tour event. To maintain clarity and avoid overloading the main presentation, the confidence intervals have been provided in the Appendix.

	Vorname	Nachname	Score	Tee-shot	Fairway	L	R	GiR	Water	Bunker
1	Rory	McIlroy	269	266.504	0.696	0.107	0.196	0.736	0.000	0.153
2	Bryson	DeChambeau	273	260.902	0.768	0.125	0.107	0.764	0.000	0.125
3	Justin	Rose	274	263.270	0.786	0.054	0.161	0.750	0.028	0.139
4	Henrik	Stenson	275	253.695	0.839	0.054	0.107	0.806	0.000	0.139
5	Tiger	Woods	277	258.218	0.643	0.107	0.250	0.708	0.000	0.181
6	Ryan	Moore	278	254.739	0.839	0.018	0.143	0.806	0.028	0.250
7	Kevin	Chappell	280	266.802	0.732	0.125	0.143	0.806	0.042	0.153
8	Marc	Leishman	280	259.035	0.714	0.071	0.214	0.722	0.014	0.208
9	Patrick	Rodgers	280	257.997	0.589	0.161	0.250	0.722	0.014	0.250
10	Chris	Kirk	281	254.156	0.732	0.143	0.125	0.708	0.014	0.264

Table 1: Historical golf metrics for the top ten players from the 2018 Arnold Palmer Invitational. Score: total score over the 4 rounds; Tee-shot : average tee-shot distance in meters (on par 4 and par 5 only); Fairway: percentage of fairways hit with the tee-shot on par 4 and par 5; L: percentage of fairways missed on the left on par 4 and par 5; R: percentage of fairways missed on the right on par 4 and par 5; GiR: percentage of green hit in a number of shot no more than par minus 2; Water: percentage of water hasard penalties ; Bunker: percentage of bunker shots.

	Vorname	Nachname	Score	Tee-shot	Fairway	L	R	GiR	Water	Bunker
1	Rory	McIlroy	273.486	284.492	0.685	0.126	0.189	0.735	0.009	0.137
2	Bryson	DeChambeau	277.623	270.841	0.676	0.141	0.183	0.760	0.009	0.121
3	Justin	Rose	275.630	269.931	0.700	0.147	0.153	0.742	0.004	0.144
4	Henrik	Stenson	281.944	258.893	0.699	0.135	0.166	0.718	0.007	0.144
5	Tiger	Woods	269.085	263.598	0.765	0.088	0.147	0.795	0.005	0.099
6	Ryan	Moore	282.356	262.925	0.686	0.133	0.181	0.718	0.008	0.157
7	Kevin	Chappell	278.488	291.782	0.624	0.121	0.255	0.721	0.015	0.118
8	Marc	Leishman	274.260	277.578	0.702	0.136	0.162	0.770	0.009	0.101
9	Patrick	Rodgers	284.414	271.482	0.628	0.181	0.191	0.678	0.006	0.164
10	Chris	Kirk	289.085	243.917	0.647	0.135	0.217	0.693	0.013	0.119

Table 2: Simulated golf metrics for the top ten players from the 2018 Arnold Palmer Invitational (10000 simulations). Score: average score over the 4 rounds. Tee-shot : average tee-shot distance in meters (on par 4 and par 5 only); Fairway: percentage of fairways hit with the tee-shot on par 4 and par 5; L: percentage of fairways missed on the left on par 4 and par 5; R: percentage of fairways missed on the right on par 4 and par 5; GiR: percentage of green hit in a number of shot no more than par minus 2; Water: percentage of water hasard penalties ; Bunker: percentage of bunker shots

Despite some discrepancies, the constructed virtual players are representative of typical PGA Tour players, which is sufficient for demonstrating how our methodology can be leveraged to improve performance.

6 Leveraging the methodology for prioritizing training : the value of golf skills

While some statistics enable professional players to compare their performances with others (e.g., strokes gained [9, 10]), identifying the specific areas for improvement to maximize performance remains a challenge. Should a player focus on enhancing distance control, lateral dispersion, putting skills, or driving length? A significant advantage of our modeling approach is its flexibility in conducting interventions. This allows us to substitute certain skills with alternatives and compare the outcomes. For example, our approach can easily assess scenarios such as a player having Rory McIlroy’s exceptional driving length or Tiger Woods’ outstanding putting skills.

We have simulated the performances of our 119 representatives of the PGA Tour, under such interventions. First, we have adjusted their driving skills to match Rory McIlroy’s performance (we have substituted their driving stats with Rory McIlroy’s stats). Then, we have modified their putting skills to match Tiger Woods’ putting abilities. This allows us to assess the impact of these changes on their average scores. The results of these simulations are reported in Fig. 20 and Fig. 21.

Interestingly, the average gain per hole using Rory McIlroy’s driving skill is 0.139 (95% confidence interval: [0.126, 0.152]), whereas the average gain per hole using Tiger Woods’ putting skill is only 0.046 (95% confidence interval: [0.041, 0.050]). This seems to challenge the well-known maxim ”Drive for show, putt for dough.” Although several authors have already critiqued this saying using statistical methods [2, 7], we believe that the answer is not universal but rather specific to both the player and the course and our model offer a grasp on this question. A detail analysis of this matter is beyond the scope of the current paper.

Our approach, combined with accurate Trackman profile of players, allows to quantify the value of certain hypothetical skills (for a specific player on a specific course) as we have illustrated here with Rory McIlroy’s driving skills and Tiger Woods’ putting skills. We believe that this could provide an invaluable tool to prioritize training for professional players, elite amateurs, but also week-end players. Indeed one could use the same approach to quantify the impact of increasing a player’s driving length by 10m or reducing their driving lateral dispersion by 10%. This could also be applied to other skills like long shots, wedging, chipping and putting, under various scenarios. Depending on the effort needed to reach the corresponding improvement and the expected benefit on the average score, the player would have a rational way of prioritizing, with his or her training team, between different training strategies.

7 Conclusion and perspectives

The primary aim of this study was to demonstrate the computational feasibility of our methodology, designed to optimize professional golfers performances on the PGA Tour using data from Shotlink^™. We have also explored how it could assist golfers in making training decisions and to challenge conventional wisdom in golf such as the ”Drive for Show, Putt for Dough” saying.

Our models could also guide golf course design or redesign. By simulating different course layouts—such as adding new obstacles or repositioning tees and pins— with different PGA tour players, course architects could tailor courses for greater competitive challenge or enhanced spectacle. Moreover, one could use our methodology to rank golf courses based on the collective performance of a representative set of players, providing an additional intriguing application.

Finally, our method would likely gain from integrating a 3D simulator instead of the current 2D model, and utilizing the 3D trajectories provided by Shotlink^™. While this would increase the computational demands, the rise would not be excessive. Indeed, 3D adaptations of the Bresenham algorithm are available [4], and parallelizing the code could help manage computing times effectively.

8 Acknowledgement

We would like to thank the PGA Tour for giving us access to the ShotLink^™ data. We would also like to thank Renaud Gris and Jason Belot (French U21 Elite amateur trainers) for the constructive discussions of our models and providing data on some of their elite amateurs. At the time of writing, the ShotLink Intelligence program has been discontinued, and ShotLink data is no longer publicly available to academics. However, we have received authorization from the PGA Tour to make our data accessible for replication and validation by other researchers. We would like to express our gratitude to the PGA Tour, and to Ken Lovell in particular, for their support. Please note that any further use of the data requires prior permission from the PGA Tour.

References

[1] The rules of golf. https://www.randa.org/rog/the-rules-of-golf, 2019. Accessed: 2024-03-15.
[2] DL Alexander and W Kern. Drive for show and putt for dough? an analysis of the earnings of pga tour golfers. Journal of sports Economics, 6(1):46–60, 2005.
[3] J Arkes. The hot hand vs. cold hand on the pga tour. International Journal of Sport Finance, 11(2):99–113, 2016.
[4] C Au and T Woo. Three dimensional extension of bresenham’s algorithm with voronoi diagram. Computer-Aided Design, 43(4):417–426, 2011.
[5] M Bansal and M Broadie. A simulation model to analyze the impact of hole size on putting in golf. In Simulation Conference, pages 2826–2834. WSC 2008, 2008.
[6] CD Baugher, JP Day, and EW Burford Jr. Drive for show and putt for dough? not anymore. Journal of Sports Economics, 17(2):207–215, 2016.
[7] CD Baugher, JP Day, and EW Burford Jr. Drive for show and putt for dough? not anymore. Journal of Sports Economics, 17(2):207–215, 2016.
[8] J Bresenham. Algorithm for computer control of a digital plotter. IBM Systems Journal, 4:25–30, 1965.
[9] M Broadie. Assessing golfer performance using golfmetrics. In Science and golf V: Proceedings of the 2008 world scientific congress of golf, pages 253–262. St. Andrews: World Scientific Congress of Golf Trust, 2008.
[10] M Broadie. Assessing golfer performance on the pga tour. Interfaces, 42(2):146–165, 2012.
[11] M Broadie and S Ko. A simulation model to analyze the impact of distance and direction on golf scores. In Winter Simulation Conference, WSC 2009, pages 3109–3120, 2009.
[12] RP Bunker and F Thabtah. A machine learning framework for sport result prediction. Applied Computing and Informatics, 15(1):27–33, 2019.
[13] R Connolly and RJ Rendleman. Tournament selection efficiency: an analysis of the pga tour’s fedexcup. Journal of Quantitative Analysis in Sports, 8(4), 2012.
[14] RA Connolly and RJ Rendleman. Dominance, intimidation, and ‘choking’ on the pga tour. Journal of Quantitative Analysis in Sports, 5(3), 2009.
[15] RA Connolly and RJ Rendleman Jr. Skill, luck, and streaky play on the pga tour. Journal of the American Statistical Association, 103(481):74–88, 2008.
[16] RA Connolly and RJ Rendleman Jr. What it takes to win on the pga tour (if your name is “tiger” or if it isn’t). Interfaces, 42(6):554–576, 2012.
[17] C Drappi and LC Ting Keh. Predicting golf scores at the shot level. Journal of Sports Analytics, 5(2):1–9, 2018.
[18] D Fearing, J Acimovic, and SC Graves. How to catch a tiger: understanding putting performance on the pga tour. Journal of Quantitative Analysis in Sports, 7(1), 2011.
[19] E Gnagy, M Dixon, E Clingerman, and J Bartholomew. An exploration of strategic decision making in golf: take a chance, it’s worth the risk. International Journal of Golf Science, 4(2):89–109, 2015.
[20] CM Grinstead and JL Snell. Introduction to probability. American Mathematical Society, Providence, RI, 1997.
[21] M Guillot and G Stauffer. The stochastic shortest path problem: a polyhedral combinatorics perspective. European Journal of Operational Research, 285(1):148–158, 2018.
[22] EL Heiny and R Heiny. And the 2011 driving champion is? dustin johnson. Journal of Quantitative Analysis in Sports, 8(4), 2012.
[23] EL Heiny and R Heiny. Stochastic model of the 2012 pga tour season. Journal of Quantitative Analysis in Sports, 10(4), 2014.
[24] DC Hickman, C Kerr, and N Metz. Rank and performance in dynamic tournaments: evidence from the pga tour. Journal of Sports Economics, 20(4):509–534, 2019.
[25] DC Hickman and NE Metz. The impact of pressure on performance: evidence from the pga tour. Journal of Economic Behavior and Organization, 116:319–330, 2015.
[26] S Hoffmeister and J Rambau. Strategy optimization in sports: a two-scale approach via markov decision problems. http://www.wm.uni-bayreuth.de/de/download/xcf2d3wd4lkj2/preprint_sso_bv.pdf, 2015.
[27] S Hoffmeister and J Rambau. Sport strategy optimization in beach volleyball? how to bound direct point probabilities dependent on individual skills. In MathSport International 2017 Conference, 2017.
[28] KY Huang and WL Chang. A neural network method for prediction of 2006 world cup football game. In The 2010 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2010.
[29] J Hucaljuk and A Rakipovic. Predicting football scores using machine learning techniques. In MIPRO, Proceedings of the 34th International Convention, pages 1623–1627. IEEE, 2011.
[30] N James and GD Rees. Approach shot accuracy as a performance indicator for us pga tour golf professionals. International Journal of Sports Sciences Coach, 3(1):145–160, 2008.
[31] J Lim, Y Lim, and J Song. Prediction of golf scores on the pga tour using statistical models. Korean Journal of Applied Statistics, 30(1):41–55, 2017.
[32] M Maher. Stochastic modelling of sport. In 2012 Ninth International Conference on Quantitative Evaluation of Systems, pages 207–208, 2012.
[33] S Ozbeklik and JK Smith. Risk taking in competition: evidence from match play golf tournaments. Journal of Corporate Finance, 44:506–523, 2017.
[34] M Pfeiffer, H Zhang, and A Hohmann. A markov chain model of elite table tennis competition. International Journal of Sports Sciences Coach, 5(2):205–222, 2010.
[35] S Robertson, AF Burnett, and R Gupta. Two tests of approach-iron golf skill and their ability to predict tournament performance. Journal of Sports Sciences, 32(14):1341–1349, 2014.
[36] K Routley and O Schulte. A markov game model for valuing player actions in ice hockey. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 782–791, Arlington, Virginia, United States, 2015. AUAI Press.
[37] Z Shi, S Moorthy, and A Zimmermann. Predicting ncaab match outcomes using ml techniques–some results and lessons learned. In ECML/PKDD 2013 Workshop on Machine Learning and Data Mining for Sports Analytics, 2013.
[38] M Stockl and PF Lamb. The variable and chaotic nature of professional golf performance. Journal of Sports Sciences, 36(9):978–984, 2018.
[39] S Sugawara, H Kawamura, and K Suzuki. Skill-based simulation model for optimizing strategy in golf. In Advanced Intelligent Mechatronics (AIM), 2013 IEEE/ASME International Conference, pages 1591–1596, 2013.
[40] A Terroba, W Kosters, J Varon, and CS Manresa-Yee. Finding optimal strategies in tennis from video sequences. International Journal of Pattern Recognition and Artificial Intelligence, 27(06):1355010, 2013.
[41] E Trumbelj and P Vraar. Simulating a basketball match with a homogeneous markov model and forecasting the outcome. International Journal of Forecasting, 28(2):532–542, 2012.
[42] O Wiseman. Using machine learning to predict the winning score of professional golf events on the PGA Tour. PhD thesis, National College of Ireland, Dublin, 2016.

9 Appendix

The tables below present the confidence intervals for the mean of various metrics considered in the table 1. For each metric, ’LB’ represents the lower bound and ’UB’ represents the upper bound of the interval.

	FirstName	LastName	Score	Score LB	score UB	Tee-shot	Tee-shot LB	Tee-shot UB
1	Rory	McIlroy	269.000	258.816	279.184	266.504	262.347	270.661
2	Bryson	DeChambeau	273.000	262.445	283.555	260.902	257.769	264.036
3	Justin	Rose	274.000	263.146	284.854	263.270	259.396	267.145
4	Henrik	Stenson	275.000	266.162	283.838	253.695	249.943	257.446
5	Tiger	Woods	277.000	267.877	286.123	258.218	253.077	263.359
6	Ryan	Moore	278.000	268.006	287.994	254.739	250.969	258.510
7	Kevin	Chappell	280.000	268.684	291.316	266.802	262.215	271.389
8	Marc	Leishman	280.000	269.146	290.854	259.035	254.889	263.180
9	Patrick	Rodgers	280.000	270.532	289.468	257.997	254.502	261.492
10	Chris	Kirk	281.000	270.205	291.795	254.156	250.510	257.801

	FirstName	LastName	Fairway	Fairway LB	Fairway UB	L	L LB	L UB
1	Rory	McIlroy	0.696	0.594	0.799	0.107	0.036	0.178
2	Bryson	DeChambeau	0.768	0.675	0.860	0.125	0.061	0.189
3	Justin	Rose	0.786	0.691	0.880	0.054	0.000	0.107
4	Henrik	Stenson	0.839	0.758	0.921	0.054	0.006	0.101
5	Tiger	Woods	0.643	0.536	0.750	0.107	0.036	0.178
6	Ryan	Moore	0.839	0.754	0.925	0.018	-0.013	0.049
7	Kevin	Chappell	0.732	0.636	0.828	0.125	0.047	0.203
8	Marc	Leishman	0.714	0.620	0.809	0.071	0.015	0.128
9	Patrick	Rodgers	0.589	0.500	0.678	0.161	0.087	0.234
10	Chris	Kirk	0.732	0.640	0.825	0.143	0.067	0.218

	FirstName	LastName	R	R LB	R UB	GiR	GiR LB	GiR UB
1	Rory	McIlroy	0.196	0.107	0.286	0.736	0.638	0.834
2	Bryson	DeChambeau	0.107	0.036	0.178	0.764	0.682	0.846
3	Justin	Rose	0.161	0.075	0.246	0.750	0.648	0.852
4	Henrik	Stenson	0.107	0.032	0.183	0.806	0.725	0.886
5	Tiger	Woods	0.250	0.146	0.354	0.708	0.610	0.806
6	Ryan	Moore	0.143	0.063	0.223	0.806	0.729	0.883
7	Kevin	Chappell	0.143	0.072	0.214	0.806	0.714	0.897
8	Marc	Leishman	0.214	0.123	0.305	0.722	0.618	0.826
9	Patrick	Rodgers	0.250	0.159	0.341	0.722	0.623	0.822
10	Chris	Kirk	0.125	0.052	0.198	0.708	0.605	0.811

	FirstName	LastName	water	water LB	water UB	bunker	bunker LB	bunker UB
1	Rory	McIlroy	0.000	0.000	0.000	0.153	0.065	0.240
2	Bryson	DeChambeau	0.000	0.000	0.000	0.125	0.046	0.204
3	Justin	Rose	0.028	-0.004	0.059	0.139	0.056	0.222
4	Henrik	Stenson	0.000	0.000	0.000	0.139	0.069	0.209
5	Tiger	Woods	0.000	0.000	0.000	0.181	0.093	0.268
6	Ryan	Moore	0.028	-0.027	0.082	0.250	0.151	0.349
7	Kevin	Chappell	0.042	-0.019	0.103	0.153	0.062	0.243
8	Marc	Leishman	0.014	-0.013	0.041	0.208	0.115	0.301
9	Patrick	Rodgers	0.014	-0.013	0.041	0.250	0.158	0.342
10	Chris	Kirk	0.014	-0.013	0.041	0.264	0.161	0.367

The tables below present the confidence intervals for the mean of various metrics considered in the table 2 . For each metric, ’LB’ represents the lower bound and ’UB’ represents the upper bound of the interval.

	FirstName	LastName	Score	Score LB	score UB	Tee-shot	Tee-shot LB	Tee-shot UB
1	Rory	McIlroy	273.486	272.555	274.418	284.492	283.870	285.113
2	Bryson	DeChambeau	277.623	276.739	278.506	270.841	270.328	271.354
3	Justin	Rose	275.630	274.768	276.493	269.931	269.510	270.353
4	Henrik	Stenson	281.944	281.022	282.867	258.893	258.463	259.323
5	Tiger	Woods	269.085	268.275	269.895	263.598	263.269	263.927
6	Ryan	Moore	282.356	281.473	283.240	262.925	262.520	263.330
7	Kevin	Chappell	278.488	277.572	279.405	291.782	291.217	292.348
8	Marc	Leishman	274.260	273.386	275.135	277.578	277.045	278.110
9	Patrick	Rodgers	284.414	283.510	285.317	271.482	271.055	271.908
10	Chris	Kirk	289.085	288.156	290.013	243.917	243.217	244.617

	FirstName	LastName	Fairway	Fairway LB	Fairway UB	L	L LB	L UB
1	Rory	McIlroy	0.685	0.677	0.693	0.126	0.120	0.133
2	Bryson	DeChambeau	0.676	0.668	0.685	0.141	0.135	0.147
3	Justin	Rose	0.700	0.691	0.709	0.147	0.140	0.154
4	Henrik	Stenson	0.699	0.690	0.707	0.135	0.129	0.141
5	Tiger	Woods	0.765	0.757	0.773	0.088	0.083	0.093
6	Ryan	Moore	0.686	0.677	0.695	0.133	0.126	0.139
7	Kevin	Chappell	0.624	0.615	0.632	0.121	0.115	0.127
8	Marc	Leishman	0.702	0.694	0.711	0.136	0.130	0.142
9	Patrick	Rodgers	0.628	0.619	0.637	0.181	0.174	0.188
10	Chris	Kirk	0.647	0.639	0.656	0.135	0.129	0.142

	FirstName	LastName	R	R LB	R UB	GiR	GiR LB	GiR UB
1	Rory	McIlroy	0.189	0.182	0.196	0.735	0.726	0.743
2	Bryson	DeChambeau	0.183	0.176	0.190	0.760	0.752	0.768
3	Justin	Rose	0.153	0.146	0.160	0.742	0.734	0.750
4	Henrik	Stenson	0.166	0.159	0.173	0.718	0.710	0.727
5	Tiger	Woods	0.147	0.141	0.154	0.795	0.787	0.803
6	Ryan	Moore	0.181	0.175	0.188	0.718	0.710	0.727
7	Kevin	Chappell	0.255	0.248	0.262	0.721	0.713	0.730
8	Marc	Leishman	0.162	0.155	0.168	0.770	0.761	0.778
9	Patrick	Rodgers	0.191	0.184	0.198	0.678	0.669	0.687
10	Chris	Kirk	0.217	0.210	0.224	0.693	0.684	0.701

	FirstName	LastName	water	water LB	water UB	bunker	bunker LB	bunker UB
1	Rory	McIlroy	0.009	0.007	0.011	0.137	0.130	0.144
2	Bryson	DeChambeau	0.009	0.007	0.010	0.121	0.114	0.128
3	Justin	Rose	0.004	0.003	0.005	0.144	0.137	0.150
4	Henrik	Stenson	0.007	0.006	0.009	0.144	0.137	0.151
5	Tiger	Woods	0.005	0.003	0.006	0.099	0.093	0.105
6	Ryan	Moore	0.008	0.006	0.009	0.157	0.150	0.164
7	Kevin	Chappell	0.015	0.013	0.017	0.118	0.112	0.125
8	Marc	Leishman	0.009	0.007	0.011	0.101	0.095	0.107
9	Patrick	Rodgers	0.006	0.005	0.008	0.164	0.157	0.172
10	Chris	Kirk	0.013	0.011	0.015	0.119	0.113	0.126

$Refer to caption$	$Refer to caption$	$Refer to caption$	$Refer to caption$	$Refer to caption$
$Refer to caption$	$Refer to caption$	$Refer to caption$	$Refer to caption$	$Refer to caption$
$Refer to caption$	$Refer to caption$	$Refer to caption$