Data Driven Investment Management — Baseball Steered The Path

Christian Schitton
7 min readJun 17, 2023

Talking nowadays about statistical modelling and data driven investment management is much more accepted than it was in the early years of 2000 when the baseball business started to use statistics to identify undervalued/ overvalued players and to re-build effective hybrid teams within tight budget constraints.

photo credit: pixabay.com

At that time, the baseball industry began to replace gut feelings, decade old thinking patterns and biased expectations with statistical models in order to make clear-cut investment decisions.

Or better, statistics helped to make clear-cut investment decisions under budget constraints aiming to reduce the cost of a win significantly.

And it took the industry one season to put things upside down, forcing baseball clubs to adapt very fast or being wiped out as dinosaurs.

Today, using statistical modelling is the standard in the baseball business.

How It Started

If you watched the movie Money Ball you know the story: the Oakland Athletics were a rather small baseball team with a limited budget available. So, in terms of hiring talented players they could never compete with the big teams like the Boston Red Sox or the New York Yankees and therefore had a natural disadvantage in competing with the better funded teams.

Given those budget constraints, the Oakland Athletics started to use statistical modelling in order to be able to mathematically calculate the odds that certain players will perform well throughout the season in performing certain tasks but were still priced relatively cheap in the then prevailing market conditions.

The goal was to create some kind of “hybrid” team of mispriced players with a predicted performance in certain aspects of the game which -as a team- could benchmark the big teams hiring all the expensive talents. All this was done while matching the tight budget. At the other end of the scope, currently overvalued players in the own team should be sold for a good price in order to improve the budget situation.

Overall, the cost per win should be reduced significantly.

Being ridiculed in the beginning, even losing games during the adjustment period, the Oakland Athletics proved this method to be a reliable one for making the right business decisions. They became one of the most cost effective teams while e.g. setting new records in achieving 20 consecutive wins in the year 2002.

Let’s see how this works in principle.

An Overvalued Asset

Starting Point

The year 2013 and a baseball team has a young and quite unknown player, called José, who did extraordinarily well in the first part of the baseball season. He showed the following performance metrics:

A Batting Average of 0.45 would make José one of the most successful players in baseball. The record for the Batting Average so far is 0.4 and was achieved long time ago, ie. not earlier than in 1941.

So, do we see a big talent here of whom we can expect to break the old record of the year 1941 or was it just luck that José performed so exceptionally well?

Again, we can make an educated guess or , on the other hand, do some preparation and try to retrieve the information from the data which are available to us. In this case, we decided to follow the data and try to predict José’s Batting Average for the remaining season.

Technically, we use a Hierarchical Model/ Bayes Statistics.

The Data Framework

In a typical season, players have about 500 At Bats.

Filtering the three seasons before, i.e. the years 2010, 2011 and 2012, for all the players with at least 500 At Bats reveals a Batting Average of 0.275. Here is the summary statistics:

and José’s performance beginning of the season 2013

José’s Batting Average of 0.45 is therefore more than 6 standard deviations away from what usually can be expected as Batting Average in a baseball season which qualifies him for a clear outlier. The concerning question here is, did this happen due to José’s huge talent, was this just a lucky punch or is it due to both factors?

The Hierarchical Model

In short, the hierarchical model provides a mathematical description of (in this case) how to see the observation of a Batting Average of 0.45 with José.

And as for the hierarchy, we have two levels to take into account:

  • First Level: This level takes the player-to-player variability into account and is summarised as prior distribution. In other words, the Batting Average is now a random variable (denoted as p) which describes the randomness in picking a player.
  • Second Level: This is the level on the one chosen player and represents the variability due to luck when batting. It is summarised in the sampling distribution (likelihood). In other words, this is the distribution of the Batting Average (denoted as Y) of one player GIVEN that this player has a talent p.

Based on examining the seasons 2010 to 2012 we know, that the Batting Average follows a normal distribution. Visualised, there is the following situation where you also can see how exceptionally José is doing at the moment:

image by author

Mathematically, there is the following situation:

So, now we are ready to compute the so called posterior distribution in order to summarise the prediction for p:

In case of our single player, José, this means:

and

with

resulting in the posterior distribution for one single player

Prediction Results for José

So, in terms of the Batting Average how do we

  • predict the performance of José for the overall season 2013 (- the posterior distribution)
  • based on what we saw up to April (- the likelihood distribution)
  • and based on what we saw from the seasons 2010 to 2012 (- the prior distribution)?

In other words, given that we saw a Batting Average of José amounting to 0.45 in the beginning of the season, the predicted Batting Average of José for the whole season follows a normal distribution with the following parameters:

Hence and given the inherent uncertainty in predictions (expressed in terms of a 95% credible interval) we expect José to have a Batting Average for the whole season of 2013 in the range of 0.285 +/- 0.052, i.e. an expected range between 0.233 and 0.337.

Here is the whole process visualised:

image by author
image by author

As can be seen, the expected Batting Average for the whole season ranges between 0.233 and 0.337 and is far off the current extraordinary performance José showed during the early stage of the season!

“Data Driven” Consequences

With those data driven considerations at hand, the management of José’s baseball team understood that most likely the extraordinary performance which José showed in the early stage of the season can be more attributed to luck than to an extraordinary talent. With the ongoing season, it is therefore to be expected that the performance will flatten out and will move to the overall average of players.

The management also understood that other baseball teams were quite impressed by José’s current Batting Average of 0.45 and that those teams saw this as heavy indication of an extraordinary talent.

Put another way, José as a baseball player was an overpriced asset under current market conditions!

Realising this, José was sold to another team for really good money.

The Batting Average of José for the rest of the season was as follows:

Indeed, the exceptional performance flattened out over the rest of the season and stayed a notch above baseball league average.

Conclusion

Not too long ago, the application of statistical concepts in the baseball industry was completely out of scope. Meanwhile, it is a proved concept and the absolute standard when doing investment decisions.

Once successfully introduced, the transformation into a statistical oriented baseball business was very fast leaving not too much room for hesitation in adapting for the new standards.

Another aspect in this transformation is the role of the experts. Instead of taking investment decisions based on experts’ knowledge and experience, exactly this expert background was channeled into realising which parameters to look at when creating the mathematical framework.

With the mathematical concept in the backhand, decision makers could draw their investment conclusions much more effective based on a clear goal, within clear defined (statistical) boundaries and well within any sort of constraints (e.g. budget).

References

The case study was taken from Prof. Rafael Irizarry’s Inference and Modelling — course/ Harvard University

--

--

Christian Schitton

Combining Real Estate Investment & Finance expertise with advanced predictive analytics modelling. Created risk algorithms introducing data driven investing.