Wednesday, December 5, 2012

The Data Has Arrived!

After a long search for a free and comprehensive set of NCAA Men's Basketball results I have finally succeeded in finding what we need. Ken Pomeroy has been nice enough to let me use his most basic game results for this project, for which I am immensely grateful. As mentioned in a previous post, the data we care about most at this stage is relatively simple: for each game between Division I teams we want to know which two teams are involved, where the game took place, and what the score was. Thankfully, that's exactly what Ken has given us via his raw data.

Alright, so let's dive into the predictive model. I was able to put together a quick macro in Excel that ran through every game in the 2000-1 season, and modified the ratings of the teams involved in each game based on the outcome. In all, there were just under 5,000 games played throughout the season involving over 500 different teams. For the first round of tests I used a value of $K=36$ (if you need to remind yourself what the Elo calculations look like, check my previous post).

At the beginning of the season, each team was given a starting rating of 1200. I just went through and churned out new ratings for each team after each game using a simple Excel macro. As I just wanted to perform a super, super simplistic test, home court advantage is not taken into account. Additionally, it is an unrealistic assumption to assume that all teams start the season at the same skill level. Thus, I did not particularly expect these first results to be terribly accurate, but hoped that at least the basic concept would work reasonably well.

After the running through all 5,000 games the final Elo for each team was calculated. These new Elo ratings ranged from 1664 to 846. The table below shows the top- and bottom-10 performing teams based on the results of this first season.
In the 2001 NCAA Tournament Duke ended up beating Arizona in the finals, with Michigan State and Maryland losing in the semi-finals. Considering all of the Final Four teams made it into our top 7, I'd say we're off to a smashing start. This is great news for us, as this is an incredibly simplistic first cut at a modeling process that will undergo many, many iterations in the coming months.

One last test we can quickly perform is comparing our end-of-season results to the last week of AP Top 25 rankings from 2001. The table below shows where our model placed each of those teams after at the end of the season.

While our model pretty clearly begins to show some weird results with respect to teams towards the bottom of the AP list, the top 15 teams from the AP Polling are identical to the top 15 from our modeling with one exception: AP ranked Iowa State as the 10th best team (who we had ranked 17th, so right up there), and we had Gonzaga rounding out our top 15 while the Zags were nowhere to be found in the AP's list. Considering Gonzaga made it to the Sweet Sixteen in the 2001 tournament, this could be seen as a lapse in judgement on the part of the AP rather than a failure in our own modeling.
Alright, that's it for this time. Look for a post in the near future overviewing how we're going to deal with home court advantage, and hopefully extend our model all the way through the 2011-2012 season!

No comments:

Post a Comment