Hi, welcome to Word of Mike, my little corner of the internet. I am a Software/Web Developer working in North Yorkshire. I mainly write about programming but my other passion is politics so beware. (click to hide)

2014-09-14 19:11:36 UTC

Building a Horse Racing Predictor Part 1


It's now been 12 months since I first put pen to whiteboard to figure out how to build the best horse racing predictor that I could. What started out as a curious hobby has developed in to a bit of an obsession.

The fools' game, or so it's often called, for most casual punters is nothing more than a lottery with added entertainment. More serious punters try and get an "edge" by looking at statistics and using their experience to weigh up all the factors in their head in order to make the most informed choice.

Being a man of science, I wasn't particularly excited by gut feel and experience. I felt as though I ought to be able to get my edge by statistical means. Of course, this wasn't a novel thought, many people before me have tried to do the same thing, with varying degrees of success. However, having unbounded faith in my own abilities (arrogance), I was still excited by the possibilities.

I had a passive interest in horse racing for a number of years prior, because of this combined with the availability of historical data and the sheer quantity of horse races, I felt it would be a fertile testing ground in which to put some of my ideas in to practice; let this be the ramshackle documentation of said exploration.

Gathering data

The first thing I scribbled on my whiteboard was all the data I wanted to collect. I wrote down nigh-on everything that could possibly affect how well a horse runs in a race, and all the things I thought I'd need to know to be able to judge that performance. My next step involved finding a source for this information.

I quickly discovered (not unexpectedly) that there's no holy grail free API for historical horse racing results, and to gloss over many days of searching the web and weighing up commercial API costs, I ended up settling on web scraping from Timeform (specifically http://service.timeform.betfair.com/ because I found the page structure easy to work with and it had the most data.)

Ruby is my thang, so I began work in earnest writing the scraper using Nokogiri and open-uri. It was a fairly quick process, simple scraping like this is fairly easy stuff. With the scraper proven, I then set it to work pulling the last 5 years of data from Timeform (2008-2013), dealing with oddities and errors as they arose (dead heats, incomplete results, result reversal by stewards enquiry, etc) and then putting it back to work. The scraping took about 13 hours in all.

It was quite exciting to have all this data at my fingertips, but I knew that the hard work was ahead of me. What I didn't know was quite how much hard work it would turn out to be!

Rating a performance

The crunch of any racing system is in accurately rating performances. I had decided from the beginning that it would be a speed-oriented rating, but in UK & Ireland horse racing results, only the winning time is reported. I knew that for this system to be effective I'd have to be able to attribute a time to the runners-up; I would do this by estimating to the best of my ability how much slower than the winner they were, based on their reported distances (in 'lengths') behind the winner.

After a bit of research, I settled on a length being a fifth of a second (200ms), and the derivatives such as head, neck, etc being proportions of this, e.g.:

# #
# Experimental distances as fractions of a length (8 ft, 200ms)
#
DIST_ABBR_TO_LEN = {
  'nk'  => 2.0  / 8, # neck, 1/4 length, 50ms
  'snk' => 1.5  / 8, # short neck, 3/16 length, 37.5ms
  'hd'  => 1.0  / 8, # head, 1/8 of a length, 25ms
  'sh'  => 0.5  / 8, # short head, 1/16 of a length, 12.5ms
  'ns'  => 0.25 / 8, # nose, 1/32 of a length, 6.25ms
  'dh'  => 0.0       # dead heat, 0 lengths, 0ms
}

I'd had the foresight to build this in to my parser so I had times for all runners in my 5 years of historical data. There's a lot more to rating a performance than just looking at the speed though — there's all kinds of influences which can seriously affect how fast a horse runs. Going is probably the most obvious, for instance if I were to compare an average horse running on firm to a decent horse running on heavy, I might naively assume that because the average horse ran faster that he is better.

So just what are these factors, and how do we adjust for them? This is the hottest question in this game, and the best efforts to answer this question produce the best rating systems. There's little dispute what the main influences are; distance, going, weight carried, draw.

One approach, and my first, to adjusting for going is to normalise every race to "good", by adding/subtracting time based on the average speeds over each of the goings. For example, if the average speed over 1m on heavy was 17 yards/second, but on good it was 18 yards/second, I'd normalise all results on heavy ground by subtracting a yard a second from the actual speed.

I was getting somewhere closer to a more effective rating once I'd taken going in to consideration, and I'd settled on a rudimentary adjustment for weight carried, too.

lbs = result.weight
lbs_from_base_weight = lbs - 110
ms_per_yard_per_lb = 10/220.to_f
speed += ms_per_yard_per_lb*lbs_from_base_weight

I'd adjust everything to a the base weight, 7 stone (110lbs), by deducting 10ms from the time it took that horse to travel a furlong for each pound they are carrying over the base weight. (They'd travel 1 furlong 10ms slower, or 1 yard 0.045ms slower for each pound.) So in practice over a mile race, a horse carrying 10lbs more than a rival will effectively be slowed by 10*10*8 = 800ms, which is roughly equivalent to 4 lengths based on earlier calculations.

This value that I've used (10ms per furlong per pound) is a guesstimate based on some research papers that I read, and anecdotally seems to be most accurate around the 1m distance. My next landmark change is to make this variable based on race type (flat, hurdles, jumps), distance, and going. Obviously carrying 1lb more over a 5f sprint isn't going to make as much difference as 1lb over a 3m national hunt chase course, but currently I'm only taking distance in to effect in a linear way.

To be continued ...