Running Smart with Machine Learning and Strava

January 16, 2020

An ordered list of your biggest potential training gains, based on sports watch data from you and other athletes.

Running smart?

As a runner always looking to improve my Personal Bests (PBs), I often wonder how I can optimise my time spent running to achieve the best gains whilst being as lazy as possible. Sometimes I run seven times a week, but don’t improve my speeds over a few months. Other times, after weeks spent hiking, drinking and only running once or twice a week, I have run really good PBs. This goes against wisdom from running forums and training plans. Fortunately, I’ve recorded my training for over five years on Strava, and so have many others. There must be a pattern in all this data.

So I built a service to pull data from Strava, the athlete’s social network, with the permission of other athletes. A machine learning algorithm determines, from all athletes’ data, which factors are most important to improve their running. From this algorithm, it is possible to give athletes an ordered list of training improvements they can make in order to get faster.

This writeup is long.

The problem is not an easy one to solve, and the solution not easy to explain in short, as the overlap between running coaches and machine learning engineers is probably fairly small. For that reason, I’ve put a summary section at the top, and kept each technical section fairly self-contained for those looking into the details. I have explained the entire process, from start to finish, including productionising the ‘product.’

Summary

After obtaining data from multiple runners, I built two models:

The entire system is now available in a simple website, where athletes can sign up and gain insights into their training plan [update: I have taken the website down as the system ran through my hosting budget very quickly].

A prototype visualisation for the system output from the first model is shown below. Here, the blue markers show how the athlete trains, relative to the top 10% and bottom 10% of athletes. The list is ordered in order of the most important factors in that athlete’s training, a product of the importance of each factor and the athlete’s shortcoming in it.

Here, it is quite obvious that the athlete needs to run more times per week, and more distance. While doing this, they should spend a lot more time in heart rate zone 2. However, there are some things that they did really well even relative to the best athletes. For instance, for the three months before their PB, they reduced their time spent on non-run activities as they focused harder on running, and they increased the proportion of long runs.

The data

In-depth explanation

From hereon, things get more technical:

Setting some boundaries for the problem

Getting technical - first with some running science

The problem I’m solving for here is for distance running: between 5km and 42km. While these are still very different races, the dynamic is similar to the extent that there is a formula and constant variable governing the mapping between an athlete’s speed in any distance in this range. Assuming I regularly run fast (tempo or threshold) runs and long runs around 25km to 30km, it is fairly simple to extrapolate my marathon time from my 5k time. This is done through “vdot,” a constant and formula developed by legendary running coach Jack Daniels. Throughout my model, the assumption is that vdot is an accurate representation of your overall running fitness, and that we are dealing with fitness in the 5-42km distance range.

However, vdot is not a perfect synonymn of fitness, so much as it is a measure of speed. For a holistic measure of fitness, one would have to look at OBLA, V02max, muscle efficiency, and muscle power output versus foot stroke duration. We can’t measure those with wrist watch data. But we can measure what helps people improve the end goal - speed.

Most runners use Strava, a social network for athletes, to upload their data, which is where I am acquiring data from. Strava DOES estimate your fitness (vdot or their own measure) in their own models, but doesn’t make this available through their API. For that reason, the model I built can only extrapolate your fitness when you run a Personal Best (PB), which Strava does make available as “Estimated best efforts.” There were options available, such as estimating fitness from your relation between heart rate and speed, but this would have added too much noise in the model for a Minimum Viable Product (MVP).

Estimated best efforts

To achieve these best efforts, it is generally accepted that you have to follow a finely-tuned balance between training hard and recovering, as well as keeping your machine oiled through lots of easy runs. The former is defined by what trainingpeaks calls the Impulse-Response model - you put your body under stress, and it responds to this stress by getting stronger. The latter is simply because science has shown that our body makes the most efficient performance gains in the aerobic heart rate zone, which corresponds to a relatively easy running pace in “Zone 2”.

Which brings us to heart rate zones: Coaches and athletes have simplified things for us by defining, based on each athlete’s maximum heart rate, a set of zones or ranges that give us target regions to train in based on our heart rate data. These zones range from Z1 to Z5, from minimum to maximum effort. Many professional athletes train entirely by heart rate zone, rather than by pace, because our hearts are the best indicator of our body’s response to training. The image below gives an indication of zones. Please note that it is relatively bad practice to estimate your heart rate zones based on your age, as this varies greatly based on genetics, rather than fitness. For instance, Chris Froome’s max heart rate is 161, and should be 186 based on this model. Fortunately, Strava does not make this assumption, but determines this over time.

Heart rate zones

We are capable of spending less time training in higher heart rate zones, but benefit greatly from a bit of time in the higher zones. Different coaches see it differently, general advice is to spend 80% of your time in the “easy” zones Z1-Z3 and 20% in the “hard” zones, Z4 and Z5. However, this also differs depending on whether you are early into your training season, preparing to “peak” for a race, or “tapering” your efforts to be rested for a race.

To manage your time these zones, and further simplify training, coaches advise several types of runs:

And if that isn’t enough, there also exists science on how much to increase your mileage per week, a macro-type impulse response model, where your body slowly adapts to more efforts, hence becoming fitter. The price to pay for increasing this too much is of course injury or burnout.

The running science, simplified

Given all of the above, for each athlete, there exists an optimal but very difficult to find balance between

Without great knowledge yourself, or a coach, it is hard to find this balance, even as a seasoned runner. There are tools available, such as TrainingPeaks’ performance manager and Strava’s premium offering. They are great, especially in how they tell you that you are training too hard and need to rest. However, I find that they all focus on trying to estimate what your current training load is, hence relying on a lot of inaccurate daily heart rate data. I needed something to advise me what I could do differently over a period of months, as opposed to on a day to day basis.

So, as a runner, I built this analysis as a supplementary tool: to tell me what I could do differently over longer periods of months, relative to other athletes.

High level overview of the system

There are several parts to a system like this, that needs to harvest data, train a model, run the model, and produce insights to users in the front end. Once could see them as a system interacting as follows, with green lines indicating synchronous flows and orange lines representing asynchronous flows, due to the time required for API access or processing:

Data flow

Getting the data: The Strava API

Strava has heavy limitations on their API usage. It cannot be queried without an athlete’s permission, and then is rate limited to allow more or less only 10 runners’ data to be pulled per day. This slows down data harvesting.

Amongst other things, the Strava API provides, per athlete:

20 of my friends were kind enough to provide me with all of their data, via Strava. Given that they had an average of 10 PBs in their data each, this provided almost 200 inputs to my models, enough to build an accurate model to estimate vdot, but not enough to build a model to estimate improvement in vdot - a problem that I will explain later.

The data model(s) and feature engineering

The model used to build up the database to train the machine learning algorithm is shown below.

Data flow

  1. All athlete activities are iterated through, determining which are PBs. For each PB, if it is more than a month after the previous PB, and more than 0.5 vdot over the previous record, becomes a data point. This is to make sure each data point is a standalone PB of sufficient magnitude.

  2. From all athlete activities, a regressor is built mapping pace to heart rate for this athlete over one month intervals. This is because a lot of athlete activities were done without heart rate monitors (generally, heart rate monitors have only been used widely in the past two or three years). However, heart rate can still be fairly accurately estimated with this regressor, with R^2 of 0.9. This comes in exceptionally useful in training the machine learning model: it improved R^2 of the overall model by 0.1.

  3. For each PB, the vdot of the PB is determined using Jack Daniels’ formula. This is the “y” of the first machine learning model: the feature the model is trained on. For the second model, change in vdot is used, or delta-y. Also for each PB, all activities for 3 months before the PB are extracted. The 3 months was chosen based on relation to R^2 of the final model.

  4. For each activity, useful features are extracted, such as type of activity, length, overall effort, etc. Whilst I won’t go too in-depth on model features, things like standard deviation heart of rate is used to determine whether the activity is an interval training session, and relative effort and length of the activity is used to determine whether or not it is a tempo or a long run. Similar logic applies in picking up strides, hill repeats, etc.

  5. Especially in the case of activities from several years ago, pace data is used to extrapolate expected heart rate for the activity, to enable some of the above and below features to be extracted from heart rate.

  6. Activities are grouped into weeks, a method I am still slightly uncomfortable about, as some athletes might train in cycles of 10 days or 14 days. For each week, more features are extracted such as weekly mileage, split between runs and other activities, and ratios between types of runs.

  7. The same is done for grouping weeks into training blocks. Here, the entire three months are considered, and features are more along the lines of relative increases in mileage, efforts, and changes in overall behaviour over the three months. Finally, a taper before the race is also analysed as a factor.

All the above features are stored in the database per PB, per athlete. They are all used to train the final machine learning models. It is important to note that pace is NOT a feature, for obvious reasons.

The machine learning model

This section will be rather underwhelming for machine learning experts. Based on features chosen, the best performing model was a decision tree, closely followed by a vanilla XGBoost. Tweaking these models showed little gains relative to improving the data with some feature engineering, such as extrapolating heart rate from pace where heart rate data wasn’t present.

Intuitively, and based on model performance, the problem is near-linear.

The model is split into two, as mentioned very early on in the article:

The features of each (based on my small dataset of n=200) are below. I will update this as the database grows:

Features explaining vdot (fitness)

Here, we have a fairly healthy SHAP plot. Health of this plot in a simple system is usually indicated by a clear separation between red and blue dots.

Features explaining vdot

Keep in mind that these features are deemed as important for a non-linear model: a random forest regression algorithm, so we can’t read this as a checklist of important features for runners. I would take some features like “r_proportion_yoga” with a pinch of salt. For instance, the model might have determined that if the athlete doesn’t do much easy running, then their time spent doing yoga is exceptionally important. But since the R^2 of this model is fairly good, and given that we have an almost linear problem, it is safe to look at the features this way, or alternatively run a linear XGBoost feature analysis.

There are several fairly safe conclusions we can make about how fast runners behave before their PBs:

Features explaining relative vdot (improvement in fitness)

As a quick recap, rather than looking at what fit athletes do, I trained a model on what athletes, in general, did to improve the most in terms of vdot. This could be an exceptionally powerful analysis with enough data, as it would bring correlation out of the equation. We would see what the average Joe did to get better.

Here, we the SHAP plot is not as healthy. There are many outlying features with high impact on the model.

Features explaining vdot

The best R^2 for this model was 0.5. Therefore, with the current dataset size, even for the most explicable athlete performance, it only explains up to 50% of the factors involved in getting faster.

Equally importantly, this model is less linear than the model predicting purely vdot. Some athletes might run PBs by being new to the sport, so the model may decide “Those who run less than 30km per week on average need to have run more than usual to achieve their PB, but those who run more than 120km per week need to run relatively less but run faster.” However, let’s take a look at what an intentional selection of the features say:

Visualisations

So how does this come in useful for an individual? With the above models, it is possible to compare a particular athlete’s training to the ideal. This is shown below, with an explanation on reading the diagram here.

The features are ordered by potential gain: the power of the feature in the model, multiplied by how relatively bad your training was in that aspect. Apart from the obvious “run more” and “run more often,” this model clearly instructs the athlete to “run more in heart rate Z2,” “run more hills” and “swim less.”

Visualisation

Productionising

Productionising at this small scale was fairly straightforward, and I went with an entirely Google Cloud setup. The result is that everything is blazing fast, apart from the Strava API queries which are rate limited at around 1 per second.

What’s next?

There are many improvements that can be made, especially to the visualisations an explanations thereof. However, first, I would like to acquire more data in order to build a better model explaining what it takes to get faster. This more causal model will be infinitely more useful to individual athletes.

© 2024 Ryan Anderson