Can my Frankenstein of a time series regression model — inspired by Prophet — compete with the real deal?
In what’s likely to be the last installment in my journey to build on Meta’s great forecasting package Prophet, I’ll be taking a look at how my homemade version stacks up against the original.
It’s going to be a quick one: we’ll first take a look at the data before visualising how the two approaches forecast on out-of-time data. We’ll then more formally determine which is the better forecaster using some metrics before discussing whether or not it was a fair comparison at all.
Let’s get cracking.
Aside: I mention other installments — two other articles, to be precise. The first covered a feature engineering for time series, based on Prophet’s approach and can be found here:
In a sequel, I tackle the model build using our shiny new features. That lives here:
Many of the topics discussed here today are covered in more detail in the linked articles — worth a read if you’re one for the fine print.
We’re using UK road traffic accident data¹, summarised to a monthly count.
Image by author
We see a few features in the time series:
A strong downward trend throughout the seriesA change in the rate of decrease somewhere between 2012 and 2014Fairly strong seasonality in the early part of the seriesPotentially variable seasonal effects, particularly toward the end of the series.
The aim of the game
We have two models — we’ll refer to our homemade Frankenstein model as the LASSO model, and Meta’s Prophet as, well… Prophet.
For each of the models, we’re going to produce out-of-time forecasts. This essentially means fitting to a subset of our monthly count data and then forecasting 12 months into the future.
Each forecast will be compared to actual observed data; whichever model gets closest — on average — wins.
Aside: this is essentially a cross-validation test. If you’re familiar with standard cross-validation approaches but haven’t used them in a time series analysis, you might find (2) below quite useful.
We can visualise the out-of-time forecasts from each of the models — LASSO in red, Prophet in blue — and compare them against the realised actuals.
We should remember that each of the forecasts were built using all of the data prior to the forecast period. As an example, the forecast for 2010 was built using using data up to and including 2009.
Image by author
That’s a pretty clear picture: with the exception of one year (2013), Prophet looks to be a bit off the mark.
What is interesting to note is the similarity in the forecast patterns created by the two approaches:
Both models produce lower forecasts — i.e. they reflect the overall generally downward trend.Both models have intra-annual increases and mid-year spikes — i.e. the forecasts produce a similar seasonality pattern.
How far — exactly — are the two models from reality? We’ll need to look at some performance metrics to find out.
We’ll measure performance using the usual suspects — mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean squared error (RMSE) — as well as a newcomer (to me at least): MASE.
The Mean Absolute Scaled Error
The mean absolute scaled error (MASE) is a “generally applicable measurement of forecast accuracy without the problems seen in the other measurements”³ and “can be used to compare forecast methods on a single series and also to compare forecast accuracy between series”³.
Mathematically, the MASE is the ratio of out-of-time forecast error to the in-sample forecast error produced by a naive forecasting approach. Since we’re using monthly data, I’ve taken the naive forecast prediction to be the value at the same point in time in the previous year — e.g. the forecast for May 2012 is simply the value for May 2011. Very naive.
When comparing forecasting methods, the method with the lowest MASE is the preferred method³.
Important to note is that MASE > 1 implies that the forecast method performs poorly relative to a naive forecast.
Aside: I’ve used the implementation described in the linked article — i.e. the “error” is the mean absolute error. I believe that we can use other measures of performance in place of the MAE — e.g. MAPE — as long as the error measure is used consistently in the scaled error calculation.
Let’s summarise out-of-fold and overall average model performance using the metrics we’ve described:
Image by author
That’s a fairly comprehensive win for the LASSO model, with Prophet only out-performing in small pockets.
Knives and gun fights?
As we’ve seen, it’s not pretty reading if you’re a Prophet fan: Meta’s tool manages to snatch a few folds (metric dependent) to avoid a complete whitewash. Impartial commentators might suggest a return to the clubhouse to re-evaluate tactics.
While the result isn’t great for Prophet, there are a few reasons why performance like this can be expected.
The LASSO model uses features that have been specifically engineered for this particular time series. The set of input features available to it is essentially a superset of what’s available to Prophet with a little extra on the side.
Additionally, some of the features are subtly different in the LASSO model. For instance, features describing potential change points are not as constrained in the LASSO as they are in the Prophet model.
Think of it as trying to out-guess someone else, knowing less than — or slightly different — to them. Not so easy.
The out-of-fold data is not as “unseen” as I’ve made it out to be.
In a previous article we covered the parameterisation of the LASSO model: how we use out-of-fold data to select the strength of regularisation which optimised the model’s ability to forecast. In that sense the LASSO model has been tuned to forecast well over each cut of the data while the Prophet model has been thrown straight out of the box and into the deep end.
In “normal” hyperparameter optimisation exercises, we can usually expect to see performance increases by about 1% — 2%; the performance increase in a time series context is likely much greater as “out-of-fold” really is “out-of-time”.
Time to call it a day with Prophet then?
Not so fast… this series of articles has certainly highlighted a few things — let’s talk through a few of them.
Out of the box, Prophet works incredibly well. Although it can indeed be beaten, it takes a bit of work to do so — much more than the 10 lines of code you need to get Prophet up and forecasting.
The interpretability of the LASSO model is far superior to what’s available from Prophet. Yes, Prophet gives us estimates of uncertainty for forecasts but we can’t tell what’s actually driving the predictions. I’m not even sure we can put Prophet through SHAP.
I’ve also found that Prophet isn’t so straightforward to tune. Maybe it’s because I’m not an advanced user of the package, or maybe it’s because of the roundabout way in which you have to tune parameters. This is certainly not the case with the LASSO model.
While the LASSO approach arguably represents an improvement in performance and interpretability, perhaps what we really need is to use both approaches: one as a litmus test for the other. For example, if a “naive” Prophet model produces sensible forecasts, it might be reasonable to replicate the LASSO approach (the “False Prophet”) to maximise performance.
That’s it from me. I hope you’ve enjoyed reading this series of articles as much as I’ve enjoyed writing them.
As always, please let me know what you think — I’m really interested to hear about your experiences with Prophet or with modelling time series in different ways.
Until next time.
References and and useful resources
https://roadtraffic.dft.gov.uk/downloads used under the Open Government Licence (nationalarchives.gov.uk)Let’s Do: Time Series Cross-Validation | Python in Plain EnglishMean absolute scaled error — Wikipedia
False Prophet: Comparing a Regression Model to Meta’s Prophet was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.