Kaggle forecasting competitions: An overlooked learning opportunity

Casper Solheim Bojer, Jens Peder Meldgaard

Research output: Contribution to journalJournal articleResearchpeer-review

115 Citations (Scopus)
189 Downloads (Pure)

Abstract

We review the results of six forecasting competitions based on the online data science platform Kaggle, which have been largely overlooked by the forecasting community. In contrast to the M competitions, the competitions reviewed in this study feature daily and weekly time series with exogenous variables, business hierarchy information, or both. Furthermore, the Kaggle data sets all exhibit higher entropy than the M3 and M4 competitions, and they are intermittent.

In this review, we confirm the conclusion of the M4 competition that ensemble models using cross-learning tend to outperform local time series models and that gradient boosted decision trees and neural networks are strong forecast methods. Moreover, we present insights regarding the use of external information and validation strategies, and discuss the impacts of data characteristics on the choice of statistics or machine learning methods. Based on these insights, we construct nine ex-ante hypotheses for the outcome of the M5 competition to allow empirical validation of our findings.
Original languageEnglish
JournalInternational Journal of Forecasting
Volume37
Issue number2
Pages (from-to)587-603
Number of pages17
ISSN0169-2070
DOIs
Publication statusPublished - 2021

Keywords

  • Benchmarking
  • Business forecasting
  • Forecast accuracy
  • Forecasting competition review
  • M competitions
  • Machine learning methods
  • Time series methods
  • Time series visualization

Cite this