Skip to main content

SoBigData Articles

A Narrative Review for a Machine Learning Application in Sports: An Example Based on Injury Forecasting in Soccer

With the technological advent of the last few decades, it is possible to record a huge quantity of data from athletes. Wearable devices, video analysis systems, tracking systems, and questionnaires are only a few examples of the devices used currently to record data in sports. These data can be used for scouting, performance analysis, and tactical analysis, but an increased interest is in assessing the risk of injuries. With this huge amount of data, the use of complex models for data analysis is mandatory and, for this reason, machine learning models are increasingly used in sports science. In order to describe the correct methodologies to develop these models, Roosi et al. [1] provides a machine learning application example focused on injury prediction in soccer.

Injuries have a great impact on the sports industry, affecting both team performance and the club’s economic status. As a matter of fact, injury-related absenteeism from training and matches for top-league players results in a total cost (in terms of player’s recovery, rehabilitation, and players’ salary) of EUR 188 million per season. In addition to soccer, players of other team sports (e.g., rugby, Australian football, and American football) are subjected to a high number of injuries (81 and 6 injuries per 1000 matches and training hours, respectively) [2,3,4]. Based on these data, it is not surprising that injury forecasting and prevention are becoming prominent topics for researchers, managers, coaches, and athletic trainers.

In the last decade, several papers proposed models to assess athletes’ injury risk. Since players’ health is affected by several factors linked to the complex human responses to external stimuli, the simplification of the training workload variables into a single feature (e.g., ACWR of one external training workload) does not permit a complete overview of their status. As a matter of fact, this simplification hides the complexity of the training stimuli, not allowing for the detection of complex patterns in training workloads linked to injuries [5,6,7]. For this reason, the literature concerning multidimensional models focused on predicting injury is growing fast [8]. 

One of the main issues in applying machine learning models on injury prediction is the data sample distribution. Essentially, the injury-related datasets are not balanced among injury and non-injury observations. In essence, it was found that in only 2% of the cases the players incurred in an injury, while the remaining 98% are no-injury examples. This highly unbalanced dataset makes the training of machine learning models difficult in being able to clearly detect workload patterns able to discriminate between the injury and no-injury examples. Oversampling and undersampling approaches could solve this problem. The aim of these sampling strategies is to balance the dataset allowing highlighting patterns in the training set and consequently enabling the machine learning model to achieve better predictive results.

Testing the predictive performance of the injury prediction models can hide pitfalls. The most used strategy to validate the performance of the injury prediction is the cross-validation [8,9], even though it could suffer from an over-fitting problem. This problem could be caused by the fact that similar examples could be included in both training and test sets. In fact, this is usually induced by an overlapping of the training and test examples induced by the data preprocessing. For example, we could insert in the training set an example of day n and in the test set a training of the day n + 1. In this case, external workload features aggregate in acute (mean of the previous week) or chronic (mean of the previous month) are created on almost all the same previous training examples. This aspect permits at the machine learning model to easily predict the injury risk due to the fact that an almost identical example was used to train it. Simulating the evolution of the competitive season permits for a better assessment of the accuracy of the injury prediction models, even if a similar problem could be detected. However, in contrast to the cross-validation approach, an evolutive scenario replicated what happened in the real world and consequently, the overfitting problem could be considered marginal.

Finally, one of the most important issues detected in the literature is that almost all of the papers do not compare their predictive performance results with a baseline model one. The baseline is a dummy classifier that makes predictions by using simple rules (e.g., always predict no injury class, always predict injury class, and stratified model in accordance with the injury examples distribution). This base model is useful for assessing if the machine learning model is really able to discriminate among players with different injury risks. In particular, if the trained model shows an accuracy similar to or lower than the baseline models, the machine learning algorithm could not be considered valid for injury prediction.

Due to the great confusion regarding the application of machine learning techniques in sport science, the aim of the narrative review proposed by Rossi et al. [1] is to provide a guideline that permits to correctly build and evaluate injury prediction models. In particular, they describe in depth the strengths and limitations of each aspect needed to create a framework of big data analytic for injury forecasting. Moreover this paper describes all the features that could be used to predict injuries and all the possible preprocessing approaches, how to train and test the predictive models, and how to extract insights from interpretable and black box models. 

Author: Alessio Rossi

Exploratory: Sport Data Science

Sustainable Goal: Good health and well-being

Items in the Catalogue



  1. Rossi, A; Pappalardo, L; Cintia, P. A Narrative Review for a Machine Learning Application in Sports: An Example Based on Injury Forecasting in Soccer. MDPI Sports. 2022, 10(1), 5.

  2. Kaplan, K.; Goodwillie, A.; Strauss, E.; Rosen, J. Rugby injuries: A review of concepts and current literature. Bull. NYU Hosp. Jt. Dis. 2008, 66, 86–93. 

  3. West, S.; Williams, S.; Cazzola, D.; Kemp, S.; Cross, M.; Stokes, K. Training Load and Injury Risk in Elite Rugby Union: The Largest Investigation to Date. Int. J. Sports Med. 2020, 42, 731–739.

  4. Williams, S.; Trewartha, G.; Kemp, S.; Stokes, K. A meta-analysis of injuries in senior men′s professional Rugby Union. Sports Med. 2013, 43, 1043–1055. 

  5. Bittencourt, N.; Meeuwisse, W.; Mendonça, L.; Nettel-Aguirre, A.; Ocarino, J.; ST, F. Complex systems approach for sports injuries: Moving from risk factor identification to injury pattern recognition—Narrative review and new concept. Br. J. Sports Med. 2015, 50, 1309–1314.

  6. Hulme, A.; Finch, C. From monocausality to systems thinking: A complementary and alternative conceptual approach for better understanding the development and prevention of sports injury. Inj. Epidemiol. 2015, 2, 31. 

  7. Quatman, C.; Quatman, C.; Hewett, T. Prediction and prevention of musculoskeletal injury: A paradigm shift in methodology. Br. J. Sports Med. 2015, 43, 1100–1107. 

  8. Seow, D.; Graham, I.; Massey, A. Prediction models for musculoskeletal injuries in professional sporting activities: A systematic review. Transl. Sports Med. 2020, 3, 505–517.

  9. Vallance, E.; Sutton-Charani, N.; Imoussaten, A.; Montmain, J.; Perrey, S. Combining internal- and external-training-loads to predict non-contact injuries in soccer. Appl. Sci. 2020, 10, 5261.