Show students soccer analytics and they will love computer science
Soccer analytics is attracting increasing interest in academia and industry, thanks to the availability of sensing technologies that provide high-fidelity data streams for every match. Apart from a few sporadic attempts, it is only in recent years that soccer statistics have developed, thanks to sensing technologies that provide high-fidelity data streams extracted from every match. These data streams are mainly used by researchers in academia, data scientists in the industry, or sports data journalists to extract meaningful knowledge and tell stories. However, these data are still little known to the public, and especially to young students who many want to pursue a career in sports analytics.
The purpose of our seminar “Train you algorithm”, during the event “Incontra l'Informatica” organized by the Department of Computer Science of the University of Pisa on April 16th, 2021, was specifically to involve high-school students into a laboratory focused on soccer data analytics. We introduces the use of a Python language programming for sports analytics (the code used is available at https://jovian.ai/jonpappalord/data-exploration) and we show how to explore a dataset of soccer-logs (https://www.nature.com/articles/s41597-019-0247-7), describing the events that occur during a match and are collected through proprietary tagging software.
Students' involvement was amazing, as they showed a huge interest in sports analytics and asked several questions about data acquisitions, data analysis and work opportunities in the sport field. In the following, we summarize the most asked questions and our answers, hoping that they will be useful to all students who could not attend the event. Our feeling was that sports analytics is a good “trojan horse” to attract students to computer science, artificial intelligence, and data analytics.
QUESTIONs AND ANSWERs
1. How are the soccer-log data collected?
The data are collected by specialized companies, which sell them to clubs, institutions, journalists, and broadcasters. In particular, the data we show in this lesson have been collected by Wyscout, a leading company in the soccer industry which connects soccer professionals worldwide, supports more than 50 soccer associations and more than 1,000 professional clubs around the world. The procedure of data collection is performed by expert video analysts (the operators), who are trained and focused on data collection for soccer, through a proprietary software (the tagger). The tagger has been developed and improved over several years and it is constantly updated to always guarantee better and better performance at the highest standards. Based on the tagger and the videos of soccer games, to guarantee the accuracy of data collection, the tagging of events in a match is performed by three operators, one operator per team and one operator acting as responsible supervisor of the output of the whole match. Optionally for near-live data delivery a team of four operators is used, one of them acting to speed up the collection of complex events which need additional and specific attributes or a quick review.
The tagging of a match consists of three main steps.
- Step 1: setting formations. At the beginning of the match, an operator sets the teams’ starting formations, the positions of the players on the pitch and their jersey number. The formation of a team consists of the list of players in the starting lineup and the list of players on the bench.
- Step 2: event tagging. For each ball touch in the match, the operator selects one player and creates a new event on the timeline. The operator then adds the type (e.g., pass, duel, shot, etc.) and subtype (e.g., a duel can be aerial or ground) of the event by using a special custom keyboard which gives operators the possibility to insert events and data in a streamlined way. The operator finally adds the coordinates on the pitch and all the additional attributes for the event. These can be different depending on the event type: such as pass high/low, foot, dribbling side and so forth. When a player shoots on goal, the system asks the operator to fill a shot specific module that collects where the shot ends (on goal, out of goal, on post and exact position).
- Step 3: quality control. After the tagging, a procedure of quality control for each match is performed, mainly consisting of two different steps. The first step is automatic: an algorithm is used to avoid the majority of the errors made by operators, considerably reducing the margin of error. For example the algorithm matches the events tagged by both operators to crosscheck if they both collected events involving both teams, like duels, with the same positioning and interpretation. Similarly, the algorithm suggests events missed by the operators or searches for impossible combinations of event sequences. The second step of quality control is manual and supervised by quality controllers. It mainly consists of an in-depth check that is carried out once the match is completed. Going through each event of some sample matches, the controller can see and eventually correct any entered parameter. Sample matches for quality control are chosen by another algorithm in order to guarantee a well distributed and statistically meaningful coverage with respect to the kind and amount of analyzed matches.
2. Are these data publicly available?
Free WyScout data describing seven prominent male soccer competitions (Spanish first division, Italian first division, English first division, German first division, French first division, World cup 2018 and European Cup 2016) referring to the season 2017/2018 are provided in a FigShare repository and are described in paper published on Nature Scientific Data (https://www.nature.com/articles/s41597-019-0247-7). To the best of our knowledge, this is the largest free dataset with soccer-log data. Actually, it is possible to buy the most recent data from different companies, e.g. WyScout and Opta.
3. Is it possible to compute the Expected goals (xG) from these data?
Yes! xG is a predictive model used to assess the likelihood of scoring for every shot made in the game. For every shot, the xG model calculates the probability to score based on event parameters derived from soccer-logs: the shot location, the assist location, if the shot is made by foot or head, the assist type, if there is a dribble of a player or a goalkeeper immediately before the shot, if it is coming from a set piece, if the shot is a counterattack or it happens in a transition, and the tagger's assessment of the danger of the shot. These parameters (plus other, more technical ones) are used to train the xG model on the historical soccer-logs and predict the probability of the shot becoming a goal.
4. Is a Data Scientist a figure requested from soccer teams?
In the last years, the leading European clubs publicly declared their interest in sports analytics and in particular on the application of data science and machine learning to the world of sports. As a matter of fact, data analysis has been concentrated up to this point in the top clubs in Europe. Only clubs with big budgets have created data analysis departments that work with the performance analysts and scouting departments. However, more and more clubs with smaller budgets are now taking their first steps in that direction.
Article written by: Alessio Rossi and Luca Pappalardo
SoBigData++ Exploratory: Sports Data Science