Real Time Predictions in the TIBCO F1™ Simulator - a Data Science Story
For the last 3 seasons, TIBCO has been in a (very successful!) partnership with the Mercedes-AMG Petronas Motorsport team, going on to win each season. As we enter our 4th year together, we continue to strengthen that working relationship. As a massive Formula 1™ fan of over 25 years, I was very excited to get the opportunity to apply data science to one of my lifelong passions. While the secrets of success in Formula 1™ are closely gaurded, one data science story we can tell is how we brought data science into our very own TIBCO F1™ Simulator. The simulator acts as TIBCO's digital twin to learn from as well as simulate and predict future performances in the simulator.
You can watch this TIBCO webinar where I cover all the content in this blog, as well as an overview of how we work with analytics and data science in our partnerhsip with the Mercedes-AMG Petronas Motorsport team here: https://www.tibco.com/resources/demand-webinar/data-science-formula-one-and-beyond
The TIBCO F1™ Simulator
At many of the events TIBCO attends, we bring our F1™ Simulator which uses the F1™ 2019 official FIA game. This is the game that is used for the official F1™ e-sports championship i.e. virtual racing series, that many of the official F1™ teams enter with their own dedicated e-sports teams. With the realism of these games reaching new levels, some of the drivers who take part in the e-sports series are finding themselves getting opportunities in real-life racing teams, and is becoming a complimentary route any prospestive driver can pursue that until recently was not available. TIBCO uses a combination of streaming technologies such as FLOGO, TIBCO Streambase and TIBCO Spotfire to provide a unique experience to those who visit us at events. Contestants race in our simulator, sometimes even getting to sit in a Formula 1™ car. Using our technologies we stream their performance and statistics live in Spotfire all as part of a fastest lap contest with prizes going to the fastest drivers:
The Race, Driver and Car Data
There is a wealth of detailed data that comes from the simulator, all in real-time. In total, there are over 130 columns and approximately a full row of data is output at a rate of 10 times a second. This means for one lap, we will receive ~105,000 data points depending on the track and driver. The data covers many aspects such as current speed, throttle, steering etc. but also many other detailed aspects such as data on:
- Position and local velocity on the track i.e. x and y coordinates, angle of direction etc.
- Wheel, suspension and tyre data
- Temperatures data i.e. of tyres and engine
- Energy Recovery System (ERS) data
- Damage to wings, tyres and components
You can view a full list here: https://forums.codemasters.com/topic/44592-f1-2019-udp-specification/
The Data Science Challenge
One of the greatest challenges in racing is to anticipate the driver's performance on any given lap from as early on as possible. By being able to model and predict this we could determine their likely position after the lap, for example if the driver is trying to do a one off fast lap during qualifying for a race. However, we could also then learn how to optimise lap times. For example, models could be used to determine the most important factors around a given circuit in terms of racing line, car setup, and how the driver tackles each track section. Each track in Formula F1™ has a unique set of corners and straights, so a good model could even provide track specific insights. Our challenge was therefore to build machine learning models that could learn and anticipate a driver's final lap time on each lap from the beginning of the lap to the end, all in real-time as they drive their lap. Of course, many things can happen as a driver races a lap in terms of making mistakes, and taking different approaches to each and every track section. We found that through the use of machine learning, our models were able to anticipate some of these lap events.
For this challenge, we used our collaborative data science tool Team Studio with Spotfire. There we can gather all our data and assets in one place and build data science workflows as a team, collaborating on work and ideas.
We used the data gathered at last year's BigDataLDN event in London as our training and test data. At this event over 200 people raced the Interlagos Brazil circuit in our simulator. This was run as competition with a live leaderboard over the two days to see who could produce the fastest lap. Every person could view their stats and predicted lap time all in real-time as they raced.
Before undertaking any data science task, we should always get a good understanding of the data available in terms of quality, coverage, limitations and potential. To do this we bought all the event data into TIBCO Spotfire for analysis. First we wanted to analyse different driver's performance in how they tackled the track. For example, in the dashboard below we could select data from drivers approaching a corner, during corner entry and exit and then analyse the difference in those who posted fast lap times vs. those who were slower, as shown below:
Figure 1 - analysing all laps raced at the event and comparing racing line, and statistics of drivers - View Image Full Size (click back to return to blog)
From this analysis, we could see that the best drivers all brake much harder and earlier into this corner highlighted above: notice the best laps are braking from 65% to 74% whereas slower laps are 40% or lower at the same point on the track. They then turn tighter into the apex of the corner than other slower drivers, and are able to apply heavy throttle i.e. acceleration out of the corner by being able to 'point' the car much straighter out of the corner. They do all this despite arriving at corner the approach (the section highlighted above), much faster than other drivers. In contrast, slow drivers brake late, often take a wider entry into the corner and applied throttle and braking unevenly. An interesting insight also was that the very fastest drivers all used a larger steering angle than other fast drivers suggesting a key insight for drivers who are not quite the fastest yet.
Now we have a feel for our data and are already gaining insights into how to drive fast (or slow!) without even being in the simulator. The next step is to start assessing features in these data to use for predicting lap times.
To run models in real-time we will need to capture various aspects of how a driver is behaving and performing. However, the data that is transmitted from the simulator is only their actions in a very small timeframe i.e. in a specific millisecond. This would not be enough to understand performance of the car or driver, or their driving style. To solve this a common technique in data science is to use feature generation. Through generic manipulation of data, we can generate features that are of use for predictions. For our simulator, we determined that a rolling 3 second average of each statistic the simulator provides, as well as cumulative sums were useful. We then could manually explore the variables available to us to select the best variables using Spotfire. In the dashboard below, we can select each lap variable (from the menu on the left), and compare fast lap times (purple) vs slow laps (white/grey) as a lap progresses. We can also use boxplots to view the difference in medians and distributions from fast laps (on the left of the boxplot below) vs. slow laps (on the right of the boxplot below).
Figure 2 - Reviewing different variables to assess how well they differentiate between fast and slow lap times as a lap progresses - View Image Full Size (click back to return to blog)
Selecting features manually can be effective. However, it may not surface all the relations between variables as we only view one variable at a time. It is also quite labour intensive. To solve this we use a workflow built in TIBCO Data Science - Team Studio to automatically select the best features for us using Random Forest model's ability to assess feature importance. Note that we are splitting the lap into 3 sectors for this task. Each Formula 1™ track contains 3 sectors, and each sector covers a unique section of the track. This allows us to specialise models for each sector rather than the whole lap at once:
Figure 3 - A variable importance workflow in Team Studio - finding the most important variables to predict lap times - View Image Full Size (click back to return to blog)
We can integrate and control this workflow through Spotfire using the Spotfire Team Studio Big Data Function. In this way we can control which track we use, how many laps to learn from and how deep the random forest can dive while visualising the results in Spotfire:
Figure 4 - Using machine learning models to derive feature importance per sector of the track - View Image Full Size (click back to return to blog)
Notice how for each sector a different set of features are selected as being important highlighting the different properties of these areas of track. However, there are also commonalities across each sector showing that certain features have a high power to predict for all areas on the track. For example, x and y which precisely describe position of the car on the track at any given moment is one such feature.
Now we have assessed our features to model with we can again use Team Studio to build multiple machine learning models to predict our lap times (using these features). As Team Studio is a collaborative tool, multiple analysts and data scientists are able to work and collaborate over the same data science workflows. In this example, I worked on two models available directly through Team Studio; linear regression and gradient boosted trees while my colleague and fellow data scientist, Noora Husseini worked on a Keras neural network using Python in a jupyter notebook. Team Studio not only provides drag and drop functionality to build data science models (via spark), it also allows for R and Python code to be run. This means we can get the best of both a no-code/low code experience coupled with the flexibility of coding in Python:
Figure 5 - Our model building workflow where multiple data scientists have collaborated to produce models for review - View Image Full Size (click back to return to blog)
Assessing Model Performance
Using the workflow shown above in figure 5, we built 3 different models per sector (linear regression, gradient boosted trees and Keras neural network). We now want to assess each of these in terms of predictive performance. Again we can use Spotfire in combination with Team Studio to provide this analysis. We can use simple model comparison statistics such as R-squared, and root mean square error to get a sense of where the models are in terms of performance. However, to truly understand a model we need to delve into the predictions in much greater detail, and even look at individual predictions. Below is a summary of the 3 model's predictions split by the 3 sectors. Displayed are the model's predicted time to complete a lap (y-axis) vs. the actual time to complete the lap (x-axis). On the right, all the residuals from each model are displayed (y-axis) vs. the actual time to complete a lap (x-axis). This gives us a visual overview off all predictions throughout a lap, for all laps vs. the true data.
Figure 6 - Comparing predicted vs. actual for all models across sectors as well as analysing model residuals - View Image Full Size (click back to return to blog)
3 stories stand out from analysis shown above:
- The models show a large improvement from sector 1 to 2, and sector 2 to 3. This makes sense for multiple reasons: we are predicting time to complete the lap. This means as the lap progresses we know more of the actual lap time which contributes to the final lap time i.e. if we are 70% into the lap, we only need to predict the final 30% of lap time. The other reason we may expect this improvement in performance is that the further into the lap we are, the more information we have on how that driver behaves and has performed so far.
- The linear regression and Keras neural network show similar performance across the sectors however, the gradient boosted trees out perform the models in sector 1 and 3, while being close to the best in sector 2. Since it is easy to run multiple models through Team Studio in the same workflow, we are able to rapidly assess more machine learning methods. This would allow us to implement linear regression for sector 2 but gradient boosted trees for the other sectors, for example. We could also combine the models into some form of weighted prediction should we choose.
- Analysing the residuals i.e. the difference between the predicted time to complete the lap vs. the actual time to complete shows that the gradient boosted trees average residual is closest to 0 i.e. perfect predictions, and they have a reasonable spread vs. the other models.
Of course, we could implement and explore more hyperparameter settings for each model. However, as with all real data science projects, we had limited time before we needed our models ready to run live at our next event.
From the results here, we decided to run with the gradient boosted tree models for each sector at our next event.
Improving How We Assess Models in a Formula 1™ Context
Another key aspect to data science is the ability to continually learn, especially with more quality data. As people race, we re-trained these trees as people race so they should improve. Again using R-squared and residuals as a simple measure of performance, we see the results below comparing the same models built from the training data before the event, and then re-trained after day one of the event:
Figure 7 - Comparing how R-squared and residual spread has improved after adding another day of training laps - View Image Full Size (click back to return to blog)
Here we can clearly see an improvement in R-squared scores in all sectors from having an extra day of laps in the training data. We also see a tighter range of residuals and now the average is almost 0.
However, the models do still struggle in sector 1. Why is this and is R-squared really a good way to assess model performance? A key lesson I have learnt from being involved in many data science projects is that single value summary scores always hide important detail and truths, so should be treated with caution.
So how can we put our model predictions into a Formula 1™ context? We can do this by taking into consideration what the driver is doing while each prediction is happening, as well as where they are in their lap. For example, a driver may be driving very well with a predicted low lap time. The driver then makes a single mistake, losing time which results in a slower final lap time. In this scenario the residuals and R-squared of the model predictions vs. the actual final lap time would be very poor before this mistake happened but is this a fair reflection of model performance? The examples below illustrate this in figure 8. Each line chart is a real lap raced at the event. The y values represents the residuals of the prediction vs. the final lap time as the lap progressed (on the x-axis). On the right is a map showing the location of the driver at each point of prediction, coloured by how close the prediction is to the final lap time at that point (purple is close or exact, yellow to blue is getting further away):
Figure 8 - Analysing model predictions through the lap, and mapping these to on-track events - View Image Full Size (click back to return to blog)
In the case of the first lap shown on the top line chart, we see large negative residuals of over 10 seconds(!) for the lap predictions initially, meaning the model was predicting a much faster lap time than what eventually was achieved by the driver. Then around the 2,500 meter point of the lap, the model predictions start to rise as a small mistake happened (shown by the red parts of the line, and highlighted on the map). Then interestingly, the models show another sharp rise right before another but much larger mistake happened (as highlighted on the line chart and map again). This second mistake is where the driver lost control and made contact with a wall, damaging their car (signified by the X's in the line chart). After this point, the residuals rapidly become near zero until the lap is over. For this reason, when assessing overall model performance, we should not take into account the model residuals and predictions before these mistakes as there is no way to know if they were accurate had the driver not made these mistakes, and only assess the model's performance after the last mistake.
Compare this to the lap below where there are only two minor mistakes made during the lap, and the model residuals largely hover around 0, and are never greater than 4 seconds.
An unexpected but highly useful finding is that the models were able to predict a mistake before they happened (as shown where the residuals increase just before a red section of each line chart). This shows the models could recognise behaviour likely to lead to a mistake before it happened i.e. not braking enough, using the wrong racing line etc. It is particularly interesting in the first lap where the sharp rise in lap time predictions happened before the large mistake. This occurs just before sector 3 starts, at a place on the track where a driver can easily lose control by over accelerating out of a tight and difficult corner. Therefore, we can use our knowledge of the track to show the models are learning the more error prone and dangerous parts of the track. We could also learn from the data here to advise a driver how to prevent this from occurring in the future. More on this in the follow up blog on predicting likelihood of accidents throughout a lap - coming soon!
It has been a great pleasure working on the TIBCO F1™ Simulator at events and incorporating data science into our dashboards, and lap analyses. Doing this work has provided many analytical and data science insights into how to drive, how to improve lap times but also the key components that contribute to fast laps (or not!). This is all highly valuable in being able to simulate and learn for real-life scenarios using techniques such as digital twins. It also highlighted the crucial point of putting any data science work into the context of your domain. Whether this is in sports and racing, medical research, manufacturing or banking, understanding your model's interpretation and use of data with respect to your use case and area is vitally important. This can influence not only the models you use, and how you deploy them in your use case but also how valuable and actionable the insights are.
Colin Gray - TIBCO Data Science - Apr 2020.
|Colin Gray is a data scientist working in the Data Science team in TIBCO, in the EMEA region. He has a keen interest in all things data science especially around how we present and communicate machine learning models and analytics, as well as finding innovative ways to combine technologies to bring data science to more fields and users. He loves sports (boxing, NFL, Formula 1), music and dogs and will find any excuse to combine data science with these interests!|