{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Formula One Project: Final Report\n", "\n", "DUE: December 10th, 2024 (Tue) \n", "Name(s): Sean O'Connor, Connor Coles \n", "Class: CSCI 349 - Intro to Data Mining \n", "Semester: Fall 2024 \n", "Instructor: Brian King " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assignment Description\n", "This is your final report! Consider the scenario of you giving a report to a client or your supervisor on your study. Include good reporting techniques. Use good tables and other visualizations. Structure your notebook using proper headers for major sections. THIS NOTEBOOK SHOULD BE A COMPLETE STANDALONE NOTEBOOK FROM START TO FINISH! But, ONLY INCLUDE THE BEST OUTCOMES FROM YOUR PREVIOUS NOTEBOOKS!\n", "\n", "Include the following sections:\n", "\n", "1. **Introduction**\n", " - This should have mostly been done in your first notebook. Just copy over relevant cells from your first part of the project, and add any new information you have learned. Your aim is to motivate the reader with the importance and relevance of your project.\n", "\n", "2. **Data**\n", " - Again, most of this was likely done in your first part of the project. So, feel free to copy the important cells over.\n", " - After your introduction, you should be introducing the original, raw data. Where was it collected? When? How? Explain the meaning of the variables. What does each observation represent? And be sure to explain the key target variable (assuming you are doing classification/regression).\n", " - PLEASE DO NOT SHOW PAGES AND PAGES OF YOUR DATA! Display only a few observations so that the reader can see what your raw, original data look like.\n", "\n", "3. **Data Preparation**\n", " - While the previous section shows the raw data, this section is going to carefully explain the steps you followed to clean and preprocess the data in a form suitable for analysis, visualization, and modeling. You should be setting proper variable types, dealing with missing data, etc. Preprocessing steps should be explained with justification. Include any dimensionality reduction techniques you might have done. Summarize what you needed to do to clean it, and show some example observations from your final, cleaned data.\n", "\n", "4. **EDA**\n", " - We expect good EDA to understand your data. Visualizations after preprocessing will do far more to convey your summary statistics than just numbers. Discuss the distributions, correlations with the target variable, etc.\n", "\n", "5. **Modeling**\n", " - What modeling methods did you try? And, which method(s) did you ultimately determine were the best? What hyperparameters were selected? Justify the selection of parameters. (i.e., did you do a grid search? You cannot simply say, \"XGBoost was the only one I evaluated, and I used default parameters.\" Boring, and unlikely to obtain the best results. You are expected to evaluate different models and different hyperparameters. It is very rare that default parameters are the best in machine learning.)\n", "\n", "6. **Performance Results**\n", " - Once you've selected the best model, clearly convey the results of your model. I expect to see ROC curves, precision/recall curves, confusion matrices, tables with prediction performance by class, (or, if regression, use appropriate regression measures), etc.\n", "\n", "7. **Discussion**\n", " - Reflect on your project. For example: discuss any challenges you had with cleaning and preparing the data. Did you find any surprises during your modeling? Compare and contrast the methods and hyperparameters you evaluated. And, it's often useful to discuss the features that you thought were the most predictive, and those that were least useful. (Search for feature importance scikit-learn for more info!) Any info that might be of interest to me related to your project goes here.\n", "\n", "8. **Conclusions**\n", " - Write a short summary of your project goes here." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2024-12-09T15:24:16.987194Z", "start_time": "2024-12-09T15:24:16.974515Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python version: 3.10.13 (main, Sep 11 2023, 08:24:56) [Clang 14.0.6 ]\n", "Pandas version: 2.2.2\n", "Numpy version: 1.23.5\n", "Matplotlib version: 3.8.4\n", "Seaborn version: 0.13.2\n", "FastF1 version: 3.4.4\n", "Scikit-learn version: 1.5.1\n", "XGBoost version: 2.1.1\n" ] } ], "source": [ "# Importing Libraries\n", "import sys\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import os\n", "\n", "import fastf1\n", "import fastf1.plotting\n", "from fastf1.ergast.structure import FastestLap\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.metrics import mean_squared_error, r2_score\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.ensemble import GradientBoostingRegressor\n", "from sklearn.svm import SVR\n", "import xgboost as xgb\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.model_selection import train_test_split, GridSearchCV\n", "#print versions of libraries\n", "print(f'Python version: {sys.version}')\n", "print(f'Pandas version: {pd.__version__}')\n", "print(f'Numpy version: {np.__version__}')\n", "print(f'Matplotlib version: {plt.matplotlib.__version__}')\n", "print(f'Seaborn version: {sns.__version__}')\n", "print(f'FastF1 version: {fastf1.__version__}')\n", "print(f'Scikit-learn version: {sys.modules[\"sklearn\"].__version__}')\n", "print(f'XGBoost version: {xgb.__version__}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Statement\n", "We are analyzing Formula One driver performance to understand and predict race outcomes based on various conditions. Specifically, we aim to:\n", "1. Predict lap times based on weather and track conditions\n", "2. Understand how different variables affect driver performance\n", "3. Create models that can forecast race performance\n", "\n", "This is primarily a regression problem, as we're predicting continuous values (lap times) based on multiple features." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2024-12-09T15:24:17.026610Z", "start_time": "2024-12-09T15:24:17.013465Z" } }, "outputs": [], "source": [ "# Set up FastF1 plotting and caching\n", "cache_dir = '../data/cache'\n", "if not os.path.exists(cache_dir):\n", " os.makedirs(cache_dir)\n", "\n", "fastf1.Cache.enable_cache(cache_dir)\n", "fastf1.plotting.setup_mpl(misc_mpl_mods=False, color_scheme=None)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2024-12-09T15:24:17.053839Z", "start_time": "2024-12-09T15:24:17.050836Z" } }, "outputs": [], "source": [ "# Define years, sessions, and events of interest\n", "years = [2021, 2022, 2023, 2024]\n", "sessions = ['Race']\n", "events = ['Bahrain Grand Prix', 'British Grand Prix', 'Belgian Grand Prix', 'United States Grand Prix', 'Mexico City Grand Prix'] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why these events, sessions, and years?\n", "\n", "These events were chosen because they are all currently scheduled for the 2024 season, as well as having occurred in previous years. \n", "\n", "Each event has a specific set of conditions that may affect driver performance:\n", "- Bahrain: Hot and humid, with high track temperatures\n", "- British: Cool and changeable, with frequent rain\n", "- Belgian: Overcast and cool, with frequent weather changes\n", "- United States: Very hot, with high track temperatures\n", "- Mexico City: Cool and changeable, with frequent rain\n", "\n", "As for years, we chose 2021 to 2024 because they are the most recent years for which data is available. In 2021, the regulations changed to allow for more overtaking, so the lap times became incomparable to that of previous years.\n", "\n", "We chose to only use the 'Race' session because it is the most representative of a race condition, as opposed to qualifying, which can be very sporadic." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2024-12-09T15:24:20.703407Z", "start_time": "2024-12-09T15:24:17.065829Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processing 2021 Bahrain Grand Prix - Race\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "core INFO \tLoading data for Bahrain Grand Prix - Race [v3.4.4]\n", "req INFO \tUsing cached data for session_info\n", "req INFO \tUsing cached data for driver_info\n", "req INFO \tUsing cached data for session_status_data\n", "req INFO \tUsing cached data for lap_count\n", "req INFO \tUsing cached data for track_status_data\n", "req INFO \tUsing cached data for _extended_timing_data\n", "req INFO \tUsing cached data for timing_app_data\n", "core INFO \tProcessing timing data...\n", "req INFO \tUsing cached data for car_data\n", "req INFO \tUsing cached data for position_data\n", "req INFO \tUsing cached data for weather_data\n", "req INFO \tUsing cached data for race_control_messages\n", "core INFO \tFinished loading data for 20 drivers: ['44', '33', '77', '4', '11', '16', '3', '55', '22', '18', '7', '99', '31', '63', '5', '47', '10', '6', '14', '9']\n" ] } ], "source": [ "# Get data from FastF1 API\n", "\n", "# Data containers\n", "weather_data_list = []\n", "lap_data_list = []\n", "\n", "#Quick Test\n", "years = [2021]\n", "events = ['Bahrain Grand Prix']\n", "\n", "\n", "# Loop through years and sessions\n", "for year in years:\n", " for event_name in events: \n", " for session_name in sessions:\n", " try:\n", " print(f\"Processing {year} {event_name} - {session_name}\")\n", " \n", " # Load the session\n", " session = fastf1.get_session(year, event_name, session_name, backend='fastf1')\n", " session.load()\n", " \n", " # Process weather data\n", " weather_data = session.weather_data\n", " if weather_data is not None:\n", " weather_df = pd.DataFrame(weather_data)\n", " # Add context columns\n", " weather_df['Year'] = year\n", " weather_df['Event'] = event_name\n", " weather_df['Session'] = session_name\n", " weather_data_list.append(weather_df)\n", "\n", " # Process lap data\n", " lap_data = session.laps\n", " if lap_data is not None:\n", " lap_df = pd.DataFrame(lap_data)\n", " # Add context columns\n", " lap_df['Year'] = year\n", " lap_df['Event'] = event_name\n", " lap_df['Session'] = session_name\n", " # Ensure driver information is included\n", " if 'Driver' not in lap_df.columns:\n", " lap_df['Driver'] = lap_df['DriverNumber'].map(session.drivers)\n", " # Add team information if available\n", " if 'Team' not in lap_df.columns:\n", " lap_df['Team'] = lap_df['Driver'].map(session.drivers_info['TeamName'])\n", " lap_data_list.append(lap_df)\n", " \n", " except Exception as e:\n", " print(f\"Error with {event_name} {session_name} ({year}): {e}\")\n", "\n", "# Combine data into DataFrames\n", "if weather_data_list:\n", " weather_data_combined = pd.concat(weather_data_list, ignore_index=True)\n", " # Ensure consistent column ordering\n", " weather_cols = ['Time', 'Year', 'Event', 'Session', \n", " 'AirTemp', 'Humidity', 'Pressure', 'Rainfall', \n", " 'TrackTemp', 'WindDirection', 'WindSpeed']\n", " weather_data_combined = weather_data_combined[weather_cols]\n", " \n", "if lap_data_list:\n", " lap_data_combined = pd.concat(lap_data_list, ignore_index=True)\n", " # Ensure consistent column ordering\n", " lap_cols = ['Time', 'Year', 'Event', 'Session', \n", " 'Driver', 'Team', 'LapNumber', 'LapTime',\n", " 'Sector1Time', 'Sector2Time', 'Sector3Time',\n", " 'Compound', 'TyreLife', 'FreshTyre',\n", " 'SpeedI1', 'SpeedI2', 'SpeedFL', 'SpeedST']\n", " # Only include columns that exist\n", " existing_cols = [col for col in lap_cols if col in lap_data_combined.columns]\n", " lap_data_combined = lap_data_combined[existing_cols]\n", " \n", "# Time conversion\n", "# Function to convert timedelta to datetime\n", "def convert_timedelta_to_datetime(df, base_date='2021-01-01'):\n", " if 'Time' in df.columns:\n", " # Create a base datetime and add the timedelta\n", " base = pd.Timestamp(base_date)\n", " if df['Time'].dtype == 'timedelta64[ns]':\n", " df['Time'] = base + df['Time']\n", " return df\n", "\n", "# Apply conversion to both dataframes\n", "weather_data_combined = convert_timedelta_to_datetime(weather_data_combined)\n", "lap_data_combined = convert_timedelta_to_datetime(lap_data_combined)\n", "\n", "# Remove missing values\n", "weather_data_combined = weather_data_combined.dropna()\n", "lap_data_combined = lap_data_combined.dropna()\n", "\n", "# Create a new column for lap time in seconds\n", "lap_data_combined['LapTime_seconds'] = lap_data_combined['LapTime'].dt.total_seconds()\n", "\n", "# Merge the data\n", "merged_data = pd.merge_asof(\n", " lap_data_combined.sort_values('Time'),\n", " weather_data_combined.sort_values('Time'),\n", " on='Time',\n", " by=['Event', 'Year'], # Match within same event and year\n", " direction='nearest',\n", " tolerance=pd.Timedelta('1 min') # Allow matching within 1 minute\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Description\n", "Our data comes from the FastF1 API, which provides detailed Formula One racing data. Each observation represents a single lap during a race or qualifying session, including:\n", "\n", "Key Variables:\n", "- **Time**: Timestamp of the lap\n", "- **Driver**: Driver identifier\n", "- **LapTime**: Time taken to complete the lap\n", "- **Weather Conditions**:\n", " - TrackTemp: Track temperature in Celsius\n", " - AirTemp: Air temperature in Celsius\n", " - Humidity: Percentage\n", " - Rainfall: Boolean indicating presence of rain\n", "- **Performance Metrics**:\n", " - Sector times (1,2,3)\n", " - Speed measurements at various points\n", " - Compound: Tire compound used\n", " - TyreLife: Age of tires in laps\n", "\n", "Each lap is represented as a fixed-length vector containing these attributes, making it suitable for machine learning algorithms." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2024-12-09T15:25:18.878045Z", "start_time": "2024-12-09T15:25:18.858848Z" } }, "outputs": [ { "data": { "text/html": [ "
| \n", " | Time | \n", "Year | \n", "Event | \n", "Session | \n", "Driver | \n", "Team | \n", "LapNumber | \n", "LapTime | \n", "Sector1Time | \n", "Sector2Time | \n", "Sector3Time | \n", "Compound | \n", "TyreLife | \n", "FreshTyre | \n", "SpeedI1 | \n", "SpeedI2 | \n", "SpeedFL | \n", "SpeedST | \n", "LapTime_seconds | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | \n", "2021-01-01 00:41:37.134 | \n", "2021 | \n", "Bahrain Grand Prix | \n", "Race | \n", "GAS | \n", "AlphaTauri | \n", "2.0 | \n", "0 days 00:02:22.263000 | \n", "0 days 00:00:45.220000 | \n", "0 days 00:01:00.086000 | \n", "0 days 00:00:36.957000 | \n", "MEDIUM | \n", "5.0 | \n", "False | \n", "120.0 | \n", "134.0 | \n", "182.0 | \n", "236.0 | \n", "142.263 | \n", "
| 4 | \n", "2021-01-01 00:48:28.044 | \n", "2021 | \n", "Bahrain Grand Prix | \n", "Race | \n", "GAS | \n", "AlphaTauri | \n", "5.0 | \n", "0 days 00:02:11.534000 | \n", "0 days 00:01:05.748000 | \n", "0 days 00:00:41.956000 | \n", "0 days 00:00:23.830000 | \n", "HARD | \n", "1.0 | \n", "True | \n", "231.0 | \n", "251.0 | \n", "275.0 | \n", "213.0 | \n", "131.534 | \n", "
| 5 | \n", "2021-01-01 00:50:04.721 | \n", "2021 | \n", "Bahrain Grand Prix | \n", "Race | \n", "GAS | \n", "AlphaTauri | \n", "6.0 | \n", "0 days 00:01:36.677000 | \n", "0 days 00:00:30.990000 | \n", "0 days 00:00:41.802000 | \n", "0 days 00:00:23.885000 | \n", "HARD | \n", "2.0 | \n", "True | \n", "233.0 | \n", "254.0 | \n", "275.0 | \n", "280.0 | \n", "96.677 | \n", "
| 6 | \n", "2021-01-01 00:51:41.675 | \n", "2021 | \n", "Bahrain Grand Prix | \n", "Race | \n", "GAS | \n", "AlphaTauri | \n", "7.0 | \n", "0 days 00:01:36.954000 | \n", "0 days 00:00:31.176000 | \n", "0 days 00:00:41.678000 | \n", "0 days 00:00:24.100000 | \n", "HARD | \n", "3.0 | \n", "True | \n", "232.0 | \n", "252.0 | \n", "274.0 | \n", "282.0 | \n", "96.954 | \n", "
| 8 | \n", "2021-01-01 00:54:56.129 | \n", "2021 | \n", "Bahrain Grand Prix | \n", "Race | \n", "GAS | \n", "AlphaTauri | \n", "9.0 | \n", "0 days 00:01:37.030000 | \n", "0 days 00:00:31.256000 | \n", "0 days 00:00:41.911000 | \n", "0 days 00:00:23.863000 | \n", "HARD | \n", "5.0 | \n", "True | \n", "234.0 | \n", "248.0 | \n", "276.0 | \n", "286.0 | \n", "97.030 | \n", "