{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Formula One Project: Final Report\n", "\n", "DUE: December 10th, 2024 (Tue) \n", "Name(s): Sean O'Connor, Connor Coles \n", "Class: CSCI 349 - Intro to Data Mining \n", "Semester: Fall 2024 \n", "Instructor: Brian King " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2024-12-10T18:38:06.152974Z", "start_time": "2024-12-10T18:38:06.139745Z" } }, "outputs": [], "source": [ "# Importing Libraries\n", "import sys\n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import random\n", "import logging\n", "import warnings\n", "import os\n", "\n", "import fastf1\n", "import fastf1.plotting\n", "from fastf1.ergast.structure import FastestLap\n", "\n", "import matplotlib.pyplot as plt\n", "import matplotlib as mpl\n", "from matplotlib.collections import LineCollection\n", "\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.metrics import mean_squared_error, r2_score\n", "from sklearn.ensemble import GradientBoostingRegressor\n", "import xgboost as xgb\n", "from sklearn.model_selection import train_test_split, GridSearchCV\n", "from sklearn.pipeline import Pipeline, make_pipeline\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.metrics import mean_absolute_error\n", "\n", "from xgboost import XGBRegressor\n", "from lightgbm import LGBMRegressor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Statement\n", "We are analyzing Formula One driver performance to understand and predict race outcomes based on various conditions. Specifically, we aim to:\n", "1. Predict lap times based on weather and track conditions\n", "2. Understand how different variables affect driver performance\n", "3. Create models that can forecast race performance\n", "\n", "This is primarily a regression problem, as we're predicting continuous values (lap times) based on multiple features." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2024-12-10T17:42:22.557629Z", "start_time": "2024-12-10T17:42:22.532022Z" } }, "outputs": [], "source": [ "# Set up FastF1 plotting and caching\n", "cache_dir = '../data/cache'\n", "if not os.path.exists(cache_dir):\n", " os.makedirs(cache_dir)\n", "\n", "fastf1.Cache.enable_cache(cache_dir)\n", "fastf1.plotting.setup_mpl(misc_mpl_mods=False, color_scheme=None)\n", "logging.disable(logging.INFO)\n", "warnings.filterwarnings('ignore', category=UserWarning)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2024-12-10T17:42:23.135727Z", "start_time": "2024-12-10T17:42:23.132572Z" } }, "outputs": [], "source": [ "# Define years, sessions, and events of interest\n", "years = [2021, 2022, 2023, 2024]\n", "sessions = ['Race']\n", "events = ['Bahrain Grand Prix', 'Saudi Arabian Grand Prix', 'Dutch Grand Prix', 'Italian Grand Prix', 'Austrian Grand Prix', 'Hungarian Grand Prix', 'British Grand Prix', 'Belgian Grand Prix', 'United States Grand Prix', 'Mexico City Grand Prix', 'São Paulo Grand Prix']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why these events, sessions, and years?\n", "\n", "These events were chosen because they are all currently scheduled for the 2024 season, as well as having occurred in previous years. \n", "\n", "Each event has a specific set of conditions that may affect driver performance, for example:\n", "- Bahrain: Hot and humid, with high track temperatures\n", "- British: Cool and changeable, with frequent rain\n", "- Belgian: Overcast and cool, with frequent weather changes\n", "- United States: Very hot, with high track temperatures\n", "- Mexico City: Cool and changeable, with frequent rain\n", "\n", "As for years, we chose 2021 to 2024 because they are the most recent years for which data is available. In 2021, the regulations changed to allow for more overtaking, so the lap times became incomparable to that of previous years.\n", "\n", "We chose to only use the 'Race' session because it is the most representative of a race condition, as opposed to qualifying, which can be very sporadic." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2024-12-10T17:45:57.601953Z", "start_time": "2024-12-10T17:42:24.338445Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "core WARNING \tDriver 7: Lap timing integrity check failed for 1 lap(s)\n", "events WARNING \tCorrecting user input 'Sao Paulo Grand Prix' to 'São Paulo Grand Prix'\n", "core WARNING \tNo lap data for driver 22\n", "core WARNING \tNo lap data for driver 47\n", "core WARNING \tFailed to perform lap accuracy check - all laps marked as inaccurate (driver 22)\n", "core WARNING \tFailed to perform lap accuracy check - all laps marked as inaccurate (driver 47)\n", "events WARNING \tCorrecting user input 'United States Grand Prix' to 'United States Grand Prix'\n", "events WARNING \tCorrecting user input 'Sao Paulo Grand Prix' to 'São Paulo Grand Prix'\n", "events WARNING \tCorrecting user input 'United States Grand Prix' to 'United States Grand Prix'\n", "events WARNING \tCorrecting user input 'Sao Paulo Grand Prix' to 'São Paulo Grand Prix'\n", "events WARNING \tCorrecting user input 'United States Grand Prix' to 'United States Grand Prix'\n", "events WARNING \tCorrecting user input 'Sao Paulo Grand Prix' to 'São Paulo Grand Prix'\n", "core WARNING \tNo lap data for driver 23\n", "core WARNING \tFailed to perform lap accuracy check - all laps marked as inaccurate (driver 23)\n" ] } ], "source": [ "# Get data from FastF1 API\n", "\n", "# Data containers\n", "weather_data_list = []\n", "lap_data_list = []\n", "\n", "# Loop through years and sessions\n", "for year in years:\n", " for event_name in events: \n", " for session_name in sessions:\n", " try:\n", " # Load the session\n", " session = fastf1.get_session(year, event_name, session_name, backend='fastf1')\n", " session.load()\n", " \n", " # Process weather data\n", " weather_data = session.weather_data\n", " if weather_data is not None:\n", " weather_df = pd.DataFrame(weather_data)\n", " # Add context columns\n", " weather_df['Year'] = year\n", " weather_df['Event'] = event_name\n", " weather_df['Session'] = session_name\n", " weather_data_list.append(weather_df)\n", "\n", " # Process lap data\n", " lap_data = session.laps\n", " if lap_data is not None:\n", " lap_df = pd.DataFrame(lap_data)\n", " # Add context columns\n", " lap_df['Year'] = year\n", " lap_df['Event'] = event_name\n", " lap_df['Session'] = session_name\n", " # Ensure driver information is included\n", " if 'Driver' not in lap_df.columns:\n", " lap_df['Driver'] = lap_df['DriverNumber'].map(session.drivers)\n", " # Add team information if available\n", " if 'Team' not in lap_df.columns:\n", " lap_df['Team'] = lap_df['Driver'].map(session.drivers_info['TeamName'])\n", " lap_data_list.append(lap_df)\n", " \n", " except Exception as e:\n", " print(f\"Error with {event_name} {session_name} ({year}): {e}\")\n", "\n", "# Combine data into DataFrames\n", "if weather_data_list:\n", " weather_data_combined = pd.concat(weather_data_list, ignore_index=True)\n", " # Ensure consistent column ordering\n", " weather_cols = ['Time', 'Year', 'Event', 'Session', \n", " 'AirTemp', 'Humidity', 'Pressure', 'Rainfall', \n", " 'TrackTemp', 'WindDirection', 'WindSpeed']\n", " weather_data_combined = weather_data_combined[weather_cols]\n", " \n", "if lap_data_list:\n", " lap_data_combined = pd.concat(lap_data_list, ignore_index=True)\n", " # Ensure consistent column ordering\n", " lap_cols = ['Time', 'Year', 'Event', 'Session', \n", " 'Driver', 'Team', 'LapNumber', 'LapTime',\n", " 'Sector1Time', 'Sector2Time', 'Sector3Time',\n", " 'Compound', 'TyreLife', 'FreshTyre',\n", " 'SpeedI1', 'SpeedI2', 'SpeedFL', 'SpeedST']\n", " # Only include columns that exist\n", " existing_cols = [col for col in lap_cols if col in lap_data_combined.columns]\n", " lap_data_combined = lap_data_combined[existing_cols]\n", " \n", "# Time conversion\n", "# Function to convert timedelta to datetime\n", "def convert_timedelta_to_datetime(df, base_date='2021-01-01'):\n", " if 'Time' in df.columns:\n", " # Create a base datetime and add the timedelta\n", " base = pd.Timestamp(base_date)\n", " if df['Time'].dtype == 'timedelta64[ns]':\n", " df['Time'] = base + df['Time']\n", " return df\n", "\n", "# Apply conversion to both dataframes\n", "weather_data_combined = convert_timedelta_to_datetime(weather_data_combined)\n", "lap_data_combined = convert_timedelta_to_datetime(lap_data_combined)\n", "\n", "# Remove missing values\n", "weather_data_combined = weather_data_combined.dropna()\n", "lap_data_combined = lap_data_combined.dropna()\n", "\n", "# Create a new column for lap time in seconds\n", "lap_data_combined['LapTime_seconds'] = lap_data_combined['LapTime'].dt.total_seconds()\n", "\n", "# Merge the data\n", "merged_data = pd.merge_asof(\n", " lap_data_combined.sort_values('Time'),\n", " weather_data_combined.sort_values('Time'),\n", " on='Time',\n", " by=['Event', 'Year'], # Match within same event and year\n", " direction='nearest',\n", " tolerance=pd.Timedelta('1 min') # Allow matching within 1 minute\n", ")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Description\n", "Our data comes from the FastF1 API, which provides detailed Formula One racing data. Each observation represents a single lap during a race or qualifying session, including:\n", "\n", "Key Variables:\n", "- **Time**: Timestamp of the lap\n", "- **Driver**: Driver identifier\n", "- **LapTime**: Time taken to complete the lap\n", "- **Weather Conditions**:\n", " - TrackTemp: Track temperature in Celsius\n", " - AirTemp: Air temperature in Celsius\n", " - Humidity: Percentage\n", " - Rainfall: Boolean indicating presence of rain\n", "- **Performance Metrics**:\n", " - Sector times (1,2,3)\n", " - Speed measurements at various points\n", " - Compound: Tire compound used\n", " - TyreLife: Age of tires in laps\n", "\n", "Each lap is represented as a fixed-length vector containing these attributes, making it suitable for machine learning algorithms." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2024-12-10T17:46:25.733104Z", "start_time": "2024-12-10T17:46:25.694313Z" } }, "outputs": [ { "data": { "text/html": [ "
| \n", " | Time | \n", "Year | \n", "Event | \n", "Session | \n", "Driver | \n", "Team | \n", "LapNumber | \n", "LapTime | \n", "Sector1Time | \n", "Sector2Time | \n", "Sector3Time | \n", "Compound | \n", "TyreLife | \n", "FreshTyre | \n", "SpeedI1 | \n", "SpeedI2 | \n", "SpeedFL | \n", "SpeedST | \n", "LapTime_seconds | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | \n", "2021-01-01 00:41:37.134 | \n", "2021 | \n", "Bahrain Grand Prix | \n", "Race | \n", "GAS | \n", "AlphaTauri | \n", "2.0 | \n", "0 days 00:02:22.263000 | \n", "0 days 00:00:45.220000 | \n", "0 days 00:01:00.086000 | \n", "0 days 00:00:36.957000 | \n", "MEDIUM | \n", "5.0 | \n", "False | \n", "120.0 | \n", "134.0 | \n", "182.0 | \n", "236.0 | \n", "142.263 | \n", "
| 4 | \n", "2021-01-01 00:48:28.044 | \n", "2021 | \n", "Bahrain Grand Prix | \n", "Race | \n", "GAS | \n", "AlphaTauri | \n", "5.0 | \n", "0 days 00:02:11.534000 | \n", "0 days 00:01:05.748000 | \n", "0 days 00:00:41.956000 | \n", "0 days 00:00:23.830000 | \n", "HARD | \n", "1.0 | \n", "True | \n", "231.0 | \n", "251.0 | \n", "275.0 | \n", "213.0 | \n", "131.534 | \n", "
| 5 | \n", "2021-01-01 00:50:04.721 | \n", "2021 | \n", "Bahrain Grand Prix | \n", "Race | \n", "GAS | \n", "AlphaTauri | \n", "6.0 | \n", "0 days 00:01:36.677000 | \n", "0 days 00:00:30.990000 | \n", "0 days 00:00:41.802000 | \n", "0 days 00:00:23.885000 | \n", "HARD | \n", "2.0 | \n", "True | \n", "233.0 | \n", "254.0 | \n", "275.0 | \n", "280.0 | \n", "96.677 | \n", "
| 6 | \n", "2021-01-01 00:51:41.675 | \n", "2021 | \n", "Bahrain Grand Prix | \n", "Race | \n", "GAS | \n", "AlphaTauri | \n", "7.0 | \n", "0 days 00:01:36.954000 | \n", "0 days 00:00:31.176000 | \n", "0 days 00:00:41.678000 | \n", "0 days 00:00:24.100000 | \n", "HARD | \n", "3.0 | \n", "True | \n", "232.0 | \n", "252.0 | \n", "274.0 | \n", "282.0 | \n", "96.954 | \n", "
| 8 | \n", "2021-01-01 00:54:56.129 | \n", "2021 | \n", "Bahrain Grand Prix | \n", "Race | \n", "GAS | \n", "AlphaTauri | \n", "9.0 | \n", "0 days 00:01:37.030000 | \n", "0 days 00:00:31.256000 | \n", "0 days 00:00:41.911000 | \n", "0 days 00:00:23.863000 | \n", "HARD | \n", "5.0 | \n", "True | \n", "234.0 | \n", "248.0 | \n", "276.0 | \n", "286.0 | \n", "97.030 | \n", "