{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Formula One Project: Modeling\n", "\n", "DUE: December 4th, 2024 (Wed) \n", "Name(s): Sean O'Connor, Connor Coles \n", "Class: CSCI 349 - Intro to Data Mining \n", "Semester: Fall 2024 \n", "Instructor: Brian King " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assignment Description\n", "\n", "Copy over the important cells from the previous step that read in and cleaned your data to this new notebook file. You do not need to copy over all your EDA and plots describing your data, only the code that prepares your data for modeling. This notebook is about exploring the development of predictive models. Some initial preliminary work on applying some modeling techniques should be completed.\n", "Be sure to commit and push all supporting code that you've completed in this file. Include in this notebook a summary cell at the top that details your accomplishments, challenges, and what you expect to accomplish for your final steps. Be sure to update your readme.md in your repository.\n", "\n", "## Progress Summary\n", "\n", "### Accomplishments So Far\n", "- Successfully loaded and preprocessed Formula 1 race data from 2021-2024\n", "- Created comprehensive feature engineering pipeline including weather and track conditions\n", "- Implemented initial modeling with Random Forest, XGBoost, and Gradient Boosting\n", "- Achieved best performance on Belgian GP (R² = 0.775) and Mexico City GP (R² = 0.505)\n", "\n", "### Challenges Faced\n", "- High variability in model performance across different tracks\n", "- British GP proving particularly difficult to predict (best R² = 0.047)\n", "- Complex interactions between weather variables and lap times\n", "- Limited data availability for some races/conditions\n", "\n", "### Next Steps\n", "- Implement hyperparameter tuning using GridSearchCV\n", "- Explore additional feature engineering possibilities\n", "- Test neural network approaches for complex weather-performance relationships\n", "- Create ensemble model combining best performers for each track\n", "- Prepare final visualizations and analysis for report\n", "\n", "## Data Preparation and Feature Engineering" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Importing Libraries\n", "import os\n", "import logging\n", "\n", "import fastf1\n", "import fastf1.plotting\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import xgboost as xgb\n", "from xgboost import XGBRegressor\n", "from fastf1.ergast.structure import FastestLap\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import mean_squared_error, r2_score\n", "from sklearn.model_selection import cross_val_score, train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.svm import SVR\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.metrics import mean_absolute_error" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# FastF1 general setup\n", "cache_dir = '../data/cache'\n", "if not os.path.exists(cache_dir):\n", " os.makedirs(cache_dir)\n", "\n", "fastf1.Cache.enable_cache(cache_dir)\n", "fastf1.plotting.setup_mpl(misc_mpl_mods=False, color_scheme=None)\n", "logging.disable(logging.INFO)\n", "\n", "# Set up plot style\n", "# print style.available to check available styles\n", "plt.style.use('seaborn-v0_8-whitegrid')" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Define years, sessions, and events of interest\n", "years = [2021, 2022, 2023, 2024]\n", "sessions = ['Race']\n", "events = ['Bahrain Grand Prix', 'British Grand Prix', 'Belgian Grand Prix', 'United States Grand Prix', 'Mexico City Grand Prix']" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processing 2021 Bahrain Grand Prix - Race\n", "Processing 2021 British Grand Prix - Race\n", "Processing 2021 Belgian Grand Prix - Race\n", "Processing 2021 United States Grand Prix - Race\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "core WARNING \tDriver 7: Lap timing integrity check failed for 1 lap(s)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Processing 2021 Mexico City Grand Prix - Race\n", "Processing 2022 Bahrain Grand Prix - Race\n", "Processing 2022 British Grand Prix - Race\n", "Processing 2022 Belgian Grand Prix - Race\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "events WARNING \tCorrecting user input 'United States Grand Prix' to 'United States Grand Prix'\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Processing 2022 United States Grand Prix - Race\n", "Processing 2022 Mexico City Grand Prix - Race\n", "Processing 2023 Bahrain Grand Prix - Race\n", "Processing 2023 British Grand Prix - Race\n", "Processing 2023 Belgian Grand Prix - Race\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "events WARNING \tCorrecting user input 'United States Grand Prix' to 'United States Grand Prix'\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Processing 2023 United States Grand Prix - Race\n", "Processing 2023 Mexico City Grand Prix - Race\n", "Processing 2024 Bahrain Grand Prix - Race\n", "Processing 2024 British Grand Prix - Race\n", "Processing 2024 Belgian Grand Prix - Race\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "events WARNING \tCorrecting user input 'United States Grand Prix' to 'United States Grand Prix'\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Processing 2024 United States Grand Prix - Race\n", "Processing 2024 Mexico City Grand Prix - Race\n" ] } ], "source": [ "# Get data from FastF1 API\n", "\n", "# Data containers\n", "weather_data_list = []\n", "lap_data_list = []\n", "\n", "# Loop through years and sessions\n", "for year in years:\n", " for event_name in events: \n", " for session_name in sessions:\n", " try:\n", " print(f\"Processing {year} {event_name} - {session_name}\")\n", " \n", " # Load the session\n", " session = fastf1.get_session(year, event_name, session_name, backend='fastf1')\n", " session.load()\n", " \n", " # Process weather data\n", " weather_data = session.weather_data\n", " if weather_data is not None:\n", " weather_df = pd.DataFrame(weather_data)\n", " # Add context columns\n", " weather_df['Year'] = year\n", " weather_df['Event'] = event_name\n", " weather_df['Session'] = session_name\n", " weather_data_list.append(weather_df)\n", "\n", " # Process lap data\n", " lap_data = session.laps\n", " if lap_data is not None:\n", " lap_df = pd.DataFrame(lap_data)\n", " # Add context columns\n", " lap_df['Year'] = year\n", " lap_df['Event'] = event_name\n", " lap_df['Session'] = session_name\n", " # Ensure driver information is included\n", " if 'Driver' not in lap_df.columns:\n", " lap_df['Driver'] = lap_df['DriverNumber'].map(session.drivers)\n", " # Add team information if available\n", " if 'Team' not in lap_df.columns:\n", " lap_df['Team'] = lap_df['Driver'].map(session.drivers_info['TeamName'])\n", " lap_data_list.append(lap_df)\n", " \n", " except Exception as e:\n", " print(f\"Error with {event_name} {session_name} ({year}): {e}\")\n", "\n", "# Combine data into DataFrames\n", "if weather_data_list:\n", " weather_data_combined = pd.concat(weather_data_list, ignore_index=True)\n", " # Ensure consistent column ordering\n", " weather_cols = ['Time', 'Year', 'Event', 'Session', \n", " 'AirTemp', 'Humidity', 'Pressure', 'Rainfall', \n", " 'TrackTemp', 'WindDirection', 'WindSpeed']\n", " weather_data_combined = weather_data_combined[weather_cols]\n", " \n", "if lap_data_list:\n", " lap_data_combined = pd.concat(lap_data_list, ignore_index=True)\n", " # Ensure consistent column ordering\n", " lap_cols = ['Time', 'Year', 'Event', 'Session', \n", " 'Driver', 'Team', 'LapNumber', 'LapTime',\n", " 'Sector1Time', 'Sector2Time', 'Sector3Time',\n", " 'Compound', 'TyreLife', 'FreshTyre',\n", " 'SpeedI1', 'SpeedI2', 'SpeedFL', 'SpeedST']\n", " # Only include columns that exist\n", " existing_cols = [col for col in lap_cols if col in lap_data_combined.columns]\n", " lap_data_combined = lap_data_combined[existing_cols]\n", " \n", "# Time conversion\n", "# Function to convert timedelta to datetime\n", "def convert_timedelta_to_datetime(df, base_date='2021-01-01'):\n", " if 'Time' in df.columns:\n", " # Create a base datetime and add the timedelta\n", " base = pd.Timestamp(base_date)\n", " if df['Time'].dtype == 'timedelta64[ns]':\n", " df['Time'] = base + df['Time']\n", " return df\n", "\n", "# Apply conversion to both dataframes\n", "weather_data_combined = convert_timedelta_to_datetime(weather_data_combined)\n", "lap_data_combined = convert_timedelta_to_datetime(lap_data_combined)\n", "\n", "# Remove missing values\n", "weather_data_combined = weather_data_combined.dropna()\n", "lap_data_combined = lap_data_combined.dropna()\n", "\n", "# Create a new column for lap time in seconds\n", "lap_data_combined['LapTime_seconds'] = lap_data_combined['LapTime'].dt.total_seconds()\n", "\n", "# Merge the data\n", "merged_data = pd.merge_asof(\n", " lap_data_combined.sort_values('Time'),\n", " weather_data_combined.sort_values('Time'),\n", " on='Time',\n", " by=['Event', 'Year'], # Match within same event and year\n", " direction='nearest',\n", " tolerance=pd.Timedelta('1 min') # Allow matching within 1 minute\n", ")" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def engineer_features(df):\n", " \"\"\"\n", " Engineer features for F1 lap time prediction with track-specific optimizations.\n", " \n", " Parameters:\n", " df (pandas.DataFrame): Input dataframe containing raw F1 session data\n", " Returns:\n", " pandas.DataFrame: DataFrame with engineered features\n", " \"\"\"\n", " # Basic weather and track condition features\n", " df['GripCondition'] = df.apply(lambda x: \n", " x['TrackTemp'] * (1 - x['Humidity']/200) if 'British' in x['Event']\n", " else x['TrackTemp'] * (1 - x['Humidity']/100), axis=1)\n", " \n", " df['TempDelta'] = df['TrackTemp'] - df['AirTemp']\n", " \n", " # Tire degradation\n", " df['TyreDeg'] = df.apply(lambda x: \n", " np.exp(-0.02 * x['TyreLife']) if 'British' in x['Event']\n", " else np.exp(-0.025 * x['TyreLife']) if 'Belgian' in x['Event']\n", " else np.exp(-0.015 * x['TyreLife']), axis=1)\n", " \n", " # Track evolution with weather adjustment\n", " df['TrackEvolution'] = df.apply(lambda x: \n", " (1 - np.exp(-0.12 * x['LapNumber'])) * (1 - x['Humidity']/300) if 'British' in x['Event']\n", " else (1 - np.exp(-0.15 * x['LapNumber'])) if 'United States' in x['Event']\n", " else 1 - np.exp(-0.1 * x['LapNumber']), axis=1)\n", " \n", " # Temperature interactions\n", " df['TempInteraction'] = df['TrackTemp'] * df['AirTemp']\n", " df['TempInteractionSquared'] = df['TempInteraction'] ** 2\n", " \n", " # Weather complexity\n", " df['WeatherComplexity'] = df.apply(lambda x:\n", " (x['WindSpeed'] * 0.5 + abs(x['TempDelta']) * 0.3 + x['Humidity'] * 0.2) / 100.0\n", " if 'Belgian' in x['Event']\n", " else (x['WindSpeed'] * 0.3 + abs(x['TempDelta']) * 0.4 + x['Humidity'] * 0.3) / 100.0\n", " if 'British' in x['Event']\n", " else (x['WindSpeed'] * 0.2 + abs(x['TempDelta']) * 0.5 + x['Humidity'] * 0.3) / 100.0,\n", " axis=1)\n", " \n", " # Track-specific features\n", " df['DesertEffect'] = np.where(\n", " df['Event'].str.contains('Bahrain'),\n", " df['WindSpeed'] * df['Humidity'] * df['TempInteraction'] / 10000,\n", " 0\n", " )\n", " \n", " df['WetWeatherEffect'] = df.apply(lambda x:\n", " (x['Humidity'] * x['WindSpeed'] * abs(x['TempDelta'])) / 1200 if 'British' in x['Event']\n", " else (x['Humidity'] * x['WindSpeed'] * abs(x['TempDelta'])) / 1000 if 'Belgian' in x['Event']\n", " else 0, axis=1)\n", " \n", " df['AltitudeEffect'] = np.where(\n", " df['Event'].str.contains('Mexico City'),\n", " df['AirTemp'] * (1 - df['Humidity']/200) * df['WindSpeed'] / 10,\n", " 0\n", " )\n", " \n", " return df" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def prepare_modeling_data(df):\n", " \"\"\"\n", " Prepare data for modeling with optimized track-specific configurations.\n", " \n", " Parameters:\n", " df (pandas.DataFrame): Input dataframe with raw F1 session data\n", " \n", " Returns:\n", " dict: Dictionary containing modeling results for each track\n", " \"\"\"\n", " data = engineer_features(df)\n", " track_results = {}\n", " \n", " base_features = [\n", " 'TrackTemp', 'AirTemp', 'Humidity', 'WindSpeed',\n", " 'TyreLife', 'TyreDeg', 'TempDelta', 'GripCondition',\n", " 'TrackEvolution', 'TempInteraction', 'TempInteractionSquared',\n", " 'WeatherComplexity', 'DesertEffect', 'WetWeatherEffect', 'AltitudeEffect'\n", " ]\n", " \n", " # Track-specific configurations\n", " track_configs = {\n", " 'Bahrain': {\n", " 'n_estimators': 200,\n", " 'max_depth': 6,\n", " 'learning_rate': 0.007,\n", " 'outlier_threshold': 1.8,\n", " 'split_ratio': 0.8\n", " },\n", " 'Belgian': {\n", " 'n_estimators': 180,\n", " 'max_depth': 5,\n", " 'learning_rate': 0.008,\n", " 'outlier_threshold': 1.8,\n", " 'split_ratio': 0.8\n", " },\n", " 'Mexico City': {\n", " 'n_estimators': 160,\n", " 'max_depth': 6,\n", " 'learning_rate': 0.009,\n", " 'outlier_threshold': 1.6,\n", " 'split_ratio': 0.8\n", " },\n", " 'United States': {\n", " 'n_estimators': 200,\n", " 'max_depth': 5,\n", " 'learning_rate': 0.006,\n", " 'outlier_threshold': 1.7,\n", " 'split_ratio': 0.75\n", " },\n", " 'British': {\n", " 'n_estimators': 180,\n", " 'max_depth': 4,\n", " 'learning_rate': 0.008,\n", " 'outlier_threshold': 1.7,\n", " 'split_ratio': 0.75\n", " }\n", " }\n", " \n", " for event in data['Event'].unique():\n", " print(f\"\\nProcessing {event}\")\n", " track_data = data[data['Event'] == event].copy()\n", " \n", " # Get track-specific config\n", " config = next((v for k, v in track_configs.items() if k in event), {\n", " 'n_estimators': 150,\n", " 'max_depth': 6,\n", " 'learning_rate': 0.01,\n", " 'outlier_threshold': 1.7,\n", " 'split_ratio': 0.8\n", " })\n", " \n", " # Outlier removal with track-specific thresholds\n", " Q1 = track_data['LapTime_seconds'].quantile(0.25)\n", " Q3 = track_data['LapTime_seconds'].quantile(0.75)\n", " IQR = Q3 - Q1\n", " track_data = track_data[\n", " (track_data['LapTime_seconds'] >= Q1 - config['outlier_threshold'] * IQR) &\n", " (track_data['LapTime_seconds'] <= Q3 + config['outlier_threshold'] * IQR)\n", " ]\n", " \n", " # Feature preparation\n", " track_data = pd.get_dummies(track_data, columns=['Compound'])\n", " compound_features = [col for col in track_data.columns if col.startswith('Compound_')]\n", " feature_columns = base_features + compound_features\n", " \n", " # Split data\n", " track_data = track_data.sort_values('Time')\n", " split_idx = int(len(track_data) * config['split_ratio'])\n", " \n", " X_train = track_data.iloc[:split_idx][feature_columns]\n", " X_test = track_data.iloc[split_idx:][feature_columns]\n", " y_train = track_data.iloc[:split_idx]['LapTime_seconds']\n", " y_test = track_data.iloc[split_idx:]['LapTime_seconds']\n", " \n", " # Scale features\n", " scaler = StandardScaler()\n", " X_train_scaled = scaler.fit_transform(X_train)\n", " X_test_scaled = scaler.transform(X_test)\n", " \n", " # Models with track-specific configurations\n", " models = {\n", " 'Random Forest': RandomForestRegressor(\n", " n_estimators=config['n_estimators'],\n", " max_depth=config['max_depth'],\n", " random_state=42\n", " ),\n", " 'XGBoost': XGBRegressor(\n", " n_estimators=config['n_estimators'],\n", " max_depth=config['max_depth'],\n", " learning_rate=config['learning_rate'],\n", " random_state=42\n", " ),\n", " 'Gradient Boosting': GradientBoostingRegressor(\n", " n_estimators=config['n_estimators'],\n", " max_depth=config['max_depth'],\n", " learning_rate=config['learning_rate'],\n", " random_state=42\n", " )\n", " }\n", " \n", " track_results[event] = {}\n", " \n", " for name, model in models.items():\n", " model.fit(X_train_scaled, y_train)\n", " y_pred = model.predict(X_test_scaled)\n", " \n", " rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n", " r2 = r2_score(y_test, y_pred)\n", " mean_error = mean_absolute_error(y_test, y_pred)\n", " \n", " track_results[event][name] = {\n", " 'RMSE': rmse,\n", " 'R2': r2,\n", " 'Mean Error': mean_error,\n", " 'Feature Importance': pd.DataFrame({\n", " 'feature': feature_columns,\n", " 'importance': model.feature_importances_\n", " }).sort_values('importance', ascending=False)\n", " }\n", " \n", " print(f\"\\nResults for {event} - {name}:\")\n", " print(f\"RMSE: {rmse:.2f} seconds\")\n", " print(f\"R2 Score: {r2:.3f}\")\n", " print(f\"Mean Error: {mean_error:.2f} seconds\")\n", " print(\"\\nTop 5 important features:\")\n", " print(track_results[event][name]['Feature Importance'].head().to_string())\n", " \n", " return track_results" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def plot_model_performance(track_results):\n", " # Prepare data for plotting\n", " comparison_data = []\n", " for track, models in track_results.items():\n", " for model_name, metrics in models.items():\n", " comparison_data.append({\n", " 'Track': track.replace(' Grand Prix', ''), # Shorter names\n", " 'Model': model_name,\n", " 'RMSE': metrics['RMSE'],\n", " 'R2': metrics['R2'],\n", " 'Mean Error': metrics['Mean Error']\n", " })\n", " \n", " comparison_df = pd.DataFrame(comparison_data)\n", " \n", " # Create figure with subplots\n", " fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 12))\n", " \n", " # Plot RMSE\n", " sns.barplot(data=comparison_df, x='Track', y='RMSE', hue='Model', ax=ax1)\n", " ax1.set_title('Model Performance (RMSE) by Track')\n", " ax1.set_xlabel('') # Remove x-label from top plot\n", " plt.setp(ax1.get_xticklabels(), rotation=45, ha='right')\n", " ax1.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')\n", " \n", " # Plot R²\n", " sns.barplot(data=comparison_df, x='Track', y='R2', hue='Model', ax=ax2)\n", " ax2.set_title('Model Performance (R²) by Track')\n", " ax2.set_xlabel('Track')\n", " plt.setp(ax2.get_xticklabels(), rotation=45, ha='right')\n", " ax2.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')\n", " \n", " # Add a title for the entire figure\n", " fig.suptitle('F1 Lap Time Prediction Model Performance', fontsize=16, y=1.02)\n", " \n", " # Adjust layout to prevent overlap\n", " plt.tight_layout()\n", " plt.show()\n", " \n", " # Print detailed statistics\n", " print(\"\\nDetailed Statistics by Track:\")\n", " for track in comparison_df['Track'].unique():\n", " track_data = comparison_df[comparison_df['Track'] == track]\n", " print(f\"\\n{track}:\")\n", " print(f\"Best RMSE: {track_data['RMSE'].min():.2f} seconds\")\n", " print(f\"Best R²: {track_data['R2'].max():.3f}\")\n", " best_model = track_data.loc[track_data['R2'].idxmax(), 'Model']\n", " print(f\"Best Model: {best_model}\")\n", " \n", " # Print overall model rankings\n", " print(\"\\nOverall Model Rankings (by mean R²):\")\n", " model_rankings = comparison_df.groupby('Model')['R2'].agg(['mean', 'std']).sort_values('mean', ascending=False)\n", " print(model_rankings.round(3))" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Processing Bahrain Grand Prix\n", "\n", "Results for Bahrain Grand Prix - Random Forest:\n", "RMSE: 1.48 seconds\n", "R2 Score: 0.199\n", "Mean Error: 1.05 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "9 TempInteraction 0.254072\n", "10 TempInteractionSquared 0.248422\n", "8 TrackEvolution 0.234635\n", "7 GripCondition 0.078404\n", "4 TyreLife 0.038056\n", "\n", "Results for Bahrain Grand Prix - XGBoost:\n", "RMSE: 1.41 seconds\n", "R2 Score: 0.275\n", "Mean Error: 1.02 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "9 TempInteraction 0.806055\n", "7 GripCondition 0.071437\n", "8 TrackEvolution 0.032646\n", "6 TempDelta 0.017066\n", "17 Compound_SOFT 0.014787\n", "\n", "Results for Bahrain Grand Prix - Gradient Boosting:\n", "RMSE: 1.43 seconds\n", "R2 Score: 0.259\n", "Mean Error: 1.02 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "10 TempInteractionSquared 0.293157\n", "8 TrackEvolution 0.238935\n", "9 TempInteraction 0.202298\n", "7 GripCondition 0.067621\n", "5 TyreDeg 0.045087\n", "\n", "Processing Belgian Grand Prix\n", "\n", "Results for Belgian Grand Prix - Random Forest:\n", "RMSE: 1.12 seconds\n", "R2 Score: 0.775\n", "Mean Error: 0.92 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "5 TyreDeg 0.202237\n", "4 TyreLife 0.158269\n", "6 TempDelta 0.142368\n", "0 TrackTemp 0.134529\n", "10 TempInteractionSquared 0.118459\n", "\n", "Results for Belgian Grand Prix - XGBoost:\n", "RMSE: 1.34 seconds\n", "R2 Score: 0.677\n", "Mean Error: 1.09 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "6 TempDelta 0.415692\n", "0 TrackTemp 0.288164\n", "4 TyreLife 0.111815\n", "11 WeatherComplexity 0.073480\n", "8 TrackEvolution 0.035673\n", "\n", "Results for Belgian Grand Prix - Gradient Boosting:\n", "RMSE: 1.34 seconds\n", "R2 Score: 0.680\n", "Mean Error: 1.09 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "6 TempDelta 0.330856\n", "4 TyreLife 0.183217\n", "5 TyreDeg 0.170472\n", "8 TrackEvolution 0.080550\n", "10 TempInteractionSquared 0.070230\n", "\n", "Processing Mexico City Grand Prix\n", "\n", "Results for Mexico City Grand Prix - Random Forest:\n", "RMSE: 1.05 seconds\n", "R2 Score: 0.505\n", "Mean Error: 0.77 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "1 AirTemp 0.489273\n", "8 TrackEvolution 0.208571\n", "11 WeatherComplexity 0.086104\n", "4 TyreLife 0.066092\n", "5 TyreDeg 0.062654\n", "\n", "Results for Mexico City Grand Prix - XGBoost:\n", "RMSE: 1.17 seconds\n", "R2 Score: 0.375\n", "Mean Error: 0.90 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "1 AirTemp 0.608189\n", "8 TrackEvolution 0.087145\n", "11 WeatherComplexity 0.068450\n", "16 Compound_MEDIUM 0.066138\n", "4 TyreLife 0.034888\n", "\n", "Results for Mexico City Grand Prix - Gradient Boosting:\n", "RMSE: 1.10 seconds\n", "R2 Score: 0.448\n", "Mean Error: 0.84 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "1 AirTemp 0.475628\n", "8 TrackEvolution 0.176599\n", "11 WeatherComplexity 0.109352\n", "5 TyreDeg 0.072201\n", "4 TyreLife 0.071545\n", "\n", "Processing United States Grand Prix\n", "\n", "Results for United States Grand Prix - Random Forest:\n", "RMSE: 1.33 seconds\n", "R2 Score: 0.417\n", "Mean Error: 1.05 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "10 TempInteractionSquared 0.300427\n", "9 TempInteraction 0.283201\n", "8 TrackEvolution 0.234468\n", "11 WeatherComplexity 0.042663\n", "4 TyreLife 0.027237\n", "\n", "Results for United States Grand Prix - XGBoost:\n", "RMSE: 1.59 seconds\n", "R2 Score: 0.163\n", "Mean Error: 1.27 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "9 TempInteraction 0.583169\n", "11 WeatherComplexity 0.135200\n", "8 TrackEvolution 0.118025\n", "0 TrackTemp 0.037846\n", "6 TempDelta 0.027296\n", "\n", "Results for United States Grand Prix - Gradient Boosting:\n", "RMSE: 1.45 seconds\n", "R2 Score: 0.310\n", "Mean Error: 1.16 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "10 TempInteractionSquared 0.301519\n", "9 TempInteraction 0.283101\n", "8 TrackEvolution 0.210339\n", "11 WeatherComplexity 0.062277\n", "5 TyreDeg 0.032606\n", "\n", "Processing British Grand Prix\n", "\n", "Results for British Grand Prix - Random Forest:\n", "RMSE: 1.32 seconds\n", "R2 Score: -0.075\n", "Mean Error: 1.03 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "8 TrackEvolution 0.423258\n", "0 TrackTemp 0.169432\n", "6 TempDelta 0.114612\n", "7 GripCondition 0.113560\n", "11 WeatherComplexity 0.062591\n", "\n", "Results for British Grand Prix - XGBoost:\n", "RMSE: 1.24 seconds\n", "R2 Score: 0.047\n", "Mean Error: 0.96 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "8 TrackEvolution 0.227264\n", "0 TrackTemp 0.157029\n", "7 GripCondition 0.134898\n", "2 Humidity 0.111574\n", "6 TempDelta 0.058466\n", "\n", "Results for British Grand Prix - Gradient Boosting:\n", "RMSE: 1.26 seconds\n", "R2 Score: 0.022\n", "Mean Error: 0.97 seconds\n", "\n", "Top 5 important features:\n", " feature importance\n", "8 TrackEvolution 0.423821\n", "0 TrackTemp 0.176351\n", "7 GripCondition 0.117170\n", "6 TempDelta 0.050927\n", "2 Humidity 0.049042\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Detailed Statistics by Track:\n", "\n", "Bahrain:\n", "Best RMSE: 1.41 seconds\n", "Best R²: 0.275\n", "Best Model: XGBoost\n", "\n", "Belgian:\n", "Best RMSE: 1.12 seconds\n", "Best R²: 0.775\n", "Best Model: Random Forest\n", "\n", "Mexico City:\n", "Best RMSE: 1.05 seconds\n", "Best R²: 0.505\n", "Best Model: Random Forest\n", "\n", "United States:\n", "Best RMSE: 1.33 seconds\n", "Best R²: 0.417\n", "Best Model: Random Forest\n", "\n", "British:\n", "Best RMSE: 1.24 seconds\n", "Best R²: 0.047\n", "Best Model: XGBoost\n", "\n", "Overall Model Rankings (by mean R²):\n", " mean std\n", "Model \n", "Random Forest 0.364 0.321\n", "Gradient Boosting 0.344 0.243\n", "XGBoost 0.307 0.240\n" ] } ], "source": [ "# Execute modeling pipeline\n", "track_results = prepare_modeling_data(merged_data)\n", "\n", "# Visualize results\n", "plot_model_performance(track_results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Findings\n", "\n", "1. **Track-Specific Performance**:\n", " - Best performance achieved on Belgian GP with Random Forest (R² = 0.775)\n", " - Most challenging predictions for British GP (best R² = 0.047)\n", " - Weather conditions appear to have strongest influence at Belgian GP\n", "\n", "2. **Model Comparison**:\n", " - Random Forest consistently performs best across tracks\n", " - XGBoost shows high variance in performance\n", " - Gradient Boosting provides most stable results\n", "\n", "3. **Important Features**:\n", " - Track temperature and air temperature interaction\n", " - Track evolution throughout race\n", " - Weather complexity score\n", " - Tire degradation metrics\n", "\n", "## Next Steps for Improvement\n", "\n", "1. **Feature Engineering**:\n", " - Create more sophisticated tire degradation models\n", " - Incorporate historical track performance data\n", " - Develop track-specific feature sets\n", "\n", "2. **Model Optimization**:\n", " - Implement GridSearchCV for hyperparameter tuning\n", " - Test neural network approaches\n", " - Create track-specific model ensembles\n", "\n", "3. **Analysis Refinement**:\n", " - Investigate poor performance on British GP\n", " - Analyze weather condition thresholds\n", " - Study interaction effects between features" ] } ], "metadata": { "kernelspec": { "display_name": "csci349", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 2 }