Initial commit- repo setup/skeletons

2026-02-05 00:06:39 -05:00 · 2024-11-18 08:14:05 -05:00
commit 32186b26da
4 changed files with 171 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,39 @@
 # CSCI 349 Final Project: Formula One Driver Performance Analysis
 ### Team Members:
 - Sean O'Connor
 - Connor Coles
 ### Project Summary
 We are conducting a data mining project focused on analyzing driver performance in Formula One racing. Our goal is to correlate driver performance with track and weather conditions, and to predict future race results using these correlations. We will apply various data mining techniques to extract meaningful insights from the dataset.
 ### Important Dates
 - **Data Selection Due:** November 13, 2024
 - **DataPrep_EDA.ipynb Due:** November 22, 2024
 - **Modeling.ipynb Due:** December 4, 2024
 - **Final Report PDF Due:** December 10, 2024
 - **Video Presentation Due:** December 13, 2024
 ### Package Structure
 Directories:
 - **data** - Contains the dataset used for analysis.
 - **notebooks** - Contains Jupyter notebooks for data preparation, EDA, modeling, and the final report.
 ### 3rd Party Libraries
 - pandas
 - numpy
 - matplotlib
 ### Video Presentation
 <!-- Our video presentation can be found [here](insert_video_link). -->
 Our video presentation will be linked here.
 ### Final Deliverables
 - **DataPrep_EDA.ipynb** - Notebook for data preparation and exploratory data analysis.
 - **Modeling.ipynb** - Notebook for developing and evaluating predictive models.
 - **Final_Report.pdf** - Comprehensive report summarizing our findings and methodologies, submitted to Gradescope.
 - **Video Presentation** - A recorded video summarizing our project, linked above.
 ### Important Links
 - [Dataset Source](https://openf1.org)
 - [GitLab Repository](https://gitlab.bucknell.edu/sso005/csci349_final_project)
--- a/project/DataPrep_EDA.ipynb
+++ b/project/DataPrep_EDA.ipynb
@@ -0,0 +1,38 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Formula One Project: Data Preparation and EDA\n",
    "\n",
    "DUE: November 22nd, 2024 (Fri)  \n",
    "Name(s): Sean O'Connor, Connor Coles    \n",
    "Class: CSCI 349 - Intro to Data Mining  \n",
    "Semester: Fall 2024  \n",
    "Instructor: Brian King  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Assignment Description\n",
    "Create your first notebook file, DataPrep_EDA.ipynb. Use both markdown and code cells to convey the following:\n",
    "- What problem are you working on? Summarize in a single cell.\n",
    "- What data are you using to understand the problem? Describe the data in a very general sense. Where did it come from? You should understand what every observation in the data represents, and what each variable represents.\n",
    "- Remember that the key to achieving good machine learning outcomes is understanding how each real-world entity in your data will be represented as a fixed length vector of attributes in your dataset! Preprocessing your data will be a big part of this challenge. If you do not expect to spend quality time cleaning and prepping your data, you will not get good results. Once you have established how each data object is represented in a form ready for a data mining algorithm, and the data are clean, you will have a substantial part of your battle toward modeling solved.\n",
    "- Strive to generate good summary statistics, show what the data looks like, and include good EDA and visualizations with boxplots, barcharts, density plots for key variables, or whatever other plots you want that are specific to your data and problem to help the reader understand basic distributions of important variables. Visualizations can help you convey general info about your data and are extremely helpful.\n",
    "- In your final cells, discuss the modeling methods you expect to use. Start by clearly explaining if this is a classification, regression, clustering, or association rule mining problem? Justify. You have much of the framework to apply most algorithms, even those beyond what we covered in class. Feel free to explore different methods if you have good justification for doing so. If there are any papers of significance that have been published with these data, then discuss the ones most interesting/relevant to the team.\n",
    "- Finally, what is your overarching aim with this project? What are you hoping to learn? Or, what hypothesis are you using the data to confirm or disprove? What challenges do you foresee on this project? Discuss your concerns. How will you get your work done? Give a reasonable list of milestones to reach to arrive at the final deadline for the project."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/project/Final_Report.ipynb
+++ b/project/Final_Report.ipynb
@@ -0,0 +1,60 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Formula One Project: Final Report\n",
    "\n",
    "DUE: December 10th, 2024 (Tue)  \n",
    "Name(s): Sean O'Connor, Connor Coles    \n",
    "Class: CSCI 349 - Intro to Data Mining  \n",
    "Semester: Fall 2024  \n",
    "Instructor: Brian King  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Assignment Description\n",
    "This is your final report! Consider the scenario of you giving a report to a client or your supervisor on your study. Include good reporting techniques. Use good tables and other visualizations. Structure your notebook using proper headers for major sections. THIS NOTEBOOK SHOULD BE A COMPLETE STANDALONE NOTEBOOK FROM START TO FINISH! But, ONLY INCLUDE THE BEST OUTCOMES FROM YOUR PREVIOUS NOTEBOOKS!\n",
    "\n",
    "Include the following sections:\n",
    "\n",
    "1. **Introduction**\n",
    "   - This should have mostly been done in your first notebook. Just copy over relevant cells from your first part of the project, and add any new information you have learned. Your aim is to motivate the reader with the importance and relevance of your project.\n",
    "\n",
    "2. **Data**\n",
    "   - Again, most of this was likely done in your first part of the project. So, feel free to copy the important cells over.\n",
    "   - After your introduction, you should be introducing the original, raw data. Where was it collected? When? How? Explain the meaning of the variables. What does each observation represent? And be sure to explain the key target variable (assuming you are doing classification/regression).\n",
    "   - PLEASE DO NOT SHOW PAGES AND PAGES OF YOUR DATA! Display only a few observations so that the reader can see what your raw, original data look like.\n",
    "\n",
    "3. **Data Preparation**\n",
    "   - While the previous section shows the raw data, this section is going to carefully explain the steps you followed to clean and preprocess the data in a form suitable for analysis, visualization, and modeling. You should be setting proper variable types, dealing with missing data, etc. Preprocessing steps should be explained with justification. Include any dimensionality reduction techniques you might have done. Summarize what you needed to do to clean it, and show some example observations from your final, cleaned data.\n",
    "\n",
    "4. **EDA**\n",
    "   - We expect good EDA to understand your data. Visualizations after preprocessing will do far more to convey your summary statistics than just numbers. Discuss the distributions, correlations with the target variable, etc.\n",
    "\n",
    "5. **Modeling**\n",
    "   - What modeling methods did you try? And, which method(s) did you ultimately determine were the best? What hyperparameters were selected? Justify the selection of parameters. (i.e., did you do a grid search? You cannot simply say, \"XGBoost was the only one I evaluated, and I used default parameters.\" Boring, and unlikely to obtain the best results. You are expected to evaluate different models and different hyperparameters. It is very rare that default parameters are the best in machine learning.)\n",
    "\n",
    "6. **Performance Results**\n",
    "   - Once you've selected the best model, clearly convey the results of your model. I expect to see ROC curves, precision/recall curves, confusion matrices, tables with prediction performance by class, (or, if regression, use appropriate regression measures), etc.\n",
    "\n",
    "7. **Discussion**\n",
    "   - Reflect on your project. For example: discuss any challenges you had with cleaning and preparing the data. Did you find any surprises during your modeling? Compare and contrast the methods and hyperparameters you evaluated. And, it's often useful to discuss the features that you thought were the most predictive, and those that were least useful. (Search for feature importance scikit-learn for more info!) Any info that might be of interest to me related to your project goes here.\n",
    "\n",
    "8. **Conclusions**\n",
    "   - Write a short summary of your project goes here."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/project/Modeling.ipynb
+++ b/project/Modeling.ipynb
@@ -0,0 +1,34 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Formula One Project: Modeling\n",
    "\n",
    "DUE: December 4th, 2024 (Wed)  \n",
    "Name(s): Sean O'Connor, Connor Coles    \n",
    "Class: CSCI 349 - Intro to Data Mining  \n",
    "Semester: Fall 2024  \n",
    "Instructor: Brian King  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Assignment Description\n",
    "\n",
    "Copy over the important cells from the previous step that read in and cleaned your data to this new notebook file. You do not need to copy over all your EDA and plots describing your data, only the code that prepares your data for modeling. This notebook is about exploring the development of predictive models. Some initial preliminary work on applying some modeling techniques should be completed.\n",
    "Be sure to commit and push all supporting code that you've completed in this file. Include in this notebook a summary cell at the top that details your accomplishments, challenges, and what you expect to accomplish for your final steps. Be sure to update your readme.md in your repository."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }