TDM 20200: Project 10 — 2024
Motivation: Machine learning and AI are huge buzzwords in industry. In this project, we will delve into an introduction of some machine learning related libraries in Python, like tensorflow
and scikit-learn
. We aim to understand some basic machine learning workflow concepts.
Context: The purpose of these projects is to give you exposure to machine learning tools, some basic functionality, and to show why they are useful, without needing any special math or statistics background. We will try to build a model to predict the arrival delay (ArrDelay) of flights, based on features like departure delay, distance of the flight, departure time, arrival time, etc.
Scope: Python, tensorflow, scikit-learn
Readings and Resources
You need to use 2 cores for your Jupyter Lab session for Project 10 this week. |
You can use |
We added five new videos to help you with Project 10. BUT the example videos are about a data set with beer reviews. You need to (instead) work on the flight data given here: |
Questions
Question 1 (2 points)
For this project, we will only need these rows of the data set:
mycols = [
'DepDelay', 'ArrDelay', 'Distance',
'CarrierDelay', 'WeatherDelay',
'DepTime', 'ArrTime', 'Diverted', 'AirTime'
]
-
Load just a few rows of the data set. Explore the dataset columns, and figure out the data types for the following specific columns. Based on your exploration, define a dictionary variable called
my_col_types
that hold the column names and the types of each of the columns listed inmycols
. -
Now load the first 10,000 rows of the data set (but only the columns specified in
mycols
) into a data frame calledmyDF
.
|
Question 2 (2 points)
-
Import the following libraries
import tensorflow as tf from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import pandas as pd import numpy as np import time
-
For each column, fill in the missing values within that column, using the median value of that column.
(Note: This works in this situation, because all of our columns contain numerical data. In the future, if you want to fill in missing values in a column that contains an object
data type, we need to use a different procedure.)
(Another note: We are filling in missing values because a machine learning model can be confused by missing values. Some machine learning models depend on every item, to make a decision.)
Question 3 (2 points)
Now let’s look into how to prepare our features and labels for the machine learning model. You may use the following example code.
# Splitting features and labels
features = myDF.drop('ArrDelay', axis=1)
labels = myDF['ArrDelay']
-
What is the difference between features and labels?
-
Considering the following example code, why do we need to have our data split into training and testing sets?
# Split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
|
|
Question 4 (2 points)
-
Now let us standardize our data, using this example code.
scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train).astype(np.float32) X_test_scaled = scaler.transform(X_test).astype(np.float32)
This is what scaling does to the data, and the reason why we need it for machine learning models:
-
Machine learning models usually assume all features are on a similar scale. So data need to be standardized to be in a common scale
-
Standardizing is like to translate and rescale every point on a graph to fit within a new frame, so the machine learning model can understand better
-
StandardScaler() is a function used to pre-process data before feeding it into a machine learning model
-
The StandardScaler adjusts data features so they have a mean of 0 and a standard deviation of 1, making models like neural networks perform better because they’re sensitive to the scale of input data.
-
-
-
Now let us slice our data, using this example code.
train_dataset = tf.data.Dataset.from_tensor_slices((X_train_scaled, y_train)).batch(14) test_dataset = tf.data.Dataset.from_tensor_slices((X_test_scaled, y_test)).batch(14)
This is a brief description about how TensorFlow slices data:
-
from_tensor_slices()
is a function that takes tuples of arrays (or tensors) as input, and outputs a dataset. Each element is a slice from these arrays in tuples format. Each element is a tuple of one row fromX
(features), and a corresponding row fromY
(labels). This technique allows the model to match each input with a corresponding output. -
batch(14)
divides the dataset into batches of 14 elements. Instead of feeding all of the data to the model at one time, the data then (instead) be processed iteratively, so that the computation is not too memory-intensive.-
We can choose how many pieces of data are used at a time. For instance, we can use 14 slices at a time. The number of slices can impact the model’s performance and how long it takes the model to learn. You may need to try different numbers, to figure out which works best.
-
-
Question 5 (2 points)
-
Now (finally!) we will build a machine learning model, we will train it, and we will evaluate it using TensorFlow. The following example code defines a model architecture, compiles the model, trains the model on a dataset, and evaluates it on a separate dataset, to ensure the model’s effectiveness. Please create and run the whole program, named: load the dataset, clean the data, specify the features and labels, specify the training and testing data, define the model, compile and train the model, and clean things up, after building the model.
# Define model model = tf.keras.Sequential([ tf.keras.layers.Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(1) ]) # Compile model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mean_absolute_error']) # Train history = model.fit(train_dataset, epochs=10, validation_data=test_dataset) # Cleanup del X_train_scaled, X_test_scaled, train_dataset, test_dataset
In the next project, we will reflect on what we learned during this project. We will continue to explore!
|
Project 10 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and outputs for the assignment
-
firstname-lastname-project10.ipynb
-
-
Python file with code and comments for the assignment
-
firstname-lastname-project10.py
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |