TDM 20100: Project 9 — 2022
Motivation: Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a "bread-and-butter" skill for anyone working with data.
Context: We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like AVG
, MIN
, and MAX
.
Scope: SQL, sqlite
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/taxi/taxi_sample.db
Questions
Question 1
In previous projects, we used awk
to parse through and summarize data. While awk
is extremely convenient and can work well, but SQL is even better.
Write a query that will return the fare_amount
, surcharge
, tip_amount
, and tolls_amount
as a percentage of total_amount
.
Make sure to limit the output to only 100 rows! Use the |
Use the |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Check out the payment_type
column. Write a query that counts the number of each type of payment_type
. The end result should print something like the following.
payment_type, count CASH, 123
You can use aliasing to control the output header names. |
Write a query that sums the total_amount
for payment_type
of "CASH". What is the total amount of cash payments?
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Write a query that gets the largest number of passengers in a single trip. How far was the trip? What was the total amount? Answer all of this in a single query.
Whoa, there must be some erroneous data in the database! Not too surprising. Write a query that explores this more, explain what your query does and how it helps you understand what is going on.
Make sure all queries limit output to only 100 rows. |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Write a query that gets the average total_amount
for each year in the database. Which year has the largest average total_amount
? Use the pickup_datetime
column to determine the year.
Read this page and look at the strftime function. |
If you want the headers to be more descriptive, you can use aliases. |
-
Code used to solve this problem.
-
Output from running the code.
Question 5
What percent of data in our database has information on the location of pickup and dropoff? Examine the data, to see if there is a pattern to the rows with that information and without that information.
There is a distinct pattern. Pay attention to the date and time of the data. |
Confirm your hypothesis with the original data set(s) (in /anvil/projects/tdm/data/taxi/yellow/*.csv
), using bash. This doesn’t have to be anything more thorough than running a simple head
command with a 1-2 sentence explanation.
Of course, there will probably be some erroneous data for the latitude and longitude columns. However, you could use the |
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |