STAT 39000: Project 12 — Spring 2021
Motivation: As we mentioned before, data wrangling is a big part in any data driven project. "Data Scientists spend up to 80% of the time on data cleaning and 20 percent of their time on actual data analysis." Therefore, it is worth to spend some time mastering how to best tidy up our data.
Context: We are continuing to practice using various tidyverse
packages, in order to wrangle data.
Scope: python
Make sure to read about, and use the template found here, and the important information about projects submissions here.
The first step in any data science project is to define our problem statement. In this project, our goal is to gain insights into customers' behaviours with regards to online orders and restaurant ratings.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/restaurant_recommendation/*.csv
Questions
Question 1
Load the tidyverse
suite a packages, and read the data from files orders.csv
, train_customers.csv
, and vendors.csv
into tibble`s named `orders
, customers
, and vendors
respectively.
Take a look the tibbles
and describe in a few sentences the type of information contained in each dataset. Although the name can be self-explanatory, it is important to get an idea of what exactly we are looking at. For each combination of 2 datasets, which column would you use to join them?
-
R code used to solve the problem.
-
Output from running your code.
-
1-2 sentences explaining each dataset (
orders
,customers
, andvendors
). -
1-2 sentences for each combination of 2 datasets describing if we could combine the datasets or not, and which column you would you use to join them.
Question 2
Let’s tidy up our datasets a bit prior to joining them. For each dataset, complete the tasks below.
-
orders
: remove columns from and betweenpreparationtime
todelivered_time
(inclusive). -
customers
: take a look at the columndob
. Based on its values, what do you believe it was supposed to contain? Can we rely on the numbers selected? Why or why not? Based on your answer, keep the columnsakeed_customer_id
,gender
, anddob
, OR justakeed_customer_id
andgender
. -
vendors
: take a look at columnscountry_id
andcity_id
. Would they be useful to compare the vendors in our dataset? Why or why not? If not, remove the columns from the dataset.
-
R code used to solve the problem.
-
Output from running your code.
-
1-2 sentences describing what columns you kept for
vendors
andcustomers
and why.
Question 3
Use your solutions from questions (1) and (2), and the join functions from tidyverse (inner_join
, left_join
, right_join
, and full_join
) to create a single tibble
called myDF
containing information only where all 3 tibbles
intersect.
For example, we do not want myDF
to contain orders from customers that are not in customers
tibble. Which function(s) from the tidyverse did you use to merge the datasets and why?
|
When combining two datasets, you may want to change the argument |
-
R code used to solve the problem.
-
Output from running your code.
-
1-2 sentences describing which function you used, and why.
Question 4
Great, now we have a single, tidy dataset to work with. There are 2 vendor categories in myDF, Restaurants
and Sweets & Bakes
. We would expect there to be some differences. Let’s compare them using the following variables: deliverydistance
, item_count
, grand_total
, and vendor_discount_amount
. Our end goal (by the end of question 5) is to create a histogram colored by the vendor’s category (vendor_category_en
), for each variable.
To accomplish this easily using ggplot
, we will take advantage of pivot_longer
. Pivot columns deliverydistance
, item_count
, grand_total
, and vendor_discount_amount
in myDF
. The end result should be a tibble
with columns variable
and values
, which contain the name of the pivoted column (variable
), and values of those columns (values
) Call this modified dataset myDF_long
.
-
R code used to solve the problem.
-
Output from running your code.
Question 5
Now that we have the data in the ideal format for our plot, create a histogram for each variable. Make sure to color them by vendor category (vendor_category_en
). How do the two types of vendors compare in these 4 variables?
Use the argument |
You may want to add some transparency to your plot. Add it using |
You may want to change the argument |
-
R code used to solve the problem.
-
Output from running your code.
-
2-3 sentences comparing
Restaurants
andSweets & Bakes
fordeliverydistance
,item_count
,grand_total
andvendor_discount_amount
.