STAT 39000: Project 13 — Spring 2021
Motivation: Data wrangling tasks can vary between projects. Examples include joining multiple data sources, removing data that is irrelevant to the project, handling outliers, etc. Although we’ve practiced some of these skills, it is always worth it to spend some extra time to master tidying up our data.
Context: We will continue to gain familiarity with the tidyverse
suite of packages (including ggplot
), and data wrangling tasks.
Scope: r, tidyverse
Dataset
The following questions will use the dataset found in Scholar:
/anvil/projects/tdm/data/consumer_complaints/complaints.csv
Questions
Question 1
Read the dataset into a tibble
named complaintsDF
. This dataset contains consumer complaints for over 5,000 companies. Our goal is to create a tibble
called companyDF
containing the following summary information for each company:
-
Company
: The company name (Company
) -
State
: The state (State
) -
percent_timely_response
: Percentage of timely complaints (Timely response?
) -
percent_consumer_disputed
: Percentage of complaints that were disputed by the consumer (Consumer disputed?
) -
percent_submitted_online
: Percentage of complaints that were submitted online (use columnSubmitted via
, and consider a submission to be an online submission if it was submitted viaWeb
orEmail
) -
total_n_complaints
: Total number of complaints
There are various ways to create companyDF
. Let’s practice using the pipes (%>%
) to get companyDF
. The idea is that our code at the end of question 2 will look something like this:
companyDF <- complaintsDF %>%
insert_here_code_to_change_variables %>% # (question 1)
insert_here_code_to_group_and_get_summaries_per_group # (question 2)
First, create logical columns (columns containing TRUE
or FALSE
) for Timely response?
, Consumer disputed?
and Submitted via
named timely_response_log
, consumer_disputed_log
and submitted_online
, respectively.
timely_response_log
and consumer_disputed_log
will have value TRUE
if Timely response?
and Consumer disputed?
have values Yes
respectively, and FALSE
if the value for the original column is No
. submitted_online
will have value TRUE
if the the complaint was submitted via Web
or Email
.
You can double check your results for each column by getting a table with the original and modified column, as shown below. In this case, we would want all TRUE
values to be in row Yes
, and all FALSE
to be in row No
.
table(companyDF$`Timely response?`, companyDF$timely_response_log)
-
R code used to solve the problem.
-
Output from running your code.
Question 2
Continue the pipeline we started in question (1). Get the summary information for each company. Note that you will need to include more pipes in the pseudo-code from question (1) as we want the summary for each company in each state. If a company is present in 4 states, companyDF
should have 4 rows for that company — one for each state. For the rest of the project, we will refer to a company as its unique combination of Company
and State
.
The function |
-
R code used to solve the problem.
-
Output from running your code.
Question 3
Using ggplot2
, create a scatterplot showing the relationship between percent_timely_response
and percent_consumer_disputed
for companies with at least 500 complaints. Based on your results, do you believe there is an association between how timely the company’s response is, and whether the consumer disputes? Why or why not?
Remember, here we consider each row of |
-
R code used to solve the problem.
-
Output from running your code.
Question 4
Which company, with at least 250 complaints, has the highest percent of consumer dispute?
We are learning |
-
R code used to solve the problem.
-
Output from running your code.
Question 5
Create a graph using ggplot2
that compares States
based on any columns from companyDF
or complaintsDF
. You may need to summarize the data, filter, or even create new variables depending on what your metric of comparison is. Below are some examples of graphs that can be created. Do not feel limited by them. Make sure to change the labels for each axis, add a title, and change the theme.
-
Cleveland’s dotplot for the top 10 states with the highest ratio between percent of disputed complaints and timely response.
-
Bar graph showing the total number of complaints in each state.
-
Scatterplot comparing the percentage of timely responses in the state and average number of complaints per state.
-
Line plot, where each line is a state, showing the total number of complaints per year.
-
R code used to solve the problem.
-
Output from running your code.
-
The plot produced.
-
1-2 sentences commenting on your plot.