STAT 29000: Project 7 — Spring 2022

Motivation: Being able to analyze and create good visualizations is a skill that is invaluable in many fields. It can be pretty fun too! In this project, we will create plots using the plotly package, as well as do some data manipulation using pandas.

Context: This is the second project focused around creating visualizations in Python.

Scope: plotly, Python, pandas

Learning Objectives
  • Demonstrate the ability to create basic graphs with default settings.

  • Demonstrate the ability to modify axes labels and titles.

  • Demonstrate the ability to customize a plot (color, shape/linetype).

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/disney/total.parquet

  • /depot/datamine/data/disney/metadata.csv

Questions

Question 1

Read in the data from the parquet file /depot/datamine/data/disney/total.parquet and store it in a variable called dat. Do the same for /depot/datamine/data/disney/metadata.csv and call it meta.

Plotly express makes it really easy to create nice, clean graphics, and it integrates with pandas superbly. You can find links to all of the plotly express functions on this page.

Let’s start out simple. Create a bar chart for the total number of observations for each ride. Make sure your plot has labels for the x axis, y axis, and overall plot.

While the default plotly plots look amazing and have great interactivity, they won’t render in your notebook well in Gradescope. For this reason, please use fig.show(renderer="jpg") for all of your plots, otherwise they will not show up in gradescope and you will not get full credit.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Great! Wouldn’t it be interesting to see how the total number of observations changes over time, by ride?

Create a single plot that contains a bar plot for all of the rides, with the total number of observations for each ride by year.

This page has a good example of making facetted subplots.

First, create a new column called year based on the year from the datetime column.

Next, group by both the ride_name and year columns. Use the count method to get the total number of observations for each combination of ride_name and year. After that, use the reset_index method so that both ride_name and year become columns again (instead of indices).

The x axis should be the year, y axis could be datetime (which actually contains the count of observations), the color argument should be year, facet_col should be ride_name, and you can limit the number of plots per column by specifying facet_col_wrap to be 4 (for 4 plots per row).

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Create a plot that shows the association between the average SPOSTMIN and WDWMAXTEMP for a ride of your choice. Some notes to help you below.

  1. Create a new column in dat called day that is the date. Make sure to use pd.to_datetime to convert the date to the correct type.

  2. Use the groupby method to group by both ride_name and day. Get the average by using the mean method on your grouped data. In order to make ride_name and day columns instead of indices, call the reset_index method. Finally, use the query method to subset your data to just be data for your given ride.

  3. Convert the DATE column in meta to the correct type using pd.to_datetime.

  4. Use the merge method to merge your grouped data with the metadata on day (from the grouped data) and DATE (from meta).

  5. Make the scatterplot.

Is there an obvious relationship between the two variables for your chosen ride? Did you expect the results?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

This is an extremely rich dataset with lots and lots of plotting potential! In addition, there are a lot of interesting questions you could ask about wait times and rides that could actually be useful if you were to visit Disney!

Create a graphic using a plot we have not yet used from this webpage. Make sure to use proper labels, and make sure the graphic shows some sort of potentially interesting relationship. Write 1-2 sentences about why you decided to create this plot.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Ask yourself a question regarding the information in the dataset. For example, maybe you think that certain events from the meta dataframe will influence a certain ride. Perhaps you think the time the park opens is relevant to the time of year? Write down the question you would like to answer using a plot. Choose the type of plot you are going to use, and write 1-2 sentences explaining your reasoning. Create the plot. What were the results? Was the plot an effective way to answer your question?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.