Name: ________________ Section: _______________ Statistics Assignment #2 Descriptive Statistics (20 points) Marketing

Name: ________________

Section: _______________

Statistics Assignment #2

Descriptive Statistics (20 points)

Marketing Research

INSTRUCTIONS:

This assignment must be completed individually. This means that you may not ask any person other than the professor for help. You are permitted to use the internet to look up how to perform certain operations in R or to look up certain statistical concepts. If you have any question, please ask the professor directly.

Instructions for this assignment assume that you are using RStudio. Remember that you must install “base” R in order to use RStudio. You may use a different statistical software (e.g., SPSS, Stata) if you like, but it is your responsibility to make sure that the results, graphs, etc. from this software match the correct R output. You may not use Excel for this assignment.

In this document, I will tell you what code to run in RStudio. The specific code will be written in Courier font (i.e., the font you are reading now). Other instructions and notes will be written in Times New Roman (i.e., the font you are reading now). If the font is not Courier, you should not type that into RStudio. It will not work.

For early assignments, I will give you relatively detailed code and notes to go with them. However, once we have covered a certain command, the instructions will be less detailed. You should go back and look at old assignments and questions to see the more detailed notes on a certain command.

Finally, if I ask you to copy and paste your output, it is your responsibility to make sure that the output is readable. You may lose points if the output is not readable (e.g., the columns are not lined up properly). Therefore, I recommend taking a screenshot rather than copy and pasting the text.

For this assignment, please use the “IMDb Data.csv” file posted on Blackboard.

This data file has the following variables:

Rank: IMDb rank in the for the last 10 years

Title: Name of the movie

Genre: Genre of the movie

Description: Brief description of the plot

Director: Director of the movie

Actors: List of actors in the movie

Year: Year that the movie was released

Runtime: Length of the movie (in minutes)

Rating: Average viewer score (on a 1 to 10 scale)

Votes: Number of viewer ratings

Revenue: Millions of dollars made at the box office

Metascore: The average critic score (converted to a 1 to 100 scale)

“BACKWARD LOOKING” QUESTIONS (15 points total)

GRADED FOR ACCURACY

Question 1:

Load the “IMDb Data.csv” file into RStudio. Hint: See Assignment 1, Question 1 for instructions.

Create a histogram showing the number of movies in the top 1000 by year. Hint: See Assignment 1, Question 4 for similar code. Because you want to show the data for the entire dataset, you do not need to include the part of the code that is in “[]”.

Copy/paste the code that you used to create the histogram. (1 point)

Copy/paste the histogram below. (1 point)

In which year were the most movies in the top 1000 released? (1 point)

Approximately how many movies were released in that year (i.e., the year you answered in part c)? (1 point)

What do you think explains the pattern you see in the histogram? Do not simply explain the shape of the histogram. Describe a real-world phenomenon that would produce these results. Remember that this data was collected in 2016. (2 point)

Question 2:

Imagine that you wanted to test the hypothesis that movies with a higher number of ratings (i.e., more votes) receive higher ratings. That is, movies that are seen more often (and therefore receive more ratings) are rated more highly.

Use the following code to run a simple linear regression, using Rating as a dependent (Y) variable and Votes as your independent (X) variable.

reg1<-lm(Rating~Votes, data=imdb)

summary(reg1)

See Assignment 1, Question 5 for more details about running regressions in R.

What is the coefficient associated with “Votes”? (1 point)

Interpret this coefficient. How do changes in the number of votes a movie receives influence its IMDb rating? Be precise. Don’t just say, “it goes up.” (2 points)

Using an alpha value (i.e., a p-value cutoff) of 0.05, is the relationship between Votes and Ratings statistically significant? Explain your answer. (1 point)

Think about your answer from part (b). Why might that interpretation be wrong? That is, what could happen in the real world that might make you doubt that the number of votes directly influences how high a movie is rated? (2 points)

Create a multiple regression model (i.e., a regression with more than one X variable), using “Votes” and “Runtime (Minutes)” to predict Rating. Use the following code:

reg2<-lm(Rating~Votes+`Runtime (Minutes)`, data=imdb)

summary(reg2)

You can add additional independent variables to your regression by using the plus sign (+) on the right-hand side of the tilde (~). So in the code above, Votes+`Runtime (Minutes)` means that you are going to use Votes and Runtime to predict Rating.

Remember that because the variable “Runtime (Minutes)” has parenthesis in the variable name, you need to put “backquotes” (i.e., `Runtime (Minutes)`) around the name (`Runtime (Minutes)`) when you refer to it in the R code.

Copy and paste your regression output. (1 point)

Compare the output for your regression in part (e) to the output for your regression in part (a). Which of the two regression models would do a better job estimating a movie’s IMDb rating? Explain your answer. (1 point)

Use the regression from part (e). Imagine that there are two movies. Movie A is 90 minutes long and has 50,000 votes on IMDb. Movie B is 120 minutes long and has 40,000 votes on IMDb. What would you expect each movie’s rating to be? (1 point; 0.5 points each)

Movie A: _________

Movie B: _________

“FORWARD LOOKING” QUESTIONS (5 points total)

GRADED FOR EFFORT

Question 3:

Run a regression comparing the average rating of Adventure movies to the average rating of Comedy movies. Use the following code.

reg3<-lm(Rating~as.factor(Genre), data=imdb[imdb$Genre==”Adventure”|imdb$Genre==”Comedy”,])

Note: R is weird sometimes. It has a bad time understanding certain symbols, especially when you copy/paste. So type in the code above rather than copy/pasting it. You should also type it all as one line of code (not split up into two lines).

In the code, [imdb$Genre==”Adventure”|imdb$Genre==”Comedy”,] means that you want to choose only certain rows of data. Specifically, you want to choose only the rows where Genre equals “Adventure” OR rows where Genre equals “Comedy.” Putting the “|” in the code means that you are choosing one OR the other.

If you want to use string (i.e., text) variables in your regressions as “factor” or “group” variables, you need to use the as.factor function, which converts the text to something that is usable by R. This is why you have as.factor(Genre) as an X variable in this regression instead of just Genre.

summary(reg3)

What is the beta () coefficient associated with Comedy? (0.5 points)

On average, do you think that comedies are rated higher or lower than Adventure movies? (0.5 points)

Create a table showing the average Ratings for each Genre. Include all genres. Copy/paste the output. (1 point)

Hint: Use the summaryBy function from Assignment 1, Question 1. Also remember to run library(doBy) before you try this question.

Look at the table that you created for part (c). Compare it to your regression output from part (a). What do you notice about the relationship between the average ratings for Adventure and Comedy movies and the coefficients from your regression output? (1 point)

Based on your answer from part (d), how should you interpret the beta () coefficient associated with Comedy? (1 point)

Run a t-test comparing the average ratings for Adventure and Comedy movies. Use the following code:

t.test(Rating~Genre, data=imdb[imdb$Genre==”Adventure”|imdb$Genre==”Comedy”,])

Like with the code used for regressions, Rating~Genre means that you are using “Rating” as a dependent variable and “Genre” as an “independent” or “grouping” variable. That is, you are testing if there are differences in Rating depending on the Genre.

Because you specified that you only want to look at the Genres “Adventure” OR “Comedy,” the t-test is only comparing those two groups. If, for example, you wanted to look at Adventure and Horror, you would write data=imdb[imdb$Genre==”Adventure”|imdb$Genre==”Horror”,]

Copy/paste the output from your t-test. (0.5 points)

Look at the p-value from the t-test and compare it to the p-value for the “Genre” coefficient from the regression you ran for part (a). What do you notice? (0.5 points)