Burrito synergy in SoCal

Today we will look at a dataset of reviews of burritos in southern California. This is a data set collected by a group of college friends who live in the greater San Diego area.

burrito <- read_csv("https://statsmaths.github.io/stat_data/burrito.csv")
## Parsed with column specification:
## cols(
##   location = col_character(),
##   cost = col_double(),
##   yelp = col_double(),
##   google = col_double(),
##   chips_included = col_double(),
##   hunger = col_double(),
##   tortilla = col_double(),
##   temp = col_double(),
##   meat = col_double(),
##   fillings = col_double(),
##   meat_filling = col_double(),
##   uniformity = col_double(),
##   salsa = col_double(),
##   synergy = col_double(),
##   wrap = col_double(),
##   overall = col_double(),
##   lat = col_double(),
##   lon = col_double()
## )

The available variables in the data are:

Fit a linear regression predicting the average yelp score of locations in this dataset and look at the model using reg_table:

model <- lm_basic(yelp ~ 1, data = burrito)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = yelp ~ 1, data = burrito)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.430 -0.305  0.070  0.070  0.570 
## 
## Coefficients:
##             Estimate 2.5 % 97.5 %
## (Intercept)    3.930 3.808  4.052
## 
## Residual standard error: 0.4287 on 49 degrees of freedom

The average Yelp score of all restaurants in this area is just 3.7; looking at the confidence interval from the model does it seem likely that burrito locations have a higher Yelp rating than locations in general, or is this just noise in the data?

Answer: No, it seems like they intentionally selected restaurants with good yelp scores.

The regression model confidence interval requires that the data are sampled independently from the larger population. Why might this dataset not hold to this assumption?

Answer: They likely only selected good restaurants or those that were close to where they lived.

Now fit a linear regression that predicts the yelp score as a function of whether the burrito includes chips.

model <- lm_basic(yelp ~ chips_included, data=burrito)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = yelp ~ chips_included, data = burrito)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.41176 -0.30101  0.08824  0.08824  0.58824 
## 
## Coefficients:
##                Estimate    2.5 % 97.5 %
## (Intercept)     3.91176  3.76270  4.061
## chips_included  0.05699 -0.20652  0.320
## 
## Residual standard error: 0.4323 on 48 degrees of freedom
## Multiple R-squared:  0.003923,   Adjusted R-squared:  -0.01683 
## F-statistic: 0.1891 on 1 and 48 DF,  p-value: 0.6656

Describe what the chips coefficient in the previous model means in words:

Answer: Its the additional amount that you would expect a restaurant that includes chips with their burritos to have as a yelp score.

Is the slope coefficient statistically significant from zero (in other words, is zero in the confidence interval)?

Answer: No, it runs from -0.20652 to 0.320 (a negative to a positive value).

Describe this in terms of your answers above.

Answer: There does not seem to be a statistically significant difference in the Yelp score based on whether chips are included with the burrito.

The word synergy can be defined as:

the interaction or cooperation of two or more organizations, substances, or other agents to produce a combined effect greater than the sum of their separate effects.

This is not something I would have thought of when considering tacos, but let’s see how synergy effected the scores of both the reviewers as well as Google and Yelp users in general.

First, I’ll add a variable called high_synergy defined as whether the synergy score is greater than 3.5.

burrito$high_synergy <- (burrito$synergy > 3.5)

Fit a regression predicting the Yelp score as a function of the variable high_synergy.

model <- lm_basic(yelp ~ high_synergy, data = burrito)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = yelp ~ high_synergy, data = burrito)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27778 -0.21224 -0.01562  0.22222  0.72222 
## 
## Coefficients:
##                   Estimate     2.5 % 97.5 %
## (Intercept)       3.777778  3.580078  3.975
## high_synergyTRUE  0.237847 -0.009277  0.485
## 
## Residual standard error: 0.4172 on 48 degrees of freedom
## Multiple R-squared:  0.07237,    Adjusted R-squared:  0.05305 
## F-statistic: 3.745 on 1 and 48 DF,  p-value: 0.05887

Is there strong evidence that Yelp scores are higher for high synergy burrito restaurants?

Answer: No, again the confidence interval extends from a negative value to a positive one.

Now fit a regression predicting the overall score as a function of the variable high_synergy.

model <- lm_basic(overall ~ high_synergy, data = burrito)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = overall ~ high_synergy, data = burrito)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1963 -0.2641  0.1359  0.3359  0.9037 
## 
## Coefficients:
##                  Estimate  2.5 % 97.5 %
## (Intercept)        2.6963 2.4583  2.934
## high_synergyTRUE   1.1678 0.8703  1.465
## 
## Residual standard error: 0.5021 on 48 degrees of freedom
## Multiple R-squared:  0.5649, Adjusted R-squared:  0.5558 
## F-statistic: 62.31 on 1 and 48 DF,  p-value: 3.19e-10

Is there evidence that high synergy burrito locations have a higher overall score than low synergy burrito locations?

Answer: Yes! The confidence interval runs from 0.87 to 1.465.

Using the model from the previous question, what does the model predict will be the overall score of a location with low synergy?

Answer: 2.69.

What does the model predict will be the overall score of a location with high synergy?

Answer: 2.69 + 1.1678 = 3.8578.

Finally, fit a regression model that predicts the overall score as a function of the variable synergy coded as a factor. That is, use overall ~ factor(synergy) in the formula. We are doing this so that each value of the synergy score (1, 2, 3, 4, and 5) are treated as a category rather than a continuous variable.

model <- lm_basic(overall ~ factor(synergy), data = burrito)
reg_table(model, level = 0.95)
## 
## Call:
## lm_basic(formula = overall ~ factor(synergy), data = burrito)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.05323 -0.25323  0.00417  0.31875  0.72222 
## 
## Coefficients:
##                  Estimate   2.5 % 97.5 %
## (Intercept)        2.5000  1.6294  3.371
## factor(synergy)2  -0.2222 -1.1399  0.695
## factor(synergy)3   0.6917 -0.2317  1.615
## factor(synergy)4   1.3532  0.4687  2.238
## factor(synergy)5   1.7000  0.4688  2.931
## 
## Residual standard error: 0.4322 on 45 degrees of freedom
## Multiple R-squared:  0.6977, Adjusted R-squared:  0.6708 
## F-statistic: 25.96 on 4 and 45 DF,  p-value: 3.408e-11

What does the model predict the overall score for a restaurant with a synergy score of 3 will be?

Answer: 2.5 + 0.6917 = 3.1917

Finally, draw a plot that corresponds to the model you just produced for the synergy score. Hint: You’ll need the geom_confint layer.

ggplot(burrito, aes(factor(synergy), overall)) +
  geom_confint()
## Warning: Removed 2 rows containing missing values (geom_pointrange).