Analysis of Variance (ANOVA) is a statistical test used to determine if there is a statistically significant difference between categorical groups. In general, we expect that there are one or more group might be influential on the dependent variable, while the other group treated as control group.

Before conducting the test, we need to check the test assumption: \(\bullet\) no relationship between the subjects. \(\bullet\) different groups must have equal sample sizes. \(\bullet\) dependent variable is normally distributed. \(\bullet\) Population variances must be equal(Homogeneity of variance)

ANOVA Test on MLR Prevelance

These assumptions were checked, and our model for prevelance meet these requirement. Then, we conduct ANOVA test to our model, and uses the F-test to determine whether the variables fits the data as well as our model. If that F-statics are sufficiently large, or P-values are significantly small,then we can conclude that our model fits the data better than the intercept-only model,and there are coefficients not equal to zero. But, if we fail to reject the null hypothesis, it means that all the coefficients are simultaneously zero, and the intercept-only model is better than our model.

ANOVA test for model \[Prevalence = location\beta_{location}+ln(test)\beta_{ln(test)}+ln(unemployment)\beta_{ln(unemployment)}+ln(area)\beta_{ln(area)}+ln(population)\beta_{ln(population)}+ \epsilon_i\]

Then, we conduct a hypothesis test.

\[H_0 = \beta_{location}+\beta_{ln(test)}+\beta_{ln(unemployment)}+\beta_{ln(area)}+\beta_{ln(population)}\] \[H_1 = At\ least\ one\ \beta\: is\ not\ zero \]

For model for prevalence, we decided to use weighted least square regression. And fit the model by weighted least square is better than unweighted. Since our error variance is unknown, we need to estimate the standard deviation function and the weight. In a weighted fit, less weight is given to the less precise measurements and more weight to more precise measurements.

mlr_model = lm(prevalence~location+ln_test+ln_unemployment+ln_area+ln_population,data = ca_model_data)
sd_function <- lm(abs(mlr_model$residuals) ~ mlr_model$fitted.values)

var_fitted <- sd_function$fitted.values^2

wt <- 1/var_fitted

wls_prevalence <- lm(prevalence~location+ln_test+ln_unemployment+ln_area+ln_population,data = ca_model_data, weights = wt)
aov(wls_prevalence) %>% summary()
##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## location         1   6.15    6.15    3.45 0.068913 .  
## ln_test          1 191.51  191.51  107.45 2.96e-14 ***
## ln_unemployment  1  47.63   47.63   26.72 3.80e-06 ***
## ln_area          1  24.60   24.60   13.80 0.000497 ***
## ln_population    1  19.91   19.91   11.17 0.001544 ** 
## Residuals       52  92.68    1.78                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the table above, we can see that the Variables,such as \(ln(test)\),\(ln(unemployment)\),\(ln(area)\),\(ln(population)\) are statistically significant, which means their p-value smaller than 0.05. Only variable \(location\) has p-value greater than 0.05. But its p-value=0.068913 still can considered as small. So we decide to keep it in our model.

ANOVA Test on MLR Fatality

For our fatality model, the test assumptions are checked ANOVA test for model \[death\ rate = location\beta_{location}+vaccinated\beta_{vaccinated}+labor\ rate\beta_{labor\ rate}+ln(tests)\beta_{ln(test)}+ln(area)\beta_{ln(area)}+ln(unemployment)\beta){ln(unemployment)}+ln(bed)\beta_{ln(bed)}+\epsilon_i\]

Then, we conduct a hypothesis test for fatality.

\[H_0 = \beta_{location}=\beta_{vaccinated}=\beta_{labor\ rate}=\beta_{ln(test)}=\beta_{ln(area)}=\beta_{ln(unemployment)}=\beta_{ln(bed)}\] \[H_1 = At\ least\ one\ \beta\: is\ not\ zero \]

mlr_model = lm(death_rate~location+vaccinated+labor_rate+ln_tests+ln_area+ln_unemployment+ln_bed,data = ca_model_data_death)
anova(mlr_model) %>% broom::tidy()
## # A tibble: 8 × 6
##   term               df   sumsq  meansq statistic    p.value
##   <chr>           <int>   <dbl>   <dbl>     <dbl>      <dbl>
## 1 location            1 0.0915  0.0915     20.7    0.0000365
## 2 vaccinated          1 0.00149 0.00149     0.337  0.564    
## 3 labor_rate          1 0.0535  0.0535     12.1    0.00108  
## 4 ln_tests            1 0.0180  0.0180      4.07   0.0493   
## 5 ln_area             1 0.0478  0.0478     10.8    0.00189  
## 6 ln_unemployment     1 0.0168  0.0168      3.80   0.0571   
## 7 ln_bed              1 0.0272  0.0272      6.14   0.0167   
## 8 Residuals          48 0.212   0.00442    NA     NA

From the table above, we can see that the variables, such as \(location\),\(labor\ rate\),\(ln(tests)\),\(ln(area)\),\(ln(unemployment)\),\(ln(bed)\) are statistically significant, which means their p-value smaller than 0.05. Only variable \(vaccinated\) has p-value greater than 0.05. This is because of the effect of Collinearity. Since \(vaccinated\) is closely related to the \(death\ rate\). Thus, we decide to keep variable \(vaccinated\) in our model.