Analysis of Variance (ANOVA) is a statistical test used to determine if there is a statistically significant difference between categorical groups. In general, we expect that there are one or more group might be influential on the dependent variable, while the other group treated as control group.
Before conducting the test, we need to check the test assumption: \(\bullet\) no relationship between the subjects. \(\bullet\) different groups must have equal sample sizes. \(\bullet\) dependent variable is normally distributed. \(\bullet\) Population variances must be equal(Homogeneity of variance)
These assumptions were checked, and our model for prevelance meet these requirement. Then, we conduct ANOVA test to our model, and uses the F-test to determine whether the variables fits the data as well as our model. If that F-statics are sufficiently large, or P-values are significantly small,then we can conclude that our model fits the data better than the intercept-only model,and there are coefficients not equal to zero. But, if we fail to reject the null hypothesis, it means that all the coefficients are simultaneously zero, and the intercept-only model is better than our model.
ANOVA test for model \[Prevalence = location\beta_{location}+ln(test)\beta_{ln(test)}+ln(unemployment)\beta_{ln(unemployment)}+ln(area)\beta_{ln(area)}+ln(population)\beta_{ln(population)}+ \epsilon_i\]
Then, we conduct a hypothesis test.
\[H_0 = \beta_{location}+\beta_{ln(test)}+\beta_{ln(unemployment)}+\beta_{ln(area)}+\beta_{ln(population)}\] \[H_1 = At\ least\ one\ \beta\: is\ not\ zero \]
For model for prevalence, we decided to use weighted least square regression. And fit the model by weighted least square is better than unweighted. Since our error variance is unknown, we need to estimate the standard deviation function and the weight. In a weighted fit, less weight is given to the less precise measurements and more weight to more precise measurements.
mlr_model = lm(prevalence~location+ln_test+ln_unemployment+ln_area+ln_population,data = ca_model_data)
sd_function <- lm(abs(mlr_model$residuals) ~ mlr_model$fitted.values)
var_fitted <- sd_function$fitted.values^2
wt <- 1/var_fitted
wls_prevalence <- lm(prevalence~location+ln_test+ln_unemployment+ln_area+ln_population,data = ca_model_data, weights = wt)
aov(wls_prevalence) %>% summary()
## Df Sum Sq Mean Sq F value Pr(>F)
## location 1 6.15 6.15 3.45 0.068913 .
## ln_test 1 191.51 191.51 107.45 2.96e-14 ***
## ln_unemployment 1 47.63 47.63 26.72 3.80e-06 ***
## ln_area 1 24.60 24.60 13.80 0.000497 ***
## ln_population 1 19.91 19.91 11.17 0.001544 **
## Residuals 52 92.68 1.78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the table above, we can see that the Variables,such as \(ln(test)\),\(ln(unemployment)\),\(ln(area)\),\(ln(population)\) are statistically significant, which means their p-value smaller than 0.05. Only variable \(location\) has p-value greater than 0.05. But its p-value=0.068913 still can considered as small. So we decide to keep it in our model.
For our fatality model, the test assumptions are checked ANOVA test for model \[death\ rate = location\beta_{location}+vaccinated\beta_{vaccinated}+labor\ rate\beta_{labor\ rate}+ln(tests)\beta_{ln(test)}+ln(area)\beta_{ln(area)}+ln(unemployment)\beta){ln(unemployment)}+ln(bed)\beta_{ln(bed)}+\epsilon_i\]
Then, we conduct a hypothesis test for fatality.
\[H_0 = \beta_{location}=\beta_{vaccinated}=\beta_{labor\ rate}=\beta_{ln(test)}=\beta_{ln(area)}=\beta_{ln(unemployment)}=\beta_{ln(bed)}\] \[H_1 = At\ least\ one\ \beta\: is\ not\ zero \]
mlr_model = lm(death_rate~location+vaccinated+labor_rate+ln_tests+ln_area+ln_unemployment+ln_bed,data = ca_model_data_death)
anova(mlr_model) %>% broom::tidy()
## # A tibble: 8 × 6
## term df sumsq meansq statistic p.value
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 location 1 0.0915 0.0915 20.7 0.0000365
## 2 vaccinated 1 0.00149 0.00149 0.337 0.564
## 3 labor_rate 1 0.0535 0.0535 12.1 0.00108
## 4 ln_tests 1 0.0180 0.0180 4.07 0.0493
## 5 ln_area 1 0.0478 0.0478 10.8 0.00189
## 6 ln_unemployment 1 0.0168 0.0168 3.80 0.0571
## 7 ln_bed 1 0.0272 0.0272 6.14 0.0167
## 8 Residuals 48 0.212 0.00442 NA NA
From the table above, we can see that the variables, such as \(location\),\(labor\ rate\),\(ln(tests)\),\(ln(area)\),\(ln(unemployment)\),\(ln(bed)\) are statistically significant, which means their p-value smaller than 0.05. Only variable \(vaccinated\) has p-value greater than 0.05. This is because of the effect of Collinearity. Since \(vaccinated\) is closely related to the \(death\ rate\). Thus, we decide to keep variable \(vaccinated\) in our model.