After exploring our outcomes geographically by county, as well as the
results on key demographic figures(age, gender, race), we proceeded with
next step to include all potential predictors and then determine which
predictors among our large data set, were significantly associated with
our outcome variables, prevalence and
death rate, at 5% significance level.
ca_model_comp_df %>%
select(-county) %>%
GGally::ggpairs(upper = list(continuous = GGally::wrap("cor", size = 2))) +
theme(legend.title = element_text(size = 5),
legend.key.size = unit(0.3, 'cm'),
legend.text = element_text(size = 4)) +
theme(
axis.title.x = element_text(size = 5),
axis.text.x = element_text(size = 5),
axis.title.y = element_text(size = 5),
axis.text.y = element_text(size = 5)) +
scale_y_continuous(labels = point)

Associations between Predictors and Outcomes After exploring our
outcomes geographically by county, as well as the results on key
demographic figures(age, gender, race), we proceeded with next step to
include all potential predictors and then determine which predictors
among our large data set, were significantly associated with our outcome
variables, prevalence and death rate, at 5%
significance level.
This correlated matrix plot clearly illustrates the relationship between each variables and our possible outcomes. Besides, there is no abnormal multicollinearity indicating our data is reasonable for further regressions. Since the LS estimate and big likelihood function may be distorted by the variable with abnormally high correlation and lose the predictive ability and accuracy.
In the correlation matrix above, we find out that the correlates of worse outcomes: * Variables highly correlated with highest prevalence: unemployment % , test participation % * Variables highly correlated with highest death rates: unemployment %, land area
The following graphs explore the associations between the selected highly-correlated predictors and each outcome variables across county.
pre_best_1 =
ca_model_comp_df %>%
mutate(county = factor(county)) %>%
ggplot(aes(x = test, y = prevalence, color = county)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "lm", color = "red", se = F) +
theme(
axis.title.x = element_text(size = 5),
axis.text.x = element_text(size = 5),
axis.title.y = element_text(size = 5),
axis.text.y = element_text(size = 5)) +
theme(legend.position = "none") +
labs(title = "Prevalence vs Test Participation")
pre_best_2 =
ca_model_comp_df %>%
mutate(county = factor(county)) %>%
ggplot(aes(x = unemployment, y = prevalence, color = county)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "lm", color = "red", se = F) +
theme(
axis.title.x = element_text(size = 5),
axis.text.x = element_text(size = 5),
axis.title.y = element_text(size = 5),
axis.text.y = element_text(size = 5)) +
theme(legend.position = "none") +
labs(title = "Prevalence vs Test Unemployment")
pre_best_1 + pre_best_2

pre_best_3 =
ca_model_comp_df %>%
mutate(county = factor(county)) %>%
ggplot(aes(x = unemployment, y = death_rate, color = county)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "lm", color = "red", se = F) +
theme(
axis.title.x = element_text(size = 5),
axis.text.x = element_text(size = 5),
axis.title.y = element_text(size = 5),
axis.text.y = element_text(size = 5)) +
theme(legend.position = "none") +
labs(title = "Death Rate vs Unemployment")
pre_best_4 =
ca_model_comp_df %>%
mutate(county = factor(county)) %>%
ggplot(aes(x = land_area_sq_mi, y = prevalence, color = county)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "lm", color = "red", se = F) +
theme(
axis.title.x = element_text(size = 5),
axis.text.x = element_text(size = 5),
axis.title.y = element_text(size = 5),
axis.text.y = element_text(size = 5)) +
theme(legend.position = "none") +
labs(title = "Death Rate vs land Area")
pre_best_3 + pre_best_4
