COVID-19 has resulted in many long-term effects on societies all across the world, especially in areas with high population density such as the State of California. The motivation of this project is to investigate the trends of COVID-19 new cases development in the past in relation with factors like vaccine administration and demographic factors. This project aims to eventually come up with conclusions that can identify vulnerable groups and areas to COVID-19 and offer suggestions for the protection of people against the disease. This project is particularly interested in topics such as the relationship between county population size and COVID-19 prevalence, and the COVID-19 prevalence between different age groups.

Begin the data processing procedure by loading all of the interested
data frames from the internet, including Age.csv,
Demo.csv,12-plus.csv,16-plus.csv,
and Post_stat.csv.
age_df = read_csv("data/age.csv")
demo_df = read_csv("data/demo.csv")
stat_12 = read_csv("data/12_plus.csv")
stat_16 = read_csv("data/16_plus.csv")
post_stat = read_csv("data/post_stat.csv")
Age.csv originally has 13267 rows and 8 variables. It
includes COVID-19 data such as total case, percent case, deaths in
respect to three different demographic groups including age group, race
ethnicity, and gender. Thus, analysts decide to separate it into three
distinct data frames, each describing on one demographic group.
Subsequently, arranged the sequence of the columns by moving the ‘date’
variable in front of other variables to better match the sequence of
other data frames in this study.
## datasets of age_df
age =
age_df %>%
janitor::clean_names() %>%
filter(demographic_category == "Age Group") %>%
select(report_date,
demographic_value,
total_cases,
deaths,
percent_deaths,
percent_of_ca_population) %>%
drop_na() %>%
rename("age_group" = demographic_value,
"date" = report_date) %>%
filter(!age_group %in% c("missing", "Missing","Total"))
race =
age_df %>%
janitor::clean_names() %>%
filter(demographic_category == "Race Ethnicity") %>%
select(report_date,
demographic_value,
total_cases,
deaths,
percent_deaths,
percent_of_ca_population) %>%
drop_na() %>%
rename("race_group" = demographic_value,
"date" = report_date) %>%
filter(!race_group %in% c("Total", "Other"))
gender =
age_df %>%
janitor::clean_names() %>%
filter(demographic_category == "Gender") %>%
select(report_date,
demographic_value,
total_cases,
deaths,
percent_deaths,
percent_of_ca_population) %>%
drop_na() %>%
rename("gender" = demographic_value,
"date" = report_date) %>%
filter(!gender %in% c("Total", "Unknown"))
Demo.csv originally has 62647 rows and 17 variables.It
contains COVID-19 data belonging to different demographic areas, mainly
the different counties in the state of California, but it also includes
COVID-19 data for the whole states. For our purposes, analysts decide to
focus our analysis on comparing the COVID-19 data in different counties.
Thus analysts decide to drop any rows with the area_type of
state. Then, delete the column area_type and rename the
column area to county_name. Analysts also
decide to focus the analysis on the cumulative COVID-19 data instead of
the changes per day. Then, delete some columns including
total_test, positive_test,
reported_cases, reported_deaths, and
reported_tests. Lastly, drop all of the rows containing N/A
values.The resulted data frame includes data such as the population, the
cumulative cases, tests, and reported deaths of the county.
## dataset of demo
demo_df = read_csv("data/demo.csv")
demo =
demo_df %>%
filter(area_type == "County") %>%
select(-c(area_type, total_tests, reported_tests)) %>%
rename("county_name" = area) %>%
filter(!county_name %in% c("Unknown"))
12-plus.csv, 16-plus.csv,
Post_stat.csv are legacy tables about COVID-19 status based
on previous separation between standard vaccinated and boosted groups,
with restriction on the age group above 16, above 12, and no restriction
respectively. 12-plus.csv originally has 532 rows and 24
variables. 16-plus.csv originally has 286 rows and 17
variables. Post_stat.csv originally has 607 rows and 17
variables.Since the AREA and AREA_TYP columns
contain the exact information throughout all of these data frames, it is
not valuable for our purpose and thus we remove them. We also removed
six data rows before the date 02/07/2021 due to missing data in the
12-plus.csv and Post_stat.csv.
## datasets of post_stat
stat_12 = stat_12[-(1:6),]
stat_12 =
stat_12 %>%
arrange(date, decreasing = F) %>%
select(-c(area_type, area))
stat_16 = stat_16[-(1:6),]
stat_16 =
stat_16 %>%
janitor::clean_names() %>%
arrange(date, decreasing = F) %>%
select(-c(area_type, area))
post_stat = post_stat[-(1:6),]
post_stat =
post_stat %>%
janitor::clean_names() %>%
arrange(date, decreasing = F) %>%
select(-c(area_type, area))
After retrieving the CA_Labor and CA_Land_Area data set from the data
sources, first remove the variable columns of employment,
unemployment, and rank from the CA_Labor data
set due to the interest for this investigation. Then rename the column
unemployment_rate_per_cent as unemployment for
future data processing. The resulted data set has 58 rows and 3 variable
columns.
For CA_Land_Area data set, separate the counties into costal counties
and inland counties first using mutate() function, then
remove the variable column of rank,state, and
population. The resulted data set has 58 observations and 3
variable columns.
In the end, using left_join() function to combine the
two processed data set into one final data set by county names, the
final data set is named as ca_nonmedical_data and contains
58 rows and 5 variable columns, including name of the county, county’s
area in square miles, location type of the county, unemployment and
labor force.
labor_data = read_csv("./data/CA_Labor.csv") %>%
janitor::clean_names() %>%
dplyr::select(-employment,-unemployment,-rank) %>%
rename(unemployment = unemployment_rate_per_cent)
land_data = read_csv("./data/CA_Land_Area.csv") %>%
janitor::clean_names() %>%
mutate(
location = ifelse(county %in% c("Los Angeles", "Orange", "San Diego", "Monterey", "San Benito", "San Luis Obispo", "Santa Barbara", "Santa Cruz", "Ventura", "Alameda", "Contra Costa", "Marin", "Napa", "San Francisco", "San Mateo", "Santa Clara", "Solano", "Sonoma", "Del Norte", "Humboldt", "Mendocino"), "costal", "inland")) %>%
dplyr::select(-rank, -state, -population)
ca_nonmedical_data = left_join(land_data, labor_data, by = c("county"))
For the purpose of model building, a data set focused on reflecting
covid information such as death rate, total test rate, and population
with respect to each county is needed. To do this, use the function
mutate() and some calculation to produce the death rate and
total test rate column for the new data set. Then select the variables
of interest including county_name, death_rate,
test, population and group them by county. The
resulted data set is named as demon_covid.
demo_covid = demo %>%
mutate(cumulative_reported_deaths = as.numeric(cumulative_reported_deaths),
cumulative_cases = as.numeric(cumulative_cases),
death_rate = 0.01 + cumulative_reported_deaths/population*100,
test = cumulative_total_tests/population*100) %>%
dplyr::select(county_name, death_rate, test,population) %>%
rename(county=county_name) %>%
group_by(county) %>%
summarize(death_rate=max(death_rate,na.rm = T),
test = max(test),
population = max(population)) %>%
filter(!county %in% c("Out of state","Unknown"))
To obtain vaccine status of each county, read in the
CA_covid19vaccines.csv file and group the data by county, then obtain
the total number of fully vaccinated people by using the
sum() function. Drop all the N/A values and any county
besides the 58 counties of interest. The resulted data set includes 58
rows and 2 column of variables.
vaccine = read_csv("./data/CA_covid19vaccines.csv") %>%
janitor::clean_names() %>%
group_by(county) %>%
summarise(
fully_vaccinated = sum(fully_vaccinated)
# total number of fully vaccinated people
) %>%
arrange(county) %>%
drop_na() %>%
filter(!county %in% c("Unknown", "All CA Counties", "All CA and Non-CA Counties","Outside California"))
This investigation is interested in knowing the hospitalization
ability of each county in California. After reading in the
CA_Covid19_Hospital.csv data file, group the data by county and find the
maximum amount of hospital bed and available ICU beds recorded for each
county. Then select county, all_hospital_beds,
icu_available_beds to form the processed data set, which
includes 56 rows and 3 column variables.
hospital = read_csv("./data/CA_Covid19_Hospital.csv") %>%
group_by(county) %>%
summarize(all_hospital_beds = max(all_hospital_beds,na.rm = T),
icu_available_beds = max(icu_available_beds,na.rm = T)) %>%
dplyr::select(county, all_hospital_beds, icu_available_beds)
After the data pre-processing activities, we are ready to form a
final pre-model data frame. Combine the demo_covid,
vaccine, hospital, and
ca_nonmedical_data together by using the function
leftjoin() while binding the data sets by the variable
county. Then relocate the variable death rate
to the front and do some math adjustment for other data variables
including vaccinated, labor_rate,
hospital_bed, icu_bed for the purpose of
better visual representation afterwards. Remove some columns of
variables as needed. The final pre-modal dataframe includes 58 rows and
10 columns of variables.
ca_medical_data = left_join(demo_covid, vaccine, by =c("county"))
ca_medical_data2 = left_join(ca_medical_data, hospital, by =c("county"))
ca_premodel_data = left_join(ca_medical_data2, ca_nonmedical_data, by = c("county")) %>%
relocate(death_rate) %>%
mutate(
vaccinated=fully_vaccinated/population*100,
labor_rate = labor_force/population*100,
hospital_bed = all_hospital_beds/population*100,
icu_bed = icu_available_beds/population*100
)%>%
dplyr::select(-fully_vaccinated, -population, -labor_force,-all_hospital_beds,-icu_available_beds)
skimr::skim(ca_premodel_data)
| Name | ca_premodel_data |
| Number of rows | 58 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| county | 0 | 1 | 4 | 15 | 0 | 58 | 0 |
| location | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| death_rate | 0 | 1.00 | 0.21 | 0.09 | 0.01 | 0.13 | 0.19 | 0.28 | 0.51 | ▂▇▅▃▁ |
| test | 0 | 1.00 | 367.42 | 171.84 | 104.51 | 277.33 | 312.06 | 402.06 | 983.95 | ▅▇▂▁▁ |
| land_area_sq_mi | 0 | 1.00 | 2685.85 | 3102.32 | 46.87 | 959.49 | 1535.34 | 3454.40 | 20056.92 | ▇▂▁▁▁ |
| unemployment | 0 | 1.00 | 7.33 | 2.12 | 4.42 | 5.97 | 6.96 | 8.22 | 17.37 | ▇▆▁▁▁ |
| vaccinated | 0 | 1.00 | 63.20 | 13.47 | 29.78 | 54.34 | 61.29 | 72.35 | 94.09 | ▁▇▇▇▃ |
| labor_rate | 0 | 1.00 | 44.16 | 5.95 | 30.47 | 40.29 | 44.59 | 47.75 | 61.19 | ▂▅▇▁▁ |
| hospital_bed | 2 | 0.97 | 0.32 | 0.39 | 0.08 | 0.18 | 0.22 | 0.29 | 2.79 | ▇▁▁▁▁ |
| icu_bed | 2 | 0.97 | 0.03 | 0.05 | 0.00 | 0.01 | 0.01 | 0.03 | 0.35 | ▇▁▁▁▁ |