Data Introduction and Further Processing

Introduction

COVID-19 has resulted in many long-term effects on societies all across the world, especially in areas with high population density such as the State of California. The motivation of this project is to investigate the trends of COVID-19 new cases development in the past in relation with factors like vaccine administration and demographic factors. This project aims to eventually come up with conclusions that can identify vulnerable groups and areas to COVID-19 and offer suggestions for the protection of people against the disease. This project is particularly interested in topics such as the relationship between county population size and COVID-19 prevalence, and the COVID-19 prevalence between different age groups.

Begin the data processing procedure by loading all of the interested data frames from the internet, including Age.csv, Demo.csv,12-plus.csv,16-plus.csv, and Post_stat.csv.

age_df = read_csv("data/age.csv")
demo_df = read_csv("data/demo.csv")
stat_12 = read_csv("data/12_plus.csv")
stat_16 = read_csv("data/16_plus.csv")
post_stat = read_csv("data/post_stat.csv")

Age, race and gender datasets

Age.csv originally has 13267 rows and 8 variables. It includes COVID-19 data such as total case, percent case, deaths in respect to three different demographic groups including age group, race ethnicity, and gender. Thus, analysts decide to separate it into three distinct data frames, each describing on one demographic group. Subsequently, arranged the sequence of the columns by moving the ‘date’ variable in front of other variables to better match the sequence of other data frames in this study.

## datasets of age_df
age =
    age_df %>%
    janitor::clean_names() %>%
    filter(demographic_category == "Age Group") %>%
    select(report_date,
            demographic_value,
            total_cases,
            deaths,
            percent_deaths,
            percent_of_ca_population) %>%
    drop_na() %>%
    rename("age_group" = demographic_value,
           "date" = report_date) %>% 
    filter(!age_group %in% c("missing", "Missing","Total"))

race =
    age_df %>%
    janitor::clean_names() %>%
    filter(demographic_category == "Race Ethnicity") %>%
    select(report_date,
           demographic_value,
           total_cases,
           deaths,
           percent_deaths,
           percent_of_ca_population) %>%
    drop_na() %>%
    rename("race_group" = demographic_value,
           "date" = report_date) %>% 
    filter(!race_group %in% c("Total", "Other"))

gender =
    age_df %>%
    janitor::clean_names() %>%
    filter(demographic_category == "Gender") %>%
    select(report_date,
           demographic_value,
           total_cases,
           deaths,
           percent_deaths,
           percent_of_ca_population) %>%
    drop_na() %>%
    rename("gender" = demographic_value,
           "date" = report_date) %>% 
    filter(!gender %in% c("Total", "Unknown"))

Demographic dataset

Demo.csv originally has 62647 rows and 17 variables.It contains COVID-19 data belonging to different demographic areas, mainly the different counties in the state of California, but it also includes COVID-19 data for the whole states. For our purposes, analysts decide to focus our analysis on comparing the COVID-19 data in different counties. Thus analysts decide to drop any rows with the area_type of state. Then, delete the column area_type and rename the column area to county_name. Analysts also decide to focus the analysis on the cumulative COVID-19 data instead of the changes per day. Then, delete some columns including total_test, positive_test, reported_cases, reported_deaths, and reported_tests. Lastly, drop all of the rows containing N/A values.The resulted data frame includes data such as the population, the cumulative cases, tests, and reported deaths of the county.

## dataset of demo
demo_df = read_csv("data/demo.csv")

demo =
    demo_df %>%
    filter(area_type == "County") %>%
    select(-c(area_type, total_tests, reported_tests)) %>%
    rename("county_name" = area) %>% 
    filter(!county_name  %in% c("Unknown"))

Post statistics categorized by age

12-plus.csv, 16-plus.csv, Post_stat.csv are legacy tables about COVID-19 status based on previous separation between standard vaccinated and boosted groups, with restriction on the age group above 16, above 12, and no restriction respectively. 12-plus.csv originally has 532 rows and 24 variables. 16-plus.csv originally has 286 rows and 17 variables. Post_stat.csv originally has 607 rows and 17 variables.Since the AREA and AREA_TYP columns contain the exact information throughout all of these data frames, it is not valuable for our purpose and thus we remove them. We also removed six data rows before the date 02/07/2021 due to missing data in the 12-plus.csv and Post_stat.csv.

## datasets of post_stat
stat_12 = stat_12[-(1:6),]
stat_12 =
    stat_12 %>%
    arrange(date, decreasing = F) %>%
    select(-c(area_type, area))



stat_16 = stat_16[-(1:6),]
stat_16 =
    stat_16 %>%
    janitor::clean_names() %>%
    arrange(date, decreasing = F) %>%
    select(-c(area_type, area))

post_stat = post_stat[-(1:6),]
post_stat =
    post_stat %>%
    janitor::clean_names() %>%
    arrange(date, decreasing = F) %>%
    select(-c(area_type, area))

County labor force market and unemployment rate

After retrieving the CA_Labor and CA_Land_Area data set from the data sources, first remove the variable columns of employment, unemployment, and rank from the CA_Labor data set due to the interest for this investigation. Then rename the column unemployment_rate_per_cent as unemployment for future data processing. The resulted data set has 58 rows and 3 variable columns.

For CA_Land_Area data set, separate the counties into costal counties and inland counties first using mutate() function, then remove the variable column of rank,state, and population. The resulted data set has 58 observations and 3 variable columns.

In the end, using left_join() function to combine the two processed data set into one final data set by county names, the final data set is named as ca_nonmedical_data and contains 58 rows and 5 variable columns, including name of the county, county’s area in square miles, location type of the county, unemployment and labor force.

labor_data = read_csv("./data/CA_Labor.csv") %>%
    janitor::clean_names() %>%
    dplyr::select(-employment,-unemployment,-rank) %>%
    rename(unemployment = unemployment_rate_per_cent)

land_data = read_csv("./data/CA_Land_Area.csv") %>%
    janitor::clean_names() %>%
    mutate(
        location = ifelse(county %in% c("Los Angeles", "Orange", "San Diego", "Monterey", "San Benito", "San Luis Obispo", "Santa Barbara", "Santa Cruz", "Ventura", "Alameda", "Contra Costa", "Marin", "Napa", "San Francisco", "San Mateo", "Santa Clara", "Solano", "Sonoma", "Del Norte", "Humboldt", "Mendocino"), "costal", "inland")) %>%
    dplyr::select(-rank, -state, -population)

ca_nonmedical_data = left_join(land_data, labor_data, by = c("county"))

Demographical COVID data

For the purpose of model building, a data set focused on reflecting covid information such as death rate, total test rate, and population with respect to each county is needed. To do this, use the function mutate() and some calculation to produce the death rate and total test rate column for the new data set. Then select the variables of interest including county_name, death_rate, test, population and group them by county. The resulted data set is named as demon_covid.

demo_covid = demo %>%
     mutate(cumulative_reported_deaths = as.numeric(cumulative_reported_deaths),
            cumulative_cases = as.numeric(cumulative_cases),
        death_rate = 0.01 + cumulative_reported_deaths/population*100,
        test = cumulative_total_tests/population*100) %>%
    dplyr::select(county_name, death_rate, test,population) %>%
    rename(county=county_name) %>%
    group_by(county) %>%
    summarize(death_rate=max(death_rate,na.rm = T),
              test = max(test),
              population = max(population)) %>%
    filter(!county %in% c("Out of state","Unknown"))

Vaccination

To obtain vaccine status of each county, read in the CA_covid19vaccines.csv file and group the data by county, then obtain the total number of fully vaccinated people by using the sum() function. Drop all the N/A values and any county besides the 58 counties of interest. The resulted data set includes 58 rows and 2 column of variables.

vaccine = read_csv("./data/CA_covid19vaccines.csv") %>%
    janitor::clean_names() %>%
    group_by(county) %>%
    summarise(
        fully_vaccinated = sum(fully_vaccinated)
        # total number of fully vaccinated people
            ) %>%
    arrange(county) %>%
    drop_na() %>%
    filter(!county %in% c("Unknown", "All CA Counties", "All CA and Non-CA Counties","Outside California"))

Hospitalization

This investigation is interested in knowing the hospitalization ability of each county in California. After reading in the CA_Covid19_Hospital.csv data file, group the data by county and find the maximum amount of hospital bed and available ICU beds recorded for each county. Then select county, all_hospital_beds, icu_available_beds to form the processed data set, which includes 56 rows and 3 column variables.

hospital = read_csv("./data/CA_Covid19_Hospital.csv") %>%
    group_by(county) %>%
    summarize(all_hospital_beds = max(all_hospital_beds,na.rm = T),
              icu_available_beds = max(icu_available_beds,na.rm = T)) %>%
    dplyr::select(county, all_hospital_beds, icu_available_beds)

Final premodel data frame

After the data pre-processing activities, we are ready to form a final pre-model data frame. Combine the demo_covid, vaccine, hospital, and ca_nonmedical_data together by using the function leftjoin() while binding the data sets by the variable county. Then relocate the variable death rate to the front and do some math adjustment for other data variables including vaccinated, labor_rate, hospital_bed, icu_bed for the purpose of better visual representation afterwards. Remove some columns of variables as needed. The final pre-modal dataframe includes 58 rows and 10 columns of variables.

ca_medical_data = left_join(demo_covid, vaccine, by =c("county"))
ca_medical_data2 = left_join(ca_medical_data, hospital, by =c("county"))
ca_premodel_data = left_join(ca_medical_data2, ca_nonmedical_data, by = c("county")) %>%
    relocate(death_rate) %>%
    mutate(
        vaccinated=fully_vaccinated/population*100,
        labor_rate = labor_force/population*100,
        hospital_bed = all_hospital_beds/population*100,
        icu_bed = icu_available_beds/population*100
    )%>%
    dplyr::select(-fully_vaccinated, -population, -labor_force,-all_hospital_beds,-icu_available_beds)

skimr::skim(ca_premodel_data)

Data summary
Name	ca_premodel_data
Number of rows	58
Number of columns	10
_______________________
Column type frequency:
character	2
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
county	0	1	4	15	0	58	0
location	0	1	6	6	0	2	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
death_rate	0	1.00	0.21	0.09	0.01	0.13	0.19	0.28	0.51	▂▇▅▃▁
test	0	1.00	367.42	171.84	104.51	277.33	312.06	402.06	983.95	▅▇▂▁▁
land_area_sq_mi	0	1.00	2685.85	3102.32	46.87	959.49	1535.34	3454.40	20056.92	▇▂▁▁▁
unemployment	0	1.00	7.33	2.12	4.42	5.97	6.96	8.22	17.37	▇▆▁▁▁
vaccinated	0	1.00	63.20	13.47	29.78	54.34	61.29	72.35	94.09	▁▇▇▇▃
labor_rate	0	1.00	44.16	5.95	30.47	40.29	44.59	47.75	61.19	▂▅▇▁▁
hospital_bed	2	0.97	0.32	0.39	0.08	0.18	0.22	0.29	2.79	▇▁▁▁▁
icu_bed	2	0.97	0.03	0.05	0.00	0.01	0.01	0.03	0.35	▇▁▁▁▁