I am starting with 6 different lists/csvs that each contain one charater column. This column shows the Home Owner Loan Corporation (HOLC) neighborhood grades of census block groups. So the columns look something like shown below. I am new to using R studio and I am wondering if the first step would be to combine the lists. Another option could be to add a new binary column to each list that identifies a column as a 1 if it is not NA and 0 if it is. Then each of the lists can can condense into the categories A, B, C, D, and NA and the new binary column can be summed.
Ideally I am interested in using ggplot but I am open to other options. Thanks for your help! I appreciate it.
Example image of how I would like the results to look. In this example, each line represents a different list/csv table:
houston_grade2020 |
---|
NA |
NA |
B |
A |
NA |
C |
D |
minneapolis_grade2020 |
---|
A |
NA |
NA |
B |
C |
C |
D |
houston_grade1990 |
---|
B |
B |
B |
A |
A |
C |
D |
minneapolis_grade1990 |
---|
B |
A |
NA |
A |
NA |
NA |
D |
etc.
(I started by working with one csv to try and visualize it but alas it did not work. In this example, I did not add the binary column.)
# Group by Grade
Houston_2020_group <-
data.frame(
values = c(Houston_2020_sub$houston_grade2020),
group = c(rep("Houston 2020", nrow(Houston_2020_sub)))
)
ggplot(data = Houston_2020_group, aes(x = values, y = group, fill = group)) +
geom_line()+
lab(title="HOLC Grades")
results:
In this example, I failed to sum the count of the appearances of each grade. For the final result I would like all lists/csvs to be represented in the graph.
Your biggest challenge here is rearranging your data into an appropriate format for plotting. Essentially, you should get all your data in a single data frame, with all the grades in a single column, and have a second column indicating which data set the grades came from. Then you can group the data according to this second column and count the number of each grade. This then allows easy plotting:
library(tidyverse)
list(Houston_2020 = Houston_2020_sub,
Minneapolis_2020 = Minneapolis_2020_sub,
Houston_1990 = Houston_1990_sub,
Minneapolis_1990 = Minneapolis_1990_sub) %>%
lapply(function(x) setNames(x, 'grade')) %>%
{do.call(bind_rows, c(., .id = 'group'))} %>%
mutate(grade = factor(grade)) %>%
group_by(group) %>%
count(grade, .drop = FALSE) %>%
ggplot(aes(grade, n, colour = group, group = group)) +
geom_line() +
geom_point(color = 'black') +
facet_grid(group~.)
If you want all the lines on the same panel, just get rid of that final facet_grid
line. It looks messy without this at present because your numbers are so small.
Data in reproducible format, taken from question
Houston_2020_sub <- data.frame(houston_grade2020 = c(NA, NA, 'B', 'A',
NA, 'C', 'D'))
Minneapolis_2020_sub <- data.frame(minneapolis_grade2020 = c('A', NA, NA, "B",
"C", "C", "D"))
Houston_1990_sub <- data.frame(houston_grade1990 = c('B', 'B', 'B', 'A', 'A',
'C', 'D'))
Minneapolis_1990_sub <- data.frame(minneapolis_grade1990 = c('B', 'A', NA, 'A',
NA, NA, 'D'))