I am rather stuck. I have 2 data frames - df1 has unique station IDs, % values by month, & num of occurrences (years) of that station in the data; df2 has repeated station IDs by year and values per month by year.
df1: represents percentage on non-missing temperature data for each station by month; n represents number of years in that station's record
station_ID Jan Feb Mar ... Dec n
10160355 37 39 38 39 141
10160360 94 91 98 89 56
10160390 83 87 85 82 163
df2: temperature data by station for each month and year; n from df1 is the length of the repeated station_ID in df2
station_ID year Jan Feb Mar ... Dec
10160355 1878 NA 10 12 12
10160355 1879 12 12 13 10
...
10160355 2018 14 11 15 14
10160360 1963 12 10 12 14
10160360 1964 10 12 15 11
...
(repeats for all stations & total rows = 277604)
What I need: For each monthly column, if df1$station < 50%, replace the data in df2 with NAs for all rows for that station/month - else, leave df2 as is. So, since df1$station_ID[1] shows only 37% for Jan, all January's for that station (df2$station[1:141]) becomes NA.
Example output I need:
station_ID year Jan Feb ... Dec
10160355 1878 NA NA NA
10160355 1879 NA NA NA
...
10160360 1963 12 10 14
10160360 1964 10 12 11
...
I've tried about 20 different methods, but I think I need some form of dplyr with rep to repeat the NAs for rows for each station when the condition is true.
Latest attempt with just one month at a time since I couldn't figure out how to do all cols:
df3 = df2 %>%
group_by(station_ID) %>%
select(Jan) %>%
mutate(if_else(df1$Jan < 50, rep(NA_character_, df1$n), Jan))
This gives an error for invalid 'times' for rep. I think I might be close, but I appreciate any suggestions! Thanks!
Things like this are much easier to do in "long" format - especially with dplyr
library(dplyr)
library(tidyr)
df1_long = pivot_longer(df1, cols = Jan:Dec, names_to = "month", values_to = "non_missing")
df2_long = pivot_longer(df2, cols = Jan:Dec, names_to = "month", values_to = "temp")
result_long = df2_long %>%
left_join(df1_long) %>%
mutate(temp = ifelse(non_missing < 50, NA, temp))
result_long
# # A tibble: 20 x 6
# station_ID year month temp n non_missing
# <int> <int> <chr> <int> <int> <int>
# 1 10160355 1878 Jan NA 141 37
# 2 10160355 1878 Feb NA 141 39
# 3 10160355 1878 Mar NA 141 38
# 4 10160355 1878 Dec NA 141 39
# 5 10160355 1879 Jan NA 141 37
# 6 10160355 1879 Feb NA 141 39
# 7 10160355 1879 Mar NA 141 38
# 8 10160355 1879 Dec NA 141 39
# 9 10160355 2018 Jan NA 141 37
# 10 10160355 2018 Feb NA 141 39
# 11 10160355 2018 Mar NA 141 38
# 12 10160355 2018 Dec NA 141 39
# 13 10160360 1963 Jan 12 56 94
# 14 10160360 1963 Feb 10 56 91
# 15 10160360 1963 Mar 12 56 98
# 16 10160360 1963 Dec 14 56 89
# 17 10160360 1964 Jan 10 56 94
# 18 10160360 1964 Feb 12 56 91
# 19 10160360 1964 Mar 15 56 98
# 20 10160360 1964 Dec 11 56 89
In many cases (especially making charts, but modeling too), I'd advise you to stick with this long-format data. However, it can be converted back to your original wide format:
result_wide = result_long %>%
select(-n, -non_missing) %>%
pivot_wider(names_from = "month", values_from = "temp")
result_wide
# # A tibble: 5 x 6
# station_ID year Jan Feb Mar Dec
# <int> <int> <int> <int> <int> <int>
# 1 10160355 1878 NA NA NA NA
# 2 10160355 1879 NA NA NA NA
# 3 10160355 2018 NA NA NA NA
# 4 10160360 1963 12 10 12 14
# 5 10160360 1964 10 12 15 11
Using this data:
df1 = read.table(text = 'station_ID Jan Feb Mar Dec n
10160355 37 39 38 39 141
10160360 94 91 98 89 56
10160390 83 87 85 82 163', header = T)
df2 = read.table(text = 'station_ID year Jan Feb Mar Dec
10160355 1878 NA 10 12 12
10160355 1879 12 12 13 10
10160355 2018 14 11 15 14
10160360 1963 12 10 12 14
10160360 1964 10 12 15 11', header = T)