I have a precipitation database, where it is structured as follows.
Season; YEAR; MONTH; DAY 01; DAY 02; DAY 03 ..... DAY 31
At first I wanted to calculate the accumulated in each month (I did it using the precintcon), but only for one season. Now I want to do the same thing, but separating each station, where I will have the daily and monthly values for each station, in addition to changing the structure of the database. Where the first column would be the date and the other columns would be each season.
Date; season1; station2; estacao3 ....... estacaoN
01/01/1994;30;10;5;6
01/02/1994;10;12;55
.
.
.
.
.
.
.
31/07/2018
This task requires some reshaping of the dataset, first make it longer and then wider again. dc37's answer already describes how to do that with data.table
. I'd recommend a little different approach, using only tidyverse
functions.
You state, that you want to calculate the sum of the rainfall per month at each station, for that task it is actually easier to keep the data in a long format instead of making it wide again. I'll demonstrate both options (2a and 2b) below.
I would also recommend not merging the date variables, because that makes it harder to group the data by month, alternatively to my approach, you could merge year and month only, that would still allow for the necessary grouping. Anyways, 2a) demonstrates how to use tidyr::unite() to merge the date variables.
1) Convert dataset to long format
library(tidyverse)
library(readxl)
rainfall_df <- read_excel("Dados_precipitacao.xls", skip = 2)
rainfall_long_df <-
rainfall_df %>%
select(-Bacia) %>%
pivot_longer(`dia 1`:`dia 31`, names_to = "dia") %>%
mutate(dia = gsub("dia ", "", dia))
rainfall_long_df looks like this:
# A tibble: 1,931,889 x 5
`Município/Posto` Ano Mês dia value
<chr> <dbl> <dbl> <chr> <dbl>
1 Agua Branca 1994 1 1 0
2 Agua Branca 1994 1 2 0
3 Agua Branca 1994 1 3 0
4 Agua Branca 1994 1 4 0
5 Agua Branca 1994 1 5 0
6 Agua Branca 1994 1 6 8.6
7 Agua Branca 1994 1 7 0
8 Agua Branca 1994 1 8 2
9 Agua Branca 1994 1 9 0
10 Agua Branca 1994 1 10 0
# … with 1,931,879 more rows
2a) This is what you asked for: Calculating the sums per month and station from a wide dataset.
rainfall_wide_df <-
rainfall_long_df %>%
unite(data, dia, Mês, Ano, sep = "/", remove = FALSE) %>%
pivot_wider(names_from = `Município/Posto`)
rainfall_wide_df %>%
group_by(Ano, Mês) %>%
summarise_at(vars(`Agua Branca`:`Zabelê`), sum)
This results in:
# A tibble: 296 x 253
# Groups: Ano [26]
Ano Mês `Agua Branca` Aguiar `Alagoa Grande` `Alagoa Nova` Alagoinha Alcantil `Algodão de Jan…
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1994 1 174. 442. 101 68.5 64.6 NA NA
2 1994 2 NA NA NA NA NA NA NA
3 1994 3 285. 120. 239. 210. 213. NA NA
4 1994 4 NA NA NA NA NA NA NA
5 1994 5 176. 73.2 160. 233. 190 NA 41.8
6 1994 6 NA NA NA NA NA NA NA
7 1994 7 55.6 33.3 292. 188. 291. NA 51.4
8 1994 8 28 0 60.8 68.1 57.6 NA 16.1
9 1994 9 NA NA NA NA NA NA NA
10 1994 10 20 0 8.8 9.3 3.6 NA 0
# … with 286 more rows, and 244 more variables
2b) This is an alternative solution to get the sums for each station and month. Which is easier to work with for further steps (visualization in ggplot2 especially). Also I feel, that the code is more straight forward!
rainfall_long_df %>%
group_by(`Município/Posto`, Ano, Mês) %>%
summarise(rainfall_per_month = sum(value))
The result will be a long version of the sum of rainfall per month and station.
# A tibble: 62,319 x 4
# Groups: Município/Posto, Ano [5,522]
`Município/Posto` Ano Mês rainfall_per_month
<chr> <dbl> <dbl> <dbl>
1 Agua Branca 1994 1 174.
2 Agua Branca 1994 2 NA
3 Agua Branca 1994 3 285.
4 Agua Branca 1994 4 NA
5 Agua Branca 1994 5 176.
6 Agua Branca 1994 6 NA
7 Agua Branca 1994 7 55.6
8 Agua Branca 1994 8 28
9 Agua Branca 1994 9 NA
10 Agua Branca 1994 10 20
# … with 62,309 more rows