Search code examples
rhydrotsm

Data organization precipitation


I have a precipitation database, where it is structured as follows.

Season; YEAR; MONTH; DAY 01; DAY 02; DAY 03 ..... DAY 31 

data here

At first I wanted to calculate the accumulated in each month (I did it using the precintcon), but only for one season. Now I want to do the same thing, but separating each station, where I will have the daily and monthly values for each station, in addition to changing the structure of the database. Where the first column would be the date and the other columns would be each season.

Date; season1; station2; estacao3 ....... estacaoN

01/01/1994;30;10;5;6
01/02/1994;10;12;55
.
.
.
.
.
.
.
31/07/2018

Solution

  • This task requires some reshaping of the dataset, first make it longer and then wider again. dc37's answer already describes how to do that with data.table. I'd recommend a little different approach, using only tidyverse functions.

    You state, that you want to calculate the sum of the rainfall per month at each station, for that task it is actually easier to keep the data in a long format instead of making it wide again. I'll demonstrate both options (2a and 2b) below.

    I would also recommend not merging the date variables, because that makes it harder to group the data by month, alternatively to my approach, you could merge year and month only, that would still allow for the necessary grouping. Anyways, 2a) demonstrates how to use tidyr::unite() to merge the date variables.

    1) Convert dataset to long format

      library(tidyverse)
      library(readxl)
      rainfall_df <- read_excel("Dados_precipitacao.xls", skip = 2)
    
      rainfall_long_df <-
        rainfall_df %>%
        select(-Bacia) %>%
        pivot_longer(`dia 1`:`dia 31`, names_to = "dia") %>%
        mutate(dia = gsub("dia ", "", dia))
    

    rainfall_long_df looks like this:

     # A tibble: 1,931,889 x 5
       `Município/Posto`   Ano   Mês dia   value
       <chr>             <dbl> <dbl> <chr> <dbl>
     1 Agua Branca        1994     1 1       0  
     2 Agua Branca        1994     1 2       0  
     3 Agua Branca        1994     1 3       0  
     4 Agua Branca        1994     1 4       0  
     5 Agua Branca        1994     1 5       0  
     6 Agua Branca        1994     1 6       8.6
     7 Agua Branca        1994     1 7       0  
     8 Agua Branca        1994     1 8       2  
     9 Agua Branca        1994     1 9       0  
    10 Agua Branca        1994     1 10      0  
    # … with 1,931,879 more rows
    

    2a) This is what you asked for: Calculating the sums per month and station from a wide dataset.

    rainfall_wide_df <-   
      rainfall_long_df %>%
        unite(data, dia, Mês, Ano, sep = "/", remove = FALSE) %>%
        pivot_wider(names_from = `Município/Posto`)
    
    rainfall_wide_df %>% 
        group_by(Ano, Mês) %>% 
        summarise_at(vars(`Agua Branca`:`Zabelê`), sum)
    

    This results in:

    # A tibble: 296 x 253
    # Groups:   Ano [26]
         Ano   Mês `Agua Branca` Aguiar `Alagoa Grande` `Alagoa Nova` Alagoinha Alcantil `Algodão de Jan…
       <dbl> <dbl>         <dbl>  <dbl>           <dbl>         <dbl>     <dbl>    <dbl>            <dbl>
     1  1994     1         174.   442.            101            68.5      64.6       NA             NA  
     2  1994     2          NA     NA              NA            NA        NA         NA             NA  
     3  1994     3         285.   120.            239.          210.      213.        NA             NA  
     4  1994     4          NA     NA              NA            NA        NA         NA             NA  
     5  1994     5         176.    73.2           160.          233.      190         NA             41.8
     6  1994     6          NA     NA              NA            NA        NA         NA             NA  
     7  1994     7          55.6   33.3           292.          188.      291.        NA             51.4
     8  1994     8          28      0              60.8          68.1      57.6       NA             16.1
     9  1994     9          NA     NA              NA            NA        NA         NA             NA  
    10  1994    10          20      0               8.8           9.3       3.6       NA              0  
    # … with 286 more rows, and 244 more variables
    

    2b) This is an alternative solution to get the sums for each station and month. Which is easier to work with for further steps (visualization in ggplot2 especially). Also I feel, that the code is more straight forward!

    rainfall_long_df %>%
        group_by(`Município/Posto`, Ano, Mês) %>%
        summarise(rainfall_per_month = sum(value))
    

    The result will be a long version of the sum of rainfall per month and station.

    # A tibble: 62,319 x 4
    # Groups:   Município/Posto, Ano [5,522]
       `Município/Posto`   Ano   Mês rainfall_per_month
       <chr>             <dbl> <dbl>              <dbl>
     1 Agua Branca        1994     1              174. 
     2 Agua Branca        1994     2               NA  
     3 Agua Branca        1994     3              285. 
     4 Agua Branca        1994     4               NA  
     5 Agua Branca        1994     5              176. 
     6 Agua Branca        1994     6               NA  
     7 Agua Branca        1994     7               55.6
     8 Agua Branca        1994     8               28  
     9 Agua Branca        1994     9               NA  
    10 Agua Branca        1994    10               20  
    # … with 62,309 more rows