Search code examples
rdataframedata-sciencemodeling

Data Wrangling for Modeling in R


I have a data set (original version, # A tibble: 33,478 x 12) of the form similar to the attached picture, and partial data:

enter image description here

dput(head(canals2, n=10))

 structure(list(Site = c(1, 2, 4, 11, 10, 12, 13, 14, 15, 16), 
    `Sample Date` = c("2/11/2004", "2/11/2004", "2/11/2004", 
    "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", 
    "2/11/2004", "2/11/2004"), `Analysis code` = c("NH3", "NH3", 
    "NH3", "Chl a", "Chl a", "Chl a", "NH3", "Chl a", "NH3", 
    "NH3"), Analysis = c("Ammonia-Nitrogen", "Ammonia-Nitrogen", 
    "Ammonia-Nitrogen", "Chlorophyll a", "Chlorophyll a", "Chlorophyll a", 
    "Ammonia-Nitrogen", "Chlorophyll a", "Ammonia-Nitrogen", 
    "Ammonia-Nitrogen"), Result = c(0.068, 0.07, 0.014, 1.31, 
    1.39, 1.95, 0.247, 1.46, 0.113, 0.17), Units = c("mg/L", 
    "mg/L", "mg/L", "mg/m3", "mg/m3", "mg/m3", "mg/L", "mg/m3", 
    "mg/L", "mg/L")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

I would like to try to predict, for instance, "Chlorophyll a" from "Ammonia-Nitrogen" using a linear model (say, using the lm() function). lm() takes column names as inputs for 'formula', yet this dataset was generated very differently. I should be using the values in the Results column for each analysis, but I can't seem to find a good way of organizing my data as such.

So far, I tried splitting the data by analysis, with the intention of creating a new dataframe for each analysis, then replacing the Result with the name of the analysis selected in that dataframe. Here is the function I used (ran it on the main data set, which is why it contains more analysis names):

analysis_list = unique(canals$Analysis)

> analysis_list

1 "Ammonia-Nitrogen" "Chlorophyll a" "Fecal Coliform"
[4] "Specific Conductance" "Copper" "Dissolved Oxygen"
[7] "E Coli" "Enterococci" "Nitrite + Nitrate"
[10] "Ortho-Phosphate" "pH" "Salinity"
[13] "Temperature" "Total Kjeldahl Nitrogen" "Total Nitrogen"
[16] "Total Phosphorus" "Turbidity"

split_analyses <- function()
  canals_byAnalysis <- vector(mode = "list", length = 0)
  for (i in 1:17) {
    analysis <- analysis_list[i]
    updated_analysis <- canals %>% 
                    subset(Analysis == analysis, 
                    select = -c(`Analysis code`))
    canals_byAnalysis[[i]] <- updated_analysis
  }

split_analyses()

Unfortunately, that didn't work as expected, and I had many issues merging the tables I created. I tried other ways as well, but I am getting nowhere. Is anyone willing to offer any suggestions?


Solution

  • If I understand correctly, then it sounds like you're trying to restructure your data to get it into the proper form for modelling purposes. I think using pivot_wider (from tidyr) will get you what you want. Here's what I did:

    First, here's your data as a dataframe:

    Site <- c(1, 2, 4, 11, 10, 12, 13, 14, 15, 16)
    Sample_Date <- c("2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", 
    "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004")
    Analysis_code <- c("NH3", "NH3", "NH3", "Chl a", "Chl a", "Chl a", "NH3", "Chl 
    a", "NH3", "NH3")
    Analysis <- c("Ammonia-Nitrogen", "Ammonia-Nitrogen", "Ammonia-Nitrogen", 
    "Chlorophyll a", "Chlorophyll a", "Chlorophyll a", "Ammonia-Nitrogen", 
    "Chlorophyll a", "Ammonia-Nitrogen", "Ammonia-Nitrogen")
    Results <- c(0.068, 0.07, 0.014, 1.31, 1.39, 1.95, 0.247, 1.46, 0.113, 0.17)
    Units <- c("mg/L", "mg/L", "mg/L", "mg/m3", "mg/m3", "mg/m3", "mg/L", "mg/m3", 
    "mg/L", "mg/L")
    
       Site Sample_Date Analysis_code         Analysis Results Units
    1     1   2/11/2004           NH3 Ammonia-Nitrogen   0.068  mg/L
    2     2   2/11/2004           NH3 Ammonia-Nitrogen   0.070  mg/L
    3     4   2/11/2004           NH3 Ammonia-Nitrogen   0.014  mg/L
    4    11   2/11/2004         Chl a    Chlorophyll a   1.310 mg/m3
    5    10   2/11/2004         Chl a    Chlorophyll a   1.390 mg/m3
    

    Next, we'll apply pivot_wider to spread the Analysis variable. This will leave you with a column for each of the Analysis types you have, along with their respective Results value.

    #spread the analysis variable
    new_df <- df %>%
      pivot_wider(names_from = "Analysis", values_from = "Results")
    
        Site Sample_Date Analysis_code Units `Ammonia-Nitrogen` `Chlorophyll a`
       <dbl> <chr>       <chr>         <chr>              <dbl>           <dbl>
     1     1 2/11/2004   NH3           mg/L               0.068           NA   
     2     2 2/11/2004   NH3           mg/L               0.07            NA   
     3     4 2/11/2004   NH3           mg/L               0.014           NA   
     4    11 2/11/2004   Chl a         mg/m3             NA                1.31
     5    10 2/11/2004   Chl a         mg/m3             NA                1.39