Search code examples
rtidyr

Using tidyr separate to split information based on spaces and other characters, and remove some others


I've some awkwardly formatted information scraped from a pdf table and I would like to split it into columns of numeric values based on both space and comma separators. Note that the spaces are variable in length between data. Some values also have "%" fixed to values like "11.48%" and some of the units also have ".%" affixed like "wt.%".

A couple of lines of the data below in the dataframe 'df'.

df <- data.frame(Data = c("Ag, ppm  0.219 0.025 0.169 0.269  0.144 0.294 11.48%  22.96%   34.43%  0.208  0.230",
                               "Al, wt.%  5.27  0.119 5.03  5.50   4.91  5.62 2.25%   4.51%   6.76%   5.00  5.53"))

df$Method <- "XRF"

Now when I run the following piece of code tidyr::separate' seems to work well separating the data into the desired columns, discarding the "%" symbols and the ".%" suffix, and assigning the desired names. However, it truncates the decimals, and as such the converted numeric values are integer rather than numeric. It seems to be with separate using "." flagging that the next characters up to the next separator, comma, or white space, should be discarded.

df2 <- df %>% 
       separate(., col = Data,  into = c("Var", "Unit", "Cert_Val", "ASD1", "ASD2L", 
                                    "ASD2H","ASD3L", "ASD3H", "RSD1","RSD2",
                                    'RSD3', 'Low', "High"), convert = TRUE) 

str(df2)
'data.frame':   2 obs. of  14 variables:
 $ Var     : chr  "Ag" "Al"
 $ Unit    : chr  "ppm" "wt"
 $ Cert_Val: int  0 5
 $ ASD1    : int  219 27
 $ ASD2L   : int  0 0
 $ ASD2H   : int  25 119
 $ ASD3L   : int  0 5
 $ ASD3H   : int  169 3
 $ RSD1    : int  0 5
 $ RSD2    : int  269 50
 $ RSD3    : int  0 4
 $ Low     : int  144 91
 $ High    : int  0 5
 $ Method  : chr  "XRF" "XRF"

I suspect there is some way of using the 'sep' command in 'separate' to get the function to not use "." as a discard signal character, but I'm asking for some help from perhaps a regular expression guru for the correct syntax. Or perhaps there is a better solution to this problem decoding problem than using 'separate'?


Solution

  • If you specify the separator as " +" (one or more spaces) and convert the percentages, the output looks correct to me:

    library(tidyverse)
    
    df <- data.frame(Data = c("Ag, ppm  0.219 0.025 0.169 0.269  0.144 0.294 11.48%  22.96%   34.43%  0.208  0.230",
                              "Al, wt.%  5.27  0.119 5.03  5.50   4.91  5.62 2.25%   4.51%   6.76%   5.00  5.53"))
    
    df$Method <- "XRF"
    
    df2 <- df %>% 
      separate(col = Data,  into = c("Var", "Unit", "Cert_Val", "ASD1", "ASD2L", 
                                        "ASD2H","ASD3L", "ASD3H", "RSD1","RSD2",
                                        'RSD3', 'Low', "High"), convert = TRUE,
               sep = " +") %>%
      mutate(across(starts_with("RSD"), ~parse_number(.x) / 100),
             Var = sub(",", "", Var))
    
    df2
    #>   Var Unit Cert_Val  ASD1 ASD2L ASD2H ASD3L ASD3H   RSD1   RSD2   RSD3   Low
    #> 1  Ag  ppm    0.219 0.025 0.169 0.269 0.144 0.294 0.1148 0.2296 0.3443 0.208
    #> 2  Al wt.%    5.270 0.119 5.030 5.500 4.910 5.620 0.0225 0.0451 0.0676 5.000
    #>   High Method
    #> 1 0.23    XRF
    #> 2 5.53    XRF
    
    str(df2)
    #> 'data.frame':    2 obs. of  14 variables:
    #>  $ Var     : chr  "Ag" "Al"
    #>  $ Unit    : chr  "ppm" "wt.%"
    #>  $ Cert_Val: num  0.219 5.27
    #>  $ ASD1    : num  0.025 0.119
    #>  $ ASD2L   : num  0.169 5.03
    #>  $ ASD2H   : num  0.269 5.5
    #>  $ ASD3L   : num  0.144 4.91
    #>  $ ASD3H   : num  0.294 5.62
    #>  $ RSD1    : num  0.1148 0.0225
    #>  $ RSD2    : num  0.2296 0.0451
    #>  $ RSD3    : num  0.3443 0.0676
    #>  $ Low     : num  0.208 5
    #>  $ High    : num  0.23 5.53
    #>  $ Method  : chr  "XRF" "XRF"
    

    Created on 2023-08-21 with reprex v2.0.2

    Does that solve your problem?