Search code examples
rregextext-parsing

Elimating double Quotes in a Dataframe


I have a data frame containing 27 columns. All these columns have data that has a structure similar to the one below.

principal_amt <- c('"pa": "5975.00"', '"pa": "2285.00"', '"pa": "15822.00"')
closed_accounts <- c( '"ca": 0', '"ca": 3', '"ca": 0')
status <- c(' "loan_status": "Paid" ', ' "loan_status": "Funded"',' "loan_status": "Funded"')
DF <- data.frame(principal_amt, closed_accounts)

I want to automatically remove the double quotes present in the observations so that the final data frame has a structure similar to this.

principal_amt <- c(5975.00, 2285.00, 15822.00)
closed_accounts <- c(0, 3, 0)
status <- c('Paid','Funded','Funded')
DF_Final <- data.frame(principal_amt, closed_accounts)

How do I go about this?


Solution

  • The readr package ships with a handy parse_number function for such use cases.

    library(tidyverse)
    
    DF %>%
      mutate(across(.fns = parse_number))
    
      principal_amt closed_accounts
    1          5975               0
    2          2285               3
    3         15822               0
    

    Updated approach for nominal variables - extracting a match. Look behind : ", then extract everything until empty string at edge of the word.

    DF %>%
      mutate(across(c(1,3), parse_number),
             across(2, str_extract, "(?<=: \").*\\b"))
     principal_amt status closed_accounts
    1          5975   Paid               0
    2          2285 Funded               3
    3         15822 Funded               0