Search code examples
rregexstrsplit

Split salary range


I am trying to split the first column range into two separate columns as so

enter image description here

I have tried the following code:

testDF$IncomeLowerRange <- strsplit(gsub("[^-]+-", "", testDF$IncomeRange), ";")[1]
testDF$IncomeUpperRange <- strsplit(gsub("[^-]+-", "", testDF$IncomeRange), ";")[2]

Solution

  • I do not think there is a one-liner for this as you need to extract some of the data, replace fully or partially in other cases.

    I suggest using a simple logic for this:

    > test <- c("$25,000-49,000","Not displayed", "$100,000+")
    > df <- data.frame(test)
    > df$col1 <- sub("^\\$?([0-9,]+).*", "\\1", df$test)
    > df$col1[df$col1=="Not displayed"] <- "NA"
    > df$col2 <- sub("^[^-]+-(.*)", "\\1", df$test)
    > df$col2[df$col2=="Not displayed"] <- "NA"
    > df$col2[df$col2=="$100,000+"] <- "250,000"
    > 
    > df
                test    col1    col2
    1 $25,000-49,000  25,000  49,000
    2  Not displayed      NA      NA
    3      $100,000+ 100,000 250,000
    

    where col1 will be your IncomeLowerRange and col2 will be your IncomeUpperRange.

    The ^\\$?([0-9,]+).* regex is used to extract the first number from a range, and also 100,000 into the first column (^ matches the start of string, \$? matches 1 or 0 $ symbols, ([0-9,]+) captures digits and commas into Group 1 that is kept, and the rest (also, all after the number - .*) is removed.