Search code examples
rstringdplyrgrepl

Separating columns using regex used to split the columns into two variables


I have a dataset in which my unit of analysis and variable are together in the same column. I would like to split this up into two variables, but not sure how to because it involves regex.

Say I have the following data:

de <- data.frame(year = c(2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
                          2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012),
                 company = c("Company A - Variable 1", "Company A - Variable 2", "Company A - Variable 3",
                             "CompanyB - Variable 1", "CompanyB - Variable 2", "CompanyB - Variable 3", 
                             "Compan- C - Variable 1", "Compan- C - Variable 2", "Compan- C - Variable 3",
                             "Company A - Variable 1", "Company A - Variable 2", "Company A - Variable 3",
                             "CompanyB - Variable 1", "CompanyB - Variable 2", "CompanyB - Variable 3", 
                             "Compan- C - Variable 1", "Compan- C - Variable 2", "Compan- C - Variable 3"),
                 score = c(1,3,5,7,9,11,13,15,17,2,4,6,8,10,12,14,16,18))

Which looks like:

year company                score
<dbl> <chr>                  <dbl>
2010 Company A - Variable 1     1
2010 Company A - Variable 2     3
2010 Company A - Variable 3     5
2010 CompanyB - Variable 1      7
2010 CompanyB - Variable 2      9
2010 CompanyB - Variable 3     11
2010 Compan- C - Variable 1    13
2010 Compan- C - Variable 2    15
2010 Compan- C - Variable 3    17
2012 Company A - Variable 1     2
2012 Company A - Variable 2     4
2012 Company A - Variable 3     6
2012 CompanyB - Variable 1      8
2012 CompanyB - Variable 2     10
2012 CompanyB - Variable 3     12
2012 Compan- C - Variable 1    14
2012 Compan- C - Variable 2    16
2012 Compan- C - Variable 3    18

The company variable is a combination of company names (in all different forms and shapes), always followed by " - " and then the variable name. I want to wrangle the data so that I arrive at the following structure, splitting up the company and the variable into new columns:


year   company   variable score
2010 Company A Variable 1     1
2010 Company A Variable 2     3
2010 Company A Variable 3     5
2010  CompanyB Variable 1     7
2010  CompanyB Variable 2     9
2010  CompanyB Variable 3    11
2010 Compan- C Variable 1    13
2010 Compan- C Variable 2    15
2010 Compan- C Variable 3    17
2012 Company A Variable 1     2
2012 Company A Variable 2     4
2012 Company A Variable 3     6
2012  CompanyB Variable 1     8
2012  CompanyB Variable 2    10
2012  CompanyB Variable 3    12
2012 Compan- C Variable 1    14
2012 Compan- C Variable 2    16
2012 Compan- C Variable 3    18

I guess then use something to extract the text before " - " (the company name) and after " - " (the variable name). Then sort of split this up into two different variables: company and variable. Any elegant solution would be much appreciated. Thanks


Solution

  • You can use tidyr::separate with a lookahead regex to split at the blank space before the string "Variable".

    library(tidyr)
    
    separate(de, company, sep = "(?=Variable)", into = c('Company', 'Variable'))
    #>    year      Company   Variable score
    #> 1  2010 Company A -  Variable 1     1
    #> 2  2010 Company A -  Variable 2     3
    #> 3  2010 Company A -  Variable 3     5
    #> 4  2010  CompanyB -  Variable 1     7
    #> 5  2010  CompanyB -  Variable 2     9
    #> 6  2010  CompanyB -  Variable 3    11
    #> 7  2010 Compan- C -  Variable 1    13
    #> 8  2010 Compan- C -  Variable 2    15
    #> 9  2010 Compan- C -  Variable 3    17
    #> 10 2012 Company A -  Variable 1     2
    #> 11 2012 Company A -  Variable 2     4
    #> 12 2012 Company A -  Variable 3     6
    #> 13 2012  CompanyB -  Variable 1     8
    #> 14 2012  CompanyB -  Variable 2    10
    #> 15 2012  CompanyB -  Variable 3    12
    #> 16 2012 Compan- C -  Variable 1    14
    #> 17 2012 Compan- C -  Variable 2    16
    #> 18 2012 Compan- C -  Variable 3    18