Search code examples
rregexseparatortidyr

How to separate data with 5 question marks separator using separate()?


Hey so I have a tibble with head() printed like this:

# A tibble: 6 × 1
                                   id.make.model.year
                                             <chr>
1  27550?????AM General?????DJ Po Vehicle 2WD?????1984
2  28426?????AM General?????DJ Po Vehicle 2WD?????1984
3   27549?????AM General?????FJ8c Post Office?????1984
4   28425?????AM General?????FJ8c Post Office?????1984
5 1032?????AM General?????Post Office DJ5 2WD?????1985
6 1033?????AM General?????Post Office DJ8 2WD?????1985

with only one column. I want to seperate this into four columns with those four column names. I tried to use separate()

A %>% 
  separate(id.make.model.year,into=c("id","make"),sep="?????")

and

A %>% 
  separate(id.make.model.year,into=c("id","make"),sep="\\?????")

but they both return the following error:

Error in stringi::stri_split_regex(value, sep, n_max) : Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

Yet another try...:

A %>% 
  separate(id.make.model.year,into=c("id","make"),sep="[?????]")

which returns

# A tibble: 33,439 × 2
      id  make
*  <chr> <chr>
1  27550      
2  28426      
3  27549      
4  28425      
5   1032      
6   1033      
7   3347      
8  13309      
9  13310      
10 13311      
# ... with 33,429 more rows
Warning message:
Too many values at 33439 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ... 

I also tried dropping sep, but all the spaces are clearly counted as separators.

What's the right way to do this? Thanks in advance.


Solution

  • The regex to match one question mark is \?, or [?]. However if you have five of them, [?????] still only one matches one occurrence of that character because [...] defines a character class. Just like [aaaaa] would only match one letter a, not five.

    So to capture the five repetitions I think you want \?{5} or [?]{5} (or \?\?\?\?\? or [?][?][?][?][?]).

    Until you post data with dput() I can't confirm.