Search code examples
rtidyrdata-cleaningseparator

How to prevent tidyr's separate function from pulling in row numbers and then dropping data


I am trying to write a line of code to separate a text string whenever a capital letter in encountered without removing the letter. The approach I have taken is as follows:

set.seed(1)

# create a dataframe of fused alpha numeric codes that I wish to separate
df1 <- as.data.frame(matrix(
    paste0(sample(LETTERS, 20, replace = TRUE), sample(seq(1, 7, 0.1), 20, replace = TRUE)), 
    nrow = 10)) %>% unite(col = "ab", sep = "")
df1

# Add a space (" ") before any captial letter encountered
df2 <- df1 %>% mutate(ab = gsub('([[:upper:]])', ' \\1', ab))
df2

# use separate to split the column based on the space
df3 <- df2 %>% separate(col=ab, into=c("a", "b"), sep = " ")
df3

When I run separate I get a warning and the output is not correct:

#Warning message:
#Expected 2 pieces. Additional pieces discarded in 10 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. 
#> df3
#   a    b
#1      Y3
#2    D4.6
#3      G5
#4    A3.4
#5    B5.5
#6    W4.6
#7    K4.6
#8    N4.3
#9    R5.1
#10   S3.4

The contents intended for column "a" have been placed on column "b", whilst those intended for "b" appear to have been removed entirely.


Solution

  • This is because you create a white space before your first letter: to remove it, you can use trimws or str_trim:

    df1 %>% 
      mutate(ab = trimws(gsub('([[:upper:]])', ' \\1', ab))) %>%
      separate(col=ab, into=c("a", "b"), sep = " ")
    
          a    b
    1    Y3 A5.3
    2  D4.6 U2.4
    3    G5 U4.2
    4  A3.4 J2.9
    5  B5.5 V4.4
    6  W4.6 N1.5
    7  K4.6 J1.9
    8  N4.3 G5.1
    9  R5.1 I4.7
    10 S3.4 O5.6