Search code examples
rsapplystrsplit

How to remove part of a string in a column of dataframe in R?


I have a dataframe like following:

df:

 S            S1       S2     S3       S4   
100130426     0        0    0.9066     0
100133144   16.3644  9.2659 11.6228 12.0894
100134869   12.9316  17.379 9.2294  11.0799
     3457   1910.3  2453.50 2695.37 1372.3624
     9834   1660.13 857.30  1240.53 1434.6463
ATP5L2|267    0        0     0.9066    0
ATP5L|1063  1510.29 1270.79 2965.54 2397.1866
ATP5O|539   2176.17 1868.95 2004.53 2360.3641

I actually want to remove "|" and also numbers after "|" in the first column. For eg: ATP5L2|267 should be like ATP5L2.

So I tried in the following way:

SD <- sapply(strsplit(df$s, split='|', fixed=TRUE), function(x) (x[1]))

But this gave me an error:

Error in strsplit(s, split = "|", fixed = TRUE) : non-character argument.

Output should look like following:

df:

 S            S1       S2     S3       S4   
100130426     0        0    0.9066     0
100133144   16.3644  9.2659 11.6228 12.0894
100134869   12.9316  17.379 9.2294  11.0799
     3457   1910.3  2453.50 2695.37 1372.3624
     9834   1660.13 857.30  1240.53 1434.6463
   ATP5L2     0        0     0.9066    0
    ATP5L   1510.29 1270.79 2965.54 2397.1866
    ATP5O   2176.17 1868.95 2004.53 2360.3641

Solution

  • You can do this with sub and a regular expression.

    df$S = sub("\\|.*", "", as.character(df$S))
    df
              S        S1        S2        S3        S4
    1 100130426    0.0000    0.0000    0.9066    0.0000
    2 100133144   16.3644    9.2659   11.6228   12.0894
    3 100134869   12.9316   17.3790    9.2294   11.0799
    4      3457 1910.3000 2453.5000 2695.3700 1372.3624
    5      9834 1660.1300  857.3000 1240.5300 1434.6463
    6    ATP5L2    0.0000    0.0000    0.9066    0.0000
    7     ATP5L 1510.2900 1270.7900 2965.5400 2397.1866
    8     ATP5O 2176.1700 1868.9500 2004.5300 2360.3641
    

    Details:

    sub substitutes the second argument for whatever matches the first argument. In this case, we want | and everything after it. You can't just write | because that has a special meaning in regular expressions so you "escape" it with by writing \\|. It is followed by .*. The . means "any character" and * means any number of times, so together \\|.* means | followed by any number of characters. We replace that with the empty string "". We apply this operation to as.character(df$S) because your error message makes it look like your variable df$S may be a factor, rather than a string.