Search code examples
linuxstringtextsedrange

How to replace a range of numbers from a range of strings with sed


I'm trying to modify a given text file, wherein I want to change/alter the following strings, eg:

lcl|NC_018257.1_cds_XP_003862892.1_5067
lcl|NC_018241.1_cds_XP_003859498.1_1683
lcl|NC_018256.1_cds_XP_003862456.1_4633
lcl|NC_018237.1_cds_XP_003858978.1_1163
lcl|NC_018254.1_cds_XP_003861926.1_4104

so that it only contains the XP_n.1 part of the string.

I have successfully removed the lcl|NC\_*.1_cds\_ part out of the strings for which I used the following sed command:

sed 's/lcl|NC\_.\*_cds_//g' cds.fa > cds4.fa

The resultant text file contains strings like XP_003862892.1_5067.

There are about 8014 strings like this ranging from XP_*.1_1 to XP_*.1_8014. I want to delete the _1 to _8014 part of the string and replace it with 1.

I tried using

sed 's/1\_./1/g'

and it seemed to have worked, however when I scrolled further down the list of strings, the double digit numbers didn't get replaced - only one of the digits was replaced, which immediately followed the '_', resulting in the first digit turning into 1 and the rest retaining their original identity. Same with triple and quadruple digit numbers. eg:

XP_003857837.1_23   --->   XP_003857837.13
XP_003857942.1_228  --->   XP_003857942.128

I have absolutely no idea how to remove this, all my attempts have led to failure. Some people have asked me for what my desired output should look like, the ideal output would be: XP_003857837.1, each string should be followed by a .1 instead of .1_SomeNumberRangingFrom1to8014


Solution

  • You can do everything in one go with a slightly more complex regex.

    sed 's/lcl|NC_.*_cds_\(XP_[0-9.]*\)_.*/\1/' cds.fa > cds4.fa
    

    The backslashed parentheses create a capturing group, and \1 in the replacement recalls the first captured group (\2 for the second, etc, if you have more than one). The regex inside the group looks for XP_ followed by digits and dots, and the expression after matches the rest of the line from the next uderscore on.

    In other words, this basically says "replace the whole line with just the part we care about".

    By the by, there is no reason to backslash underscores anywhere, and the /g option to the s command only makes sense when you want to replace multiple occurrences on the same input line.