Search code examples
regexbashsedbioinformaticsfasta

Remove string between two space characters with sed


somehow I can't wrap my head around this. I have the following string:

>sp.A9L976 PSBA_LEMMI Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA

I would like to use sed to remove the string between the 1th and 2nd occurrence of a space. Hence, in this case, the PSBA_LEMMI should be removed. The string between the first two spaces does not contain any special characters.

So far I tried the following:

sed 's/\s.*\s/\s/'

But this removes everything unitl the last occurring space string, resulting in:>sp.A9L976 TESTgene=psbA. I thought by leaving out the greedy expression g sed will only match the first occurrence of the string. I also tried:

sed 's/(?<=\s).*(?=\s)//'

But this did not match / remove anything. Can someone help me out here? What am I missing?


Solution

  • You can use

    sed -E 's/\s+\S+\s+/ /'
    sed -E 's/[[:space:]]+[^[:space:]]+[[:space:]]+/ /'
    

    The two POSIX ERE patterns are the same, they match one or more whitespaces, one or more non-whitespaces, and one or more whitespaces, just \s and \S pattern can only be used in the GNU sed version.

    Note that you cannot use \s as a whitespace char in the replacement part. \s is a regex pattern, and regex is used in the LHS (left-hand side) to search for whitespaces. So, a literal space is required to replace with a space.

    Since you can also use an awk solution you may use

    awk '{$2=""}1' file
    

    Here, the lines ("records") are split into "fields" with whitespace (it is the default field separator), and the second field ($2) value is cleared with {$2 = ""} and the 1 forces awk to output the result (calling the default print command).