Search code examples
awksedbioinformatics

How to remove last character from a line only if it's a number in R or Linux


I have a list of ~28,000 gene transcripts, e.g.:

4R79.1b 
4R79.2b 
AC3.1a 
AC3.2 
AC3.3 
AC3.5a

I need to get gene names by removing the last character only if it's a letter. I've been googling for days and haven't found a solution that would remotely help, I must have missed something.

I thought there must be a simple solution but my best attempt was sed 's/[[:alpha:]]$//' transcripts.txt > genes.txt but it did not seem to do anything and the size of the file has not changed from the original.


Solution

  • With awk:

    $ echo '4R79.1b 4R79.2b AC3.1a AC3.2 AC3.3 AC3.5a' | 
    awk '{for(i=1;i<=NF;i++) sub(/[[:alpha:]]$/,"",$i)} 1'   
    

    Prints:

    4R79.1 4R79.2 AC3.1 AC3.2 AC3.3 AC3.5 
    

    Or sed:

    sed -E 's/[[:alpha:]]([[:space:]]|$)/\1/g'
    

    For a new file, just redirect:

    sed -E 's/[[:alpha:]]([[:space:]]|$)/\1/g' file > new_file
    

    If you want to replace inplace you can use sed:

    sed -i bak -E 's/[[:alpha:]]([[:space:]]|$)/\1/g' file
    

    Or awk by redirecting to a new temp file then overwriting the original (which is what sed -i is doing...):

    awk '{for(i=1;i<=NF;i++) sub(/[[:alpha:]]$/,"",$i)} 1' file > TEMP_FILE && mv -f TEMP_FILE file
    

    You can also use GNU awk which has an inplace option as well.