Search code examples
linuxgreptext-filestabulartext-extraction

How to extract specific information from multiple files and make a table in linux?


I have multiple text files with information. Here I'm showing for two text files which are like below:

Sample1.txt

Status  /documents/Sample1.sorted.bam
Assigned        50945040
Unassigned_Unmapped     947866
Unassigned_MappingQuality       0
Unassigned_Chimera      0
Unassigned_FragmentLength       0
Unassigned_Duplicate    0
Unassigned_MultiMapping 49013681
Unassigned_Secondary    0
Unassigned_Nonjunction  0
Unassigned_NoFeatures   21189312
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    4430011

Sample2.txt

Status  /documents/Sample2.sorted.bam
Assigned        36335614
Unassigned_Unmapped     870456
Unassigned_MappingQuality       0
Unassigned_Chimera      0
Unassigned_FragmentLength       0
Unassigned_Duplicate    0
Unassigned_MultiMapping 68688141
Unassigned_Secondary    0
Unassigned_Nonjunction  0
Unassigned_NoFeatures   23746485
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    3734593

For single text file I'm using grep:

grep "Assigned\|Unmapped\|MultiMapping\|NoFeatures\|Ambiguity" Sample1.txt > output.txt

But I want the output to be like below were I can use a small script on all text file and make table:

                        Sample1       Sample2
Assigned                50945040      36335614
Unassigned_Unmapped     947866        870456
Unassigned_MultiMapping 49013681      68688141
Unassigned_NoFeatures   21189312      23746485
Unassigned_Ambiguity    4430011       3734593

Solution

  • $ cat tst.awk
    $2 != 0 {
        printf "%s%s", (NR>1 ? $1 : "Name"), OFS
        for (i=2; i<=NF; i+=2) {
            gsub(/^.*\/|\..*$/,"",$i)
            printf "%s%s", $i, (i<NF ? OFS : ORS)
        }
    }
    
    $ paste Sample1.txt Sample2.txt | awk -f tst.awk | column -t
    Name                     Sample1   Sample2
    Assigned                 50945040  36335614
    Unassigned_Unmapped      947866    870456
    Unassigned_MultiMapping  49013681  68688141
    Unassigned_NoFeatures    21189312  23746485
    Unassigned_Ambiguity     4430011   3734593
    

    To get output that Excel can understand rather than the output shown in the question do this:

    $ cat tst.awk
    BEGIN { OFS="," }
    $2 != 0 {
        printf "%s%s", (NR>1 ? $1 : "Name"), OFS
        for (i=2; i<=NF; i+=2) {
            gsub(/^.*\/|\..*$/,"",$i)
            printf "%s%s", $i, (i<NF ? OFS : ORS)
        }
    }
    
    $ paste Sample1.txt Sample2.txt | awk -f tst.awk > output.csv
    

    and then double-click on output.csv to open it with Excel.