Search code examples
regexgrepcut

Extract from text the first and 6th column if there is a 6th column


I have data in following format and I want to extract the first column and the column 6, if there is a column six:

ID1        Bacteria;Firmicutes;Clostridia;Clostridiales;
ID2        Bacteria;Firmicutes;Clostridia;Clostridiales;Eubacteriaceae;Eubacterium;Eubacterium hallii;
ID3        Bacteria;Firmicutes;
ID4        Bacteria;Firmicutes;
ID5        Bacteria;Firmicutes;Clostridia;
ID6        Bacteria;
ID7        Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae;Faecalibacterium;
ID8        Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae;Faecalibacterium;Faecalibacterium prausnitzii;

The output should be:

ID2 Eubacterium
ID7 Faecalibacterium
ID8 Faecalibacterium

I try to solve the problem by split by ";" and grep the 6th column cut -d ";" -f 6 but think you will have a better solution. Thank you in advance!


Solution

  • You can use awk:

    awk -F\; 'NF>=6{print substr ($1, 0, 4), $6}' file
    

    If there are 6 or more fields, then it extracts fields 1 and 6 based on delimiter ;. and then extracts first 3 chars from field 1.

    Sample output:

    $ awk -F\; 'NF>=6{print substr ($1, 0, 4), $6}' file
    ID2 Eubacterium
    ID7 Faecalibacterium
    ID8 Faecalibacterium