I have data in following format and I want to extract the first column and the column 6, if there is a column six:
ID1 Bacteria;Firmicutes;Clostridia;Clostridiales;
ID2 Bacteria;Firmicutes;Clostridia;Clostridiales;Eubacteriaceae;Eubacterium;Eubacterium hallii;
ID3 Bacteria;Firmicutes;
ID4 Bacteria;Firmicutes;
ID5 Bacteria;Firmicutes;Clostridia;
ID6 Bacteria;
ID7 Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae;Faecalibacterium;
ID8 Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae;Faecalibacterium;Faecalibacterium prausnitzii;
The output should be:
ID2 Eubacterium
ID7 Faecalibacterium
ID8 Faecalibacterium
I try to solve the problem by split by ";" and grep the 6th column cut -d ";" -f 6
but think you will have a better solution. Thank you in advance!
You can use awk:
awk -F\; 'NF>=6{print substr ($1, 0, 4), $6}' file
If there are 6 or more fields, then it extracts fields 1 and 6 based on delimiter ;
. and then extracts first 3 chars from field 1.
Sample output:
$ awk -F\; 'NF>=6{print substr ($1, 0, 4), $6}' file
ID2 Eubacterium
ID7 Faecalibacterium
ID8 Faecalibacterium