Search code examples
bashshellcsvpad

Pad Independently Missing Columns per Row in CSV with Bash (based off expected values)


I have a CSV file in which the ideal format for a row is this:

taxID#, scientific name, kingdom, k, phylum, p, class, c, order, o, family, f, genus, g

...where kingdom, phylum, etc. are identifiers, literals ("kingdom", ... "phylum"), and the values that follow the identifiers (k, p, etc.) are the actual values for those kingdoms, phyla, etc.

Example:

240395,Rugosa emeljanovi,kingdom,Metazoa,phylum,Chordata,class,Amphibia,order,Anura,family,Ranidae,genus,Rugosa

However, not all rows possess all levels of taxonomy, i.e. any one row might be missing the columns for an identifier/value pair, say, "class, c," and any 2-column PAIR can be missing independently of the other pairs missing or not. Also, if fields are missing, they will always be missing with their identifier field, so I'd never get "kingdom, phylum" together without the value for "k" between them. Thus much of my file is missing random fields:

...
135487,Nocardia cyriacigeorgica,class,Actinobacteria,order,Corynebacteriales,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,phylum,Actinobacteria,class,Actinobacteria
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria
77133,uncultured bacterium
...

Question: How can I write a bash shell script that can "pad" every row in a file so that every field pair that may be missing from my ideal format is inserted, and its value column that follows is just blank. Desired output:

...
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria,clas,,order,,family,,genus,
77133,uncultured bacterium,kingdom,,phylum,,class,,order,,family,,genus,
...

Notes:

  • Notice if a genus was missing, the padded output should end with a comma to denote the value of genus doesn't exist.
  • taxID# and scientific name (the first two fields) will ALWAYS be present.
  • I don't care for time/resource efficiency if your solution is brute-forcey.

What I've tried:

  • I wrote a simple if/then script that checks sequentially if an expected field is gone. pseudocode:

    if "$f3" is not "kingdom", pad

    but the problem is that if kingdom was truly missing, it will get padded in output but the remaining field variables will be goofed up and I can't just follow that by saying

    if "$f5" is not "phylum", pad

    because if kingdom were missing, phylum would probably now be in field 3 ($f3), not $f5, that is, if it too weren't missing. (I did this by concatenating onto a string variable the expected output based on the absence of each field, and simply concatenating the original value if the field wasn't missing, and then echoing the finished, supposedly padded row to output).

I'd like to be able to execute my script like this

bash pad.sh prePadding.csv postPadding.csv

but I would accept answers using Mac Excel 2011 if needed.

Thank you!!


Solution

  • This would be the answer in bash using associative arrays:

    #!/bin/bash
    
    declare -A THIS
    while IFS=, read -a LINE; do
      # we always get the #ID and name
      if (( ${#LINE[@]} < 2 || ${#LINE[@]} % 2 )); then
        echo Invalid CSV line: "${LINE[@]}" >&2
        continue
      fi
      echo -n "${LINE[0]},${LINE[1]},"
      THIS=()
      for (( INDEX=2; INDEX < ${#LINE[@]}; INDEX+=2 )); do
        THIS[${LINE[INDEX]}]=${LINE[INDEX+1]}
      done
      for KEY in kingdom phylum class order family; do
        echo -n $KEY,${THIS[$KEY]},
      done
      echo genus,${THIS[genus]}
    done <$1 >$2
    

    It also validates CSV lines so that they contain at least 2 columns (ID and name) and that they have an even number of columns.

    The script can be extended to do more error checking (i.e. if both arguments are passed, if the input exists, etc), but it should work as expected with just the way you posted it.