Search code examples
bashcsvarangodbarangoimport

Bash XSV auto populate empty values with CSV column


I have a CSV export that I need to map to new values to in order to then import into a different system. I am using ArangoDB to create this data migration mapping.

Below is the full script used:

#!/bin/bash

execute () {
    filepath=$1
    prefix=$2
    keyField=$3
    filename=`basename "${filename%.csv}"`
    collection="$prefix$filename"
    filepath="/data-migration/$filepath"

    # Check for "_key" column
    if ! xsv headers "$1" | grep -q _key
    # Add "_key" column using the keyfield provided
    then
        xsv select $keyField "$1" | sed -e "1s/$keyField/_key/" > "$1._key"
        xsv cat columns "$1" "$1._key" > "$1.cat"
        mv "$1.cat" "$1"
        rm "$1._key"
    fi

    # Import CSV into Arango Collection
    docker exec arango arangoimp --collection "$collection" --type csv "$filepath" --server.password ''
}

# This single line runs the execute() above
execute 'myDirectory/myFile.csv' prefix_ OLD_ORG_ID__C

So far I've deduced the $keyField (OLD_ORG_ID__C) parameter passed to the execute() function, is used in the loop of the script. This looks for $keyField column and then migrates the values to a newly created _key column using the XSV toolkit.

OLD_ORG_ID__C | _key
  A123        ->  A123
  B123        ->  B123
              ->  ##    <-auto populate

Unfortunately not every row has a value for the OLD_ORG_ID__C column and as a result the _key for that row is also empty which then causes the import to Arango to fail.

Note: This _key field is necessary for my AQL scripts to work properly

How can I rewrite the loop to auto-index the blank values?

then
    xsv select $keyField "$1" | sed -e "1s/$keyField/_key/" > "$1._key"
    xsv cat columns "$1" "$1._key" > "$1.cat"
    mv "$1.cat" "$1"
    rm "$1._key"
fi

Is there a better way to solve this issue? Perhaps xsv sort by the keyField and then auto populate the from the blank rows to the end?

UPDATE: Per the comments/answer I tried something along these lines but so far still not working

#!/bin/bash

execute () {
    filepath=$1
    prefix=$2
    keyField=$3
    filename=`basename "${filename%.csv}"`
    collection="$prefix$filename"
    filepath="/data-migration/$filepath"

    # Check for "_key" column
    if ! xsv headers "$1" | grep -q _key
    # Add "_key" column using the keyfield provided
    then

        awk -F, 'NR==1 { for(i=1; i<=NF;++i) if ($i == "'$keyField'") field=i; print; next }
    $field == "" { $field = "_generated_" ++n }1' $1 > $1-test.csv

    fi

}
# import a single collection if needed
execute 'agas/Account.csv' agas_ OLD_ORG_ID__C

This creates a Account-test.csv file but unfortunately it does not have the "_key" column or and changes to the OLD_ORG_ID__C values. Preferably I would only want to see the "_key" values populated with auto-numbered values when OLD_ORG_ID__C is blank, otherwise they should copy the provided value.


Solution

  • If your question is "how can I find from the first header line of a CSV file which field is named OLD_ORG_ID__C, then on subsequent lines put a unique value in this column if it is empty" try something like

    awk -F, 'NR==1 { for(i=1; i<=NF;++i) if ($i == "OLD_ORG_ID__C") field=i ; print; next }
        $field == "" { $field = "_generated_" ++n }1' file >newfile
    

    This has no provision for coping with complexities like quoted fields with embedded commas. (I have no idea what xsv is but maybe it would be better equipped for such scenarios?)

    If I can guess what this code does

    xsv select $keyField "$1" |
    sed -e "1s/$keyField/_key/" > "$1._key"
    

    then probably you could replace it with something like

    xsv select "$keyField" "$1" |
    awk -v field="$keyField" 'NR==1 { $0 = field }
        /^$/ { $0 = NR } 1' >"$1._key"
    

    to replace the first line with the value of $keyField and replace any subsequent empty lines with their line number.