I have a CSV export that I need to map to new values to in order to then import into a different system. I am using ArangoDB to create this data migration mapping.
Below is the full script used:
#!/bin/bash
execute () {
filepath=$1
prefix=$2
keyField=$3
filename=`basename "${filename%.csv}"`
collection="$prefix$filename"
filepath="/data-migration/$filepath"
# Check for "_key" column
if ! xsv headers "$1" | grep -q _key
# Add "_key" column using the keyfield provided
then
xsv select $keyField "$1" | sed -e "1s/$keyField/_key/" > "$1._key"
xsv cat columns "$1" "$1._key" > "$1.cat"
mv "$1.cat" "$1"
rm "$1._key"
fi
# Import CSV into Arango Collection
docker exec arango arangoimp --collection "$collection" --type csv "$filepath" --server.password ''
}
# This single line runs the execute() above
execute 'myDirectory/myFile.csv' prefix_ OLD_ORG_ID__C
So far I've deduced the $keyField
(OLD_ORG_ID__C
) parameter passed to the execute()
function, is used in the loop of the script. This looks for $keyField
column and then migrates the values to a newly created _key
column using the XSV toolkit.
OLD_ORG_ID__C | _key
A123 -> A123
B123 -> B123
-> ## <-auto populate
Unfortunately not every row has a value for the OLD_ORG_ID__C
column and as a result the _key
for that row is also empty which then causes the import to Arango to fail.
Note: This _key
field is necessary for my AQL scripts to work properly
How can I rewrite the loop to auto-index the blank values?
then
xsv select $keyField "$1" | sed -e "1s/$keyField/_key/" > "$1._key"
xsv cat columns "$1" "$1._key" > "$1.cat"
mv "$1.cat" "$1"
rm "$1._key"
fi
Is there a better way to solve this issue? Perhaps xsv sort
by the keyField
and then auto populate the from the blank rows to the end?
UPDATE: Per the comments/answer I tried something along these lines but so far still not working
#!/bin/bash
execute () {
filepath=$1
prefix=$2
keyField=$3
filename=`basename "${filename%.csv}"`
collection="$prefix$filename"
filepath="/data-migration/$filepath"
# Check for "_key" column
if ! xsv headers "$1" | grep -q _key
# Add "_key" column using the keyfield provided
then
awk -F, 'NR==1 { for(i=1; i<=NF;++i) if ($i == "'$keyField'") field=i; print; next }
$field == "" { $field = "_generated_" ++n }1' $1 > $1-test.csv
fi
}
# import a single collection if needed
execute 'agas/Account.csv' agas_ OLD_ORG_ID__C
This creates a Account-test.csv file but unfortunately it does not have the "_key" column or and changes to the OLD_ORG_ID__C
values. Preferably I would only want to see the "_key" values populated with auto-numbered values when OLD_ORG_ID__C
is blank, otherwise they should copy the provided value.
If your question is "how can I find from the first header line of a CSV file which field is named OLD_ORG_ID__C
, then on subsequent lines put a unique value in this column if it is empty" try something like
awk -F, 'NR==1 { for(i=1; i<=NF;++i) if ($i == "OLD_ORG_ID__C") field=i ; print; next }
$field == "" { $field = "_generated_" ++n }1' file >newfile
This has no provision for coping with complexities like quoted fields with embedded commas. (I have no idea what xsv
is but maybe it would be better equipped for such scenarios?)
If I can guess what this code does
xsv select $keyField "$1" |
sed -e "1s/$keyField/_key/" > "$1._key"
then probably you could replace it with something like
xsv select "$keyField" "$1" |
awk -v field="$keyField" 'NR==1 { $0 = field }
/^$/ { $0 = NR } 1' >"$1._key"
to replace the first line with the value of $keyField
and replace any subsequent empty lines with their line number.