Search code examples
linuxbashshellubuntuaws-cli

How do I prepend variable to open file stream when using split to create csv's?


I have a bash file that takes a large csv and splits the csv into smaller csv's based on this blog https://medium.com/swlh/automatic-s3-file-splitter-620d04b6e81c. It works well as it is fast never downloading the csv's which is great for a lambda. The csv's after they split do not have headers only the originating csv. This is problem for me since I am not able to read with apache pyspark a set of files one with header row and many other files without header rows.

I want to add a header row to each csv written.

What the code does

INFILE

  • "s3//test-bucket/test.csv"

OUTFILES - split into 300K lines

  • "s3//dest-test-bucket/test.00.csv"
  • "s3//dest-test-bucket/test.01.csv"
  • "s3//dest-test-bucket/test.02.csv"
  • "s3//dest-test-bucket/test.03.csv"

Original code that works

LINECOUNT=300000
INFILE=s3://"${S3_BUCKET}"/"${FILENAME}"
OUTFILE=s3://"${DEST_S3_BUCKET}"/"${FILENAME%%.*}"

FILES=($(aws s3 cp "${INFILE}" - | split -d -l ${LINECOUNT} --filter "aws s3 cp - \"${OUTFILE}_\$FILE.csv\"  | echo \"\$FILE.csv\""))

This was my attempt to add a variable to outgoing file stream, but it did not work.

LINECOUNT=300000
INFILE=s3://"${S3_BUCKET}"/"${FILENAME}"
OUTFILE=s3://"${DEST_S3_BUCKET}"/"${FILENAME%%.*}"

HEADER=$(aws s3 cp "${INFILE}" - | head -n 1)

FILES=($(aws s3 cp "${INFILE}" - | split -d -l ${LINECOUNT} --filter "echo ${HEADER}; aws s3 cp - \"${OUTFILE}_\$FILE.csv\"  | echo \"\$FILE.csv\""))

Attempt 2:

LINECOUNT=300000
INFILE=s3://"${S3_BUCKET}"/"${FILENAME}"
OUTFILE=s3://"${DEST_S3_BUCKET}"/"${FILENAME%%.*}"

HEADER=$(aws s3 cp "${INFILE}" - | head -n 1)

FILES=($(aws s3 cp "${INFILE}" - | split -d -l ${LINECOUNT} --filter "{ echo -n ${HEADER}; aws s3 cp - \"${OUTFILE}_\$FILE.csv\"; } | echo \"\$FILE.csv\""))

AWS documentation states

You can use the dash parameter for file streaming to standard input (stdin) or standard output (stdout).

I don't know if this is even possible with a open file stream.


Solution

  • Hope this helps. I think you are only missing the cat aspect of adding the header.

    This article shows one way to split a file and provide the header using split command and filter arguments.

    Using that snip and applying it to the code above seems to work. Notice that the 2 commands inside the curly braces are echo ${HEADER} and cat. The first, echo creates the header on stdout and then the second, cat will pipe aws cp stdin to stdout which is the input to aws cp - creating the new file on S3.

    HEADER='"Name", "Team", "Position", "Height(inches)", "Weight(lbs)", "Age"'
    
    aws s3 cp ${INFILE} - | split -d -l ${LINECOUNT} --filter "{ \[ "\$FILE" != "x00" \] && echo ${HEADER} ; cat; } | aws s3 cp - \"${OUTFILE}\${FILE}.csv\""
    
    

    After running the command, I observed 3 new files and each file had the desired header.

    
    head -n2 *.csv
    ==> x00.csv <==
    "Name", "Team", "Position", "Height(inches)", "Weight(lbs)", "Age"
    "Adam Donachie", "BAL", "Catcher", 74, 180, 22.99
    
    ==> x01.csv <==
    Name, Team, Position, Height(inches), Weight(lbs), Age
    "John Rheinecker", "TEX", "Starting Pitcher", 74, 230, 27.76
    
    ==> x02.csv <==
    Name, Team, Position, Height(inches), Weight(lbs), Age
    "Chase Utley", "PHI", "Second Baseman", 73, 183, 28.2