We recently received a file to be ingested, the file is PSV format, however, all the fields are padded with extra characters $~$ on the left and right, so the entire PSV is like below:
$~$Field1$~$|$~$Field2$~$|$~$Field3$~$
$~$Data1$~$|$~$Data2$~$|$~$Data3$~$
$~$Data4$~$|$~$Data5$~$|$~$Data6$~$
$~$Data7$~$|$~$Data8$~$|$~$Data9$~$
$~$Data10$~$|$~$Data11$~$|$~$Data12$~$ .....
There are 100 Million rows in the file.
What would be the best way to trim these paddings so as to make it a standard PSV?
Thank you very much, any suggestion/sharing is appreciated here.
UPDATE:
The data is received from SFTP, and uploaded to Hadoop by IT Data Support (Unix Admin), we only have access to Hadoop cluster, but if it is an easy job for Data Support, maybe I can convince them to do the preprocess. Thanks.
tr
might be the faster solution. note, you can pipe any strings, so in this case, I'm cat
ing a file on disk, but this can also be a file stream from sftp.
~/Desktop/test $ cat data.txt
$~$Field1$~$|$~$Field2$~$|$~$Field3$~$
$~$Data1$~$|$~$Data2$~$|$~$Data3$~$
$~$Data4$~$|$~$Data5$~$|$~$Data6$~$
$~$Data7$~$|$~$Data8$~$|$~$Data9$~$
# the '>' will open a new file for writing
~/Desktop/test $ cat data.txt | tr -d \$~\$ > output.psv
# see the results here
~/Desktop/test $ cat output.psv
Field1|Field2|Field3
Data1|Data2|Data3
Data4|Data5|Data6
Data7|Data8|Data9
examples: https://shapeshed.com/unix-tr/#what-is-the-tr-command-in-unix