We have several thousand large (10M<lines) text files of tabular data produced by a windows machine which we need to prepare for upload to a database.
We need to change the file encoding of these files from cp1252
to utf-8
, replace any bare Unix LF sequences (i.e. \n
) with spaces, then replace the DOS line end sequences ("CR-LF", i.e \r\n
) with Unix line end sequences (i.e. \n
).
The dos2unix
utility is not available for this task.
We initially had a bash function that packaged these operations together using iconv
and sed
, with iconv
doing the encoding and sed
dealing with the LF/CRLF sequences. I'm trying to replace part of this bash function with a perl
command.
Based on some helpful code review, I want to change this function to a perl
script.
The author of the code review suggested the following perl
to replace CRLF (i.e. "\r\n
") with LF ("\n
").
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;'
The explanation for why this is better than what we had previously makes perfect sense, but this line fails for me with:
Unrecognized switch: -g (-h will show valid options).
More interestingly, the author of the code review also suggests it is possible to perform the decode/recode in a perl script, too, but I am completely unsure where to start.
Please can someone explain why the suggested answer fails with Unrecognized switch: -g (-h will show valid options).
?
If it helps, the line is supposed to receive piped input from incov
as follows (though I am interested in learning how to use perl
to do the redcoding/recoding step, too):
iconv --from-code=CP1252 --to-code=UTF-8 $1$ | \
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;'
> "$2"
apple|orange|\n|lemon\r\nrasperry|strawberry|mango|\n\r\n
apple|orange| |lemon\nrasperry|strawberry|mango| \n
Perl recently added the command line switch -g
as an alias for 'gulp mode' in Perl v5.36.0.
This works in Perl version v5.36.0:
s=$(printf "Line 1\nStill Line 1\r\nLine 2\r\nLine 3\r\n")
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;' <<<"$s"
Prints:
Line 1 Still Line 1
Line 2
Line 3
But any version of perl earlier than v5.36.0, you would do:
perl -0777 -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;' <<<"$s"
# same
BTW, the conversion you are looking for a way easier in this case with awk
since it is close to the defaults.
Just do this:
awk -v RS="\r\n" '{gsub(/\n/," ")} 1' <<<"$s"
Line 1 Still Line 1
Line 2
Line 3
Or, if you have a file:
awk -v RS="\r\n" '{gsub(/\n/," ")} 1' file
This is superior to the posted perl
solution since the file is processed record be record (each block of text separated by \r\n
) versus having the read the entire file into memory.
(On Windows you may need to do awk -v RS="\r\n" -v ORS="\n" '...'
)
Another note:
You can get similar behavior from Perl by:
$/="\r\n"
in a BEGIN
block;-l
switch so every line has the input record separator removed;tr
for speedy replacement of \n
with ' '
;$/="\n"
, on Windows.Full command:
perl -lpE 'BEGIN{$/="\r\n"} tr/\n/ /' file