I am trying to create a file with fixed column lengths in Unix. The file contains Russian Cyrillic characters and those characters are interpreted different from the normal 1-byte characters.
I am using below script to modify the file (the delimiter of the columns is @-@ and the row delimiter is \r\n):
input_file=$1
output_file=$2
awk -F '@-@' '{printf("%-200s%-200s%-200s%-200s%-200s%-200s%-200s%-200s\r\n", $1, $2, $3, $4, $5, $6, $7, $8)}' $input_file > $output_file
For the columns with normal characters, the output file contains correctly 200 characters columns, but for a column with 30 Cyrillic characters, the output column contains only 170 characters. This way, the lines in the file won't have the same length because the Cyrillic characters occupy 2 bytes and the code will interpret the bytes and not the characters.
Example: НИКОЛАЕВНА has 10 characters, but the script calculates it as having 20 because it occupies 20 bytes.
One input file example:
НИКОЛАЕВНА@-@russ@-@12345@-@asklle@-@НИКОЛАЕВНА@-@454@-@111@-@asdfg
Can you please suggest a way to create the padding so that all the rows have the same number of characters?
Thank you!
I don't believe awk can do this, but gawk should handle this by default as long as your locale isn't set to "C". For example, LC_ALL=en_US.UTF-8
should provide the expected behavior using gawk.