I'm trying to convert many text files to xls files. The style of the txt file is as follow:
"Name";"Login";"Role"
"Max Muster";"Bla102";"user"
"Heidi Held";"Held100";"admin"
I tried to work with this bash script:
for file in *.txt; do
tr ";" "," < "$file" | paste -d, <(seq 1 $(wc < "$file")) - > "${file%.*}.xls"
soffice --headless --convert-to xls:"MS Excel 95" filename.xls "${file%.*}.xls"
done
with this, I lost the header row and I also get a column with many Chinese signs, but the rest looks okay:
攀挀琀 | Max Muster | Bla102 | user
氀愀猀 | Heidi Held | Held100 | admin
How can I get rid of these Chinese signs and keep the header row?
The question unfortunately does not provide enough details to be sure what exactly the issues are; but we have identified in comments at least the following.
soffice
attempted to read the file as UTF-16, which is what produced the essentially random Chinese characters. (The characters could be anything; it's just more probable that a random Unicode BMP character will be in a Chinese/Japanese block.)With those observations and a refactoring of the existing script, try
for file in *.txt; do
awk -F ';' 'BEGIN { OFS="," }
FNR==1 {
# Add UTF-8 BOM
printf "\357\273\277"
# Generate header line for soffice to discard
for (i=1; i<=NF; i++) printf "bogus%s", (i==NF ? "\n" : OFS)
}
{ sub(/\015/, ""); print FNR, $0 }' "$file" > "${file%.*}.xls"
soffice --headless --convert-to xls:"MS Excel 95" filename.xls "${file%.*}.xls"
done
In so many words, the Awk script splits each input line on semicolons (-F ';'
) and sets the output field separator OFS
to a comma. On the first output line, we add a BOM and a synthetic header line for soffice
to discard before the real output, so that the header line appears like a regular data line in the output. The sub
takes care of removing any DOS carriage return character, and the variable FNR
is the current input line's line number.
I'm not sure if the BOM or the bogus header line are strictly necessary, or if perhaps you need to pass in some additional options to make soffice
treat the input as proper UTF-8. Perhaps you also need to include LC_ALL=C
somewhere in the pipeline.