Search code examples
bashunixlibreoffice

Convert many txt files to xls files with bash script


I'm trying to convert many text files to xls files. The style of the txt file is as follow:

"Name";"Login";"Role"
"Max Muster";"Bla102";"user"
"Heidi Held";"Held100";"admin"

I tried to work with this bash script:

for file in *.txt; do
      tr ";" "," < "$file" | paste -d, <(seq 1 $(wc < "$file")) - > "${file%.*}.xls"
      soffice --headless --convert-to xls:"MS Excel 95" filename.xls "${file%.*}.xls"
done

with this, I lost the header row and I also get a column with many Chinese signs, but the rest looks okay:

攀挀琀 |  Max Muster |  Bla102   |  user
氀愀猀 |  Heidi Held |  Held100  |  admin

How can I get rid of these Chinese signs and keep the header row?


Solution

  • The question unfortunately does not provide enough details to be sure what exactly the issues are; but we have identified in comments at least the following.

    • Apparently, the input file contains DOS carriage returns.
    • Apparently, soffice attempted to read the file as UTF-16, which is what produced the essentially random Chinese characters. (The characters could be anything; it's just more probable that a random Unicode BMP character will be in a Chinese/Japanese block.)

    With those observations and a refactoring of the existing script, try

    for file in *.txt; do
        awk -F ';' 'BEGIN { OFS="," }
            FNR==1 {
                # Add UTF-8 BOM
                printf "\357\273\277"
                # Generate header line for soffice to discard
                for (i=1; i<=NF; i++) printf "bogus%s", (i==NF ? "\n" : OFS)
            }
            { sub(/\015/, ""); print FNR, $0 }' "$file" > "${file%.*}.xls"
        soffice --headless --convert-to xls:"MS Excel 95" filename.xls "${file%.*}.xls"
    done
    

    In so many words, the Awk script splits each input line on semicolons (-F ';') and sets the output field separator OFS to a comma. On the first output line, we add a BOM and a synthetic header line for soffice to discard before the real output, so that the header line appears like a regular data line in the output. The sub takes care of removing any DOS carriage return character, and the variable FNR is the current input line's line number.

    I'm not sure if the BOM or the bogus header line are strictly necessary, or if perhaps you need to pass in some additional options to make soffice treat the input as proper UTF-8. Perhaps you also need to include LC_ALL=C somewhere in the pipeline.