Search code examples
pythonopenoffice-calcxlsxwriter

openoffice calc - newline causes duplicate value in cells (pandas/openpyxl)


Does anyone know how to workaround a problem of OpenOffice Calc not handling new lines in cells correct?

I have a python script that is dynamically generating an excel workbook using openpyxl via pandas.

The script works fine but when I view cells in OpenOffice that contain newlines all the values are duplicated multiple times. If I open the same file using the Microsoft Excel Viewer everything is displayed correctly and if I use a character other than a new line (e.g. comma, #, etc) it displays fine in both also.

I have a workaround to go into the excel and replace the random character using a macro but would like to avoid that if possible as the process really needs to be completely automated. also because the file will be processed by an another internal tool, I do need these cells to be processed with a new line and I can't change the character.

I have also tried using chr(10) and/or chr(13) but in the former case it just get's replaced in the output by '\n' anyway as expected.

The code I'm currently using is similar to:

test_list = []
for x in range(1,18):
    test_list.append([
        "value1",
        "\n".join(['element1', 'element2', 'element3']),
        "value3"
    ])

data_df = pd.DataFrame(test_list)

fn = r'/path/to/excel/file.xlsx'

writer = pd.ExcelWriter(fn, engine='xlsxwriter')
data_df.to_excel(writer, sheet_name='Data', index=False, header=0)
workbook = writer.book
worksheet = writer.sheets['Data']
worksheet.set_column('A:ZZ',50,
                     workbook.add_format({'text_wrap': True}))
writer.save()

What happens with the Element data is that it shows in the OpenOffice Calc cell as something like:

Openoffice Cells

Oddly the last item appears to be correct

The same data viewed as a list or via DataFrame.head() appears fine:

pprint(test_list)
[['value1', 'element1\nelement2\nelement3', 'value3'],
['value1', 'element1\nelement2\nelement3', 'value3'],
['value1', 'element1\nelement2\nelement3', 'value3'],
['value1', 'element1\nelement2\nelement3', 'value3'],
['value1', 'element1\nelement2\nelement3', 'value3'],
['value1', 'element1\nelement2\nelement3', 'value3'],
...
['value1', 'element1\nelement2\nelement3', 'value3']]

data_df.head(18):
         0                             1       2
0   value1  element1\nelement2\nelement3  value3
1   value1  element1\nelement2\nelement3  value3
2   value1  element1\nelement2\nelement3  value3
...
15  value1  element1\nelement2\nelement3  value3
16  value1  element1\nelement2\nelement3  value3

It's just when it get passed to the openpyxl library and viewed in OpenOffice.

Thanks


Solution

  • The code worked fine for me using OpenOffice 4.1.2 on Windows:

    enter image description here

    For this screenshot, I double-clicked on the bottom of the second row to expand it. Before that, it just showed element3 with a red triangle. But that seems different from the behavior you described.

    EDIT:

    Ok, I can now confirm the problem. As you said, it occurs with the mysterious number of 18 items. It looks like a bug in OpenOffice, because there is not much difference in the XML files viewed by unzipping file.xlsx.

    I also tried adding CR and LF directly to the XML files, but this just resulted in:

    enter image description here

    That leaves us with three solutions:

    1. Use LibreOffice instead, which does not have this problem (tested LO 5.1.0.3).
    2. Report the bug and wait for a new version.
    3. Use OpenOffice's preferred .ods format instead of MS Office's preferred format.