Search code examples
pythonexcelpandasexport-to-excelexport-to-csv

Export Pandas data frame with text column containg utf-8 text and URLs to Excel


My Pandas data frame consists of Tweets and meta data of each tweet (300.000 rows). Some of my colleagues need to work with this data in Excel which is why I need to export it.

I wanted to use either .to_csv or .to_excel which are both provided by Pandas but I can't get it to work properly.

When I use .to_csv my problem is that it keeps failing in the text part of the data frame. I've played around with different separators but the file is never 100% aligned. The text column seems to contain tabs, pipe characters etc. which confuses Excel.

df.to_csv('test.csv', sep='\t', encoding='utf-8')

When I try to use .to_excel together with the xlsxwriter engine I'm confronted with a different problem, which is that my text column contains to many URLs (I think). xlswriter tries to make special clickable links of these URLs instead of just handling them as strings. I've found some information on how to circumvent this but, again, I can't get it to work.

The following bit of code should be used to disable the function that I think is causing trouble:

workbook = xlsxwriter.Workbook(filename, {'strings_to_urls': False})

However, when using to_excel I can't seem to adjust this setting of the Workbook object before I load the data frame into the Excel file.

In short how do I export a column with wildly varying text from a Pandas data frame to something that Excel understands?

edit: example:

@geertwilderspvv @telegraaf ach Wilders toch, nep-voorzitter van een nep-partij met maar één lid, \nzeur niet over nep-premier of parlement!

So in this case It is obviously a line brake that is my data. I will try to find some more examples.

edit2:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><logFileName>error047600_01.xml</logFileName><summary>Er zijn fouten aangetroffen in bestand C:\Users\Guy Mahieu\Desktop\Vu ipython notebook\pandas_simple.xlsx</summary><removedRecords summary="Hier volgt een lijst van verwijderde records:"><removedRecord>Verwijderde records: Formule van het onderdeel /xl/worksheets/sheet1.xml</removedRecord></removedRecords></recoveryLog>

Translation of Dutch stuff:

Errors were found in "file". Here follows a list of removed records: removed records: formula of the part /xl/worksheets/sheet1.xml


Solution

  • I don't think it is currently possible to pass XlsxWriter constructor options via the Pandas API but you can workaround the strings_to_url issue as follows:

    import pandas as pd
    
    df = pd.DataFrame({'Data': ['http://python.org']})
    
    # Create a Pandas Excel writer using XlsxWriter as the engine.
    writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')
    
    # Don't convert url-like strings to urls.
    writer.book.strings_to_urls = False
    
    # Convert the dataframe to an XlsxWriter Excel object.
    df.to_excel(writer, sheet_name='Sheet1')
    
    # Close the Pandas Excel writer and output the Excel file.
    writer.save()
    

    Update: In recent version of Pandas you can pass XlsxWriter constructor options to ExcelWriter() directly and you do not need to set writer.book.strings_to_urls indirectly:

    writer = pd.ExcelWriter('pandas_example.xlsx',
                            engine='xlsxwriter',
                            engine_kwargs={'options': {'strings_to_urls': False}})
    

    See also Passing XlsxWriter constructor options to Pandas