Search code examples
pythonpandasgdata

Loading a generic Google Spreadsheet in Pandas


When I try to load a Google Spreadsheet in pandas

from StringIO import StringIO  
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?key=<some_long_code>&output=csv')
data = r.content
df = pd.read_csv(StringIO(data), index_col=0)

I get the following:

CParserError: Error tokenizing data. C error: Expected 1316 fields in line 73, saw 1386

Why? I would think that one could identify the spreadsheet set of rows and columns with data and use the spreadsheets rows and columns as the dataframe index and columns respectively (with NaN for anything empty). Why does it fail?


Solution

  • This question of mine shows how Getting Google Spreadsheet CSV into A Pandas Dataframe

    As one of the commentators noted you have not asked for the data in CSV format you have the "edit" request at the end of the url You can use this code and see it work on the spreadsheet (which by the way needs to be public..) It is possible to do private sheets as well but that is another topic.

    from StringIO import StringIO  # got moved around in python3 if you're using that.
    
    import requests
    r = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak1ecr7i0wotdGJmTURJRnZLYlV3M2daNTRubTdwTXc&output=csv')
    data = r.content
    
    In [10]: df = pd.read_csv(StringIO(data), index_col=0,parse_dates=['Quradate'])
    
    In [11]: df.head()
    Out[11]: 
              City                                            region     Res_Comm  \
    0       Dothan  South_Central-Montgomery-Auburn-Wiregrass-Dothan  Residential   
    10       Foley                              South_Mobile-Baldwin  Residential   
    12  Birmingham      North_Central-Birmingham-Tuscaloosa-Anniston   Commercial   
    38       Brent      North_Central-Birmingham-Tuscaloosa-Anniston  Residential   
    44      Athens                 North_Huntsville-Decatur-Florence  Residential   
    
              mkt_type            Quradate  National_exp  Alabama_exp  Sales_exp  \
    0            Rural 2010-01-15 00:00:00             2            2          3   
    10  Suburban_Urban 2010-01-15 00:00:00             4            4          4   
    12  Suburban_Urban 2010-01-15 00:00:00             2            2          3   
    38           Rural 2010-01-15 00:00:00             3            3          3   
    44  Suburban_Urban 2010-01-15 00:00:00             4            5          4   
    

    The new Google spreadsheet url format for getting the csv output is

    https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&id
    

    Well they changed the url format slightly again now you need:

    https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=0 #for the 1st sheet
    

    I also found I needed to do the following to deal with Python 3 a slight revision to the above:

    from io import StringIO 
    

    and to get the file:

    guid=0 #for the 1st sheet
    act = requests.get('https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=%s' % guid)
    dataact = act.content.decode('utf-8') #To convert to string for Stringio
    actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=[0], thousands=',').sort()
    

    actdf is now a full pandas dataframe with headers (column names)