Search code examples
pandasgoogle-analyticsgoogle-analytics-api

pandas.io.ga not working for me


So I have worked through the Hello Analytics tutorial to confirm that OAuth2 is working as expected for me, but I'm not having any luck with the pandas.io.ga module. In particular, I am stuck with this error:

In [1]: from pandas.io import ga

In [2]: df = ga.read_ga("pageviews", "pagePath", "2014-07-08")
/usr/local/lib/python2.7/dist-packages/pandas/core/index.py:1162: FutureWarning: using '-' to provide set differences 
with Indexes is deprecated, use .difference()
"use .difference()",FutureWarning)
/usr/local/lib/python2.7/dist-packages/pandas/core/index.py:1147: FutureWarning: using '+' to provide set union with 
Indexes is deprecated, use '|' or .union()
"use '|' or .union()",FutureWarning)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-b5343faf9ae6> in <module>()
----> 1 df = ga.read_ga("pageviews", "pagePath", "2014-07-08")

/usr/local/lib/python2.7/dist-packages/pandas/io/ga.pyc in read_ga(metrics, dimensions, start_date, **kwargs)
    105     reader = GAnalytics(**reader_kwds)
    106     return reader.get_data(metrics=metrics, start_date=start_date,
--> 107                            dimensions=dimensions, **kwargs)
    108 
    109 

/usr/local/lib/python2.7/dist-packages/pandas/io/ga.pyc in get_data(self, metrics, start_date, end_date, dimensions, 
segment, filters, start_index, max_results, index_col, parse_dates, keep_date_col, date_parser, na_values, converters, 
sort, dayfirst, account_name, account_id, property_name, property_id, profile_name, profile_id, chunksize)
    293 
    294         if chunksize is None:
--> 295             return _read(start_index, max_results)
    296 
    297         def iterator():

/usr/local/lib/python2.7/dist-packages/pandas/io/ga.pyc in _read(start, result_size)
    287                                         dayfirst=dayfirst,
    288                                         na_values=na_values,
--> 289                                         converters=converters, sort=sort)
    290             except HttpError as inst:
    291                 raise ValueError('Google API error %s: %s' % (inst.resp.status,

/usr/local/lib/python2.7/dist-packages/pandas/io/ga.pyc in _parse_data(self, rows, col_info, index_col, parse_dates, 
keep_date_col, date_parser, dayfirst, na_values, converters, sort)
    313                                   keep_date_col=keep_date_col,
    314                                   converters=converters,
--> 315                                   header=None, names=col_names))
    316 
    317         if isinstance(sort, bool) and sort:

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    237 
    238     # Create the parser.
--> 239     parser = TextFileReader(filepath_or_buffer, **kwds)
    240 
    241     if (nrows is not None) and (chunksize is not None):

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    551             self.options['has_index_names'] = kwds['has_index_names']
    552 
--> 553         self._make_engine(self.engine)
    554 
    555     def _get_options_with_defaults(self, engine):

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    694             elif engine == 'python-fwf':
    695                 klass = FixedWidthFieldParser
--> 696             self._engine = klass(self.f, **self.options)
    697 
    698     def _failover_to_python(self):

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in __init__(self, f, **kwds)
   1412         if not self._has_complex_date_col:
   1413             (index_names,
-> 1414              self.orig_names, self.columns) = self._get_index_name(self.columns)
   1415             self._name_processed = True
   1416             if self.index_names is None:

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in _get_index_name(self, columns)
   1886             # Case 2
   1887             (index_name, columns_,
-> 1888              self.index_col) = _clean_index_names(columns, self.index_col)
   1889 
   1890         return index_name, orig_names, columns

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in _clean_index_names(columns, index_col)
   2171                     break
   2172         else:
-> 2173             name = cp_cols[c]
   2174             columns.remove(name)
   2175             index_names.append(name)

TypeError: list indices must be integers, not Index

OAuth2 is working as expected and I have only used these parameters as demo variables--the query itself is junk. Basically, I cannot figure out where the error is coming from, and would appreciate any pointers that one may have.

Thanks!

SOLUTION (SORT OF)

Not sure if this has to do with the data I'm trying to access or what, but the offending Index type error I'm getting arises from the the index_col variable in pandas.io.ga.GDataReader.get_data() is of type pandas.core.index.Index. This is fed to pandas.io.parsers._read() in _parse_data() which falls over. I don't understand this, but it is the breaking point for me.

As a fix--if anyone else is having this problem--I have edited line 270 of ga.py to:

index_col = _clean_index(list(dimensions), parse_dates).tolist()

and everything is now smooth as butter, but I suspect this may break things in other situations...


Solution

  • Unfortunately, this module isn't really documented and the errors aren't always meaningful. Include your account_name, property_name and profile_name (profile_name is the View in the online version). Then include some dimensions and metrics you are interested in. Also make sure that the client_secrets.json is in the pandas.io directory. An example:

    ga.read_ga(account_name=account_name,
               property_name=property_name,
               profile_name=profile_name,
               dimensions=['date', 'hour', 'minute'],
               metrics=['pageviews'],
               start_date=start_date,
               end_date=end_date,
               index_col=0,
               parse_dates={'datetime': ['date', 'hour', 'minute']},
               date_parser=lambda x: datetime.strptime(x, '%Y%m%d %H %M'),
               max_results=max_results)
    

    Also have a look at my recent step by step blog post about GA with pandas.