Search code examples
pythonpandasdata-analysisbigdata

Python. BigData. Need to extract Web-Browser and OS users' preferences from cells. Best-performance method?


So. We have a clean dataframe that was made of messy TSV file like that (special thanks to @unutbu): enter image description here

chunksize = 50000    # the number of rows to be processed per iteration
dfs = []
reader = pd.read_table('data/data.tsv', sep='\t+',header=None, engine='python',
                       iterator=True, chunksize=chunksize)
for df in reader:
    df = df.stack().str.extract(r'([^=]*)=(.*)', expand=True).dropna(axis=0)
    df.columns = ['field', 'value']

    df = df.set_index('field', append=True)
    df.index = df.index.droplevel(level=1)
    df = df['value'].unstack(level=1)

    df = df.fillna('')
    df['user_vhost'] = df[['user_vhost', 'vhost', 'canonized_vhost']].apply(lambda x: ''.join(x), axis=1)
    df['user_ip'] = df[['user_ip', 'ip']].apply(lambda x: ''.join(x), axis=1)
    df.drop('vhost', axis=1, inplace=True)
    df.drop('canonized_vhost', axis=1, inplace=True)
    df.drop('ip', axis=1, inplace=True)
    dfs.append(df)

df = pd.concat(dfs, ignore_index=True)

The 'user_agent' cell looks like:

user_agent=Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36

One thing that comes to mind is the user_agent library. How would you apply (dataframe.apply/applymap?)

user_agent.browser.family

and

user_agent.os.family

to the dataframe(s) in order to make two new separate columns (browser and os) in the DataFrame?


Solution

  • this works, but it's pretty slow:

    df = pd.DataFrame({'user_agent':['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36']})
    
    # create 10.000 rows DF:
    df = pd.concat([df] * 10**4)
    
    def parse_ua_series(ua):
        p = user_agents.parse(ua)
        return pd.Series([p.browser.family, p.os.family])
    
    
    def parse_ua_list(ua):
        p = user_agents.parse(ua)
        return [p.browser.family, p.os.family]
    

    How to apply:

    In [83]: df.head().user_agent.apply(parse_ua_series).rename(columns={0:'browser', 1:'os'})
    Out[83]:
      browser         os
    0  Chrome  Windows 7
    0  Chrome  Windows 7
    0  Chrome  Windows 7
    0  Chrome  Windows 7
    

    or

    In [85]: df.head().user_agent.apply(parse_ua_list).apply(pd.Series).rename(columns={0:'browser', 1:'os'})
    Out[85]:
      browser         os
    0  Chrome  Windows 7
    0  Chrome  Windows 7
    0  Chrome  Windows 7
    0  Chrome  Windows 7
    0  Chrome  Windows 7
    

    Timings:

    In [80]: %timeit df.user_agent.apply(parse_ua_series).rename(columns={0:'browser', 1:'os'})
    1 loop, best of 3: 5.83 s per loop
    
    In [81]:
    
    In [81]: %timeit df.user_agent.apply(parse_ua_list).apply(pd.Series).rename(columns={0:'browser', 1:'os'})
    1 loop, best of 3: 5.61 s per loop
    
    In [82]:
    
    In [82]: %timeit df.user_agent.apply(lambda x: pd.Series(parse_ua_list(x))).rename(columns={0:'browser', 1:'os'})
    1 loop, best of 3: 5.78 s per loop
    

    UPDATE:

    using only ua_parser will be a little bit faster:

    def parse_ua(ua):
        p = ua_parser.user_agent_parser.Parse(ua)
        return [p.get('os').get('family'), p.get('user_agent').get('family')]
    
    In [103]: %timeit df.user_agent.apply(parse_ua).apply(pd.Series).rename(columns={0:'browser', 1:'os'})
    1 loop, best of 3: 5.28 s per loop
    

    Conclusion:

    the fastest method takes approx. 528 microseconds per row. I.e. it'll take approx. 9 minutes per 1M rows on my notebook