Search code examples
pythonpandasurlparse

Extracting URL parameters into Pandas DataFrame


There is a list containing URL adresses with parameters:

http://example.com/?param1=apple&param2=tomato&param3=carrot
http://sample.com/?param1=banana&param3=potato&param4=berry
http://example.org/?param2=apple&param3=tomato&param4=carrot

Each URL may contain any of 4 parameters.

I want to extract URL parameters and add them into Pandas DataFrame. The DataFrame should have a URL column and 4 columns with parameters. If a parameter is not present in the URL, the cell is empty:

URL    param1    param2     param3    param4
...    apple     tomato     carrot
...    banana               potato    berry
...              apple      tomato    carrot

I was planning to use python built-in urlparse module, which allows to extract parameters easily:

import urlparse
url = 'http://example.com/?param1=apple&param2=tomato&param3=carrot'
par = urlparse.parse_qs(urlparse.urlparse(url).query)
print par['param1'], par['param2']

Out: ['apple'] ['tomato']

With urlparse I can get the list of parameters in URLs:

import pandas as pd

urls = ['http://example.com/?param1=apple&param2=tomato&param3=carrot',
        'http://sample.com/?param1=banana&param3=potato&param4=berry',
        'http://example.org/?param2=apple&param3=tomato&param4=carrot']

df = pd.DataFrame(urls, columns=['url'])
params = [urlparse.parse_qs(urlparse.urlparse(url).query) for url in urls]
print params

Out: [{'param1': ['apple'], 'param2': ['tomato'], 'param3': ['carrot']},
      {'param1': ['banana'], 'param3': ['potato'], 'param4': ['berry']},
      {'param2': ['apple'], 'param3': ['tomato'], 'param4': ['carrot']}]
...

I don't know how to add extracted parameters into the DataFrame. Maybe there is a better way of doing it? The original file is ~1m URLs.


Solution

  • You can use a dictionary comprehension to extract the data in the parameters per parameter. I'm not sure if you wanted the final values in list form. If not, it would be easy to extract it.

    >>> pd.DataFrame({p: [d.get(p) for d in params] 
                      for p in ['param1', 'param2', 'param3', 'param4']})
         param1    param2    param3    param4
    0   [apple]  [tomato]  [carrot]      None
    1  [banana]      None  [potato]   [berry]
    2      None   [apple]  [tomato]  [carrot]
    

    or...

    >>> pd.DataFrame({p: [d[p][0] if p in d else None for d in params] 
                      for p in ['param1', 'param2', 'param3', 'param4']})
       param1  param2  param3  param4
    0   apple  tomato  carrot    None
    1  banana    None  potato   berry
    2    None   apple  tomato  carrot