There is a list containing URL adresses with parameters:
http://example.com/?param1=apple¶m2=tomato¶m3=carrot
http://sample.com/?param1=banana¶m3=potato¶m4=berry
http://example.org/?param2=apple¶m3=tomato¶m4=carrot
Each URL may contain any of 4 parameters.
I want to extract URL parameters and add them into Pandas DataFrame. The DataFrame should have a URL column and 4 columns with parameters. If a parameter is not present in the URL, the cell is empty:
URL param1 param2 param3 param4
... apple tomato carrot
... banana potato berry
... apple tomato carrot
I was planning to use python built-in urlparse module, which allows to extract parameters easily:
import urlparse
url = 'http://example.com/?param1=apple¶m2=tomato¶m3=carrot'
par = urlparse.parse_qs(urlparse.urlparse(url).query)
print par['param1'], par['param2']
Out: ['apple'] ['tomato']
With urlparse I can get the list of parameters in URLs:
import pandas as pd
urls = ['http://example.com/?param1=apple¶m2=tomato¶m3=carrot',
'http://sample.com/?param1=banana¶m3=potato¶m4=berry',
'http://example.org/?param2=apple¶m3=tomato¶m4=carrot']
df = pd.DataFrame(urls, columns=['url'])
params = [urlparse.parse_qs(urlparse.urlparse(url).query) for url in urls]
print params
Out: [{'param1': ['apple'], 'param2': ['tomato'], 'param3': ['carrot']},
{'param1': ['banana'], 'param3': ['potato'], 'param4': ['berry']},
{'param2': ['apple'], 'param3': ['tomato'], 'param4': ['carrot']}]
...
I don't know how to add extracted parameters into the DataFrame. Maybe there is a better way of doing it? The original file is ~1m URLs.
You can use a dictionary comprehension to extract the data in the parameters per parameter. I'm not sure if you wanted the final values in list form. If not, it would be easy to extract it.
>>> pd.DataFrame({p: [d.get(p) for d in params]
for p in ['param1', 'param2', 'param3', 'param4']})
param1 param2 param3 param4
0 [apple] [tomato] [carrot] None
1 [banana] None [potato] [berry]
2 None [apple] [tomato] [carrot]
or...
>>> pd.DataFrame({p: [d[p][0] if p in d else None for d in params]
for p in ['param1', 'param2', 'param3', 'param4']})
param1 param2 param3 param4
0 apple tomato carrot None
1 banana None potato berry
2 None apple tomato carrot