I new to pandas. I have a dataset like this one:
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/2011'],
'Event':['music.example.com', 'poetry.example.com', 'theatre.domain.com', 'comedy.domain.com'],
'Cost':[10000, 5000, 15000, 2000]})
And would like to add a column for "base domain" so I can do aggregate functions on base domain instead of subdomain. In this example there new column would have values
'baseDomain':['example.com', 'example.com', 'domain.com', 'domain.com'],
It should not just blindly split on the "." so should probably use something like tld
although the domains are not as URLs
========== Update
Used adhg and Henry Ecker solution and did it like this:
def get_base_domain(event):
ext = tldextract.extract(event)
return ext.domain + '.' + ext.suffix
df['baseDomain'] = df.apply(lambda x: get_base_domain(x['Event']), axis=1)
you can do this:
def get_base_domain(event):
return event[event.index('.')+1:]
df['baseDomain'] = df.apply(lambda x: get_base_domain(x['Event']), axis=1)
desired result:
Date Event Cost baseDomain
0 10/2/2011 music.example.com 10000 example.com
1 11/2/2011 poetry.example.com 5000 example.com
2 12/2/2011 theatre.domain.com 15000 domain.com
3 13/2/2011 comedy.domain.com 2000 domain.com
adjust get_base_domain
if you have unclean Event domain data