Search code examples
pythonpandasurlparse

Urlparse applied to a column for extracting length and TLD info


I'm trying to extract length and suffix (tld) from a list of websites in a pandas data frame.

Website.      Label
18egh.com       1
fish.co.uk      0
www.description.com 1
http://world.com 1

My desired output should be

Website      Label    Length   Tld 
18egh.com       1        5      com
fish.co.uk      0        4      co.uk
www.description.com 1    11     com
http://world.com 1       5      com

I've tried first with the length as shown as follows:

def get_domain(df):  
    my_list=[]
    for x in df['Website'].tolist():
          domain = urlparse(x).netloc
          my_list.append(domain)
          df['Domain']  = my_list
          df['Length']=df['Domain'].str.len()
    return df

but when I check the list is empty. I know that for extracting information on domain and tld it'd probably enough to use url parse, but if I am wrong I'd appreciate if you'd point me on the right direction.


Solution

  • Update:

    To extract the domains, etc. try tldextract to do the work.

    Example:

    import pandas as pd
    import tldextract # pip install tldextract | # conda install -c conda-forge tldextract
    
    df = pd.DataFrame({'Website.': {0: '18egh.com',
      1: 'fish.co.uk',
      2: 'www.description.com',
      3: 'http://world.com',
      4: 'http://forums.news.cnn.com/'},
     'Label': {0: 1, 1: 0, 2: 1, 3: 1, 4: 0}})
    
    df[['subdomin', 'domain', 'suffix']] = df.apply(lambda x: pd.Series(tldextract.extract(x['Website.'])), axis=1)
    
    print(df)
    
                              Website.  Label     subdomin       domain suffix
        0                    18egh.com      1                     18egh    com
        1                   fish.co.uk      0                      fish  co.uk
        2          www.description.com      1          www  description    com
        3             http://world.com      1                     world    com
        4  http://forums.news.cnn.com/      0  forums.news          cnn    com
    

    Original answer below


    Try:

    import pandas as pd
    
    df = pd.DataFrame({'Website.': {0: '18egh.com',
      1: 'fish.co.uk',
      2: 'www.description.com',
      3: 'http://world.com'},
     'Label': {0: 1, 1: 0, 2: 1, 3: 1}})
    
    pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.'
    
    df['Domain'] = df['Website.'].str.extract(pattern)
    df['Domain_Len'] = df['Domain'].str.len()
    
    print(df)
    
        Website.             Label  Domain          Domain_Len
    0   18egh.com            1      18egh           5
    1   fish.co.uk           0      fish            4
    2   www.description.com  1      description     11
    3   http://world.com     1      world           5
    

    Alternatively:

    pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.(.*?)$'
    
    df[['Domain', 'TLD']] = df['Website.'].str.extract(pattern, expand=True)
    df['TLD_Len'] = df['TLD'].str.len()
    df['Domain_Len'] = df['Domain'].str.len()
    
    print(df)
    
        Website.             Label  TLD     TLD_Len     Domain       Domain_Len
    0   18egh.com            1      com     3           18egh        5
    1   fish.co.uk           0      co.uk   5           fish         4
    2   www.description.com  1      com     3           description  11
    3   http://world.com     1      com     3           world        5