Search code examples
pythonspliturlparse

Website Name extract in Python


I want to extract website names from the url. For e.g. https://plus.google.com/in/test.html should give the output as - "plus google"

Some more testcases are -

  1. WWW.OH.MADISON.STORES.ADVANCEAUTOPARTS.COM/AUTO_PARTS_MADISON_OH_7402.HTML

Output:- OH MADISON STORES ADVANCEAUTOPARTS

  1. WWW.LQ.COM/LQ/PROPERTIES/PROPERTYPROFILE.DO?PROPID=6054

Output:- LQ

  1. WWW.LOCATIONS.DENNYS.COM

Output:- LOCATIONS DENNYS

  1. WV.WESTON.STORES.ADVANCEAUTOPARTS.COM

Output:- WV WESTON STORES ADVANCEAUTOPARTS

  1. WOODYANDERSONFORDFAYETTEVILLE.NET/

Output:- WOODYANDERSONFORFAYETTEVILLE

  1. WILMINGTONMAYFAIRETOWNCENTER.HGI.COM

Output:- WILMINGTONMAYFAIRETOWNCENTER HGI

  1. WHITEHOUSEBLACKMARKET.COM/

Output:- WHITEHOUSEBLACKMARKET

  1. WINGATEHOTELS.COM

Output:- WINGATEHOTELS

string = str(input("Enter the url "))
new_list = list(string)
count=0
flag=0

if 'w' in new_list:
    index1 = new_list.index('w')
    new_list.pop(index1)
    count += 1
if 'w' in new_list:
    index2 = new_list.index('w')
    if index2 != -1 and index2 == index1:
        new_list.pop(index2)
        count += 1
if 'w' in new_list:
    index3= new_list.index('w')
    if index3!= -1 and index3== index2 and new_list[index3+1]=='.':
        new_list.pop(index3)
        count+=1      
        flag = 1
if flag == 0:
    start = string.find('/')
    start += 2
    end = string.rfind('.')

    new_string=string[start:end]
    print(new_string)
elif flag == 1:
    start = string.find('.')
    start = start + 1
    end = string.rfind('.')

    new_string=string[start:end]
    print(new_string)

The above works for some testcases but not all. Please help me with it.

Thanks


Solution

  • this is something you could build upon; using urllib.parse.urlparse:

    from urllib.parse import urlparse
    
    tests = ('https://plus.google.com/in/test.html',
             ('WWW.OH.MADISON.STORES.ADVANCEAUTOPARTS.COM/'
              'AUTO_PARTS_MADISON_OH_7402.HTML'),
             'WWW.LQ.COM/LQ/PROPERTIES/PROPERTYPROFILE.DO?PROPID=6054')
    
    def extract(url):
        # urlparse will not work without a 'scheme'
        if not url.startswith('http'):
            url = 'http://' + url
        parsed = urlparse(url).netloc
        split = parsed.split('.')[:-1] # get rid of TLD
        if split[0].lower() == 'www':
            split = split[1:]
        ret = ' '.join(split)
        return ret
    
    for url in tests:
        print(extract(url))