Search code examples
pythonpandaspython-re

How to remove string after .com and "https://" from an URL in Python


I need to merge two dataframe by using url as a primary key. However, there are some extra strings in the url like in df1, I have https://www.mcdonalds.com/us/en-us.html, where in df2, I have https://www.mcdonalds.com

I need to remove the /us/en-us.html after the .com and the https:// from the url, so I can perform the merge using url between 2 dfs. Below is a simplified example. What would be the solution for this?

df1={'url': ['https://www.mcdonalds.com/us/en-us.html','https://www.cemexusa.com/find-your- 
location']}
df2={'url':['https://www.mcdonalds.com','www.cemexusa.com']}

df1['url']==df2['url']
Out[7]: False

Thanks.


Solution

  • You can use urlparse as suggested by others, or you could also use urlsplit. However, both will not handle www.cemexusa.com. So if you do not need the scheme in your key, you could use something like this:

    def to_key(url):
        if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
            url = f"https://{url}"
        return urlsplit(url).hostname
    
    df1["Key"] = df1["URL"].apply(to_key)
    

    Here is a full working example:

    import pandas as pd
    import io
    
    from urllib.parse import urlsplit
    
    df1_data = io.StringIO("""
    URL,Description
    https://www.mcdonalds.com/us/en-us.html,Junk Food
    https://www.cemexusa.com/find-your-location,Cemex
    """)
    
    df2_data = io.StringIO("""
    URL,Last Update
    https://www.mcdonalds.com,2021
    www.cemexusa.com,2020
    """)
    
    df1 = pd.read_csv(df1_data)
    df2 = pd.read_csv(df2_data)
    
    def to_key(url):
        if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
            url = f"https://{url}"
        return urlsplit(url).hostname
        
    df1["Key"] = df1["URL"].apply(to_key)
    df2["Key"] = df2["URL"].apply(to_key)
    
    joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))
    
    # and if you want to get rid of the original urls
    joined = joined.drop(["URL_df1", "URL_df2"], axis=1)
    

    The output of print(joined) would be:

      Description                Key  Last Update
    0   Junk Food  www.mcdonalds.com         2021
    1       Cemex   www.cemexusa.com         2020
    

    There may be other special cases not handled in this answer. Depending on your data, you may also need to handle an omitted www:

    urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
    # realpython.com
    
    urlsplit("https://www.realpython.com").hostname  # also a valid URL
    # www.realpython.com
    

    What is the difference between urlparse and urlsplit?

    It depends on your use case and what information you would like to extract. Since you do not need the URL's params, I would suggest using urlsplit.

    [urlsplit()] is similar to urlparse(), but does not split the params from the URL. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit