Search code examples
pythonurllib

Python how to parse 2 URLs from a string and then map it back?


I have a column in a pandas dataframe where some of the values are in this format: "From https://....com?gclid=... to https://...com". What I would like is to parse only the first URL so that the gclid and other IDs would vanish and I would like to map back that into the dataframe e.g.: "From https://....com to https://...com"

I know that there is a python module called urllib but if I apply that to this string a call a path() on it, it just parses the first URL and then I lose the other part which is as important as the first one.

Could somebody please help me? Thank you!


Solution

  • If you use DataFrame then use replace() which can use regex to find text like "?.... " (which starts with ? and ends with space - or which starts with ? and have only chars different then space - '\?[^ ]+')

    import pandas as pd
    
    df = pd.DataFrame({'text': ["From https://....com?gclid=... to https://...com"]})
    
    df['text'] = df['text'].str.replace('\?[^ ]+', '')
    

    Result

                                         text
    0  From https://....com to https://...com
    

    BTW: you can also try more complex regex to make sure it is part of url which starts with http.

    df['text'] = df['text'].str.replace('(http[^?]+)\?[^ ]+', '\\1')
    

    I use (...) to catch this url before ?... and I put it back using \\1 (already without ?...)