I have a column in a pandas dataframe where some of the values are in this format: "From https://....com?gclid=... to https://...com". What I would like is to parse only the first URL so that the gclid and other IDs would vanish and I would like to map back that into the dataframe e.g.: "From https://....com to https://...com"
I know that there is a python module called urllib but if I apply that to this string a call a path() on it, it just parses the first URL and then I lose the other part which is as important as the first one.
Could somebody please help me? Thank you!
If you use DataFrame then use replace()
which can use regex to find text like "?.... "
(which starts with ?
and ends with space
- or which starts with ?
and have only chars different then space
- '\?[^ ]+'
)
import pandas as pd
df = pd.DataFrame({'text': ["From https://....com?gclid=... to https://...com"]})
df['text'] = df['text'].str.replace('\?[^ ]+', '')
Result
text
0 From https://....com to https://...com
BTW: you can also try more complex regex to make sure it is part of url which starts with http
.
df['text'] = df['text'].str.replace('(http[^?]+)\?[^ ]+', '\\1')
I use (...)
to catch this url before ?...
and I put it back using \\1
(already without ?...
)