I'm trying to extract 2 features out of df['http_path']
and enrich the features. the problem is I used ?
sepaator. I waned to replace nan in case there was no value recorded in events/rows for further processing. Then I will replace nan
for those events they don't have any info and iterate over the rows. To avoid repeated events, I want to keep those event the have info A
,B
and concat
to df
. I tried following code:
http_path = https://example.org/path/to/file?param=42#fragment
#http_path = ...A?B ^^^^^^^^^^^^^ ^^^^^^^^
# new columns extracted from single column http_path
#api = A or /path/to/file
#param = B or param=42
http_path = df.http_path.str.split('?') #The first ? seprator
api_param_df = pd.DataFrame([row if len(row) == 2 else row+[np.nan] for row in http_path.values], columns=["api", "param"])
df = pd.concat([df, api_param_df], axis=1)
Below is the example:
http_path | API URL | URL parameters |
---|---|---|
https://example.org/path/to/file?param=42#fragment | path/to/file | param=42#fragment |
https://example.org/path/to/file | path/to/file | NaN |
Is there any elegant way to do this?
You can use str.extract
with regex (?:https?://[^/]+/)?(?P<api>[^?]+)\??(?P<param>.+)?
:
df = pd.DataFrame({'http_path': ['https://example.org/path/to/file?param=42#fragment', 'https://example.org/path/to/file']})
df
# http_path
#0 https://example.org/path/to/file?param=42#frag...
#1 https://example.org/path/to/file
df.http_path.str.extract('(?:https?://[^/]+/)?(?P<api>[^?]+)\??(?P<param>.+)?')
# api param
#0 path/to/file param=42#fragment
#1 path/to/file NaN
where in regex pattern:
(?:https?://[^/]+/)?
optionally matches domain but doesn't capture it(?P<api>[^?]+)
matches everything up to ?
\?
matches ?
literally(?P<param>.+)
matches everything after ?
Notice we also make \?
and the second capture group optional so that when there are no query parameters in http path, it returns NaN
.