python regex pandas dataframe feature-extraction

How can include the new columns to dataframe while replacing nan and avoid empty events

I'm trying to extract 2 features out of df['http_path'] and enrich the features. the problem is I used ? sepaator. I waned to replace nan in case there was no value recorded in events/rows for further processing. Then I will replace nan for those events they don't have any info and iterate over the rows. To avoid repeated events, I want to keep those event the have info A ,B and concat to df. I tried following code:

http_path = https://example.org/path/to/file?param=42#fragment
#http_path = ...A?B            ^^^^^^^^^^^^^ ^^^^^^^^

# new columns extracted from single column http_path
#api = A or /path/to/file
#param = B or param=42

http_path = df.http_path.str.split('?')   #The first ? seprator
api_param_df = pd.DataFrame([row if len(row) == 2 else row+[np.nan] for row in http_path.values], columns=["api", "param"])
df = pd.concat([df, api_param_df], axis=1)

Below is the example:

http_path	API URL	URL parameters
https://example.org/path/to/file?param=42#fragment	path/to/file	param=42#fragment
https://example.org/path/to/file	path/to/file	NaN

Is there any elegant way to do this?

Solution

You can use str.extract with regex (?:https?://[^/]+/)?(?P<api>[^?]+)\??(?P<param>.+)?:

df = pd.DataFrame({'http_path': ['https://example.org/path/to/file?param=42#fragment', 'https://example.org/path/to/file']})
df
#                                           http_path
#0  https://example.org/path/to/file?param=42#frag...
#1                   https://example.org/path/to/file

df.http_path.str.extract('(?:https?://[^/]+/)?(?P<api>[^?]+)\??(?P<param>.+)?')

#            api              param
#0  path/to/file  param=42#fragment
#1  path/to/file                NaN

where in regex pattern:

(?:https?://[^/]+/)? optionally matches domain but doesn't capture it
(?P<api>[^?]+) matches everything up to ?
\? matches ? literally
(?P<param>.+) matches everything after ?

Notice we also make \? and the second capture group optional so that when there are no query parameters in http path, it returns NaN.