Search code examples
pythonregexpandasdataframefeature-extraction

How can include the new columns to dataframe while replacing nan and avoid empty events


I'm trying to extract 2 features out of df['http_path'] and enrich the features. the problem is I used ? sepaator. I waned to replace nan in case there was no value recorded in events/rows for further processing. Then I will replace nan for those events they don't have any info and iterate over the rows. To avoid repeated events, I want to keep those event the have info A ,B and concat to df. I tried following code:

http_path = https://example.org/path/to/file?param=42#fragment
#http_path = ...A?B            ^^^^^^^^^^^^^ ^^^^^^^^

# new columns extracted from single column http_path
#api = A or /path/to/file
#param = B or param=42

http_path = df.http_path.str.split('?')   #The first ? seprator
api_param_df = pd.DataFrame([row if len(row) == 2 else row+[np.nan] for row in http_path.values], columns=["api", "param"])
df = pd.concat([df, api_param_df], axis=1) 

Below is the example:

http_path API URL URL parameters
https://example.org/path/to/file?param=42#fragment path/to/file param=42#fragment
https://example.org/path/to/file path/to/file NaN

Is there any elegant way to do this?


Solution

  • You can use str.extract with regex (?:https?://[^/]+/)?(?P<api>[^?]+)\??(?P<param>.+)?:

    df = pd.DataFrame({'http_path': ['https://example.org/path/to/file?param=42#fragment', 'https://example.org/path/to/file']})
    df
    #                                           http_path
    #0  https://example.org/path/to/file?param=42#frag...
    #1                   https://example.org/path/to/file
    
    df.http_path.str.extract('(?:https?://[^/]+/)?(?P<api>[^?]+)\??(?P<param>.+)?')
    
    #            api              param
    #0  path/to/file  param=42#fragment
    #1  path/to/file                NaN
    

    where in regex pattern:

    • (?:https?://[^/]+/)? optionally matches domain but doesn't capture it
    • (?P<api>[^?]+) matches everything up to ?
    • \? matches ? literally
    • (?P<param>.+) matches everything after ?

    Notice we also make \? and the second capture group optional so that when there are no query parameters in http path, it returns NaN.