Search code examples
pysparkurl-parametersurl-parsing

What is the best practice to remove url parameters from web request log in PySpark?


I'm a newbie in PySpark and I want to exclude/remove the URL parameters from the raw weblog that existed in the spark dataframe. The nature of data is the following:

+----------------------------------------------------------------------------------------+
|weblog                                                                                  |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000]                                                            |
|SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q==               |     
|user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g==              |
|"HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms                         |
|"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"                                    |
|WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                        |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:00:19:24 +0000]                                                            |
|W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp                   | 
|"GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?                 |
|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime'2021-03-                   |
|24T00:15:05'%20and%20substringof('dude',SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
|HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - - |                                                                               
+----------------------------------------------------------------------------------------+

so I want to remove whatever immediately after ? as below:

+----------------------------------------------------------------------------------------+
|this part should be removed from weblog                                                 |                      
+----------------------------------------------------------------------------------------+
|xxxx-clientt=005                                                                        |
+----------------------------------------------------------------------------------------+
|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime'2021-03-                   |
|24T00:15:05'%20and%20substringof('dude',SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |                                       
+----------------------------------------------------------------------------------------+

My expected output is like this:

+----------------------------------------------------------------------------------------+
|weblog                                                                                  |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000]                                                            |
|SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q==               |     
|user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g==              |
|"HEAD /xxxx/pub/ping? HTTP/1.1" 200 "-b" 53b 7ms                                        |
|"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"                                    |
|WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                        |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:00:19:24 +0000]                                                            |
|W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp                   | 
|"GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?                 |
|HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - - |                                                                               
+----------------------------------------------------------------------------------------+

So I tried to find a quick and safe way inspired by this post but I couldn't adapt if you see the colab notebook at end of this question as I tried:

from urllib.parse import urlsplit, urlunsplit

def remove_query_params_and_fragment(url):
    return urlunsplit(urlsplit(url)._replace(query=""))

I tried following way unsuccessfully and sadly couldn't exclude desired part from the rest and clean it:

from pyspark.sql.functions import udf
from urllib.parse import urlsplit

schema2 = StructType(
    [
        StructField("path", StringType(), False),
        StructField("query", ArrayType(StringType(), False), True),
        StructField("fragment", StringType(), True),
    ]
)


def _parse_url(s):
    data = urlsplit(s)
    if data[3]:
        query_params = list()
        query_params.append(data[3])
    else:
        query_params = None
    return {
        "path": "{}://{}/{}".format(data[0], data[1].rstrip("/"), data[2]),
        "query": query_params,
        "fragment": data[4],
    }


url_parse_udf = f.udf(_parse_url, schema2)

parsed = sdf.select("*", url_parse_udf(sdf["weblog"]).alias("data"))

#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|col                                                                                                                                                                                                                                                 #|
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                                                    #|
#|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The problem with my attempt is it returns everything after ? in the raw weblog. I provided the colab notebook for quick debugging. I also was thinking if there is a mechanism to parse the weblog and extract URL parameters and then subtract two columns from each other like:

sdf1 = sdf.withColumn('Result', ( sdf['weblog'] - sdf['url_parameters'] ))
weblog url_parameters Results (weblog - url_parameters)
03/Oct/2021:09:26:37 +0000...xxxx-clientt=005... xxxx-clientt=005 ...
03/Oct/2021:00:19:24 +0000...$format=json&$... $format=json&$... ...

Solution

  • Avoid using UDF if possible. UDF is like a black box to pyspark and thus spark cannot efficiently apply optimizations on them. For details please read this.

    Rather than using Udfs, you can directly use pyspark's sql functions.

    
    from pyspark.sql.functions import split
    # from urllib.parse import urlsplit
    split_with_question_mark = split(sdf.weblog, '\\?')
    param_separated_df = sdf.withColumn("before_param", split_with_question_mark[0]).withColumn("after_param", split_with_question_mark[1])
    param_separated_df.show(truncate=False)
    

    Result:

    
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |weblog                                                                                                                                                                                                                                                                                                                                                                                                                        |before_param                                                                                                                                                                                          |after_param                                                                                                                                                                                                                                         |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                       |[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                                                    |
    |[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection                             |$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    
    

    Once you have separated the before Query Url, You can split the after query part by Http method type i.e HTTP/1.1 to get the query parameters.

    import pyspark.sql.functions as func
    
    separated_by_comma = param_separated_df.withColumn("query_param", func.split(param_separated_df["after_param"], 'HTTP/1.1')[0]);
    separated_by_comma.show(truncate=False)
    
    

    Result:

    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |weblog                                                                                                                                                                                                                                                                                                                                                                                                                        |before_param                                                                                                                                                                                          |after_param                                                                                                                                                                                                                                         |query_param                                                                                                                                                  |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                       |[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                                                    |xxxx-client=005                                                                                                                                              |
    |[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection                             |$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
    

    All the above changes are made in collab that you shared.