apache-sparkpysparkapache-spark-sqlsubstringextract

Extract substring from URL / value of a key from URL


I have a table which has "url" column.

I need to extract all the values of "tag".

TableA

#+---------------------------------------------------------------------+
#|   url                                                               |
#+---------------------------------------------------------------------+
#|   https://www.amazon.in/primeday?tag=final&value=true               | 
#|   https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2     | 
#|   https://www.google.com/active/search?tag=inreview&type=addtional  |
#|   https://www.google.com/filter/search?&type=nonactive              |  

Output

#+------------------+
#|   Tag            |
#+------------------+
#|   final          | 
#|   presubmitted   | 
#|   inreview       | 

I am able to do it in Spark SQL using this:

spark.sql("""select parse_url(url,'QUERY','tag') as Tag from TableA""")

Any option via dataframe or regular expression?


Solution

  • PySpark:

    df \
      .withColumn("partialURL", split("url", "tag=")[1]) \
      .withColumn("tag", split("partialURL", "&")[0]) \
      .drop("partialURL")