apache-spark pyspark apache-spark-sql substring extract

Extract substring from URL / value of a key from URL

I have a table which has "url" column.

I need to extract all the values of "tag".

TableA

#+---------------------------------------------------------------------+
#|   url                                                               |
#+---------------------------------------------------------------------+
#|   https://www.amazon.in/primeday?tag=final&value=true               | 
#|   https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2     | 
#|   https://www.google.com/active/search?tag=inreview&type=addtional  |
#|   https://www.google.com/filter/search?&type=nonactive              |

Output

#+------------------+
#|   Tag            |
#+------------------+
#|   final          | 
#|   presubmitted   | 
#|   inreview       |

I am able to do it in Spark SQL using this:

spark.sql("""select parse_url(url,'QUERY','tag') as Tag from TableA""")

Any option via dataframe or regular expression?

Solution

PySpark:

df \
  .withColumn("partialURL", split("url", "tag=")[1]) \
  .withColumn("tag", split("partialURL", "&")[0]) \
  .drop("partialURL")