I have a table which has "url" column.
I need to extract all the values of "tag".
TableA
#+---------------------------------------------------------------------+
#| url |
#+---------------------------------------------------------------------+
#| https://www.amazon.in/primeday?tag=final&value=true |
#| https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2 |
#| https://www.google.com/active/search?tag=inreview&type=addtional |
#| https://www.google.com/filter/search?&type=nonactive |
Output
#+------------------+
#| Tag |
#+------------------+
#| final |
#| presubmitted |
#| inreview |
I am able to do it in Spark SQL using this:
spark.sql("""select parse_url(url,'QUERY','tag') as Tag from TableA""")
Any option via dataframe or regular expression?
PySpark:
df \
.withColumn("partialURL", split("url", "tag=")[1]) \
.withColumn("tag", split("partialURL", "&")[0]) \
.drop("partialURL")