I am looking to create a new column that contains all characters after the second last occurrence of the '.' character.
If there are less that two '.' characters, then keep the entire string.
I am looking to do this in spark 2.4.8 without using a UDF. Any ideas?
data = [
df = sc.parallelize(data).toDF(['host'])
df.withColumn('domain', functions.regexp_extract(df['host'], r'\b\w+\.\w+\b', 0)).show()
| host| domain|
| google.com| google.com|
| a.d.a.google.com| a.d|
| www.google.com| www.google|
The desired result is the following.
| host| domain|
| google.com| google.com|
|asdasdasd.google.com| google.com|
| a.d.a.google.com| google.com|
| www.google.com| google.com|
Simply use the substring_index
df.withColumn('domain', f.substring_index('host', '.', -2)).show(truncate=False)
|host |domain |
|google.com |google.com|
|a.d.a.google.com |google.com|
|www.google.com |google.com|