I am looking to create a new column that contains all characters after the second last occurrence of the '.' character.
If there are less that two '.' characters, then keep the entire string.
I am looking to do this in spark 2.4.8 without using a UDF. Any ideas?
data = [
('google.com',),
('asdasdasd.google.com',),
('a.d.a.google.com',),
('www.google.com',)
]
df = sc.parallelize(data).toDF(['host'])
df.withColumn('domain', functions.regexp_extract(df['host'], r'\b\w+\.\w+\b', 0)).show()
+--------------------+----------------+
| host| domain|
+--------------------+----------------+
| google.com| google.com|
|asdasdasd.google.com|asdasdasd.google|
| a.d.a.google.com| a.d|
| www.google.com| www.google|
+--------------------+----------------+
The desired result is the following.
+--------------------+----------------+
| host| domain|
+--------------------+----------------+
| google.com| google.com|
|asdasdasd.google.com| google.com|
| a.d.a.google.com| google.com|
| www.google.com| google.com|
+--------------------+----------------+
Simply use the substring_index
.
df.withColumn('domain', f.substring_index('host', '.', -2)).show(truncate=False)
+--------------------+----------+
|host |domain |
+--------------------+----------+
|google.com |google.com|
|asdasdasd.google.com|google.com|
|a.d.a.google.com |google.com|
|www.google.com |google.com|
+--------------------+----------+