Search code examples
pythonregexstringpysparksubstring

Pyspark extract all that comes after the second period


I am looking to create a new column that contains all characters after the second last occurrence of the '.' character.

If there are less that two '.' characters, then keep the entire string.

I am looking to do this in spark 2.4.8 without using a UDF. Any ideas?

data = [
('google.com',),
('asdasdasd.google.com',),
('a.d.a.google.com',),        
('www.google.com',)
]

df = sc.parallelize(data).toDF(['host'])
df.withColumn('domain', functions.regexp_extract(df['host'], r'\b\w+\.\w+\b', 0)).show()
+--------------------+----------------+
|                host|          domain|
+--------------------+----------------+
|          google.com|      google.com|
|asdasdasd.google.com|asdasdasd.google|
|    a.d.a.google.com|             a.d|
|      www.google.com|      www.google|
+--------------------+----------------+

The desired result is the following.

+--------------------+----------------+
|                host|          domain|
+--------------------+----------------+
|          google.com|      google.com|
|asdasdasd.google.com|      google.com|
|    a.d.a.google.com|      google.com|
|      www.google.com|      google.com|
+--------------------+----------------+

Solution

  • Simply use the substring_index.

    df.withColumn('domain', f.substring_index('host', '.', -2)).show(truncate=False)
    
    +--------------------+----------+
    |host                |domain    |
    +--------------------+----------+
    |google.com          |google.com|
    |asdasdasd.google.com|google.com|
    |a.d.a.google.com    |google.com|
    |www.google.com      |google.com|
    +--------------------+----------+