Search code examples
arraysapache-sparkpysparkapache-spark-sqlsubstring

Apply function to all elements in array<string> column


I have a column with arrays of strings, e.g. like this:

["test.a" "random.ac"]
["test.41" "random.23" "test.123"]

I want to get only the text before the ".". I did it only for the fist element of the array. How do I do it to all elements? Preferably without UDFs.

df = df.withColumn("address", substring_index(col("all_addresses").getItem(0), ".", 1)))

Solution

  • I would use a similar idea as @wwnde - transform function. transform takes an array, transforms every its element according to the provided function and results in the array of the same size, but with changed elements. Exactly what you need.

    However, having the same original idea, I would probably implement it differently.
    2 options:

    from pyspark.sql import functions as F
    df = spark.createDataFrame(
        [(["test.a", "random.ac"],),
         (["test.41", "random.23", "test.123"],)],
        ['c1']
    )
    
    df = df.withColumn('c2', F.transform('c1', lambda x: F.element_at(F.split(x, '\.'), 1)))
    df = df.withColumn('c3', F.transform('c1', lambda x: F.regexp_extract(x, r'(.+)\.', 1)))
    
    df.show()
    # +--------------------+--------------------+--------------------+
    # |                  c1|                  c2|                  c3|
    # +--------------------+--------------------+--------------------+
    # | [test.a, random.ac]|      [test, random]|      [test, random]|
    # |[test.41, random....|[test, random, test]|[test, random, test]|
    # +--------------------+--------------------+--------------------+