Search code examples
pysparkapache-spark-sql

Split string on custom Delimiter in pyspark


I have data with column foo which can be

foo
abcdef_zh
abcdf_grtyu_zt
pqlmn@xl

from here I want to create two columns such that

Part 1      Part 2
abcdef       zh
abcdf_grtyu  zt
pqlmn        xl

The code I am using for this is

data = data.withColumn("Part 1",split(data["foo"],substring(data["foo"],-3,1))).get_item(0)
data = data.withColumn("Part 2",split(data["foo"],substring(data["foo"],-3,1))).get_item(1)

However I am getting an error column not iterable


Solution

  • The following should work

    >>> from pyspark.sql import Row
    >>> from pyspark.sql.functions import expr
    >>> df = sc.parallelize(['abcdef_zh', 'abcdfgrtyu_zt', 'pqlmn@xl']).map(lambda x: Row(x)).toDF(["col1"])
    >>> df.show()
    +-------------+
    |         col1|
    +-------------+
    |    abcdef_zh|
    |abcdfgrtyu_zt|
    |     pqlmn@xl|
    +-------------+
    >>> df.withColumn('part2',df.col1.substr(-2, 3)).withColumn('part1', expr('substr(col1, 1, length(col1)-3)')).select('part1', 'part2').show()
    +----------+-----+
    |     part1|part2|
    +----------+-----+
    |    abcdef|   zh|
    |abcdfgrtyu|   zt|
    |     pqlmn|   xl|
    +----------+-----+