Search code examples
apache-sparksplitrddpyspark

Trouble spliting a column into more columns on Pyspark


I'm having trouble spliting a dataframe's column into more columns in PySpark:

I have a list of lists and I want to transform it into a dataframe, each value in one column.

What I have tried:

I created a dataframe from this list:

[['COL-4560', 'COL-9655', 'NWG-0610', 'D81-3754'],
 ['DLL-7760', 'NAT-9885', 'PED-0550', 'MAR-0004', 'LLL-5554']]

Using this code:

from pyspark.sql import Row
R = Row('col1', 'col2')

# use enumerate to add the ID column
df_from_list = spark.createDataFrame([R(i, x) for i, x in enumerate(recs_list)])

The result I got is:

+----+--------------------+
|col1|                col2|
+----+--------------------+
|   0|[COL-4560, COL-96...|
|   1|[DLL-7760, NAT-98...|
+----+--------------------+

I want to separate the values by comma into columns, so I tried:

from pyspark.sql import functions as F

df2 = df_from_list.select('col1', F.split('col2', ', ').alias('col2'))

# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]

df_result = df2.select('col1', *[df2['col2'][i] for i in range(nb_columns)])
df_result.show()

But I get an error on this line df2 = df_from_list.select('col1', F.split('col2', ', ').alias('col2')):

AnalysisException: cannot resolve 'split(`col2`, ', ', -1)' due to data type mismatch: argument 1 requires string type, however, '`col2`' is of array<string> type.;;

My ideal final output would be like this:

+----------+----------+----------+----------+----------+
|  SKU     |  REC_01  | REC_02   | REC_03   | REC_04   |
+----------+----------+----------+----------+----------+
| COL-4560 | COL-9655 | NWG-0610 | D81-3754 | null     |
| DLL-7760 | NAT-9885 | PED-0550 | MAR-0004 | LLL-5554 |
+---------------------+----------+----------+----------+

Some rows may have four values, but some my have more or less, I don't know the exact number of columns the final dataframe will have.

Does anyone have any idea of what is happening? Thank you very much in advance.


Solution

  • Dataframe df_from_list col2 column is already array type, so no need to split (as split works with stringtype here we have arraytype).

    Here are the steps that will work for you.

    recs_list=[['COL-4560', 'COL-9655', 'NWG-0610', 'D81-3754'],
     ['DLL-7760', 'NAT-9885', 'PED-0550', 'MAR-0004', 'LLL-5554']]
    
    from pyspark.sql import Row
    R = Row('col1', 'col2')
    
    # use enumerate to add the ID column
    df_from_list = spark.createDataFrame([R(i, x) for i, x in enumerate(recs_list)])
    
    from pyspark.sql import functions as F
    
    df2 = df_from_list
    
    # If you don't know the number of columns:
    df_sizes = df2.select(F.size('col2').alias('col2'))
    df_max = df_sizes.agg(F.max('col2'))
    nb_columns = df_max.collect()[0][0]
    
    cols=['SKU','REC_01','REC_02','REC_03','REC_04']
    df_result = df2.select(*[df2['col2'][i] for i in range(nb_columns)]).toDF(*cols)
    df_result.show()
    #+--------+--------+--------+--------+--------+
    #|     SKU|  REC_01|  REC_02|  REC_03|  REC_04|
    #+--------+--------+--------+--------+--------+
    #|COL-4560|COL-9655|NWG-0610|D81-3754|    null|
    #|DLL-7760|NAT-9885|PED-0550|MAR-0004|LLL-5554|
    #+--------+--------+--------+--------+--------+