Search code examples
pythonapache-sparkpysparkspark-koalas

Split a koalas column of lists into multiple columns


How do I go from df to df1 where df and df1 are shown below?

df = koalas.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)],'teams1':[np.random.randint(0,10) for _ in range(7)]})
df
output:
    teams   teams1
0   [SF, NYG]   0
1   [SF, NYG]   5
2   [SF, NYG]   8
3   [SF, NYG]   1
4   [SF, NYG]   2
5   [SF, NYG]   8
6   [SF, NYG]   5
df1 = koalas.DataFrame({"col1": ["SF" for _ in range(7)],\
                        "col2": ["NYG" for _ in range(7)],\
                        'teams1':[np.random.randint(0,10) for _ in range(7)]})
df1
output:
    col1 col2 teams1
0   SF  NYG 8
1   SF  NYG 2
2   SF  NYG 9
3   SF  NYG 4
4   SF  NYG 8
5   SF  NYG 3
6   SF  NYG 1

I can see a solution for pandas here. But this solution will collect all the data on the driver side which is not what I want to happen. I want a koalas (pandas on pyspark) solution


Solution

  • One way I found to use only functions that operate on workers and do not collect all the data to the driver is

    df['teams'] \
      .astype(str) \
      .str.replace('\[|\]', '') \
      .str.split(pat=',', n=1, expand=True)
    
    #     0     1
    # 0  SF   NYG
    # 1  SF   NYG
    # 2  SF   NYG
    # 3  SF   NYG
    # 4  SF   NYG
    # 5  SF   NYG
    # 6  SF   NYG
    

    I had to transform the column as type string because it was a numpy array, and pyspark couldn't operate with it.


    To get the initial dataframe along its other columns, you can use a simple concat:

    import databricks.koalas as ks
    
    ks.concat([
      df['teams'].astype(str).str.replace('\[|\]', '').str.split(pat=',', n=1, expand=True),
      df.drop(columns='teams')
    ], axis=1)
    
    #     0     1  teams1
    # 0  SF   NYG       2
    # 1  SF   NYG       2
    # 2  SF   NYG       1
    # 3  SF   NYG       1
    # 4  SF   NYG       7
    # 5  SF   NYG       8
    # 6  SF   NYG       6