Search code examples
pythonstringsplitbigdatadask

Str split with expand in Dask Dataframe


I have 34 million row and only have a column. I want to split string into 4 column.

Here is my sample dataset (df):

    Log
0   Apr  4 20:30:33 100.51.100.254 dns,packet user: --- got query from 10.5.14.243:30648:
1   Apr  4 20:30:33 100.51.100.254 dns,packet user: id:78a4 rd:1 tc:0 aa:0 qr:0 ra:0 QUERY 'no error'
2   Apr  4 20:30:33 100.51.100.254 dns,packet user: question: tracking.intl.miui.com:A:IN
3   Apr  4 20:30:33 dns user: query from 9.5.10.243: #4746190 tracking.intl.miui.com. A

I want to split it into four column using this code:

df1 = df['Log'].str.split(n=3, expand=True)
df1.columns=['Month','Date','Time','Log']
df1.head()

Here is the result that i expected

     Month Date      Time                                              Log
0      Apr    4  20:30:33  100.51.100.254 dns,packet user: --- go...
1      Apr    4  20:30:33  100.51.100.254 dns,packet user: id:78a...
2      Apr    4  20:30:33  100.51.100.254 dns,packet user: questi...
3      Apr    4  20:30:33  dns transjakarta: query from 9.5.10.243: #474...
4      Apr    4  20:30:33  100.51.100.254 dns,packet user: --- se...

but the respond is like this:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-c9b2023fbf3e> in <module>
----> 1 df1 = df['Log'].str.split(n=3, expand=True)
      2 df1.columns=['Month','Date','Time','Log']
      3 df1.head()

TypeError: split() got an unexpected keyword argument 'expand'

Is there any solution to split the string using dask?


Solution

  • Edit: this works now

    Dask dataframe does support the str.split method's expand= keyword if you provide an n= keyword as well to tell it how many splits to expect.

    Old answer

    It looks like dask dataframes's str.split method doesn't implement the expand= keyword. You might raise an issue if one does not already exist.

    As a short term workaround, you could make a Pandas function, and then use the map_partitions method to scale that across your dask dataframe

    def f(df: pandas.DataFrame) -> pandas.DataFrame:
        """ This is your code from above, as a function """
        df1 = df['Log'].str.split(n=3, expand=True)
        df1.columns=['Month','Date','Time','Log']
        return df
    
    ddf = ddf.map_partitions(f)  # apply to all pandas dataframes within dask dataframe
    

    Because Dask dataframes are just collections of Pandas dataframes it's relatively easy to build things yourself when Dask dataframe doesn't support them.