I have 34 million row and only have a column. I want to split string into 4 column.
Here is my sample dataset (df):
Log
0 Apr 4 20:30:33 100.51.100.254 dns,packet user: --- got query from 10.5.14.243:30648:
1 Apr 4 20:30:33 100.51.100.254 dns,packet user: id:78a4 rd:1 tc:0 aa:0 qr:0 ra:0 QUERY 'no error'
2 Apr 4 20:30:33 100.51.100.254 dns,packet user: question: tracking.intl.miui.com:A:IN
3 Apr 4 20:30:33 dns user: query from 9.5.10.243: #4746190 tracking.intl.miui.com. A
I want to split it into four column using this code:
df1 = df['Log'].str.split(n=3, expand=True)
df1.columns=['Month','Date','Time','Log']
df1.head()
Here is the result that i expected
Month Date Time Log
0 Apr 4 20:30:33 100.51.100.254 dns,packet user: --- go...
1 Apr 4 20:30:33 100.51.100.254 dns,packet user: id:78a...
2 Apr 4 20:30:33 100.51.100.254 dns,packet user: questi...
3 Apr 4 20:30:33 dns transjakarta: query from 9.5.10.243: #474...
4 Apr 4 20:30:33 100.51.100.254 dns,packet user: --- se...
but the respond is like this:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-36-c9b2023fbf3e> in <module>
----> 1 df1 = df['Log'].str.split(n=3, expand=True)
2 df1.columns=['Month','Date','Time','Log']
3 df1.head()
TypeError: split() got an unexpected keyword argument 'expand'
Is there any solution to split the string using dask?
Dask dataframe does support the str.split method's expand= keyword if you provide an n=
keyword as well to tell it how many splits to expect.
It looks like dask dataframes's str.split
method doesn't implement the expand= keyword. You might raise an issue if one does not already exist.
As a short term workaround, you could make a Pandas function, and then use the map_partitions method to scale that across your dask dataframe
def f(df: pandas.DataFrame) -> pandas.DataFrame:
""" This is your code from above, as a function """
df1 = df['Log'].str.split(n=3, expand=True)
df1.columns=['Month','Date','Time','Log']
return df
ddf = ddf.map_partitions(f) # apply to all pandas dataframes within dask dataframe
Because Dask dataframes are just collections of Pandas dataframes it's relatively easy to build things yourself when Dask dataframe doesn't support them.