Search code examples
pythoncudf

Using cuDF to split a Series of strings into chunks


I have a cuDF Series containing long strings and I would like to split each string into equal sized chunks.

My code to do this looks something like:

import cudf                                                                                            
                                                                                                                                                                             
s = cudf.Series(["abcdefg", "hijklmnop"])                                                              
                                                                                                       
def chunker(string):                                                                                   
    chunk_size = 3                                                                                     
    return [string[i:i+chunk_size] for i in range(0, len(string), chunk_size)]                         
                                                                                                       
print(s.apply(chunker))        

This gives the error:

No implementation of function Function(<class 'range'>) found for signature:
 
 >>> range(Literal[int](0), Masked(int32), Literal[int](3))

If I replace len(string) with a constant, then I get another error complaining about the indexing:

No implementation of function Function(<built-in function getitem>) found for signature:
 
 >>> getitem(Masked(string_view), slice<a:b>)

The code works fine in regular Pandas but I was hoping to run this on some really large datasets and benefit from cdDF GPU operations.


Solution

  • You can use str.findall for this operation with a regular expression to match any character between 1 and 3 (chunk size) times, which will be faster in pandas and cuDF:

    import pandas as pd
    import cudf
    
    N = 1000000
    s = pd.Series(["abcdefg", "hijklmnop"]*N)
    gs = cudf.from_pandas(s)
    
    %time out = s.str.findall(".{1,3}")
    %time out = gs.str.findall(".{1,3}")
    out.head()
    CPU times: user 3.55 s, sys: 164 ms, total: 3.72 s
    Wall time: 3.7 s
    CPU times: user 118 ms, sys: 31.8 ms, total: 150 ms
    Wall time: 150 ms
    
    0      [abc, def, g]
    1    [hij, klm, nop]
    2      [abc, def, g]
    3    [hij, klm, nop]
    4      [abc, def, g]
    dtype: list
    

    You may also be interested in cudf.pandas, the zero-code change accelerator for pandas code.