Search code examples
pythonpandasseriesrapidscudf

Equivalent of pd.Series.str.slice() and pd.Series.apply() in cuDF


I am wanting to convert the following code (which runs in pandas) to code that runs in cuDF.

Sample data from .head() of Series being manipulated is plugged into OG code in the 3rd code cell down -- should be able to copy/paste run.

Original code in pandas

# both are float columns now
# rawcensustractandblock
s_rawcensustractandblock = df_train['rawcensustractandblock'].apply(lambda x: str(x))

# adjust/set new tract number 
df_train['census_tractnumber'] = s_rawcensustractandblock.str.slice(4,11)

# adjust block number
df_train['block_number'] = s_rawcensustractandblock.str.slice(start=11)
df_train['block_number'] = df_train['block_number'].apply(lambda x: x[:4]+'.'+x[4:]+'0' )
df_train['block_number'] = df_train['block_number'].apply(lambda x: int(round(float(x),0)) )
df_train['block_number'] = df_train['block_number'].apply(lambda x: str(x).ljust(4,'0') )

Data being manipulated

# series of values from df_train.['rawcensustractandblock'].head()
data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401, 
                  60372963.002002, 60590423.381006])

Code adjusted to start with this sample data

Here's how the code looks when using the above provided data instead of the entire dataframe.

Based on errors encountered when trying to convert, this issue is at the Series level, so the converting the cell below to execute in cuDF should solve the problem.

import pandas as pd

# series of values from df_train.['rawcensustractandblock'].head()
data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401, 
                  60372963.002002, 60590423.381006])

# how the first line looks using the series
s_rawcensustractandblock = data.apply(lambda x: str(x))

# adjust/set new tract number 
census_tractnumber = s_rawcensustractandblock.str.slice(4,11)

# adjust block number
block_number = s_rawcensustractandblock.str.slice(start=11)
block_number = block_number.apply(lambda x: x[:4]+'.'+x[4:]+'0' )
block_number = block_number.apply(lambda x: int(round(float(x),0)) )
block_number = block_number.apply(lambda x: str(x).ljust(4,'0') )

Expected changes (output)

df_train['census_tractnumber'].head()

# out
0    1066.46
1    0524.22
2    4638.00
3    2963.00
4    0423.38
Name: census_tractnumber, dtype: object

df_train['block_number'].head()

0    1001
1    2024
2    3004
3    2002
4    1006
Name: block_number, dtype: object

Solution

  • You can use cuDF string methods (via nvStrings) for almost everything you're trying to do. You will lose some precision converting these floats to strings in cuDF (though it may not matter in your example above), so for this example I've simply converted beforehand. If possible, I would recommend initially creating the rawcensustractandblock as a string column rather than a float column.

    import cudf
    import pandas as pd
    ​
    gdata = cudf.from_pandas(pd_data.astype('str'))
    ​
    tractnumber = gdata.str.slice(4,11)
    blocknumber = gdata.str.slice(11)
    blocknumber = blocknumber.str.slice(0,4).str.cat(blocknumber.str.slice(4), '.')
    blocknumber = blocknumber.astype('float').round(0).astype('int')
    blocknumber = blocknumber.astype('str').str.ljust(4, '0')
    ​
    tractnumber
    0    1066.46
    1    0524.22
    2    4638.00
    3    2963.00
    4    0423.38
    dtype: object
    
    blocknumber
    0    1001
    1    2024
    2    3004
    3    2002
    4    1006
    dtype: object