Search code examples
pythonpandasdata-science-experience

python performance problems while classifing column values


This question is strongly related to my question earlier: here
Sorry that I have to ask again!

The code below is running and delivering the correct results but its again somehow slow (4 mins for 80K rows). I have problems to use the Series class from pandas for concrete values. Can someone recommend how I can instead classify those columns?

Could not find relevant information in the documentary:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

Running Code:

# p_test_SOLL_test_D10

for x in range (0,len(tableContent[6])):
    var = tableContent[6].loc[x, ('p_test_LAENGE')]
    if float(tableContent[6].loc[x, ('p_test_LAENGE')])>=100.0:
        tableContent[6].loc[x, ('p_test_LAENGE')]='yes'
    elif (float(tableContent[6].loc[x, ('p_test_LAENGE')]) <30.0 and float(tableContent[6].loc[x, ('p_test_LAENGE')]) >= 10):
        tableContent[6].loc[x, ('p_test_LAENGE')]='yes2'
    elif (float(tableContent[6].loc[x, ('p_test_LAENGE')]) <10.0 and float(tableContent[6].loc[x, ('p_test_LAENGE')]) >= 5):
        tableContent[6].loc[x, ('p_test_LAENGE')]='yes3'
    else:
        tableContent[6].loc[x, ('p_test_LAENGE')]='no'

print (tableContent[6]['p_test_LAENGE'])

Series Try:

if tableContent[6]['p_test_LAENGE'].astype(float) >=100.0:
    tableContent[6]['p_test_LAENGE']='yes'
elif (tableContent[6]['p_test_LAENGE'].astype(float) <30.0 and tableContent[6]['p_test_LAENGE'].astype(float) >= 10):
    tableContent[6]['p_test_LAENGE']='yes1'
elif (tableContent[6]['p_test_LAENGE'].astype(float) <10.0 and tableContent[6]['p_test_LAENGE'].astype(float) >= 5):
    tableContent[6]['p_test_LAENGE']='yes2'
else:
    tableContent[6]['p_test_LAENGE']='no'


print (tableContent[6]['p_test_LAENGE'])

Solution

  • I do not have your df to test so you need to modify the following code. Assume that min of df is greater than 10e-7 while max of df is less than 10e7

    bin = [10e-7,5,10,30,100,10e7]
    label = ['no','yes2','yes1','no','yes']
    df['p_test_LAENGE_class'] = pd.cut(df['p_test_LAENGE'], bins=bin, labels=label)
    

    Hope this will help you