Search code examples
pythonpandascsvrangequantile

How can I replace values in a CSV column from a range?


I am attempting to change the values of two columns in my dataset from specific numeric values (2, 10, 25 etc.) to single values (1, 2, 3 or 4) based on the percentile of the specific value within the dataset.

Using the pandas quantile() function I have got the ranges I wish to replace between, but I haven't figured out a working method to do so.

age1 = datasetNB.Age.quantile(0.25)
age2 = datasetNB.Age.quantile(0.5)
age3 = datasetNB.Age.quantile(0.75)

fare1 = datasetNB.Fare.quantile(0.25)
fare2 = datasetNB.Fare.quantile(0.5)
fare3 = datasetNB.Fare.quantile(0.75)

My current solution attempt for this problem is as follows:

for elem in datasetNB['Age']:
    if elem <= age1:
        datasetNB[elem].replace(to_replace = elem, value = 1)
        print("set to 1")
    elif (elem > age1) & (elem <= age2):
        datasetNB[elem].replace(to_replace = elem, value = 2)
        print("set to 2")
    elif (elem > age2) & (elem <= age3):
        datasetNB[elem].replace(to_replace = elem, value = 3)
        print("set to 3")
    elif elem > age3:
        datasetNB[elem].replace(to_replace = elem, value = 4)
        print("set to 4")
    else:
        pass

for elem in datasetNB['Fare']:
    if elem <= fare1:
        datasetNB[elem] = 1
    elif (elem > fare1) & (elem <= fare2):
        datasetNB[elem] = 2
    elif (elem > fare2) & (elem <= fare3):
        datasetNB[elem] = 3
    elif elem > fare3:
        datasetNB[elem] = 4
    else:
        pass

What should I do to get this working?


Solution

  • pandas already has one function to do that, pandas.qcut.

    You can simply do

    q_list = [0, 0.25, 0.5, 0.75, 1]
    labels = range(1, 5)
    
    df['Age'] = pd.qcut(df['Age'], q_list, labels=labels) 
    df['Fare'] = pd.qcut(df['Fare'], q_list, labels=labels) 
    

    Input

    import numpy as np
    import pandas as pd
    
    # Generate fake data for the sake of example 
    df = pd.DataFrame({
        'Age': np.random.randint(10, size=6),
        'Fare': np.random.randint(10, size=6)
    })
    
    >>> df 
    
       Age  Fare
    0    1     6
    1    8     2
    2    0     0
    3    1     9
    4    9     6
    5    2     2
    

    Output

    DataFrame after running the above code

    >>> df
    
      Age Fare
    0   1    3
    1   4    1
    2   1    1
    3   1    4
    4   4    3
    5   3    1
    

    Note that in your specific case, since you want quartiles, you can just assign q_list = 4.