Search code examples
pythonpandasauto-vectorization

vectorizing Pandas dataframes


This is a data-cleaning exercise where specific elements of a dataframe A ought to be set to NaN depending on values that are decoded through B.

I have written the following code in which the 3-nested loops would run for 17h:

def Convert(input):
    X = np.fromstring(input[1:-1], dtype=np.int, sep=',')
    return X
tf = B
# B is a dataframe of descriptors for the A dataframe
# the column 'missing_or_unknown' in B is used to determine the elements of A to be replaced
tf['missing_or_unknown'] = B['missing_or_unknown'].apply(lambda x: Convert(x))
Y = tf['missing_or_unknown'].values
for i in range(0,len(A)):
    for j in range(0,85):
        for k in range (0,len(Y[j])):
            if A.iloc[i,j] == Y[j][k]:
                A[i,j] = np.nan

I suspect the bottleneck is the long outer loop since len(A) is ~ 1 million. So, this is not the best way to use Pandas, and I would instead go for:

for j in range(0,85):
      for k in range (0,len(Y[j])):
        if A.iloc[:,j] == Y[j][k]:
                A.iloc[:,j] = np.nan

The latter however throws an exception:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

2 Questions:

  1. Am I right about the performance bottleneck & benefit from vectorization
  2. how to express the if condition in a correct way
  3. The calculation of Y is not expensive, it is a look up array to determine how to manipulate the A dataframe

Solution

  • If you want to substitute with np.nan any entry of A equal to the entry of Y in the same position you can use:

    A[A==Y]=np.nan
    

    Does this solve your problem?

    Your first code works but it's very slow.

    Your second code doesn't work because the if statement is comparing the whole column (Series) A.iloc[:,j]to a value, which you can do with .any(), as suggested.

    Here I compare the speed of my code wrt your first try on two dataframe with dimension 100x85:

    import time
    A = pd.DataFrame(np.zeros([100,85]))
    A.iloc[0,1] = 1
    Y = pd.DataFrame(np.ones([100,85]))
    start_time = time.time()
    A[A==Y]=np.nan
    print("--- %s seconds ---" % (time.time() - start_time))
    --- 0.030421018600463867 seconds ---
    
    start_time = time.time()
    for i in range(0,len(A)):
    for j in range(0,85):
        for k in range (0,len(Y[j])):
            if A.iloc[i,j] == Y[j][k]:
                A[i,j] = np.nan
    print("--- %s seconds ---" % (time.time() - start_time))
    --- 17.413578748703003 seconds ---