This is a data-cleaning exercise where specific elements of a dataframe A ought to be set to NaN depending on values that are decoded through B.
I have written the following code in which the 3-nested loops would run for 17h:
def Convert(input):
X = np.fromstring(input[1:-1], dtype=np.int, sep=',')
return X
tf = B
# B is a dataframe of descriptors for the A dataframe
# the column 'missing_or_unknown' in B is used to determine the elements of A to be replaced
tf['missing_or_unknown'] = B['missing_or_unknown'].apply(lambda x: Convert(x))
Y = tf['missing_or_unknown'].values
for i in range(0,len(A)):
for j in range(0,85):
for k in range (0,len(Y[j])):
if A.iloc[i,j] == Y[j][k]:
A[i,j] = np.nan
I suspect the bottleneck is the long outer loop since len(A)
is ~ 1 million. So, this is not the best way to use Pandas, and I would instead go for:
for j in range(0,85):
for k in range (0,len(Y[j])):
if A.iloc[:,j] == Y[j][k]:
A.iloc[:,j] = np.nan
The latter however throws an exception:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
2 Questions:
If you want to substitute with np.nan
any entry of A equal to the entry of Y in the same position you can use:
A[A==Y]=np.nan
Does this solve your problem?
Your first code works but it's very slow.
Your second code doesn't work because the if statement is comparing the whole column (Series) A.iloc[:,j]
to a value, which you can do with .any()
, as suggested.
Here I compare the speed of my code wrt your first try on two dataframe with dimension 100x85:
import time
A = pd.DataFrame(np.zeros([100,85]))
A.iloc[0,1] = 1
Y = pd.DataFrame(np.ones([100,85]))
start_time = time.time()
A[A==Y]=np.nan
print("--- %s seconds ---" % (time.time() - start_time))
--- 0.030421018600463867 seconds ---
start_time = time.time()
for i in range(0,len(A)):
for j in range(0,85):
for k in range (0,len(Y[j])):
if A.iloc[i,j] == Y[j][k]:
A[i,j] = np.nan
print("--- %s seconds ---" % (time.time() - start_time))
--- 17.413578748703003 seconds ---