I have a CSV file such as this one (example.csv
).
STRING_COL,INT_1,INT_2,FLOAT,INT_3
Hello,9,65151651,3234.54848,7832
This is a string,2,5484651,34.234,-999
Another,2,62189548,51.51658,-999
Test,2,2131514,5.2156,-999
Ham,9,6546548,2.15,-999
String,9,3216546,2.15468,-999
Every cell has a different number of decimals. They can be strings as well or integers (Int64, Int8, ...). Then I also have a similar CSV, but with some values changed. I want to check de differences between both files.
Therefore I have written a code similar to this one to compare the values cell by cell:
import pandas as pd
df = pd.read_csv(
'example.csv', delimiter=',', comment='#', skip_blank_lines=True,
verbose=False, engine='python', dtype=str
)
df = df.apply(lambda x: pd.to_numeric(x, errors='ignore', downcast='integer'))
df_2 = pd.read_csv(
'example_2.csv', delimiter=',', comment='#', skip_blank_lines=True, # file with small changes
verbose=False, engine='python', dtype=str
)
df_2 = df_2.apply(lambda x: pd.to_numeric(x, errors='ignore', downcast='integer'))
for i in list(df.index):
for column in list(df.columns):
old = df.loc[i, column]
new = df_2.loc[i, column]
if old != new:
print('DIFFERENT VALUE >> INDEX: {} | OLD: {} | NEW: {}'.format(i, old, new))
If you run this example with the small CSV file I am pretty sure it is going to work well. But with a huge CSV file some weird things are happening. I do not understand why sometimes many values are truncated to this ones:
1.6440000000000001 >> original value 1.644
7.7189999999999985 >> original value 7.7189
Then if I compare them it finds that they are different and this is not true because the value is the same. What is happening? Is there a way to fix this? Is there a better way to compare values with DataFrames?
NOTE: Maybe I am doing anything wrong in other part of my original code, but I think I have written the most important and relevant.
NOTE 2: I take into account that the !=
operator does not work well with NaN
values. I use np.isnan
to check this changes.
Update. I do not need to compare and say "yes, it is equal" and "no, it is not equal". I need to retrieve the values that have changes cell by cell.
Finally I found something to make the comparison in an appropriate way: np.isclose()
. I have read the duplicated question I found and some other questions about the epsilon value: numpy.finfo(), epsilon
Epsilon: Numbers which differ by less than machine epsilon are numerically the same
abs(a - b) < epsilon
absolute(a - b) <= (atol + rtol * absolute(b)) # np.isclose() method
So I need to make something like this. I have to check what happens if I am comparing between float32 and float64 or float16
eps64 = np.finfo(np.float64).eps
for col in df.columns:
np.isclose(
df[col],
df_2[col],
equal_nan=False,
atol=0.0,
rtol=eps64
)
But now I am facing the problem that if I want copy the value to other variable I copy the inaccurate value 1.6440000000000001
. What I am doing now to fix this is to cast the value to float >> float(1.6440000000000001)