I want to remove all rows that have atleast one 'NaN'. The dataframe looks like below but the actual one has around 1000004 rows.
The full CSV file: file
Time (µs) | ChannelA (mV) | ChannelB (mV) | ChannelC (mV) | ChannelD (mV) |
---|---|---|---|---|
1 | 0.1 | 0.2 | 0.3 | 0.4 |
2 | 0.5 | NaN | 0.4 | 0.5 |
3 | 0.6 | NaN | 0.3 | 0.11 |
4 | 0.8 | 0.6 | NaN | 0.6 |
5 | 0.7 | 0.1 | 0.4 | NaN |
The code I have written is as follows:
def fft(x, Plot_ShareY=True):
#loads the csv files
dfs = pd.read_csv(x, delimiter=";", skiprows=(1,2), decimal=",", na_values='NaN')
#removes spaces in column names
dfs = pd.DataFrame(dfs)
dfs.columns = dfs.columns.str.replace(' ', '') #removes spaces from col names
#removes infinity symbol and converts non-numeric values to NaN
for c in dfs:
dfs[c] = pd.to_numeric(dfs[c], errors='coerce')
#drops NaN
dfs = dfs.dropna(axis=0, how='any')
print(dfs)
I expect to have atleast 300000 rows but I get only around 200000 rows. When I check in the actual CSV file, the first NaN does not occur atleast until the 380000 th row. So why is dropna()
removing extra rows ?
I took the different approach as I see using pd.to_numeric(dfs[c], errors='coerce')
making it all Nan
in my testing and returning empty DataFrame while applying dfs.dropna(axis=0, how='any')
, So, best bet it to replace symbols(∞ & -∞
) to np.nan
and then remove Nan
rows.
While trimming all the Whitespace, better use strip
to remove both ends space on all the columns.
You will be getting 387060
rows.
dfs = pd.read_csv("Pipe03_2.csv", delimiter=";", skiprows=(1,2), decimal=",", na_values=['(NA)'], low_memory=False)
dfs = dfs.rename(columns=lambda x: x.strip())
dfs = dfs.replace({'-∞': np.nan, '∞': np.nan})
dfs = dfs.dropna(axis=0, how='any')
print(dfs)
dfs.head()
Time Channel A Channel B Channel C Channel D
0 -20.000096 -0,36625560 -1,09876700 0,00000000 -0,36625560
1 -20.000046 -0,36625560 -1,46502200 -0,36625560 0,30521300
2 -19.999996 0,30521300 -0,73251120 0,00000000 -0,36625560
3 -19.999946 -0,36625560 -1,09876700 0,00000000 0,00000000
4 -19.999896 0,67146860 -0,36625560 0,30521300 0,00000000
dfs.tail()
Time Channel A Channel B Channel C Channel D
944234 27.211601 -358,74740000 499,81680000 -169,21010000 -236,17380000
944237 27.211751 -358,74740000 499,81680000 -169,57640000 -235,44130000
944239 27.211851 -358,74740000 499,81680000 -169,94260000 -235,80760000
944240 27.211901 -358,74740000 499,81680000 -170,30890000 -235,80760000
944241 27.211951 -359,11360000 499,81680000 -170,30890000 -236,17380000