Search code examples
pythonpandas

Why dropna() is dropping extra rows?


I want to remove all rows that have atleast one 'NaN'. The dataframe looks like below but the actual one has around 1000004 rows.

The full CSV file: file

Time (µs) ChannelA (mV) ChannelB (mV) ChannelC (mV) ChannelD (mV)
1 0.1 0.2 0.3 0.4
2 0.5 NaN 0.4 0.5
3 0.6 NaN 0.3 0.11
4 0.8 0.6 NaN 0.6
5 0.7 0.1 0.4 NaN

The code I have written is as follows:

def fft(x, Plot_ShareY=True):

    #loads the csv files 
    dfs = pd.read_csv(x, delimiter=";", skiprows=(1,2), decimal=",", na_values='NaN')
 
    #removes spaces in column names   
    dfs = pd.DataFrame(dfs)  
    dfs.columns = dfs.columns.str.replace(' ', '') #removes spaces from col names
    
    #removes infinity symbol and converts non-numeric values to NaN
    for c in dfs:
        dfs[c] = pd.to_numeric(dfs[c], errors='coerce') 
    
    #drops NaN
    dfs = dfs.dropna(axis=0, how='any')
    print(dfs)     

I expect to have atleast 300000 rows but I get only around 200000 rows. When I check in the actual CSV file, the first NaN does not occur atleast until the 380000 th row. So why is dropna() removing extra rows ?


Solution

  • I took the different approach as I see using pd.to_numeric(dfs[c], errors='coerce') making it all Nan in my testing and returning empty DataFrame while applying dfs.dropna(axis=0, how='any'), So, best bet it to replace symbols(∞ & -∞) to np.nan and then remove Nan rows.

    While trimming all the Whitespace, better use strip to remove both ends space on all the columns.

    Solution:

    You will be getting 387060 rows.

    dfs = pd.read_csv("Pipe03_2.csv", delimiter=";", skiprows=(1,2), decimal=",", na_values=['(NA)'], low_memory=False)
    dfs = dfs.rename(columns=lambda x: x.strip())
    dfs = dfs.replace({'-∞': np.nan, '∞': np.nan})
    dfs = dfs.dropna(axis=0, how='any')
    print(dfs)
    

    Results:

    dfs.head()
    
            Time    Channel A    Channel B    Channel C    Channel D
    0 -20.000096  -0,36625560  -1,09876700   0,00000000  -0,36625560
    1 -20.000046  -0,36625560  -1,46502200  -0,36625560   0,30521300
    2 -19.999996   0,30521300  -0,73251120   0,00000000  -0,36625560
    3 -19.999946  -0,36625560  -1,09876700   0,00000000   0,00000000
    4 -19.999896   0,67146860  -0,36625560   0,30521300   0,00000000
    
    dfs.tail()
                 Time      Channel A     Channel B      Channel C      Channel D
    944234  27.211601  -358,74740000  499,81680000  -169,21010000  -236,17380000
    944237  27.211751  -358,74740000  499,81680000  -169,57640000  -235,44130000
    944239  27.211851  -358,74740000  499,81680000  -169,94260000  -235,80760000
    944240  27.211901  -358,74740000  499,81680000  -170,30890000  -235,80760000
    944241  27.211951  -359,11360000  499,81680000  -170,30890000  -236,17380000