Search code examples
pythonpandasdataframecsvdelimiter

Drop All Rows After SECOND Occurrence of Column Value


I'd like to drop all data in a .txt file that I've converted to a data frame after the second instance of a column value. In this case a delimiter "---".

Dataframe is constructed as follows:

15 Leading Causes of Death  15                          Code        Deaths      Population          Crude Rate  Crude Rate Lower 95% Confidence Interval    Crude Rate Upper 95% Confidence Interval
#Accidents (unintentional injuries) (V01-X59,Y85-Y86)   GR113-112   21          152430              13.8        8.5                                         21.1
#Intentional self-harm (suicide) (*U03,X60-X84,Y87.0)   GR113-124   15          152430              Unreliable  5.5                                         16.2
---                     
Dataset: Underlying Cause of Death, 1999-2019                       
Query Parameters:                       
States: Marin County, CA (06041)                        
Ten-Year Age Groups: 25-34 years                        
Year/Month: 1999; 2000; 2001; 2002; 2003                        
Group By: 15 Leading Causes of Death                        
Show Totals: Disabled                       
Show Zero Values: Disabled                      
Show Suppressed: Disabled                       
Calculate Rates Per: 100,000                        
Rate Options: Default intercensal populations for years 2001-2009 (except Infant Age Groups)                        
---                     
Help: See http://wonder.cdc.gov/wonder/help/ucd.html for more information.                      
---                     
Query Date: Sep 23, 2021 6:51:59 PM

I've seen plenty of solutions for how to do so after the first instance of a column value or NaN etc. but nothing for the second or nth for that matter...

Here's the simple code I have so far to read in the file.

import pandas as pd

dl = pd.read_csv('Underlying Cause of Death, 1999-2019(3).txt', sep = '\t')
dl.to_csv('test.csv', index = False)

Solution

  • Find rows start with '---' and apply a cumulative sum then get the index of the first row equals to 2 and slice your dataframe to this index.

    >>> df.iloc[:df.iloc[:, 0].str.startswith('---').cumsum().eq(2).idxmax()]
    
    0   #Accidents (unintentional injuries) (V01-X59,Y...  GR113-112    21.0    152430.0        13.8                                       8.5                                      21.1
    1   #Intentional self-harm (suicide) (*U03,X60-X84...  GR113-124    15.0    152430.0  Unreliable                                       5.5                                      16.2
    2                                                 ---        NaN     NaN         NaN         NaN                                       NaN                                       NaN
    3       Dataset: Underlying Cause of Death, 1999-2019        NaN     NaN         NaN         NaN                                       NaN                                       NaN
    4                                   Query Parameters:        NaN     NaN         NaN         NaN                                       NaN                                       NaN
    5                    States: Marin County, CA (06041)        NaN     NaN         NaN         NaN                                       NaN                                       NaN
    6                    Ten-Year Age Groups: 25-34 years        NaN     NaN         NaN         NaN                                       NaN                                       NaN
    7                                    Year/Month: 1999       2000  2001.0      2002.0        2003                                       NaN                                       NaN
    8                Group By: 15 Leading Causes of Death        NaN     NaN         NaN         NaN                                       NaN                                       NaN
    9                               Show Totals: Disabled        NaN     NaN         NaN         NaN                                       NaN                                       NaN
    10                         Show Zero Values: Disabled        NaN     NaN         NaN         NaN                                       NaN                                       NaN
    11                          Show Suppressed: Disabled        NaN     NaN         NaN         NaN                                       NaN                                       NaN
    12                       Calculate Rates Per: 100,000        NaN     NaN         NaN         NaN                                       NaN                                       NaN
    13  Rate Options: Default intercensal populations ...        NaN     NaN         NaN         NaN                                       NaN                                       NaN