Search code examples
pythonpandasdataframetime-seriesperiodicity

Pandas print date periods where a condition holds in a data frame column?


I have a txt. file that looks as follows:

Name 1
                                                           
                                                           
   @Name( )                  Value       WATER       WHP     
                  Date       Unit         Unit       Unit     
-------------- ---------- ---------- ---------- ---------- 
Name 1       20081220      2900.00       0.00    3300.00 
Name 1       20081221      0.00          0.00    3390.00 
Name 1       20081222      2500.00       0.00    2802.00 
Name 1       20081223      0.00          0.00    3022.00
Name 1       20081224      0.00          0.00    3022.00

I used the following code to import into python:

df = pd.read_csv(r'test_prd.txt', skiprows=6, engine="python", header=None)
df.columns = ['Test']
df.drop(df.tail(1).index, inplace = True) # because of file format
df = df.Test.str.split(expand=True)

df.rename(columns ={0:'Name', 1:'Number', 2:'Date', 3:'Value', 4:'Water', 5:'WHP'}
,inplace=True)
df['Date'] = pd.to_datetime(df['Date']).dt.floor('D').dt.strftime('%Y-%m-%d')
df['Note'] = (df['Value']).apply(lambda x: 'yes' if x==0 else '')
del df['Water']
del df['WHP']
    
df['Name'] = df['Name'].astype(str) + ' ' + df['Number'].astype(str)
del df['Number']

After using this code the data frame looks like:

           Name      Date       Value       Note
    0     Name 1  2008-12-20    2900.00      
    1     Name 1  2008-12-21    0.00         Yes
    2     Name 1  2008-12-22    2500.00      
    3     Name 1  2008-12-23    0.00         Yes
    4     Name 1  2008-12-24    0.00         Yes 
   ...    ...        ...        ...          ...
    78    Name    2009-03-15    0.00         Yes
    79    Name    2009-03-16    3000.00      
    80    Name    2009-03-17    0.00         Yes
   ...    ...        ...        ...          ...

I want to print the periods of time (start date - end date) for which the 'Value' column equals zero, i.e, when 'Note'=Yes. Any other row were the value is non-zero can be removed from the data frame. If there is a standalone value of zero (preceded and followed by a non-zero value), the start and end date would be the same.

The expected output should look like this:

      Name     Start Date      End Date     Value       Note
1     Name     2008-12-21    2008-12-21     0.00         Yes
2     Name     2008-12-23    2009-03-15     0.00         Yes
3     Name     2009-03-17        ***        0.00         Yes
       ...        ...           ...         ...          ...

I was trying to use a conditional if statement or df.loc but I don't know my way around Python enough to put it together. Any advice would be appreciated.


Solution

  • First let's import your dataframe using read_csv

    df = PD.read_csv("yourfile.txt", sep="\s+", engine="python", parse_dates=["Date"])
    df["Value"] = df["Value"].astype(float)
    

    Be sure to replace the value of sep with the correct column separator. Here I assume the separator is one or more whitespaces, if not please adapt it.
    Be also sure to convert the "Date" column to a datetime, by using the parse_dates parameter, and that "Value" column is of type float.

    Now with df being your dataframe, this snippet should do what you want.

    df["Start"] = (df["Value"] == 0) & (df["Value"].shift(1) != 0)
    ddf = df[df["Value"] == 0]
    ddf["Group"] = ddf["Start"].cumsum()
    rdf = ddf.groupby("Group").apply(lambda x: PD.Series({"Name":x["Name"].iloc[0],
                                                          "Start Date":x["Date"].min(),
                                                          "End Date":x["Date"].max(),
                                                          "Value": 0.,
                                                          "Note": "Yes",
                                                         })).reset_index(drop=True)
    

    The point here is to use some pandas function to achieve what you want in an efficient way. Don't use loops, if your dataframe is big, you'll need a lot of time to execute your code.

    1. Here I first create a "Start" column where I check which row is the starting of a zero interval series of rows. I do it by shifting the "Value" row forward by 1 position, and comparing each row. "Start" column has a True value for each row where an interval should be started.
    2. Then I remove the non zero "Value" rows.
    3. Then I use cumsum to sum over the "Start" column. This will make a new column which I can use to group together the intervals which should be joined.
    4. Eventually I can use groupby with apply to join together the groups, and for each group create a single row of a new dataframe where I can take the earliest and latest date from the "Date" column.

    With the rows you have posted, the final result is:

       Name Start Date   End Date  Value Note
    0  Name 2008-12-21 2008-12-21    0.0  Yes
    1  Name 2008-12-23 2009-03-15    0.0  Yes
    2  Name 2009-03-17 2009-03-17    0.0  Yes