Search code examples
pythonpandassequencespss-modeler

How to returns the number of records that have passed since EXPR in Python


I'm trying to replicate the IBM SPSS function @SINCE using Python and Pandas, but unfortunately, I've got stuck in a part of my process. Is there a direct function that replicates IBM SPSS CLEM @SINCE using python?

Here is the link for more info: Link

IMB @SINCE function description

"This function returns the offset of the last record where this condition was true--that is, the number of records before this one in which the condition was true. If the condition has never been true, @SINCE returns @INDEX + 1." (IBM, 2020)

I've been trying to replicate this function from scratch, but I haven't find the right way to do it:

How do I do this with pandas and Python?

Here the problem,

My data looks like this:

Original Data

+------+----------+
| Type | Flag     |
+------+----------+
| d    |          |
+------+----------+
| A    | myStatus |
+------+----------+
| c    |          |
+------+----------+
| B    | myStatus |
+------+----------+
| c    |          |
+------+----------+
| c    | myStatus |
+------+----------+
| c    |          |
+------+----------+
| d    |          |
+------+----------+
| d    |          |
+------+----------+
| A    | myStatus |
+------+----------+

In IBM SPSS I use this formula to get this data:

if Type = 'A' or Type = 'B' then @SINCE(Flag = 'myStatus') else -1 endif

And this is the output:

+------+----------+----------------+
| Type | Flag     | Expected Count |
+------+----------+----------------+
| d    |          | -1             |
+------+----------+----------------+
| A    | myStatus | 0              |
+------+----------+----------------+
| c    |          | -1             |
+------+----------+----------------+
| B    | myStatus | 2              |
+------+----------+----------------+
| c    |          | -1             |
+------+----------+----------------+
| c    | myStatus | -1             |
+------+----------+----------------+
| c    |          | -1             |
+------+----------+----------------+
| d    |          | -1             |
+------+----------+----------------+
| d    |          | -1             |
+------+----------+----------------+
| A    | myStatus | 4              |
+------+----------+----------------+

Solution

  • So, I found the way to fix this problem: here is the code:

    df = pd.DataFrame({"Type":["d", "A", "c", "B", "c", "c", "c", "d", "d", "A"],
                   "Flag":[np.nan, "myStatus", np.nan, "myStatus", np.nan, "myStatus", np.nan, np.nan, np.nan, "myStatus"]})
    

    Function to solve the problem:

    def spssSince(df):
       df_temp = df
       df_temp = df[df.Flag=="myStatus"]
       df_temp['last_ind'] = df_temp.index
       df_temp['last_ind'] = df_temp.last_ind.shift(1)
       df_temp['last_ind'] = df_temp['last_ind'].fillna(1)
       df_temp["Expected Count"] = df_temp.index - df_temp.last_ind
       df_temp.loc[~df_temp.Type.isin(["A", "B"]), "Expected Count"] = -1
       DFreturn = pd.merge(left=df, right=df_temp.drop(['Type', 'Flag', 'last_ind'], axis=1), how="left", left_index=True, right_index=True)
       DFreturn["Expected Count"] = DFreturn["Expected Count"].fillna(-1)
       return DFreturn
    

    Basically the function calculates the last SINCE value from a condition, calculating the actual index among the index that has the validation (using shift()).