Labelling error during version comparison

I have a dataframe that looks like this:

api_spec_id   commit_date             info_version          label
500                2020-07-22           1.1 
500                2020-11-09           1.1 
500                2020-11-16           1.1 
500                2020-11-16           1.1 
500                2020-11-23           1.1 
500                2021-02-01           1.1     
138641             2020-06-25          0.1.0                 major
138641             2020-06-25          0.1.0    
138641             2020-06-27          0.1.0    
138641             2020-06-27          0.1.9                 patch
138641             2020-06-27          0.1.10                patch
138641             2020-06-27          0.1.11                patch
138641             2020-06-27          0.1.13                patch
138641             2020-06-27          0.1.14                patch
138641             2020-06-27          0.1.15                patch
138641             2020-06-28          0.2.0                 minor.patch
138641             2020-06-30          0.2.1                 patch
138641             2020-07-01          0.3.0                 minor.patch
138641             2020-07-08          0.4.0                 minor
138641             2020-07-11          0.5.0                 minor
138641             2020-07-12          0.6.0                 minor

I am trying to compare the versions between consecutive rows and then label them, but the problem I am facing is often the first commit_date of some of the api_spec_id have a label as well, as we can see for the id 138641, when it should be empty as there is no previous row to compare to.

This is the code below, I feel the issue is probably coming because of the sem function because it extracts the version for all the rows, which could be causing some issues in the diff, but then again it works for some of the id's which is strange, and I am not able to debug this issue.

pat = r'(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(?:\.(?P<micro>\d+))?'

sem = new['info_version'].str.extract(pat).fillna(0).astype(int)

diff = sem.diff().fillna(0).ne(0)

new['label'] = diff.dot(sem.columns + '.').str.rstrip('.')

The second part of the code is for all versions which are not valid semantic versions and have to be parsed with the Version class.

attrs = ['major', 'minor', 'micro', 'pre', 'post', 'dev', 'local']
def extract_version(ver):
    ver = Version(ver)  
    return pd.Series({attr: getattr(ver, attr) for attr in attrs}, dtype=str)


sem = new['info_version'].agg(extract_version).fillna('').rename(columns={'micro': 'patch'})
diff = sem.ne(sem.shift().fillna(sem.iloc[0]))
new['label'] = diff.dot(sem.columns + '.').str.rstrip('.')

Is there any other way I could calculate the difference? Any suggestions or ideas would be really appreciated.

Solution

From one of your previous questions, I recommend you to group by api_spec_id column to process versions:

api_spec_id       commit_date   info_version    label
500                2021-02-01            1.1     
138641             2020-06-25          0.1.0    major  # <- without groupby

If you use groupby, the output will be:

api_spec_id       commit_date   info_version    label
500                2021-02-01            1.1     
138641             2020-06-25          0.1.0           # <- with groupby

So you should use:

pat = r'(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(?:\.(?P<micro>\d+))?'

sem = new['info_version'].str.extract(pat).fillna(0).astype(int)

# diff = sem.diff().fillna(0).ne(0)
diff = sem.groupby(new['api_spec_id']).diff().fillna(0).ne(0)

new['label'] = diff.dot(sem.columns + '.').str.rstrip('.')

For your second part of code, it's exactly the same problem:

attrs = ['major', 'minor', 'micro', 'pre', 'post', 'dev', 'local']
def extract_version(ver):
    ver = Version(ver)  
    return pd.Series({attr: getattr(ver, attr) for attr in attrs}, dtype=str)


sem = new['info_version'].agg(extract_version).fillna('').rename(columns={'micro': 'patch'})
# diff = sem.ne(sem.shift().fillna(sem.iloc[0]))
diff = (sem.groupby(new['api_spec_id'], group_keys=False)
           .apply(lambda x: x.ne(x.shift().fillna(x.iloc[0]))))

new['label'] = diff.dot(sem.columns + '.').str.rstrip('.')

In both cases, the output is now:

>>> new
    api_spec_id commit_date info_version        label
0           500  2020-07-22          1.1             
1           500  2020-11-09          1.1             
2           500  2020-11-16          1.1             
3           500  2020-11-16          1.1             
4           500  2020-11-23          1.1             
5           500  2021-02-01          1.1             
6        138641  2020-06-25        0.1.0             
7        138641  2020-06-25        0.1.0             
8        138641  2020-06-27        0.1.0             
9        138641  2020-06-27        0.1.9        patch
10       138641  2020-06-27       0.1.10        patch
11       138641  2020-06-27       0.1.11        patch
12       138641  2020-06-27       0.1.13        patch
13       138641  2020-06-27       0.1.14        patch
14       138641  2020-06-27       0.1.15        patch
15       138641  2020-06-28        0.2.0  minor.patch
16       138641  2020-06-30        0.2.1        patch
17       138641  2020-07-01        0.3.0  minor.patch
18       138641  2020-07-08        0.4.0        minor
19       138641  2020-07-11        0.5.0        minor
20       138641  2020-07-12        0.6.0        minor