Search code examples

Labelling error during version comparison

I have a dataframe that looks like this:

api_spec_id   commit_date             info_version          label
500                2020-07-22           1.1 
500                2020-11-09           1.1 
500                2020-11-16           1.1 
500                2020-11-16           1.1 
500                2020-11-23           1.1 
500                2021-02-01           1.1     
138641             2020-06-25          0.1.0                 major
138641             2020-06-25          0.1.0    
138641             2020-06-27          0.1.0    
138641             2020-06-27          0.1.9                 patch
138641             2020-06-27          0.1.10                patch
138641             2020-06-27          0.1.11                patch
138641             2020-06-27          0.1.13                patch
138641             2020-06-27          0.1.14                patch
138641             2020-06-27          0.1.15                patch
138641             2020-06-28          0.2.0                 minor.patch
138641             2020-06-30          0.2.1                 patch
138641             2020-07-01          0.3.0                 minor.patch
138641             2020-07-08          0.4.0                 minor
138641             2020-07-11          0.5.0                 minor
138641             2020-07-12          0.6.0                 minor

I am trying to compare the versions between consecutive rows and then label them, but the problem I am facing is often the first commit_date of some of the api_spec_id have a label as well, as we can see for the id 138641, when it should be empty as there is no previous row to compare to.

This is the code below, I feel the issue is probably coming because of the sem function because it extracts the version for all the rows, which could be causing some issues in the diff, but then again it works for some of the id's which is strange, and I am not able to debug this issue.

pat = r'(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(?:\.(?P<micro>\d+))?'

sem = new['info_version'].str.extract(pat).fillna(0).astype(int)

diff = sem.diff().fillna(0).ne(0)

new['label'] = + '.').str.rstrip('.')

The second part of the code is for all versions which are not valid semantic versions and have to be parsed with the Version class.

attrs = ['major', 'minor', 'micro', 'pre', 'post', 'dev', 'local']
def extract_version(ver):
    ver = Version(ver)  
    return pd.Series({attr: getattr(ver, attr) for attr in attrs}, dtype=str)

sem = new['info_version'].agg(extract_version).fillna('').rename(columns={'micro': 'patch'})
diff =[0]))
new['label'] = + '.').str.rstrip('.')

Is there any other way I could calculate the difference? Any suggestions or ideas would be really appreciated.


  • From one of your previous questions, I recommend you to group by api_spec_id column to process versions:

    api_spec_id       commit_date   info_version    label
    500                2021-02-01            1.1     
    138641             2020-06-25          0.1.0    major  # <- without groupby

    If you use groupby, the output will be:

    api_spec_id       commit_date   info_version    label
    500                2021-02-01            1.1     
    138641             2020-06-25          0.1.0           # <- with groupby

    So you should use:

    pat = r'(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(?:\.(?P<micro>\d+))?'
    sem = new['info_version'].str.extract(pat).fillna(0).astype(int)
    # diff = sem.diff().fillna(0).ne(0)
    diff = sem.groupby(new['api_spec_id']).diff().fillna(0).ne(0)
    new['label'] = + '.').str.rstrip('.')

    For your second part of code, it's exactly the same problem:

    attrs = ['major', 'minor', 'micro', 'pre', 'post', 'dev', 'local']
    def extract_version(ver):
        ver = Version(ver)  
        return pd.Series({attr: getattr(ver, attr) for attr in attrs}, dtype=str)
    sem = new['info_version'].agg(extract_version).fillna('').rename(columns={'micro': 'patch'})
    # diff =[0]))
    diff = (sem.groupby(new['api_spec_id'], group_keys=False)
               .apply(lambda x:[0]))))
    new['label'] = + '.').str.rstrip('.')

    In both cases, the output is now:

    >>> new
        api_spec_id commit_date info_version        label
    0           500  2020-07-22          1.1             
    1           500  2020-11-09          1.1             
    2           500  2020-11-16          1.1             
    3           500  2020-11-16          1.1             
    4           500  2020-11-23          1.1             
    5           500  2021-02-01          1.1             
    6        138641  2020-06-25        0.1.0             
    7        138641  2020-06-25        0.1.0             
    8        138641  2020-06-27        0.1.0             
    9        138641  2020-06-27        0.1.9        patch
    10       138641  2020-06-27       0.1.10        patch
    11       138641  2020-06-27       0.1.11        patch
    12       138641  2020-06-27       0.1.13        patch
    13       138641  2020-06-27       0.1.14        patch
    14       138641  2020-06-27       0.1.15        patch
    15       138641  2020-06-28        0.2.0  minor.patch
    16       138641  2020-06-30        0.2.1        patch
    17       138641  2020-07-01        0.3.0  minor.patch
    18       138641  2020-07-08        0.4.0        minor
    19       138641  2020-07-11        0.5.0        minor
    20       138641  2020-07-12        0.6.0        minor