Search code examples
pythonpandasversioningsemantic-versioning

Semantic Version comparison and labelling without packages


I have a dataframe that looks like this:

api_spec_id label    info_version   commitDate
    803            2.3.0            2019-09-12
    803            2.4.1            2019-10-04
    803            2.4.2            2019-10-07
    803            2.5.3            2019-10-08
    803            2.6.1            2019-10-08
    803            2.6.3            2019-11-25
    803            2.6.5            2019-12-10
    803            2.6.6            2019-12-11
    803            2.7.2            2019-12-11
    803            2.8.0            2019-12-19

The packaging.Version class sorts them into basic labels of major, minor etc, but the case in my dataset is a bit different since sometimes versions upgrade can be like 2.5.3 to 2.6.1, so here we have Minor.Patch changing at the same time, I do not want to give then only one label since that hinders my analysis and leads to bias. We also have cases where versions go from 2.3.0 to 2.4.1 where the upgrade pattern is Major.Minor and lastly for Major.Patch in the same way. The dataset has such odd conventions since it is crawled from real api data.

Apart from this, we also have cases where major.minor.patch are all equal, and we have pre identifiers like alpha,beta,rc,dev,pre. In such cases I want to compare them according to the standard alpha < beta < rc < pre < dev. In exception cases where this identifier is same as well, I want to compare the number afterwards. There are cases of the form 1.2.6.7, which I am think can be in the same category of major.minor.patch.micro( micro here being the 4th number).

The expected output should be like this:

api_spec_id label    info_version           commitDate
    803      -              2.3.0                 2019-09-12
    803    minor.patch      2.4.1                 2019-10-04
    803    patch            2.4.2                 2019-10-07
    803    minor.patch      2.5.3                 2019-10-08
    803    minor.patch      2.6.1                 2019-10-08
    803    patch            2.6.3                 2019-11-25
    803    patch            2.6.5                 2019-12-10
    803    patch            2.6.6                 2019-12-11
    803    minor.patch      2.7.2                 2019-12-11
    803    major            2.8.0                 2019-12-19
    803    pre              2.8.0a1               2019-12-24
    803    pre              2.8.0a2               2019-12-27
    803    pre              2.9.0.dev             2021-01-03
    803    micro            2.9.0.2               2021-01-10
    803    no change        2.9.0.2               2021-01-15

I parsed all these versions and they are in canonical format according to PEP440 regex, but I am not sure how to compare between these, because a regex won't work here, so I am in a bit of fix. Does anyone have ideas/suggestions on how to tackle this?


Solution

  • Update 1

    As second step, you can use packaging package from pypi's team:

    # pip install packaging
    from packaging.version import Version, parse
    
    attrs = ['major', 'micro', 'minor', 'pre', 'post', 'dev', 'local']
    def extract_version(ver):
        ver = parse(ver)  # or Version(ver)
        return pd.Series({attr: getattr(ver, attr) for attr in attrs}, dtype=str)
    
    sem = df['info_version'].agg(extract_version).fillna('').rename(columns={'minor': 'patch'})
    diff = sem.ne(sem.shift().fillna(sem.iloc[0]))
    df['label'] = diff.dot(sem.columns + '.').str.rstrip('.')
    

    Output

        api_spec_id info_version  commitDate          label
    0           803        2.3.0  2019-09-12               
    1           803        2.4.1  2019-10-04    micro.minor
    2           803        2.4.2  2019-10-07          micro
    3           803        2.5.3  2019-10-08    micro.minor
    4           803        2.6.1  2019-10-08    micro.minor
    5           803        2.6.3  2019-11-25          micro
    6           803        2.6.5  2019-12-10          micro
    7           803        2.6.6  2019-12-11          micro
    8           803        2.7.2  2019-12-11    micro.minor
    9           803        2.8.0  2019-12-19    micro.minor
    10          803      2.8.0a1  2019-12-24            pre
    11          803      2.8.0a2  2019-12-27            pre
    12          803    2.9.0.dev  2021-01-03  minor.pre.dev
    13          803      2.9.0.2  2021-01-10            dev
    14          803      2.9.0.2  2021-01-15               
    

    Other output:

    >>> sem
       major micro patch       pre post dev local
    0      2     0     3                         
    1      2     1     4                         
    2      2     2     4                         
    3      2     3     5                         
    4      2     1     6                         
    5      2     3     6                         
    6      2     5     6                         
    7      2     6     6                         
    8      2     2     7                         
    9      2     0     8                         
    10     2     0     8  ('a', 1)               
    11     2     0     8  ('a', 2)               
    12     2     0     9                  0      
    13     2     0     9                         
    14     2     0     9                         
    
    >>> diff
        major  micro  patch    pre   post    dev  local
    0   False  False  False  False  False  False  False
    1   False   True   True  False  False  False  False
    2   False   True  False  False  False  False  False
    3   False   True   True  False  False  False  False
    4   False   True   True  False  False  False  False
    5   False   True  False  False  False  False  False
    6   False   True  False  False  False  False  False
    7   False   True  False  False  False  False  False
    8   False   True   True  False  False  False  False
    9   False   True   True  False  False  False  False
    10  False  False  False   True  False  False  False
    11  False  False  False   True  False  False  False
    12  False  False   True   True  False   True  False
    13  False  False  False  False  False   True  False
    14  False  False  False  False  False  False  False
    

    Original answer

    As starting point, you can use (assuming dataframe is sorted by commitDate):

    # Simple regex pattern (subset of PEP440)
    pat = r'(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(?:\.(?P<micro>\d+))?'
    
    # Extract component version
    sem = df['info_version'].str.extract(pat).fillna(0).astype(int)
    
    # Check the difference
    diff = sem.diff().fillna(0).ne(0)
    
    # Use dot product
    df['label'] = diff.dot(sem.columns + '.').str.rstrip('.')
    

    Output:

    >>> df
      api_spec_id info_version commitDate        label
    0         803        2.3.0 2019-09-12             
    1         803        2.4.1 2019-10-04  minor.patch
    2         803        2.4.2 2019-10-07        patch
    3         803        2.5.3 2019-10-08  minor.patch
    4         803        2.6.1 2019-10-08  minor.patch
    5         803        2.6.3 2019-11-25        patch
    6         803        2.6.5 2019-12-10        patch
    7         803        2.6.6 2019-12-11        patch
    8         803        2.7.2 2019-12-11  minor.patch
    9         803        2.8.0 2019-12-19  minor.patch
    

    Other output:

    >>> sem
       major  minor  patch  micro
    0      2      3      0      0
    1      2      4      1      0
    2      2      4      2      0
    3      2      5      3      0
    4      2      6      1      0
    5      2      6      3      0
    6      2      6      5      0
    7      2      6      6      0
    8      2      7      2      0
    9      2      8      0      0
    
    >>> diff
       major  minor  patch  micro
    0  False  False  False  False
    1  False   True   True  False
    2  False  False   True  False
    3  False   True   True  False
    4  False   True   True  False
    5  False  False   True  False
    6  False  False   True  False
    7  False  False   True  False
    8  False   True   True  False
    9  False   True   True  False
    

    Note: it doesn't solve pre-version for the moment but the regex pattern can be modified.