I have a dataframe that looks like this:
api_spec_id label info_version commitDate
803 2.3.0 2019-09-12
803 2.4.1 2019-10-04
803 2.4.2 2019-10-07
803 2.5.3 2019-10-08
803 2.6.1 2019-10-08
803 2.6.3 2019-11-25
803 2.6.5 2019-12-10
803 2.6.6 2019-12-11
803 2.7.2 2019-12-11
803 2.8.0 2019-12-19
The packaging.Version
class sorts them into basic labels of major
, minor
etc, but the case in my dataset is a bit different since sometimes versions upgrade can be like 2.5.3
to 2.6.1
, so here we have Minor.Patch
changing at the same time, I do not want to give then only one label since that hinders my analysis and leads to bias. We also have cases where versions go from 2.3.0
to 2.4.1
where the upgrade pattern is Major.Minor
and lastly for Major.Patch
in the same way. The dataset has such odd conventions since it is crawled from real api data.
Apart from this, we also have cases where major.minor.patch
are all equal, and we have pre identifiers like alpha,beta,rc,dev,pre
. In such cases I want to compare them according to the standard alpha < beta < rc < pre < dev
. In exception cases where this identifier is same as well, I want to compare the number afterwards. There are cases of the form 1.2.6.7
, which I am think can be in the same category of major.minor.patch.micro( micro here being the 4th number).
The expected output should be like this:
api_spec_id label info_version commitDate
803 - 2.3.0 2019-09-12
803 minor.patch 2.4.1 2019-10-04
803 patch 2.4.2 2019-10-07
803 minor.patch 2.5.3 2019-10-08
803 minor.patch 2.6.1 2019-10-08
803 patch 2.6.3 2019-11-25
803 patch 2.6.5 2019-12-10
803 patch 2.6.6 2019-12-11
803 minor.patch 2.7.2 2019-12-11
803 major 2.8.0 2019-12-19
803 pre 2.8.0a1 2019-12-24
803 pre 2.8.0a2 2019-12-27
803 pre 2.9.0.dev 2021-01-03
803 micro 2.9.0.2 2021-01-10
803 no change 2.9.0.2 2021-01-15
I parsed all these versions and they are in canonical format according to PEP440 regex, but I am not sure how to compare between these, because a regex won't work here, so I am in a bit of fix. Does anyone have ideas/suggestions on how to tackle this?
Update 1
As second step, you can use packaging
package from pypi's team:
# pip install packaging
from packaging.version import Version, parse
attrs = ['major', 'micro', 'minor', 'pre', 'post', 'dev', 'local']
def extract_version(ver):
ver = parse(ver) # or Version(ver)
return pd.Series({attr: getattr(ver, attr) for attr in attrs}, dtype=str)
sem = df['info_version'].agg(extract_version).fillna('').rename(columns={'minor': 'patch'})
diff = sem.ne(sem.shift().fillna(sem.iloc[0]))
df['label'] = diff.dot(sem.columns + '.').str.rstrip('.')
Output
api_spec_id info_version commitDate label
0 803 2.3.0 2019-09-12
1 803 2.4.1 2019-10-04 micro.minor
2 803 2.4.2 2019-10-07 micro
3 803 2.5.3 2019-10-08 micro.minor
4 803 2.6.1 2019-10-08 micro.minor
5 803 2.6.3 2019-11-25 micro
6 803 2.6.5 2019-12-10 micro
7 803 2.6.6 2019-12-11 micro
8 803 2.7.2 2019-12-11 micro.minor
9 803 2.8.0 2019-12-19 micro.minor
10 803 2.8.0a1 2019-12-24 pre
11 803 2.8.0a2 2019-12-27 pre
12 803 2.9.0.dev 2021-01-03 minor.pre.dev
13 803 2.9.0.2 2021-01-10 dev
14 803 2.9.0.2 2021-01-15
Other output:
>>> sem
major micro patch pre post dev local
0 2 0 3
1 2 1 4
2 2 2 4
3 2 3 5
4 2 1 6
5 2 3 6
6 2 5 6
7 2 6 6
8 2 2 7
9 2 0 8
10 2 0 8 ('a', 1)
11 2 0 8 ('a', 2)
12 2 0 9 0
13 2 0 9
14 2 0 9
>>> diff
major micro patch pre post dev local
0 False False False False False False False
1 False True True False False False False
2 False True False False False False False
3 False True True False False False False
4 False True True False False False False
5 False True False False False False False
6 False True False False False False False
7 False True False False False False False
8 False True True False False False False
9 False True True False False False False
10 False False False True False False False
11 False False False True False False False
12 False False True True False True False
13 False False False False False True False
14 False False False False False False False
Original answer
As starting point, you can use (assuming dataframe is sorted by commitDate
):
# Simple regex pattern (subset of PEP440)
pat = r'(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(?:\.(?P<micro>\d+))?'
# Extract component version
sem = df['info_version'].str.extract(pat).fillna(0).astype(int)
# Check the difference
diff = sem.diff().fillna(0).ne(0)
# Use dot product
df['label'] = diff.dot(sem.columns + '.').str.rstrip('.')
Output:
>>> df
api_spec_id info_version commitDate label
0 803 2.3.0 2019-09-12
1 803 2.4.1 2019-10-04 minor.patch
2 803 2.4.2 2019-10-07 patch
3 803 2.5.3 2019-10-08 minor.patch
4 803 2.6.1 2019-10-08 minor.patch
5 803 2.6.3 2019-11-25 patch
6 803 2.6.5 2019-12-10 patch
7 803 2.6.6 2019-12-11 patch
8 803 2.7.2 2019-12-11 minor.patch
9 803 2.8.0 2019-12-19 minor.patch
Other output:
>>> sem
major minor patch micro
0 2 3 0 0
1 2 4 1 0
2 2 4 2 0
3 2 5 3 0
4 2 6 1 0
5 2 6 3 0
6 2 6 5 0
7 2 6 6 0
8 2 7 2 0
9 2 8 0 0
>>> diff
major minor patch micro
0 False False False False
1 False True True False
2 False False True False
3 False True True False
4 False True True False
5 False False True False
6 False False True False
7 False False True False
8 False True True False
9 False True True False
Note: it doesn't solve pre-version for the moment but the regex pattern can be modified.