I have a dataframe that looks like this:
api_spec_id commit_date info_version label
500 2020-07-22 1.1
500 2020-11-09 1.1
500 2020-11-16 1.1
500 2020-11-16 1.1
500 2020-11-23 1.1
500 2021-02-01 1.1
138641 2020-06-25 0.1.0 major
138641 2020-06-25 0.1.0
138641 2020-06-27 0.1.0
138641 2020-06-27 0.1.9 patch
138641 2020-06-27 0.1.10 patch
138641 2020-06-27 0.1.11 patch
138641 2020-06-27 0.1.13 patch
138641 2020-06-27 0.1.14 patch
138641 2020-06-27 0.1.15 patch
138641 2020-06-28 0.2.0 minor.patch
138641 2020-06-30 0.2.1 patch
138641 2020-07-01 0.3.0 minor.patch
138641 2020-07-08 0.4.0 minor
138641 2020-07-11 0.5.0 minor
138641 2020-07-12 0.6.0 minor
I am trying to compare the versions between consecutive rows and then label them, but the problem I am facing is often the first commit_date
of some of the api_spec_id
have a label as well, as we can see for the id 138641, when it should be empty as there is no previous row to compare to.
This is the code below, I feel the issue is probably coming because of the sem
function because it extracts the version for all the rows, which could be causing some issues in the diff
, but then again it works for some of the id
's which is strange, and I am not able to debug this issue.
pat = r'(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(?:\.(?P<micro>\d+))?'
sem = new['info_version'].str.extract(pat).fillna(0).astype(int)
diff = sem.diff().fillna(0).ne(0)
new['label'] = diff.dot(sem.columns + '.').str.rstrip('.')
The second part of the code is for all versions which are not valid semantic versions and have to be parsed with the Version
class.
attrs = ['major', 'minor', 'micro', 'pre', 'post', 'dev', 'local']
def extract_version(ver):
ver = Version(ver)
return pd.Series({attr: getattr(ver, attr) for attr in attrs}, dtype=str)
sem = new['info_version'].agg(extract_version).fillna('').rename(columns={'micro': 'patch'})
diff = sem.ne(sem.shift().fillna(sem.iloc[0]))
new['label'] = diff.dot(sem.columns + '.').str.rstrip('.')
Is there any other way I could calculate the difference? Any suggestions or ideas would be really appreciated.
From one of your previous questions, I recommend you to group by api_spec_id
column to process versions:
api_spec_id commit_date info_version label
500 2021-02-01 1.1
138641 2020-06-25 0.1.0 major # <- without groupby
If you use groupby
, the output will be:
api_spec_id commit_date info_version label
500 2021-02-01 1.1
138641 2020-06-25 0.1.0 # <- with groupby
So you should use:
pat = r'(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(?:\.(?P<micro>\d+))?'
sem = new['info_version'].str.extract(pat).fillna(0).astype(int)
# diff = sem.diff().fillna(0).ne(0)
diff = sem.groupby(new['api_spec_id']).diff().fillna(0).ne(0)
new['label'] = diff.dot(sem.columns + '.').str.rstrip('.')
For your second part of code, it's exactly the same problem:
attrs = ['major', 'minor', 'micro', 'pre', 'post', 'dev', 'local']
def extract_version(ver):
ver = Version(ver)
return pd.Series({attr: getattr(ver, attr) for attr in attrs}, dtype=str)
sem = new['info_version'].agg(extract_version).fillna('').rename(columns={'micro': 'patch'})
# diff = sem.ne(sem.shift().fillna(sem.iloc[0]))
diff = (sem.groupby(new['api_spec_id'], group_keys=False)
.apply(lambda x: x.ne(x.shift().fillna(x.iloc[0]))))
new['label'] = diff.dot(sem.columns + '.').str.rstrip('.')
In both cases, the output is now:
>>> new
api_spec_id commit_date info_version label
0 500 2020-07-22 1.1
1 500 2020-11-09 1.1
2 500 2020-11-16 1.1
3 500 2020-11-16 1.1
4 500 2020-11-23 1.1
5 500 2021-02-01 1.1
6 138641 2020-06-25 0.1.0
7 138641 2020-06-25 0.1.0
8 138641 2020-06-27 0.1.0
9 138641 2020-06-27 0.1.9 patch
10 138641 2020-06-27 0.1.10 patch
11 138641 2020-06-27 0.1.11 patch
12 138641 2020-06-27 0.1.13 patch
13 138641 2020-06-27 0.1.14 patch
14 138641 2020-06-27 0.1.15 patch
15 138641 2020-06-28 0.2.0 minor.patch
16 138641 2020-06-30 0.2.1 patch
17 138641 2020-07-01 0.3.0 minor.patch
18 138641 2020-07-08 0.4.0 minor
19 138641 2020-07-11 0.5.0 minor
20 138641 2020-07-12 0.6.0 minor