I have a pandas dataframe which is having long term data,
point_id issue_date latitude longitude rainfall
0 1.0 2020-01-01 6.5 66.50 NaN
1 2.0 2020-01-02 6.5 66.75 NaN
... ... ... ... ... ... ... ...
6373888 17414.0 2020-12-30 38.5 99.75 NaN
6373889 17415.0 2020-12-31 38.5 100.00 NaN
6373890 rows × 5 columns
I want to extract the Standard Meteorological Week from its issue_date column, as given in this figure.
I have tried in 2 ways.
1st
lulc_gdf['smw'] = lulc_gdf['issue_date'].astype('datetime64[ns]').dt.strftime('%V')
2nd
lulc_gdf['iso'] = lulc_gdf['issue_date'].astype('datetime64[ns]').dt.isocalendar().week
The output in both cases is same
point_id issue_date latitude longitude rainfall smw iso
0 1.0 2020-01-01 6.5 66.50 NaN 01 1
1 2.0 2020-01-02 6.5 66.75 NaN 01 1
... ... ... ... ... ... ... ...
6373888 17414.0 2020-12-30 38.5 99.75 NaN 53 53
6373889 17415.0 2020-12-31 38.5 100.00 NaN 53 53
6373890 rows × 7 columns
The issue is that the week starts here by taking reference of Sunday or Monday as the starting day of week, irrespective of year.
Like here in case of year 2020 the day on 1st January is Wednesday (not Monday), so the 1st week is of 5 days only i.e (Wed, Thu, Fri, Sat & Sunday).
year week day issue_date
0 2020 1 3 2020-01-01
1 2020 1 4 2020-01-02
2 2020 1 5 2020-01-03
3 2020 1 6 2020-01-04
... ... ... ...
6373889 2020 53 4 2020-12-31
But in the case of Standard Meteorological Weeks, I want output as: for every year
1st week should always be from - 1st January to 07th January
2nd week from - 8th January to 14th January
3rd week from - 15th January to 21st January
------------------------------- and so on
irrespective of the starting day of year (Sunday, monday etc).
How to do so?
Use:
df = pd.DataFrame({'issue_date': pd.date_range('2000-01-01','2000-12-31')})
#inspire https://stackoverflow.com/a/61592907/2901002
normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
days = df['issue_date'].dt.dayofyear
df['smw'] = np.where(df['issue_date'].dt.is_leap_year,
leap_year[days - 1],
normal_year[days - 1])
print (df[df['smw'] == 9])
issue_date smw
56 2000-02-26 9
57 2000-02-27 9
58 2000-02-28 9
59 2000-02-29 9
60 2000-03-01 9
61 2000-03-02 9
62 2000-03-03 9
63 2000-03-04 9
Performance:
#11323 rows
df = pd.DataFrame({'issue_date': pd.date_range('2000-01-01','2030-12-31')})
In [6]: %%timeit
...: normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
...: leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
...: days = df['issue_date'].dt.dayofyear
...:
...: df['smw'] = np.where(df['issue_date'].dt.is_leap_year, leap_year[days - 1], normal_year[days - 1])
...:
3.51 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %%timeit
...: df['smw1'] = get_smw(df['issue_date'])
...:
17.2 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#51500 rows
df = pd.DataFrame({'issue_date': pd.date_range('1900-01-01','2040-12-31')})
In [9]: %%timeit
...: normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
...: leap_year = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
...: days = df['issue_date'].dt.dayofyear
...:
...: df['smw'] = np.where(df['issue_date'].dt.is_leap_year, leap_year[days - 1], normal_year[days - 1])
...:
...:
11.9 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %%timeit
...: df['smw1'] = get_smw(df['issue_date'])
...:
...:
64.3 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)