Fetching Standard Meteorological Week from pandas dataframe date column

I have a pandas dataframe which is having long term data,


        point_id     issue_date  latitude   longitude   rainfall    
0           1.0      2020-01-01    6.5          66.50      NaN      
1           2.0      2020-01-02    6.5          66.75      NaN       

... ... ... ... ... ... ... ...


6373888   17414.0    2020-12-30   38.5          99.75      NaN       
6373889   17415.0    2020-12-31   38.5          100.00     NaN  


6373890 rows × 5 columns

I want to extract the Standard Meteorological Week from its issue_date column, as given in this figure.

I have tried in 2 ways.

1st

lulc_gdf['smw'] = lulc_gdf['issue_date'].astype('datetime64[ns]').dt.strftime('%V')

2nd

lulc_gdf['iso'] = lulc_gdf['issue_date'].astype('datetime64[ns]').dt.isocalendar().week

The output in both cases is same


        point_id     issue_date  latitude   longitude   rainfall    smw   iso
0           1.0      2020-01-01    6.5          66.50      NaN       01    1
1           2.0      2020-01-02    6.5          66.75      NaN       01    1

... ... ... ... ... ... ... ...


6373888   17414.0    2020-12-30   38.5          99.75      NaN       53   53
6373889   17415.0    2020-12-31   38.5          100.00     NaN       53   53


6373890 rows × 7 columns

The issue is that the week starts here by taking reference of Sunday or Monday as the starting day of week, irrespective of year.

Like here in case of year 2020 the day on 1st January is Wednesday (not Monday), so the 1st week is of 5 days only i.e (Wed, Thu, Fri, Sat & Sunday).

            year    week    day    issue_date
0           2020      1      3     2020-01-01
1           2020      1      4     2020-01-02
2           2020      1      5     2020-01-03
3           2020      1      6     2020-01-04
... ... ... ...

6373889     2020      53     4     2020-12-31

But in the case of Standard Meteorological Weeks, I want output as: for every year

1st week should always be from - 1st January to 07th January
2nd week from -                  8th January to 14th January 
3rd week from -                  15th January to 21st January 
------------------------------- and so on

irrespective of the starting day of year (Sunday, monday etc).

How to do so?

Solution

Use:

df  = pd.DataFrame({'issue_date': pd.date_range('2000-01-01','2000-12-31')})
    
#inspire https://stackoverflow.com/a/61592907/2901002
normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
leap_year  = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
days = df['issue_date'].dt.dayofyear

df['smw'] = np.where(df['issue_date'].dt.is_leap_year, 
                     leap_year[days - 1], 
                     normal_year[days - 1])
print (df[df['smw'] == 9])
   issue_date  smw
56 2000-02-26    9
57 2000-02-27    9
58 2000-02-28    9
59 2000-02-29    9
60 2000-03-01    9
61 2000-03-02    9
62 2000-03-03    9
63 2000-03-04    9

Performance:

#11323 rows
df  = pd.DataFrame({'issue_date': pd.date_range('2000-01-01','2030-12-31')})


In [6]: %%timeit
   ...: normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
   ...: leap_year  = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
   ...: days = df['issue_date'].dt.dayofyear
   ...: 
   ...: df['smw'] = np.where(df['issue_date'].dt.is_leap_year,  leap_year[days - 1],  normal_year[days - 1])
   ...: 
3.51 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %%timeit
   ...: df['smw1'] = get_smw(df['issue_date'])
   ...: 
17.2 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#51500 rows
df  = pd.DataFrame({'issue_date': pd.date_range('1900-01-01','2040-12-31')})


In [9]: %%timeit
   ...: normal_year = np.append(np.arange(363) // 7 + 1, np.repeat(52, 5))
   ...: leap_year  = np.concatenate((normal_year[:59], [9], normal_year[59:366]))
   ...: days = df['issue_date'].dt.dayofyear
   ...: 
   ...: df['smw'] = np.where(df['issue_date'].dt.is_leap_year,  leap_year[days - 1],  normal_year[days - 1])
   ...: 
   ...: 
11.9 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: %%timeit
    ...: df['smw1'] = get_smw(df['issue_date'])
    ...: 
    ...: 
64.3 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)