I am beginner to Python and Pandas
The following is a part of my pandas dataframe.
I want to extract all the dates in the INTERVAL_DATETIME column.
Like 16/11/2020, 31/05/2020.......21/11/2023. I do not want dates to be duplicated,
I could not find a way to do this.
I would be thankful if someone could help me.
Use:
df = pd.DataFrame( {'INTERVAL_DATETIME': ['6/11/2020 16:30',
'6/11/2020 17:00',
'31/05/2020 18:00',
'31/05/2020 18:30',
'21/11/2023 18:00']})
print (df)
INTERVAL_DATETIME
0 6/11/2020 16:30
1 6/11/2020 17:00
2 31/05/2020 18:00
3 31/05/2020 18:30
4 21/11/2023 18:00
L = df.INTERVAL_DATETIME.str.split().str[0].unique().tolist()
print (L)
['6/11/2020', '31/05/2020', '21/11/2023']
Explanation:
First use Series.str.split
by space, so no sep
parameter - output are lists:
print (df.INTERVAL_DATETIME.str.split())
0 [6/11/2020, 16:30]
1 [6/11/2020, 17:00]
2 [31/05/2020, 18:00]
3 [31/05/2020, 18:30]
4 [21/11/2023, 18:00]
Name: INTERVAL_DATETIME, dtype: object
Then select first lists by indexing str[0]
:
print (df.INTERVAL_DATETIME.str.split().str[0])
0 6/11/2020
1 6/11/2020
2 31/05/2020
3 31/05/2020
4 21/11/2023
Name: INTERVAL_DATETIME, dtype: object
Get unique values to numpy array by Series.unique
:
print (df.INTERVAL_DATETIME.str.split().str[0].unique())
['6/11/2020' '31/05/2020' '21/11/2023']
... and last convert to list:
print (df.INTERVAL_DATETIME.str.split().str[0].unique().tolist())
['6/11/2020', '31/05/2020', '21/11/2023']
Or use Series.str.extract
for get values before first space:
L = df.INTERVAL_DATETIME.str.extract(r'^(.*)\s+', expand=False).unique().tolist()
print (L)
['6/11/2020', '31/05/2020', '21/11/2023']
If need string dates in format YYYY-MM-DD
solution is use to_datetime
with Series.dt.strftime
:
L = (pd.to_datetime(df.INTERVAL_DATETIME, dayfirst=True)
.dt.strftime('%Y-%m-%d').unique().tolist())
print (L)
['2020-11-06', '2020-05-31', '2023-11-21']
For dates use Series.dt.date
instead strftime
:
L = pd.to_datetime(df.INTERVAL_DATETIME, dayfirst=True).dt.date.unique().tolist()
print (L)
[datetime.date(2020, 11, 6), datetime.date(2020, 5, 31), datetime.date(2023, 11, 21)]
For completness is using Series.dt.normalize
for Timestamp
s without times (it means 00:00:00
):
L = pd.to_datetime(df.INTERVAL_DATETIME, dayfirst=True).dt.normalize().unique().tolist()
print (L)
[Timestamp('2020-11-06 00:00:00'),
Timestamp('2020-05-31 00:00:00'),
Timestamp('2023-11-21 00:00:00')]