The Yelp dataset provides check-in information as strings:
Business_id | Date |
---|---|
A | 2010-04-22 05:31:33, 2010-05-09 18:24:50,... |
B | 2010-03-07 02:04:38, 2010-04-11 01:45:57,2014-05-02 18:40:35, 2014-05-06 17:59:33,... |
I want to calculate the daily number of check-in for each business.
Let suppose your data in a text file or a CSV file:
Sample Data
A,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2021-07-06 08:02:15,2021-07-06 10:01:18"
B,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2014-05-07 18:02:33,2021-07-06 08:05:15,2021-07-06 10:01:20"
C,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2014-05-08 16:05:20,2014-05-08 17:06:10,2021-07-06 10:01:19,2021-07-06 08:02:30,2021-07-06 10:01:20,2021-07-06 10:01:28"
You could read the data into a Dataframe and attempt the following:
df = pd.read_csv(r"/dir/filepath/filename.txt", header=None, delimiter=',')
df.columns = ["B_id", "Date"]
# explode converts the list into separate rows
df = df.assign(Date= df.Date.str.split(',')).explode("Date")
df["Date"] = pd.to_datetime(df["Date"])
print(df)
today = datetime.today().date()
today_df = df[df["Date"].dt.date == today]
grouped_df = today_df.groupby("B_id")["Date"].count()
grouped_df.head()
The Output after .explode():
B_id Date
0 A 2010-03-07 02:04:38
0 A 2010-04-11 01:45:57
0 A 2014-05-02 18:40:35
0 A 2014-05-06 17:59:33
0 A 2021-07-06 08:02:15
0 A 2021-07-06 10:01:18
1 B 2010-03-07 02:04:38
1 B 2010-04-11 01:45:57
1 B 2014-05-02 18:40:35
1 B 2014-05-06 17:59:33
1 B 2014-05-07 18:02:33
1 B 2021-07-06 08:05:15
1 B 2021-07-06 10:01:20
2 C 2010-03-07 02:04:38
2 C 2010-04-11 01:45:57
2 C 2014-05-02 18:40:35
2 C 2014-05-06 17:59:33
2 C 2014-05-08 16:05:20
2 C 2014-05-08 17:06:10
The final output:
B_id
A 2
B 2
C 4