Search code examples
python-3.xyelp

Python string to datetime - Yelp data


The Yelp dataset provides check-in information as strings:

Business_id Date
A 2010-04-22 05:31:33, 2010-05-09 18:24:50,...
B 2010-03-07 02:04:38, 2010-04-11 01:45:57,2014-05-02 18:40:35, 2014-05-06 17:59:33,...

I want to calculate the daily number of check-in for each business.


Solution

  • Let suppose your data in a text file or a CSV file:

    Sample Data

    A,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2021-07-06 08:02:15,2021-07-06 10:01:18"
    B,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2014-05-07 18:02:33,2021-07-06 08:05:15,2021-07-06 10:01:20"
    C,"2010-03-07 02:04:38,2010-04-11 01:45:57,2014-05-02 18:40:35,2014-05-06 17:59:33,2014-05-08 16:05:20,2014-05-08 17:06:10,2021-07-06 10:01:19,2021-07-06 08:02:30,2021-07-06 10:01:20,2021-07-06 10:01:28"
    

    You could read the data into a Dataframe and attempt the following:

    df = pd.read_csv(r"/dir/filepath/filename.txt", header=None, delimiter=',')
    df.columns = ["B_id", "Date"]
    
    # explode converts the list into separate rows
    
    df = df.assign(Date= df.Date.str.split(',')).explode("Date")
    df["Date"] = pd.to_datetime(df["Date"])
    print(df)
    today = datetime.today().date()
    today_df = df[df["Date"].dt.date == today]
    grouped_df = today_df.groupby("B_id")["Date"].count()
    grouped_df.head()
    

    The Output after .explode():

        B_id    Date
    0   A   2010-03-07 02:04:38
    0   A   2010-04-11 01:45:57
    0   A   2014-05-02 18:40:35
    0   A   2014-05-06 17:59:33
    0   A   2021-07-06 08:02:15
    0   A   2021-07-06 10:01:18
    1   B   2010-03-07 02:04:38
    1   B   2010-04-11 01:45:57
    1   B   2014-05-02 18:40:35
    1   B   2014-05-06 17:59:33
    1   B   2014-05-07 18:02:33
    1   B   2021-07-06 08:05:15
    1   B   2021-07-06 10:01:20
    2   C   2010-03-07 02:04:38
    2   C   2010-04-11 01:45:57
    2   C   2014-05-02 18:40:35
    2   C   2014-05-06 17:59:33
    2   C   2014-05-08 16:05:20
    2   C   2014-05-08 17:06:10
    

    The final output:

    B_id  
    A     2
    B     2
    C     4