Search code examples
pythonpandasdataframedata-sciencedata-analysis

Filtering pandas dataframe by date to count views for timeline of programs


I need to count viewers by program for a streaming channel from a json logfile. I identify the programs by their starttimes, such as:

So far I have two Dataframes like this:

The first one contains all the timestamps from the logfile

viewers_from_log = pd.read_json('sqllog.json', encoding='UTF-8')
# Convert date string to pandas datetime object:
viewers_from_log['time'] = pd.to_datetime(viewers_from_log['time'])

enter image description here

Source JSON file:

[
    {
        "logid": 191605,
        "time": "0:00:17"
    },
    {
        "logid": 191607,
        "time": "0:00:26"
    },
    {
        "logid": 191611,
        "time": "0:01:20"
    }
]

The second contains the starting times and titles of the programs

programs_start_time = pd.DataFrame.from_dict('programs.json', orient='index')

enter image description here Source JSON file:

{
    "2019-05-29": [
        {
            "title": "\"Amiről a kövek mesélnek\"",
            "startTime_dt": "2019-05-29T00:00:40Z"
        },
        {
            "title": "Koffer - Kedvcsináló Kul(t)túrák Külföldön",
            "startTime_dt": "2019-05-29T00:22:44Z"
        },
        {
            "title": "Gubancok",
            "startTime_dt": "2019-05-29T00:48:08Z"
        }
    ]
}

So what I need to do is to count the entries / program in the log file and link them to the program titles.


Solution

  • My approach is to slice log data for each date range from program data and get the shape. Next add column for program data with results:

    import pandas as pd
    
    # setup test data
    log_data = {'Time': ['2019-05-30 00:00:26', '2019-05-30 00:00:50', '2019-05-30 00:05:50','2019-05-30 00:23:26']}
    log_data = pd.DataFrame(data=log_data)
    
    program_data = {'Time': ['2019-05-30 00:00:00', '2019-05-30 00:22:44'],
                'Program': ['Program 1', 'Program 2']}
    program_data = pd.DataFrame(data=program_data)
    
    
    counts = []
    for index, row in program_data.iterrows():
        # get counts on selected range
        try:
            log_range = log_data[(log_data['Time'] > program_data.loc[index].values[0]) & (log_data['Time'] < program_data.loc[index+1].values[0])]
            counts.append(log_range.shape[0])
        except:
            log_range = log_data[log_data['Time'] > program_data.loc[index].values[0]]
            counts.append(log_range.shape[0])
    
     # add aditional column with collected counts
     program_data['Counts'] = counts
    

    Output:

                      Time    Program  Counts
     0  2019-05-30 00:00:00  Program 1       3
     1  2019-05-30 00:22:44  Program 2       1