Search code examples
pythonpandasdictionarygoogle-analytics-apilucidchart

Pandas convert list of dictionaries (GA output) into dataframe(s) that makes sense


I have been struggling with this logic. I am getting data from google analytics in this format, per user per site. (So these are all the activities that one user performed on the site) I cannot change the format in which I receive the data.

PROBLEM: I am running a loop through all of the users and get this output for each user. I want to put this data into a data frame in order to use it later. The problem I have, is the 'activities':[{.....},{......}] part, I cannot figure out to store all of this data in a way that makes sense.

{'sampleRate': 1,
 'sessions': [{'activities': [{'activityTime': '2020-01-08T16:00:44.399101Z',
                               'activityType': 'PAGEVIEW',
                               'campaign': '(not set)',
                               'channelGrouping': 'Direct',
                               'customDimension': [{'index': 1}],
                               'hostname': 'company.domain.com',
                               'keyword': '(not set)',
                               'landingPagePath': '/login',
                               'medium': '(none)',
                               'pageview': {'pagePath': '/thepath',
                                            'pageTitle': 'thecurrentwebpage'},
                               'source': '(direct)'},
                              {'activityTime': '2020-01-08T15:58:43.077293Z',
                               'activityType': 'PAGEVIEW',
                               'campaign': '(not set)',
                               'channelGrouping': 'Direct',
                               'customDimension': [{'index': 1}],
                               'hostname': 'company.domain.com',
                               'keyword': '(not set)',
                               'landingPagePath': '/login',
                               'medium': '(none)',
                               'pageview': {'pagePath': '/theotherpath',
                                            'pageTitle': 'thecurrentwebpage'},
                               'source': '(direct)'}],
               'dataSource': 'web',
               'deviceCategory': 'desktop',
               'platform': 'Windows',
               'sessionDate': '2020-01-08',
               'sessionId': '1578491x03d'},
              {'activities': [{'activityTime': '2019-12-28T21:58:48.993944Z',
                               'activityType': 'PAGEVIEW',
                               'campaign': '(not set)',.....

EXPECTED OUTPUT:

For each user's data to be stored in tables organised as such: Lucid chart brainstorming of ERD layouts

If there is some logic error in the picture, I am happy to change what I have. I just need the data to work.

PS: I need to use SQL and ERD's in LucidChart, I have never before had to manipulate data in this format. Any help, to get the data -that is structured like the example above- into a dataframe(s).

EDITED:

Example of two different types of activities (the activity is always classified as either 'pageview' or 'event'):

{'activityTime':
                               # Pageview activity
                               '2020-01-08T15:48:38.012671Z',
                               'activityType': 'PAGEVIEW',
                               'campaign': '(not set)',
                               'channelGrouping': 'Direct',
                               'customDimension': [{'index': 1}],
                               'hostname': 'company.domain.com',
                               'keyword': '(not set)',
                               'landingPagePath': '/login',
                               'medium': '(none)',
                               'pageview': {'pagePath': '/login',
                                            'pageTitle': 'titleofthepage'},
                               'source': '(direct)'},

                              # Event activity
                              {'activityTime': '2020-01-08T15:48:37.915105Z',
                               'activityType': 'EVENT',
                               'campaign': '(not set)',
                               'channelGrouping': 'Direct',
                               'customDimension': [{'index': 1}],
                               'event': {'eventAction': 'Successfully Logged '
                                                        'In',
                                         'eventCategory': 'Auth',
                                         'eventCount': '1',
                                         'eventLabel': '(not set)'},
                               'hostname': 'company.domain.com',
                               'keyword': '(not set)',
                               'landingPagePath': '/login',
                               'medium': '(none)',
                               'source': '(direct)'}]

Solution

  • For example you can do it like this:

    import pandas as pd
    import json
    str = """{"sampleRate": 1,
     "sessions": [{"activities": [{"activityTime": "2020-01-08T16:00:44.399101Z",
                                   "activityType": "PAGEVIEW",
                                   "campaign": "(not set)",
                                   "channelGrouping": "Direct",
                                   "customDimension": [{"index": 1}],
                                   "hostname": "company.domain.com",
                                   "keyword": "(not set)",
                                   "landingPagePath": "/login",
                                   "medium": "(none)",
                                   "pageview": {"pagePath": "/thepath",
                                                "pageTitle": "thecurrentwebpage"},
                                   "source": "(direct)"},
                                  {"activityTime": "2020-01-08T15:48:37.915105Z",
                                   "activityType": "EVENT",
                                   "campaign": "(not set)",
                                   "channelGrouping": "Direct",
                                   "customDimension": [{"index": 1}],
                                   "event": {"eventAction": "Successfully Logged In",
                                             "eventCategory": "Auth",
                                             "eventCount": "1",
                                             "eventLabel": "(not set)"},
                                   "hostname": "company.domain.com",
                                   "keyword": "(not set)",
                                   "landingPagePath": "/login",
                                   "medium": "(none)",
                                   "source": "(direct)"}],
                   "dataSource": "web",
                   "deviceCategory": "desktop",
                   "platform": "Windows",
                   "sessionDate": "2020-01-08",
                   "sessionId": "1578491x03d"}]}"""
    
    
    data = json.loads(str)
    
    session_keys = "sessionId,dataSource,deviceCategory,platform,sessionDate,DB_id".split(",")
    event_keys = "activityTime,eventCategory,eventCount,eventLabel,eventAction".split(",")
    pageview_keys = "activityTime,pageTitle,pagePath".split(",")
    
    sessions = {k:[] for k in session_keys}
    events = {k:[] for k in event_keys}
    pageviews = {k:[] for k in pageview_keys}
    activities = {"sessionId":[],"activityTime":[]}
    
    for session in data["sessions"]:
        for easy_key in session_keys[:5]:
            sessions[easy_key] += [session[easy_key]]
        for activity in session["activities"]:
            activity_time = activity["activityTime"]
            activities["sessionId"] += [session["sessionId"]]
            activities["activityTime"] += [activity_time]
            if activity["activityType"] == "PAGEVIEW":
                pageviews["activityTime"] += [activity_time]
                pageviews["pageTitle"] += [activity["pageview"]["pageTitle"]]
                pageviews["pagePath"] += [activity["pageview"]["pagePath"]]
            elif activity["activityType"] == "EVENT":
                events["activityTime"] += [activity_time]
                events["eventAction"] += [activity["event"]["eventAction"]]
                events["eventCategory"] += [activity["event"]["eventCategory"]]
                events["eventCount"] += [activity["event"]["eventCount"]]
                events["eventLabel"] += [activity["event"]["eventLabel"]]
            else:
                print("Unknown Activity: {}".format(activity["activityType"]))
    
        sessions["DB_id"] += [0]
    
    df_session = pd.DataFrame.from_dict(sessions)
    df_session.set_index('sessionId', inplace=True)
    df_event = pd.DataFrame.from_dict(events)
    df_event.set_index('activityTime', inplace=True)
    df_pageview = pd.DataFrame.from_dict(pageviews)
    df_pageview.set_index('activityTime', inplace=True)
    df_activities = pd.DataFrame.from_dict(activities)
    

    Output each DF:

    #df_session:
    
                dataSource deviceCategory platform sessionDate  DB_id
    sessionId                                                        
    1578491x03d        web        desktop  Windows  2020-01-08      0
    
    
    
    #df_activities:
         sessionId                 activityTime
    0  1578491x03d  2020-01-08T16:00:44.399101Z
    1  1578491x03d  2020-01-08T15:48:37.915105Z
    
    
    
    #df_event:
                                eventCategory eventCount eventLabel             eventAction
    activityTime                                                                           
    2020-01-08T15:48:37.915105Z          Auth          1  (not set)  Successfully Logged In
    
    
    
    #df_pageview:
                                         pageTitle  pagePath
    activityTime                                            
    2020-01-08T16:00:44.399101Z  thecurrentwebpage  /thepath
    

    Output example join

    #As example for a join, I only want the event data
    df_sa = df_activities.join(df_session, on="sessionId").join(df_event,on="activityTime",how="right") 
    
    print(df_sa)
    
         sessionId                 activityTime dataSource deviceCategory platform sessionDate  DB_id eventCategory eventCount eventLabel             eventAction
    1  1578491x03d  2020-01-08T15:48:37.915105Z        web        desktop  Windows  2020-01-08      0          Auth          1  (not set)  Successfully Logged In
    

    Schema

    It is the same as you specified above with 2 changes:

    1. The Table session doesn't have a column activities anymore.

    2. The Table Activity has an additional column sessionId.

    SQL

    How to execute SQL in pandas Dataframe you can look up online probably to much to cover here. See here for example: Executing an SQL query over a pandas dataset

    How to get the data

    Some examples: (but in the end you have to figure it out yourself if you want something specific, I don't make an SQL course here)

    • If you only want session data: Query df_session
    • If you want all activities: Query df_event and df_pageview
    • If you want all activities and combined with sessions: join df_session with df_activities then join with df_event and df_pageview

    I don't want a Dataframe... I need MYSQL Database (or something else)

    Nothing easier than that. The dataframe are in a "properly" database format.

    Example for Session:

    for index, row in df_sessions.iterrows():
        # for event and pageview the index would be activityTime
        # the df activities don't have a specific index
        sessionId = index 
        dataSource = row['dataSource']
        deviceCategory = row['deviceCategory']
        platform = row['platform']
        sessionDate = row['sessionDate']
        DB_id = row['DB_id']
        # function to save a row in a SQL DB basically:
        # INSERT INTO session (sessionId,dataSource,deviceCategory,platform,sessionDate,DB_id) VALUES(x,x,x,x,x,x)
        save_to_sql(sessionId,dataSource,deviceCategory,platform,sessionDate,DB_id)
    

    the save_to_sql is your own implementation depending on what database you are using. And it would not fit this question to explain that to you.

    Comments

    1. DB_id don't know the origin of this value. I set it to 0.