I have been struggling with this logic. I am getting data from google analytics in this format, per user per site. (So these are all the activities that one user performed on the site) I cannot change the format in which I receive the data.
PROBLEM: I am running a loop through all of the users and get this output for each user. I want to put this data into a data frame in order to use it later. The problem I have, is the 'activities':[{.....},{......}] part, I cannot figure out to store all of this data in a way that makes sense.
{'sampleRate': 1,
'sessions': [{'activities': [{'activityTime': '2020-01-08T16:00:44.399101Z',
'activityType': 'PAGEVIEW',
'campaign': '(not set)',
'channelGrouping': 'Direct',
'customDimension': [{'index': 1}],
'hostname': 'company.domain.com',
'keyword': '(not set)',
'landingPagePath': '/login',
'medium': '(none)',
'pageview': {'pagePath': '/thepath',
'pageTitle': 'thecurrentwebpage'},
'source': '(direct)'},
{'activityTime': '2020-01-08T15:58:43.077293Z',
'activityType': 'PAGEVIEW',
'campaign': '(not set)',
'channelGrouping': 'Direct',
'customDimension': [{'index': 1}],
'hostname': 'company.domain.com',
'keyword': '(not set)',
'landingPagePath': '/login',
'medium': '(none)',
'pageview': {'pagePath': '/theotherpath',
'pageTitle': 'thecurrentwebpage'},
'source': '(direct)'}],
'dataSource': 'web',
'deviceCategory': 'desktop',
'platform': 'Windows',
'sessionDate': '2020-01-08',
'sessionId': '1578491x03d'},
{'activities': [{'activityTime': '2019-12-28T21:58:48.993944Z',
'activityType': 'PAGEVIEW',
'campaign': '(not set)',.....
EXPECTED OUTPUT:
For each user's data to be stored in tables organised as such:
If there is some logic error in the picture, I am happy to change what I have. I just need the data to work.
PS: I need to use SQL and ERD's in LucidChart, I have never before had to manipulate data in this format. Any help, to get the data -that is structured like the example above- into a dataframe(s).
EDITED:
Example of two different types of activities (the activity is always classified as either 'pageview' or 'event'):
{'activityTime':
# Pageview activity
'2020-01-08T15:48:38.012671Z',
'activityType': 'PAGEVIEW',
'campaign': '(not set)',
'channelGrouping': 'Direct',
'customDimension': [{'index': 1}],
'hostname': 'company.domain.com',
'keyword': '(not set)',
'landingPagePath': '/login',
'medium': '(none)',
'pageview': {'pagePath': '/login',
'pageTitle': 'titleofthepage'},
'source': '(direct)'},
# Event activity
{'activityTime': '2020-01-08T15:48:37.915105Z',
'activityType': 'EVENT',
'campaign': '(not set)',
'channelGrouping': 'Direct',
'customDimension': [{'index': 1}],
'event': {'eventAction': 'Successfully Logged '
'In',
'eventCategory': 'Auth',
'eventCount': '1',
'eventLabel': '(not set)'},
'hostname': 'company.domain.com',
'keyword': '(not set)',
'landingPagePath': '/login',
'medium': '(none)',
'source': '(direct)'}]
For example you can do it like this:
import pandas as pd
import json
str = """{"sampleRate": 1,
"sessions": [{"activities": [{"activityTime": "2020-01-08T16:00:44.399101Z",
"activityType": "PAGEVIEW",
"campaign": "(not set)",
"channelGrouping": "Direct",
"customDimension": [{"index": 1}],
"hostname": "company.domain.com",
"keyword": "(not set)",
"landingPagePath": "/login",
"medium": "(none)",
"pageview": {"pagePath": "/thepath",
"pageTitle": "thecurrentwebpage"},
"source": "(direct)"},
{"activityTime": "2020-01-08T15:48:37.915105Z",
"activityType": "EVENT",
"campaign": "(not set)",
"channelGrouping": "Direct",
"customDimension": [{"index": 1}],
"event": {"eventAction": "Successfully Logged In",
"eventCategory": "Auth",
"eventCount": "1",
"eventLabel": "(not set)"},
"hostname": "company.domain.com",
"keyword": "(not set)",
"landingPagePath": "/login",
"medium": "(none)",
"source": "(direct)"}],
"dataSource": "web",
"deviceCategory": "desktop",
"platform": "Windows",
"sessionDate": "2020-01-08",
"sessionId": "1578491x03d"}]}"""
data = json.loads(str)
session_keys = "sessionId,dataSource,deviceCategory,platform,sessionDate,DB_id".split(",")
event_keys = "activityTime,eventCategory,eventCount,eventLabel,eventAction".split(",")
pageview_keys = "activityTime,pageTitle,pagePath".split(",")
sessions = {k:[] for k in session_keys}
events = {k:[] for k in event_keys}
pageviews = {k:[] for k in pageview_keys}
activities = {"sessionId":[],"activityTime":[]}
for session in data["sessions"]:
for easy_key in session_keys[:5]:
sessions[easy_key] += [session[easy_key]]
for activity in session["activities"]:
activity_time = activity["activityTime"]
activities["sessionId"] += [session["sessionId"]]
activities["activityTime"] += [activity_time]
if activity["activityType"] == "PAGEVIEW":
pageviews["activityTime"] += [activity_time]
pageviews["pageTitle"] += [activity["pageview"]["pageTitle"]]
pageviews["pagePath"] += [activity["pageview"]["pagePath"]]
elif activity["activityType"] == "EVENT":
events["activityTime"] += [activity_time]
events["eventAction"] += [activity["event"]["eventAction"]]
events["eventCategory"] += [activity["event"]["eventCategory"]]
events["eventCount"] += [activity["event"]["eventCount"]]
events["eventLabel"] += [activity["event"]["eventLabel"]]
else:
print("Unknown Activity: {}".format(activity["activityType"]))
sessions["DB_id"] += [0]
df_session = pd.DataFrame.from_dict(sessions)
df_session.set_index('sessionId', inplace=True)
df_event = pd.DataFrame.from_dict(events)
df_event.set_index('activityTime', inplace=True)
df_pageview = pd.DataFrame.from_dict(pageviews)
df_pageview.set_index('activityTime', inplace=True)
df_activities = pd.DataFrame.from_dict(activities)
#df_session:
dataSource deviceCategory platform sessionDate DB_id
sessionId
1578491x03d web desktop Windows 2020-01-08 0
#df_activities:
sessionId activityTime
0 1578491x03d 2020-01-08T16:00:44.399101Z
1 1578491x03d 2020-01-08T15:48:37.915105Z
#df_event:
eventCategory eventCount eventLabel eventAction
activityTime
2020-01-08T15:48:37.915105Z Auth 1 (not set) Successfully Logged In
#df_pageview:
pageTitle pagePath
activityTime
2020-01-08T16:00:44.399101Z thecurrentwebpage /thepath
#As example for a join, I only want the event data
df_sa = df_activities.join(df_session, on="sessionId").join(df_event,on="activityTime",how="right")
print(df_sa)
sessionId activityTime dataSource deviceCategory platform sessionDate DB_id eventCategory eventCount eventLabel eventAction
1 1578491x03d 2020-01-08T15:48:37.915105Z web desktop Windows 2020-01-08 0 Auth 1 (not set) Successfully Logged In
It is the same as you specified above with 2 changes:
The Table session doesn't have a column activities anymore.
The Table Activity has an additional column sessionId.
How to execute SQL in pandas Dataframe you can look up online probably to much to cover here. See here for example: Executing an SQL query over a pandas dataset
Some examples: (but in the end you have to figure it out yourself if you want something specific, I don't make an SQL course here)
df_session
df_event
and df_pageview
df_session
with
df_activities
then join with df_event
and df_pageview
Nothing easier than that. The dataframe are in a "properly" database format.
Example for Session:
for index, row in df_sessions.iterrows():
# for event and pageview the index would be activityTime
# the df activities don't have a specific index
sessionId = index
dataSource = row['dataSource']
deviceCategory = row['deviceCategory']
platform = row['platform']
sessionDate = row['sessionDate']
DB_id = row['DB_id']
# function to save a row in a SQL DB basically:
# INSERT INTO session (sessionId,dataSource,deviceCategory,platform,sessionDate,DB_id) VALUES(x,x,x,x,x,x)
save_to_sql(sessionId,dataSource,deviceCategory,platform,sessionDate,DB_id)
the save_to_sql
is your own implementation depending on what database you are using. And it would not fit this question to explain that to you.
DB_id
don't know the origin of this value. I set it to 0.