Search code examples
pythonjsonhistogram

Making a histogram from Json data


I have data in JSON format that looks something like this

{
   "ts": 1393631983,
   "visitor_uuid": "ade7e1f63bc83c66",
   "visitor_source": "external",
   "visitor_device": "browser",
   "visitor_useragent": "Opera/9.80 (Windows NT 6.1) Presto/2.12.388 Version/12.16",
   "visitor_ip": "b5af0ba608ab307c",
   "visitor_country": "BR",
   "visitor_referrer": "53c643c16e8253e7",
   "env_type": "reader",
   "env_doc_id": "140222143932-91796b01f94327ee809bd759fd0f6c76",
   "event_type": "pagereadtime",
   "event_readtime": 1010,
   "subject_type": "doc",
   "subject_doc_id": "140222143932-91796b01f94327ee809bd759fd0f6c76",
   "subject_page": 3
} {
    "ts": 1393631983,
    "visitor_uuid": "232eeca785873d35",
    "visitor_source": "internal",
    "visitor_device": "browser",
    "visitor_useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36",
    "visitor_ip": "fcf9c67037f993f0",
    "visitor_country": "MX",
    "visitor_referrer": "63765fcd2ff864fd",
    "env_type": "stream",
    "env_ranking": 10,
    "env_build": "1.7.118-b946",
    "env_name": "explore",
    "env_component": "editors_picks",
    "event_type": "impression",
    "subject_type": "doc",
    "subject_doc_id": "100713205147-2ee05a98f1794324952eea5ca678c026",
    "subject_page": 1
}

My task requires me to find subject_doc_id that matches an input from user and then display a histogram showing the countries in which that document has been viewed.

I have been able to read through the data that with my code and I am also familiar with how to plot a histogram but I need help on how to count the countries and display that in the histogram.

For example here in the data above "visitor_country":"MX" and "visitor_country":"BR" exist so I want the count of each country.

Any ideas on how I can achieve that?


Solution

  • Your json file isn't correct json file. You need to add a "[" at the start and "]" at the end of a file and separate each "{}" section by comma. Here is an example:

    Data.json

    [
        {
       "ts": 1393631983,
       "visitor_uuid": "ade7e1f63bc83c66",
       "visitor_source": "external",
       "visitor_device": "browser",
       "visitor_useragent": "Opera/9.80 (Windows NT 6.1) Presto/2.12.388 Version/12.16",
       "visitor_ip": "b5af0ba608ab307c",
       "visitor_country": "BR",
       "visitor_referrer": "53c643c16e8253e7",
       "env_type": "reader",
       "env_doc_id": "140222143932-91796b01f94327ee809bd759fd0f6c76",
       "event_type": "pagereadtime",
       "event_readtime": 1010,
       "subject_type": "doc",
       "subject_doc_id": "140222143932-91796b01f94327ee809bd759fd0f6c76",
       "subject_page": 3
    }, {
        "ts": 1393631983,
        "visitor_uuid": "232eeca785873d35",
        "visitor_source": "internal",
        "visitor_device": "browser",
        "visitor_useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36",
        "visitor_ip": "fcf9c67037f993f0",
        "visitor_country": "MX",
        "visitor_referrer": "63765fcd2ff864fd",
        "env_type": "stream",
        "env_ranking": 10,
        "env_build": "1.7.118-b946",
        "env_name": "explore",
        "env_component": "editors_picks",
        "event_type": "impression",
        "subject_type": "doc",
        "subject_doc_id": "100713205147-2ee05a98f1794324952eea5ca678c026",
        "subject_page": 1
    }, {
        "ts": 1393631983,
        "visitor_uuid": "232eeca785873d35",
        "visitor_source": "internal",
        "visitor_device": "browser",
        "visitor_useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36",
        "visitor_ip": "fcf9c67037f993f0",
        "visitor_country": "PL",
        "visitor_referrer": "63765fcd2ff864fd",
        "env_type": "stream",
        "env_ranking": 10,
        "env_build": "1.7.118-b946",
        "env_name": "explore",
        "env_component": "editors_picks",
        "event_type": "impression",
        "subject_type": "doc",
        "subject_doc_id": "100713205147-2ee05a98f1794324952eea5ca678c026",
        "subject_page": 1
    }
    , {
        "ts": 1393631983,
        "visitor_uuid": "232eeca785873d35",
        "visitor_source": "internal",
        "visitor_device": "browser",
        "visitor_useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36",
        "visitor_ip": "fcf9c67037f993f0",
        "visitor_country": "PL",
        "visitor_referrer": "63765fcd2ff864fd",
        "env_type": "stream",
        "env_ranking": 10,
        "env_build": "1.7.118-b946",
        "env_name": "explore",
        "env_component": "editors_picks",
        "event_type": "impression",
        "subject_type": "doc",
        "subject_doc_id": "100713205147-2ee05a98f1794324952eea5ca678c026",
        "subject_page": 1
    }
    ]
    

    After that For each element in the data.json file i'm checking if it is a match to our input subject_doc_id. If we got a match i'm appending it to a list of matches, so we can collect the data for our histogram. After that i want to get a number of bins based on the number of unique countries to do so i'm creating a unique list of countries and then i'm checking it's length.

    import matplotlib.pyplot as plt
    import json
    
    with open("data.json") as json_file:
        data = json.load(json_file)
    
    #Here is the subject id i'm using for the data presentation
    #100713205147-2ee05a98f1794324952eea5ca678c026
    subject_id = input("subject_doc_id: ")
    visitors = []
    for i in range(len(data)):
        if subject_id == data[i]["subject_doc_id"]:
            print("got a match from {}".format(data[i]["visitor_country"]))
            visitors.append(data[i]["visitor_country"])
    countries = []
    for i in visitors:
        if i not in countries:
            countries.append(i)
    try:
        plt.hist(visitors, bins = len(countries))
        plt.show()
    except ValueError:
        print("No matches for given subject_doc_id")
    

    If you want to sort it by continents you need to first know which country belongs to which continent. My example:

    continents = {
        "europe": ["PL, GER"],
        "south_america": ["BR"],
        "north_america": ["MX"]
    }
    

    I'm python newbie so i don't know any fancy techniques to sort the previous lists except for loops.

    continent_data = []
    for continent in continents:
        for visitor_country in visitors:
            for country in continents[continent]:
                if visitor_country in country:
                    continent_data.append(continent)
    print(continent_data)
    

    After that you can just use the previous code to sort it into unique values for bins and create a histogram based on the example above