Search code examples
pythonjsonpandasjupyter-notebookdata-preprocessing

Data preprocessing with json data using python (Jupyter notebook)


I am trying to implement some preprocessing command for json dataset. Its was easy to work with .csv file, but I am not able to get how to implement some preprocessing command like isnull(), fillna(), dropna() and imputer class.

Below are some of the commands that I have executed but failed to perform above mentioned operation as I am not able to figure out how to work with Json file dataset.

dataset link : https://drive.google.com/file/d/1puNNrRaV-Jt_kt709fuYGCvDW9-EuwoB/view?usp=sharing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json

dataset = pd.read_json('moviereviews.json', orient='columns')
print(dataset)

movies = pd.read_json( ( dataset).to_json(), orient='index')
print(movies)
print(type(movies))

movie = pd.read_json( ( dataset['12 Strong']).to_json(), orient='index')
print(movie)

movie_name = [
    "12 Strong",
    "A Ciambra",
    "All The Money In The World",
    "Along With The Gods: The Two Worlds",
    "Bilal: A New Breed Of Hero",
    "Call Me By Your Name",
    "Condorito: La Película",
    "Darkest Hour",
    "Den Of Thieves",
    "Downsizing",
    "Father Figures",
    "Film Stars Don'T Die In Liverpool",
    "Forever My Girl",
    "Happy End",
    "Hostiles",
    "I, Tonya",
    "In The Fade (Aus Dem Nichts)",
    "Insidious: The Last Key",
    "Jumanji: Welcome To The Jungle",
    "Mary And The Witch'S Flower",
    "Maze Runner: The Death Cure",
    "Molly'S Game",
    "Paddington 2",
    "Padmaavat",
    "Phantom Thread",
    "Pitch Perfect 3",
    "Proud Mary",
    "Star Wars: Episode Viii - The Last Jedi",
    "Star Wars: The Last Jedi",
    "The Cage Fighter",
    "The Commuter",
    "The Final Year",
    "The Greatest Showman",
    "The Insult (L'Insulte)",
    "The Post",
    "The Shape Of Water",
    "Una Mujer Fantástica",
    "Winchester"
]
print(movie_name)

data = []
for moviename in movie_name:
    movie = pd.read_json( ( dataset[moviename]).to_json(), orient='index')
    data.append(movie)
   
print(data)

Solution

  • You can split up the items in the dictionary and read them separately, filling NaN with None in one shot.

    If your json is called as data, then

    df = pd.DataFrame(data[0].values()).fillna('None')
    df['Movie Name'] = pd.DataFrame(data[0].keys())
    df.set_index('Movie Name', inplace=True)
    
    df.head()
    
                                             Genre       Gross IMDB Metascore Popcorn Score   Rating Tomato Score popcornscore rating tomatoscore
    Movie Name
    12 Strong                               Action  $1,465,000             54            72        R           54         None   None        None
    A Ciambra                                Drama     unknown             70       unknown  unrated       unkown         None   None        None
    All The Money In The World                None        None           None          None     None         None         72.0      R        76.0
    Along With The Gods: The Two Worlds       None        None           None          None     None         None         90.0     NR        50.0
    Bilal: A New Breed Of Hero           Animation     unknown             52       unknown  unrated       unkown         None   None        None