Search code examples
pythonindexingbigdatayelp

read from line to line yelp dataset by python


I want to change this code to specifically read from line 1400001 to 1450000. What is modification? file is composed of a single object type, one JSON-object per-line. I want also to save the output to .csv file. what should I do?

revu=[]
with open("review.json", 'r',encoding="utf8") as f:
      for line in f:
       revu = json.loads(line[1400001:1450000)

Solution

  • If it is JSON per line:

    revu=[]
    with open("review.json", 'r',encoding="utf8") as f:
        # expensive statement, depending on your filesize this might
        # let you run out of memory
        revu = [json.loads(s) for s in f.readlines()[1400001:1450000]]
    

    if you do it on the /etc/passwd file it is easy to test (no json of course, so that is left out)

    revu = []
    with open("/etc/passwd", 'r') as f:
        # expensive statement
        revu = [s for s in f.readlines()[5:10]]
    
    print(revu)  # gives entry 5 to 10
    

    Or you iterate over all lines, saving you from memory issues:

    revu = []
    with open("...", 'r') as f:
        for i, line in enumerate(f):
            if i >= 1400001 and i <= 1450000:
                revu.append(json.loads(line))
    
    # process revu   
    

    To CSV ...

    import pandas as pd
    import json
    
    def mylines(filename, _from, _to):
        with open(filename, encoding="utf8") as f:
            for i, line in enumerate(f):
                if i >= _from and i <= _to:
                    yield json.loads(line)
    
    df = pd.DataFrame([r for r in mylines("review.json", 1400001, 1450000)])
    df.to_csv("/tmp/whatever.csv")