Search code examples
pythonpandasgeneratoryielddata-files

How Do I Use a Generator on a Data File to Convert JSON and TSV Rows Into a Dataframe?


I have a ".data" file containing these two sample rows below. The first row denotes json and the second row denotes tsv. I would like to convert the json to a python dictionary and the tsv lines into a python dictionary and then output both into a dataframe using a generator.

###SAMPLE LINES of ".DATA" FILE###

{"Book": "American Horror", "Author": "Me", "date": "12/12/2012", publisher": "Fox"
Sports Law  Some Body   06/12/1999  Random House 1000
import json

def generator(file):
    
    for row in open(file, encoding="ISO-8859-1"):
        print(row)
        if "{" in row:
            yield json.loads(row)
        else:
###I don't know where to begin with the tsv data
###tsv data must fit under column names of json data
            for tsv in row:
                yield tsv
file = ".data_file"        
with open(file,'r') a some_stuff:
    df = pd.DataFrame(data=generator(some_stuff))
df
'''

Solution

  • By "TSV" I assume that your data is tab separated, i.e. the fields are delimited by a single tab character. If that is the case you can use str.split('\t') to break up the fields, like this:

    >>> line = 'Sports Law\tSome Body\t06/12/1999\tRandom House 1000\n'
    >>> line.rstrip().split('\t')
    ['Sports Law', 'Some Body', '06/12/1999', 'Random House 1000']
    

    The rstrip() is there to remove the new line at the end of the lines that you would read from the file.

    Then create a dictionary and yield it:

    book, author, date, publisher = line.rstrip().split('\t')
    yield dict(Book=book, Author=author, date=date, publisher=publisher)
    

    Or if you already have a list of column names:

    columns = ['Book', 'Author', 'date', 'publisher']
    
    yield dict(zip(columns, line.rstrip().split('\t')))