I'm working on an application where I need to store a collection of tweets along with attributes such as Tweet ID, Date of Tweet, Language and Username inside of a MySQL database.
This is an image of what I'm aiming for: https://i.sstatic.net/VWpyx.png
To do this, I created a program in python that collects 100+ tweets on Twitter in a JSON file. I then converted the JSON file to a CSV file using Microsoft Excel. After this I imported the CSV file in PHPMyAdmin as a table and I got the following outcome: https://i.sstatic.net/V11mK.png (10 rows x 185 columns).
The problem with the above is that some tweets have more data such as media, this causes the data to expand over multiple columns.
How do I quickly clean this table so that I only have my desired attributes in the table? Do I need to go back to scratch and work from my Python code or can I clean from the Table/CSV file?
If Tweets are parsed in JSON format and you need only some of the fields, I recommend you to use JSON module to parse the needed fields and Pandas module to convert them into structured table in order to write it to MySQL, for example:
import json
import pandas as pd
#Open and read the text file where all the Tweets are
with open('tweets.txt') as f:
tweets = f.readlines()
#Convert the read Tweets into JSON object
tweets_json = [json.loads(tweet) for tweet in tweets]
#Convert the list of Tweets into a structured dataframe
df = pd.DataFrame(tweets_json)
#Finally choose the attributes you need
df = df[['created_at', 'id', ...]]
#To write table into MySQL
df.to_sql(...)