Search code examples
pythonrjsontwitterdata-analysis

How to reformat a corrupt json file with escaped ' and "?


Problem

I have a large JSON file (~700.000 lines, 1.2GB filesize) containing twitter data that I need to preprocess for data and network analysis. During the data collection an error happend: Instead of using " as a seperator ' was used. As this does not conform with the JSON standard, the file can not be processed by R or Python.

Information about the dataset: Every about 500 lines start with meta info + meta information for the users, etc. then there are the tweets in json (order of fields not stable) starting with a space, one tweet per line.

This is what I tried so far:

  1. A simple data.replace('\'', '\"') is not possible, as the "text" fields contain tweets which may contain ' or " themselves.
  2. Using regex, I was able to catch some of the instances, but it does not catch everything: re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
  3. Using literal.eval(data) from the ast package also throws an error.

As the order of the fields and the legth for each field is not stable I am stuck on how to reformat that file in order to conform to JSON.

Normal sample line of the data (for this options one and two would work, but note that the tweets are also in non-english languages, which use " or ' in their tweets):

 {'author_id': '1236888827605725186', 'entities': {'mentions': [{'start': 108, 'end': 124, 'username': 'realDonaldTrump'}], 'hashtags': [{'start': 49, 'end': 55, 'tag': 'QAnon'}, {'start': 56, 'end': 66, 'tag': 'ProudBoys'}]}, 'context_annotations': [{'domain': {'id': '10', 'name': 'Person', 'description': 'Named people in the world like Nelson Mandela'}, 'entity': {'id': '799022225751871488', 'name': 'Donald Trump', 'description': 'US President Donald Trump'}}, {'domain': {'id': '35', 'name': 'Politician', 'description': 'Politicians in the world, like Joe Biden'}, 'entity': {'id': '799022225751871488', 'name': 'Donald Trump', 'description': 'US President Donald Trump'}}], 'text': 'RT @NinjaHodon: Here’s an example of the average #QAnon #ProudBoys crackass trash that’s going to vote for @realDonaldTrump. \n\n https://t.…', 'referenced_tweets': [{'type': 'retweeted', 'id': '1315363137240010753'}], 'conversation_id': '1315441338427506689', 'id': '1315441338427506689', 'lang': 'en', 'public_metrics': {'retweet_count': 20, 'reply_count': 0, 'like_count': 0, 'quote_count': 0}, 'created_at': '20201011T23:57:09.000Z', 'source': 'Twitter for Android', 'possibly_sensitive': False}

Reformatted sample line which causes an issue:

    {"users": [{"id": "437781219", "username": "HakesJon", "location": `"Wisconsin", "description": "#IndieFictionWriter. Husband. Father. Bearded.\n#BlackLivesMatter #DemilitarizeThePolice #DismantlePolicing", "name": "Jon Hakes", "created_at": "20111215T20:42:41.000Z"}, {"id": "1171947445841997824", "username": "FactNc", "location": "Under Carolina blue skies ", "description": "Defender of truth, justice and the American way.  "I never give them hell. I just tell the truth and they think it\'s hell." Harry S. Truman", "name": "NCFactFinder", "created_at": "20190912T00:44:21.000Z"}, {"id": "315041625", "username": "o0rimbuk0o", "description": "Your desire to put pronouns here is not my issue. Get help.\n\n#resist #notmypresident\n#FBiden", "name": "Sick of it", "created_at": "20110611T06:16:11.000Z"}, {"id": "3141427487", "username": "theGeekSheek", "description": "I don't believe in your God.  Don't tell me he hates me.", "name": "Chic Geek", "created_at": "20150406T18:34:45.000Z"}, {"id": "1084112678", "username": "KarinBorjeesson", "description": "Love to help people & animals in need. Love music. Fucking hate racists. #Anon #OpExposeCPS #BLM #FreePalestine #Yemen #OpSerenaShim #Animalrights #NoDAPL", "name": "AnonyMISSKarin", "created_at": "20130112T20:57:28.000Z"}, {"id": "1003712866011308033", "username": "persian_pesar", "description": "\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200fبه ستواری و سختی رشک پولاد/\nبه راه عشق سرها داده بر باد/\nقرین بیستون هم\u200cسنگ فرهاد/\nز کرمانشاهیان یاد اینچنین باد\n\u200e#Civil_Environment_Engineer", "name": "persianpesar🏳\u200d🌈", "created_at": "20180604T18:59:30.000Z"}, {"id": "814795859644809217", "username": "Aazadist", "description": "\u200f\u200e#Equality🌐\n\u200e#Humanity🌐\nخواهی نشوی همرنگ ، رسوای جماعت شو", "name": "Aazad 🏳️\u200d🌈 آزاد", "created_at": "20161230T11:30:45.000Z"}, {"id": "790375699638915072", "username": "Isaihstewart", "location": "Los Angeles, CA", "description": "Part time assistant manager at “Sheets and Things”", "name": "Dey got the henessey 🗣", "created_at": "20161024T02:13:46.000Z"}, {"id": "4846243708", "username": "williamvercetti", "location": "Virginia Beach, VA", "description": "vma. art. modelo papi. tpain to the dms.", "name": "William Vercetti", "created_at": "20160125T17:21:50.000Z"}, {"id": "1160723882", "username": "k_cawsey", "location": "Halifax, Nova Scotia", "description": "Chaucer, Malory, Arthur Tolkien. @Dal_English", "name": "Dr. Kathy Cawsey", "created_at": "20130208T17:15:30.000Z"}, {"id": "3789298943", "username": "solomonesther17", "location": "Lagos, Nigeria", "description": "FairBib Legal Practitioners", "name": "Esther Solomon", "created_at": "20150927T04:52:29.000Z"}, {"id": "14860380", "username": "Dejify", "location": "San Francisco", "description": "The Nigerian State is a festering boil that the world can't afford to ignore. Because, when it pops, its rancid ooze won't be pleasant nor easy to contain.", "name": "Buhari: Uber Ment (Dèjì Akọ́mọláfẹ́)", "created_at": "20080521T18:57:27.000Z"}, {"id": "1120883223070773248", "username": "Donna780780", "description": "", "name": "Donna Swidley", "created_at": "20190424T02:52:40.000Z"}, {"id": "1253742908487929858", "username": "Neros_sis", "location": "Florida", "description": "", "name": "@Nero's Fiddle  GOP has a terrorism problem", "created_at": "20200424T17:50:00.000Z"}, {"id": "585090491", "username": "vickierae562", "location": "The LBC", "description": "That’s Right, I’m a Lefty 🤣 and I don’t feed trolls! #resist #DumpTrump #DitchMitch #LooseLindsey", "name": "Vickie Rae", "created_at": "20120519T21:00:28.000Z"}, {"id": "1262122532607574022", "username": "EmilySi49944255", "description": "", "name": "Skylar Aubrey", "created_at": "20200517T20:47:34.000Z"}, {"id": "1401663176", "username": "mdeHummelchen", "location": "Tief im Westen", "description": "Pflegewissenschaftlerin,Pflegeberaterin,Dozentin,Lächeln und winken...Pro Pflegekammer", "name": "Madame Hummelchen 💙", "created_at": "20130504T07:44:32.000Z"}, {"id": "2381808114", "username": "mommy97giraffe", "location": "Antifa HQs/Mom Division Office", "description": "Follower of Jesus, Mennonite mom&wife, lover of books, world, peo, poetry&art. 6 autoimmunes&fibro🥄ie Proud Mama Bear of 1gayD & 1pan&autistic son, in 20s🌈💖", "name": "Mennonite Mom(she/her)", "created_at": "20140310T08:51:02.000Z"}, {"id": "2362182011", "username": "rd2glry", "location": "Washington, DC", "description": "", "name": "ateachr", "created_at": "20140224T04:07:21.000Z"}, {"id": "974917494870700032", "username": "GiraffeOld", "location": "Arizona, USA", "description": "", "name": "old man giraffe", "created_at": "20180317T07:56:58.000Z"}, {"id": "830939480", "username": "redz041", "description": "", "name": "Jan Mouzone", "created_at": "20120918T12:18:36.000Z"}, {"id": "3346032292", "username": "kumccaig44", "description": "", "name": "Katrine McCaig", "created_at": "20150625T21:25:21.000Z"}, {"id": "80630279", "username": "LuluTheCalm", "location": "Green Grass & Puddles, Canada", "description": "Mischief in My Eyes & Adventure in My Soul. \nLet's Have a Laugh &, you know, Make the World a Better Place.😎 \nAus/Brit/Cdn🇨🇦", "name": "Lulu 🇨🇦#BeKindBeCalmBeSafe💞 😷 🎏", "created_at": "20091007T17:26:56.000Z"}, {"id": "3252437864", "username": "engelhardterin", "location": "Houston, TX || Lubbock, TX", "description": "24 || Texas Tech || ♀️ || she/her", "name": "Erin Engelhardt", "created_at": "20150622T07:26:28.000Z"}, {"id": "93797267", "username": "mcbeaz", "location": "he/him", "description": "black lives matter.", "name": "mike", "created_at": "20091201T05:28:58.000Z"}, {"id": "2585773107", "username": "michiganington", "location": "Washington, D.C. ", "description": "", "name": "Allyoop", "created_at": "20140606T02:12:33.000Z"}, {"id": "27857135", "username": "JackRayher", "location": "Northport, NY", "description": "Senior Marketing Executive\nLifelong Democrat\n#BidenHarris", "name": "Jack Rayher", "created_at": "20090331T12:12:03.000Z"}, {"id": "1078457644736827392", "username": "RobertCooper58", "description": "Bilingual community advocate. Father of five wonderful kids. Lifelong progressive and proud member of @TheDemCoalition. Early supporter of President @JoeBiden.", "name": "Robert Cooper 🌊", "created_at": "20181228T01:08:34.000Z"}, {"id": "206860139", "username": "MariaArtze", "location": "Münster, Deutschland", "description": "Nas trincheiras da ESO\nEmigrante a medio retornar. Womansplainer.\n(Sie  vostede)\n\nTrans rights are human rights.", "name": "A Malvada Profe mediovacinada", "created_at": "20101023T22:27:26.000Z"}, {"id": "2903906123", "username": "lm1067", "location": "London, England", "description": "B A FINE ARTIST GRADUATED", "name": "Luis Pais", "created_at": "20141203T15:53:10.000Z"}, {"id": "64119853", "username": "IAM_SHAKESPEARE", "location": "Tweeting from the Grave", "description": "This bot has tweeted the complete works of Shakespeare (in order) 5 times over the last 12years. On hiatus for a bit. Created by @strebel", "name": "Willy Shakes", "created_at": "20090809T05:41:08.000Z"}, {"id": "3176623941", "username": "acastellich", "location": "Chicago, Il.", "description": "Abogado,Restaurantero,Immigrant , UVM. AD1 IPADE MBA. Restaurant Hospitality Industry, Chicago IL.", "name": "Alejandro Castelli", "created_at": "20150417T13:23:17.000Z"}, {"id": "782765390925533185", "username": "Diane_L_Espo", "location": "Florida, USA", "description": "", "name": "DianeEspo 🇺🇲🗽", "created_at": "20161003T02:13:07.000Z"}, {"id": "67471020", "username": "thedcma", "location": "Fort Lauderdale, FL", "description": "🖤💎 Style is the only substance I abuse.💎🖤 I’m just a 🌈 Gay 🐔Hillbilly 🔮Warlock 🛵 Riding a 👨🏻\u200d🎤Vaporwave Fever Dream #blacklivesmatter", "name": "Grace Kelly on Steiroids", "created_at": "20090821T00:32:37.000Z"}, {"id": "78797635", "username": "graciosodiablo", "description": "Too much of a good thing can be bad.  So too little of a bad thing must be good. 160 characters or less of me should be perfect.", "name": "gracioso diabloint", "created_at": "20091001T03:59:16.000Z"}, {"id": "268314713", "username": "philppedurand", "location": "Auxerre", "description": "Je suis une personne gentille je milite pour la PMA. je suis militant communiste je suis aussi à l’association des Rosoirs je suis conseillé quartier", "name": "Philippe durand", "created_at": "20110318T14:37:36.000Z"}, {"id": "37996028", "username": "nicrawhide", "location": "Pinconning Michigan ", "description": "Just your average small town gay with big town sensibility!!", "name": "Nicholas Bean", "created_at": "20090505T19:20:37.000Z"}, {"id": "1236656342674407427", "username": "LadyJayPersists", "location": "Valhalla", "description": "USN Veteran | Shieldmaiden | Mom | Not here for a man, I have one | PTSD Warrior | My mind is a beautiful servant to a dangerous master", "name": "Jax", "created_at": "20200308T14:13:48.000Z"}, {"id": "171183306", "username": "dawndawnB", "location": "United States", "description": "Mrs. B, mother of 2 amazing kids, Substance Abuse Counselor, Volunteer, Music Lover. Born in DC but a VA Lo❤️er!", "name": "nwad", "created_at": "20100726T19:21:24.000Z"}, {"id": "817247846751555587", "username": "me2020_2021", "location": "Brisbane, Queensland", "description": "Proud Aussie, living a wonderful life with my wife, Australian Cricket 🏏👏,😷 🍺🥃 \U0001f9ae Alex", "name": "👀🏳️\u200d🌈 "A girl has no Name”', 'created_at': '20170106T05:54:05.000Z'}, {'id': '879459933988585472', 'username': 'Davecl3069', 'location': 'San Francisco Bay Area', 'description': 'proud of my views, life long learner,& hopefully, that guy!\n#LowerTheFlagForCovidVictims #VoteBlue #BLM #SupportThePlayers #LGBTQ #WeNeedToDoBetter #ResistStill', 'name': 'David', 'created_at': '20170626T22:02:42.000Z'}`

Code used

Regex:

    def convert_to_json(file):
        with open(file, "r", encoding="utf-8") as f:
            x = f.read()
            x = x.replace("-", "")
            rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
            decoded = rx.sub('"', x)
        

literal_eval:

def open_json():
    with open("data.json", "r", encoding="utf-8") as f:
        f.read()
        data = literal_eval(f)
        data = json.loads(str(data))

What I would like to achieve

  1. Reformat the data to conform to JSON (this question) in order to be able to
  2. Build a dataframe with the relevant tweettext, user information and metadata (secondary goal) to be used in further analyses.

Thanks in advance for any suggestions! :)


Solution

  • So I figured out a way to process the corrupt data. The solution can be found here.

    Using ast.literal_eval(input_string) lets me read in the corrupt json lines as a dictionary. Only thing is to make sure that no leading or trailing whitespace, commata etc. are included in the input string. Example code for reading in data with ast.literal_eval():

    from ast import literal_eval
    
    with open("inputdata.json", "r", encoding="utf-8") as f:
        dictlist = []
        for line in f:
            x: str = f.readline()
            x = x.lstrip()
            data = literal_eval(x)
            dictlist.append(data)