Search code examples
csvweb-scrapingnlpopencsv

How can I extract some contents in the cells of web-scraped csv file?


I am struggling with dealing with a csv file that scraped one crowdfunding website.

My goal is successfully load all information as separate columns, but I found some information are mixed in a single column when I load it using 1) R, 2) Stata, and 3) Python.

Since the real data is really dirty, let me suggest abbreviate version of current dataset.

ID Pledge creator
000001 13.7 {"urls":{"web":{"user":"www.kickstarter.com/profile/731"}}, "name":John","id":709510333}
000002 26.4 {"urls":{"web":{"user":"www.kickstarter.com/profile/759"}}, "name":Kellen","id":703514812}
000003 7.6 {"urls":{"web":{"user":"www.kickstarter.com/profile/7522"}}, "name":Jach","id":609542647}

My goal was extracting the "name" and "id" as separate columns, though they are all mixed with URLs in the creator column.

Is there any way that I can extract names (John, Kellen, Jach) and ids as separate columns? I prefer R, but Stata and Python would also be helpful!

Thank you so much for considering this.


Solution

  • if you want to extract the name and id without any other values you can simply replace the code that is setting the creator column with replace the creator with what ever variable that holds the dictionary

    {"name": creator["name"], "id": creator["id"]}
    
    

    also if the json data is not formatted correctly (like missing a quote) you can try using regular expressions