Search code examples
pythonrjsoncsvtwitter

Is there any way to transform this data format into CSV?


I have a large amount of extracted Json file with a format attached. I want to know if there is any way to convert it into CSV with column as the feature and values in the row.

{"state": "New Jersey", "text": "RT @joncoopertweets: Register to join the #WeThePeopleMarch on September 21st in Washington, D.C. \u2014 or one of the 50+ marches that will be\u2026", "has_emoji": false, "created_at": "Mon Sep 02 16:32:05 +0000 2019", "id": 1168562246349467649, "entities": {"hashtags": [{"text": "WeThePeopleMarch", "indices": [42, 59]}], "urls": [], "user_mentions": [{"screen_name": "joncoopertweets", "name": "Jon Cooper", "id": 27493883, "id_str": "27493883", "indices": [3, 19]}], "symbols": []}, "source": "Twitter for iPad", "location": "Leonia, NJ", "verified": false, "geocode": null}
{"state": "Indiana", "text": "RT @dariusherron1: Don\u2019t nobody love they girl like Mexicans ", "has_emoji": false, "created_at": "Mon Sep 02 16:32:05 +0000 2019", "id": 1168562246378827776, "entities": {"hashtags": [], "urls": [{"url": "", "expanded_url": "", "display_url": "", "indices": [61, 84]}], "user_mentions": [{"screen_name": "dariusherron1", "name": "Darius Herron", "id": 1680891876, "id_str": "1680891876", "indices": [3, 17]}], "symbols": []}, "source": "Twitter for iPhone", "location": "Indianapolis, IN", "verified": false, "geocode": null}

JSON_format_Pic

enter image description here


Solution

  • I'm not entirely clear on your expected output (see the comments and discussions to @user5783745's answer). Your JSON strings contain some nested objects which will give rise to a nested list structure if you use jsonlite::fromJSON. Since you don't provide matching expected output for the sample data you give, there may be different ways to handle these nested entries.

    A possibility is to parse the JSON strings, and then flatten the resulting list twice before binding the rows.

    library(tidyverse)
    library(jsonlite)
    map(json, ~fromJSON(.x) %>% flatten() %>% flatten()) %>% bind_rows()
    ## A tibble: 2 x 15
    #  state text  has_emoji created_at     id indices screen_name name  id_str
    #  <chr> <chr> <lgl>     <chr>       <dbl> <list>  <chr>       <chr> <chr>
    #1 New … WeTh… FALSE     Mon Sep 0… 2.75e7 <int [… joncoopert… Jon … 27493…
    #2 Indi… "RT … FALSE     Mon Sep 0… 1.68e9 <int [… dariusherr… Dari… 16808…
    ## … with 6 more variables: source <chr>, location <chr>, verified <lgl>,
    ##   url <chr>, expanded_url <chr>, display_url <chr>
    

    The resulting object is a tibble with some list columns. To store as a CSV you could then exclude those list columns.


    Sample data

    json <- c(
        '{"state": "New Jersey", "text": "RT @joncoopertweets: Register to join the #WeThePeopleMarch on September 21st in Washington, D.C. \u2014 or one of the 50+ marches that will be\u2026", "has_emoji": false, "created_at": "Mon Sep 02 16:32:05 +0000 2019", "id": 1168562246349467649, "entities": {"hashtags": [{"text": "WeThePeopleMarch", "indices": [42, 59]}], "urls": [], "user_mentions": [{"screen_name": "joncoopertweets", "name": "Jon Cooper", "id": 27493883, "id_str": "27493883", "indices": [3, 19]}], "symbols": []}, "source": "Twitter for iPad", "location": "Leonia, NJ", "verified": false, "geocode": null}',
        '{"state": "Indiana", "text": "RT @dariusherron1: Don\u2019t nobody love they girl like Mexicans ", "has_emoji": false, "created_at": "Mon Sep 02 16:32:05 +0000 2019", "id": 1168562246378827776, "entities": {"hashtags": [], "urls": [{"url": "", "expanded_url": "", "display_url": "", "indices": [61, 84]}], "user_mentions": [{"screen_name": "dariusherron1", "name": "Darius Herron", "id": 1680891876, "id_str": "1680891876", "indices": [3, 17]}], "symbols": []}, "source": "Twitter for iPhone", "location": "Indianapolis, IN", "verified": false, "geocode": null}')