Extracting geo coordinates from a complex nested Twitter json, using Python

I am reading multiple complex json files and trying to extract geo coordinates. I cannot attach the file itself right now, but I can print the tree here. The file has several hundred options and some objects repeat.

Please see the structure of the file in .txt format.

When I read the json with Spark in Python, it shows me these coordinates in coordinates column and it is there.

It is stored in coordinates column. Please see a proof.

I am obviously trying to reduce the number of columns and select some columns.

The last two columns are my geo coordinates. I tried both coordinates and geo and also coordinates.coordinates with geo.coordinates. Both options do not work.

df_tweets = tweets.select(['text', 
                       'user.name', 
                       'user.screen_name', 
                       'user.id', 
                       'user.location',  
                       'place.country', 
                       'place.full_name', 
                       'place.name',
                       'user.followers_count', 
                       'retweet_count',
                       'retweeted',
                       'user.friends_count',
                       'entities.hashtags.text', 
                       'created_at', 
                       'timestamp_ms', 
                       'lang',
                       'coordinates.coordinates', # or just `coordinates`
                       'geo.coordinates' # or just `geo`
                       ])

In the first case with coordinates and geo I get the following, printing the schema:

df_tweets.printSchema()

root
 |-- text: string (nullable = true)
 |-- name: string (nullable = true)
 |-- screen_name: string (nullable = true)
 |-- id: long (nullable = true)
 |-- location: string (nullable = true)
 |-- country: string (nullable = true)
 |-- full_name: string (nullable = true)
 |-- name: string (nullable = true)
 |-- followers_count: long (nullable = true)
 |-- retweet_count: long (nullable = true)
 |-- retweeted: boolean (nullable = true)
 |-- friends_count: long (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- created_at: string (nullable = true)
 |-- timestamp_ms: string (nullable = true)
 |-- lang: string (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- geo: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)

When I do coordinates.coordinates and geo.coordinates, I get

root
 |-- text: string (nullable = true)
 |-- name: string (nullable = true)
 |-- screen_name: string (nullable = true)
 |-- id: long (nullable = true)
 |-- location: string (nullable = true)
 |-- country: string (nullable = true)
 |-- full_name: string (nullable = true)
 |-- name: string (nullable = true)
 |-- followers_count: long (nullable = true)
 |-- retweet_count: long (nullable = true)
 |-- retweeted: boolean (nullable = true)
 |-- friends_count: long (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- created_at: string (nullable = true)
 |-- timestamp_ms: string (nullable = true)
 |-- lang: string (nullable = true)
 |-- coordinates: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- coordinates: array (nullable = true)
 |    |-- element: double (containsNull = true)

When I print both dataframes in Pandas, none of them gives me coordinates, I still have None.

How to extract geo coordinates properly?

Solution

If I look at my dataframe with tweet data, I see it like this

In [44]: df[df.coordinates.notnull()]['coordinates']
Out[44]:
98    {'type': 'Point', 'coordinates': [-122.32111, ...
99    {'type': 'Point', 'coordinates': [-122.32111, ...
Name: coordinates, dtype: object

So it's a dictionary that has to be parsed

tweets_coords = df[df.coordinates.notnull()]['coordinates'].tolist()

for coords in tweets_coords:
    print(coords)
    print(coords['coordinates'])
    print(coords['coordinates'][0])
    print(coords['coordinates'][1])

Output:

{'type': 'Point', 'coordinates': [-122.32111, 47.62366]}
[-122.32111, 47.62366]
-122.32111
47.62362
{'type': 'Point', 'coordinates': [-122.32111, 47.62362]}
[-122.32111, 47.62362]
-122.32111
47.62362

You can setup a lambda function in apply() to parse these out row by row, otherwise you can use list comprehension using what i've provided as the basis for your analysis.

All that said, maybe check this first... Where you are using coordinates.coordinates and geo.coordinates, try coordinates['coordinates'] and geo['coordinates']