I have a ORC file i am able to read that into a DataFrame using Pyspark 2.2.0
from pyspark.context import SparkContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = spark.read.orc("s3://leadid-sandbox/krish/lead_test/")
The above df has a schema like below
root
|-- item: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
sample data looks like this(Just a sample data not the entire dataset)
item
{http_Accept-Language={"s":"en-US"}, Win64={"n":"1"},
geoip_region={"s":"FL"}, Platform={"s":"Win7"}, geoip_postal_code=
{"s":"33432"}, JavaApplets={"n":"1"}, http_Accept={"s":"*/*"},
Version={"s":"11.0"}, Cookies={"n":"1"}, Platform_Version=
{"s":"6.1"}, http_Content-Type={"s":"application/x-www-form-
urlencoded"}}
{http_Accept-Language={"s":"en-US"}, Win64={"n":"1"}, IFrames=
{"n":"1"}, geoip_region={"s":"CA"}, Platform={"s":"Win7"}, Parent=
{"s":"IE 11.0"}, http_Dnt={"n":"1"}}
So I exploded "item" like below
expDf = df.select(explode("item"))
The above DataFrame has below schema and when i do an show(2) has the below details
root
|-- key: string (nullable = false)
|-- value: string (nullable = true)
+------------+----------+
| key| value|
+------------+----------+
|geoip_region|
{
"s": "FL"
}
|
| Tables|
{
"n": "1"
}
|
+------------+----------+
How can i select data out of this DataFrame ? I have tried different ways but of no use. So i would need 'geoip_region' with value as 'FL' and so on. Any help is appreciated.
Thanks Joshi for your response, for some reason i was getting row[0] not found error on my code, I was running this on AWS Glue environment may be that could be a reason.
I got what i wanted using the below code.
# Creating a DataFrame of the raw file
df = spark.read.orc("s3://leadid-sandbox/krish/lead_test/")
# Creating a temp view called Leads for the above dataFrame
df.createOrReplaceTempView("leads")
# Extracting the data using normal SQL from the above created Temp
View
tblSel = spark.sql("SELECT get_json_object(item['token'], '$.s') as
token, get_json_object(item['account_code'], '$.s') as account_code
from leads").show()