Search code examples
pythonapache-sparkpysparkorc

Extract data out of ORC using Pyspark


I have a ORC file i am able to read that into a DataFrame using Pyspark 2.2.0

from pyspark.context import SparkContext
from pyspark.sql.types import *
from pyspark.sql.functions import *

df = spark.read.orc("s3://leadid-sandbox/krish/lead_test/")

The above df has a schema like below

root
 |-- item: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

sample data looks like this(Just a sample data not the entire dataset)

item
{http_Accept-Language={"s":"en-US"}, Win64={"n":"1"}, 
geoip_region={"s":"FL"}, Platform={"s":"Win7"}, geoip_postal_code=
{"s":"33432"}, JavaApplets={"n":"1"}, http_Accept={"s":"*/*"}, 
Version={"s":"11.0"}, Cookies={"n":"1"}, Platform_Version=
{"s":"6.1"}, http_Content-Type={"s":"application/x-www-form-
urlencoded"}}
{http_Accept-Language={"s":"en-US"}, Win64={"n":"1"}, IFrames=
{"n":"1"}, geoip_region={"s":"CA"}, Platform={"s":"Win7"}, Parent=
{"s":"IE 11.0"}, http_Dnt={"n":"1"}} 

So I exploded "item" like below

expDf = df.select(explode("item"))

The above DataFrame has below schema and when i do an show(2) has the below details

root
 |-- key: string (nullable = false)
 |-- value: string (nullable = true)

+------------+----------+
| key| value|
+------------+----------+
|geoip_region|
{
    "s": "FL"
}
|
| Tables|
{
   "n": "1"
}
|
+------------+----------+

How can i select data out of this DataFrame ? I have tried different ways but of no use. So i would need 'geoip_region' with value as 'FL' and so on. Any help is appreciated.


Solution

  • Thanks Joshi for your response, for some reason i was getting row[0] not found error on my code, I was running this on AWS Glue environment may be that could be a reason.

    I got what i wanted using the below code.

    # Creating a DataFrame of the raw file
    
    df = spark.read.orc("s3://leadid-sandbox/krish/lead_test/")
    
    # Creating a temp view called Leads for the above dataFrame
    df.createOrReplaceTempView("leads")
    
    # Extracting the data using normal SQL from the above created Temp 
      View
    tblSel = spark.sql("SELECT get_json_object(item['token'], '$.s') as 
    token, get_json_object(item['account_code'], '$.s') as account_code 
    from leads").show()