I have following code for my pyspark script. I am trying to generate a query and run it every time with different values of i
. The query should select nested JSON elements and calculate the size (i.e. number of occurrences). I am calculating this to help me with unit testing the final table which I am creating separately with "explode" functionality
for i in range(1,10) :
onerowDF = spark.sql("""SELECT items['responses'][i]['id'] as items_response_id, items['responses'][i]['name'] as responses_name FROM responses""")
onerowDf.select(size("items_response_id"), size("responses_name")).show()
I am getting error when I run this:
AnalysisException: u"cannot resolve '`i`' given input columns: [hasMore, items, total]; line 1 pos 74;\n'Project [items#1.id AS items_id#149, items#1.responseTime AS items.responseTime#154, items#1.responses['i][id] AS items_response_id#150, items#1.responses['i][name] AS responses_name#151, items#1.responses['i][type] AS responses_type#152, items#1.responses['i][answers] AS responses_answers#153]\n+- SubqueryAlias responses\n +- Relation[hasMore#0,items#1,total#2L] json\n"
I have knowingly removed some elements from the code above to make it simpler which is why the error lists more element than my code here.
So why can't I replace the value of I in each query and run the 2 statements and get results?
for i in range(1,10) :
onerowDF = spark.sql("SELECT items['responses']["+i+"]['id'] as items_response_id,
items['responses']["+i+"]['name'] as responses_name FROM responses")")
onerowDf.select(size("+items_response_id+"), size("+responses_name+")).show()
make sure you concatenate your variables. do not include in the string. instead of having "...i..."
make sure you do "..."+i+"...."
.
if u have another i you are using may i am wrong. but also check
"onerowDf.select(size("+items_response_id+"), `size("+responses_name+")).show()"`