Search code examples
pythonloopspysparkapache-spark-sqlcode-generation

Loop to generate and execute commands doesn't work


I have following code for my pyspark script. I am trying to generate a query and run it every time with different values of i. The query should select nested JSON elements and calculate the size (i.e. number of occurrences). I am calculating this to help me with unit testing the final table which I am creating separately with "explode" functionality

for i in range(1,10) : 
onerowDF = spark.sql("""SELECT items['responses'][i]['id'] as items_response_id, items['responses'][i]['name'] as responses_name FROM responses""")
onerowDf.select(size("items_response_id"), size("responses_name")).show()

I am getting error when I run this:

AnalysisException: u"cannot resolve '`i`' given input columns: [hasMore, items, total]; line 1 pos 74;\n'Project [items#1.id AS items_id#149, items#1.responseTime AS items.responseTime#154, items#1.responses['i][id] AS items_response_id#150, items#1.responses['i][name] AS responses_name#151, items#1.responses['i][type] AS responses_type#152, items#1.responses['i][answers] AS responses_answers#153]\n+- SubqueryAlias responses\n   +- Relation[hasMore#0,items#1,total#2L] json\n"

I have knowingly removed some elements from the code above to make it simpler which is why the error lists more element than my code here.

So why can't I replace the value of I in each query and run the 2 statements and get results?


Solution

  • for i in range(1,10) : 
        onerowDF = spark.sql("SELECT items['responses']["+i+"]['id'] as items_response_id, 
               items['responses']["+i+"]['name'] as responses_name FROM responses")")
        onerowDf.select(size("+items_response_id+"), size("+responses_name+")).show()
    

    make sure you concatenate your variables. do not include in the string. instead of having "...i..." make sure you do "..."+i+"....". if u have another i you are using may i am wrong. but also check

     "onerowDf.select(size("+items_response_id+"), `size("+responses_name+")).show()"`