Search code examples
pythondataframeapache-spark-sqlrdd

Convert rdd rows into one columns


I am trying to get all the values from Rows into Columns. I don't have an Index, so find it hard to have all in one column.

Code: getting the values

traceFilters = sqlContext.read.format("csv").options(header='true', delimiter = ',').load("/data/*.txt")

traceFilters.take(5)
fields = [
 StructField("City", StringType(), False),
 StructField("Country", StringType(), False)
]

traceFilters.track(5)

for row in traceFilters.rdd.collect():
    a =  row.City
    print a

This is the data that i am getting from above code:

New York
London
Vienna

and the result that i want.

[ New York, London, Vienna ]

I tried using transpose, but its not working and also with zip. Code that i tried:

print a.transpose()

or val1= a.set_index('City').T

Any help appreciated.

Thanks


Solution

  • It looks like you are just printing each value, but that you really want a list. This appends each value into a list, then prints it:

    traceFilters = sqlContext.read.format("csv").options(header='true', delimiter = ',').load("/data/*.txt")
    
    traceFilters.take(5)
    fields = [
     StructField("City", StringType(), False),
     StructField("Country", StringType(), False)
    ]
    
    traceFilters.track(5)
    
    a = []
    for row in traceFilters.rdd.collect():
        a.append(row.City)
    print(a)