Search code examples
apache-sparktypespysparkapache-spark-sqlrdd

Spark RDD loads all fields in the csv file as string


I have a csv file and I'm loading it as follows:

sc.textFile("market.csv").take(3)

The output is this:

['"ID","Area","Postcode","Amount"',
'"1234/232","City","8479","20000"',
'"5987/215","Metro","1111","25000"']

Also, loading with map operation:

sc.textFile("market.csv").map(lambda line: line.split(","))

Gives me:

[['"ID"','"Area"','"Postcode"','"Amount"'],
['"1234/232"','"City"','"8479"','"20000"'],
['"5987/215"','"Metro"','"1111"','"25000"']]

This is too many " " and ' ' and does not let me analyze my results!!

I want to have an output like this:

[["ID","Area","Postcode","Amount"],
["1234/232","City",8479,20000],
["5987/215","Metro",1111,25000]]

In which the text values are string type, and the numbers are int/double type.

How can I do that? Thanks.


Solution

  • Here is the way. You should do it manually.

    rdd = sc.textFile("test.csv")
    rdd = rdd.map(lambda line: line.replace('\"','').split(','))
    
    def isHeader(row): return 'ID' in str(row)
        
    rdd1 = rdd.filter(isHeader)
    rdd2 = rdd.filter(lambda x: not(isHeader(x))).map(lambda line: [line[0], line[1], int(line[2]), int(line[3])])
    
    rdd1.union(rdd2).collect()
    
    
    [['ID', 'Area', 'Postcode', 'Amount'],
     ['1234/232', 'City', 8479, 20000],
     ['5987/215', 'Metro', 1111, 25000]]