Search code examples
fileapache-sparkpysparkapache-spark-sqltext-files

Insert data to different columns based on the first value from a text file


This is my text file:

1|16255|usa|Test||TEST806282|||||
2|16255|806282|company_member|False|true
3|16255|my_data
1|18299|usa|Test||TEST260092|||||
2|18299|260092|company_member|False|False

I want to insert this data into different columns in a single table based on the key (1,2,3,1,2) using Spark SQL.

Desired output:

enter image description here


Solution

  • Thank you for the clarifications. Something along the following lines should do the trick. Text files are best read using rdds.

    rdd = spark.sparkContext.textFile('my_path/my_file.txt')
    rdd = rdd.map(lambda k: k.split('|'))
    
    def relocate(e):
        if e[0] == '1':
            return [e[0], e[1], e[2], e[5], '', '', '', '', '', '']
        elif e[0] == '2':
            return [e[0], e[1], '', '', '', e[2], e[3], e[4], e[5], '']
        elif e[0] == '3':
            return [e[0], e[1], '', '', '', '', '', '', '', e[2]]
    
    rdd = rdd.map(relocate)
    df = rdd.toDF(['recordtype', 'idnumber', 'country', 'designation', 'dept', 'empnumber', 'membertype', 'billable', 'active', 'Emplinfo'])
    
    df.show()
    # +----------+--------+-------+-----------+----+---------+--------------+--------+------+--------+
    # |recordtype|idnumber|country|designation|dept|empnumber|    membertype|billable|active|Emplinfo|
    # +----------+--------+-------+-----------+----+---------+--------------+--------+------+--------+
    # |         1|   16255|    usa| TEST806282|    |         |              |        |      |        |
    # |         2|   16255|       |           |    |   806282|company_member|   False|  true|        |
    # |         3|   16255|       |           |    |         |              |        |      | my_data|
    # |         1|   18299|    usa| TEST260092|    |         |              |        |      |        |
    # |         2|   18299|       |           |    |   260092|company_member|   False| False|        |
    # +----------+--------+-------+-----------+----+---------+--------------+--------+------+--------+