Search code examples
pythonpython-3.xdata-scienceetlaws-glue-spark

Concat / Join / Transform multiple columns to one struct column


I have very big, legacy file with ~5000 columns and very big amount of record. Many columns are named like a_1,a_2,...,a_200 etc.

I want to concatenate number of columns into struct (for better data manipulation later), so instead:

_| a_1 | a_2 | a_3 |...
0| true | false | true |...
1| false | true | false |...

I would like to have struct a { 1: true, 2: false, ... 200: true } .

How to transform it using Python, probably Panda's? Columns have always same prefix, like a_, b_... etc.

Greeetings


Solution

  • Considering a CSV that look like this

    _|a_1|a_2|a_3|a_4|b_1|b_2|b_3|b_4
    0|true|false|true|false|true|false|true|false
    1|false|true|false|true|false|true|false|true
    

    Here's how you could to it in python with just the standard library:

    import csv
    
    with open("data.csv", newline='') as csvfile:
        reader = csv.reader(csvfile, delimiter='|')
        
        headers = next(reader) # The first line is the table columns
        
        # let's extract the tuples (letter, number) from the table columns
        dataranks = [tuple(x.split("_")) for x in headers[1:]] # dataranks = [('a', '1'), ('a', '2'), ('a', '3'), ('a', '4'), ('b', '1'), ('b', '2'), ('b', '3'), ('b', '4')
    
        joined_data = []
    
        for row in reader:
        
            # for each row, let's make a new dictionary
            aggregate = {}
        
            # for each value in the row, let's associate it to it's (letter, number) column data tuple
            for value, ranks in zip(row[1:], dataranks):
        
                # for each letter, let's use a nested dict for the number values
                if ranks[0] not in aggregate:
                    aggregate[ranks[0]] = {}
    
                # just need to fill the dict now
                aggregate[ranks[0]][ranks[1]] = value
    
            # and add it to our list.
            joined_data.append(aggregate)
        
        print(joined_data)
    

    The content of joined_data would be:

    [{'a': {'1': 'true', '2': 'false', '3': 'true', '4': 'false'},
      'b': {'1': 'true', '2': 'false', '3': 'true', '4': 'false'}},
     {'a': {'1': 'false', '2': 'true', '3': 'false', '4': 'true'},
      'b': {'1': 'false', '2': 'true', '3': 'false', '4': 'true'}}]