Search code examples
pythondictionarynestedaggregatesummary

how to obtain dictionary of dictionaries that stores aggregated values from a csv file


I have a data file that contains the following:

 Part#1
         A 10 20 10 10 30 10 20 10 30 10 20
         B 10 10 20 10 10 30 10 30 10 20 30
  Part#2
         A 30 30 30 10 10 20 20 20 10 10 10
         B 10 10 20 10 10 30 10 30 10 30 10
  Part#3
         A 10 20 10 30 10 20 10 20 10 20 10
         B 10 10 20 20 20 30 10 10 20 20 30

From there I'm looking to have dictionary of dictionaries with summarized data per letter, so it will be something like this:

dictionary = {{Part#1:{A:{10:6, 20:3, 30:2},
                       B:{10:6, 20:2, 30:3}}}, 
              {Part#2:{A:{10:5, 20:3, 30:3}, 
                       B:{10:7, 20:1, 30:3}}}, 
              {Part#3:{A:{10:6, 20:4, 30:1}, 
                       B:{10:4, 20:5, 30:2}}}} 

that way if I want to display each part it will give me an output like this:

dictionary[Part#1]

A
 10: 6
 20: 3
 30: 2

B
 10: 6
 20: 2
 30: 3

… and so on for the next couple of partitions in the file.

At the moment I've been able to parse the file from txt to csv. and convert such into a dictionary let's say the outer dictionary. I've been testing a couple of ways to see the output I get, and so far this piece of code is the one closer (but not in its entirety) to the structure I'm looking for, which I already described above.

partitions_dict = df_head(5).to_dict(orient='list')      

print(partitions_dict)

Output:

{0: ['A', 'B', 'A', 'B', 'A'], 1: ['10', '10', '10', '10', '10'], 2: [10, 10, 10, 10, 10], 3: [10, 10, 10, 10, 10], 4: [10, 10, 10, 10, 10], 5: [10, 10, 10, 10, 10], 6: [10, 10, 10, 10, 10], 7: [10, 10, 10, 10, 10]

The functions I'm using to parse the file:

def fileFormatConverter(txt_file):
    """ Receives a generated text file  of partitions as a parameter
        and converts it into csv format.
        input: text file
        return: csv file """

    filename, ext = os.path.splitext(txt_file)
    csv_file = filename + ".csv"
    in_txt = csv.reader(open(txt_file, "r"), delimiter = ' ')
    out_csv = csv.writer(open(csv_file,'w'))
    out_csv.writerows(in_txt)   
    return (csv_file)

# removes "Part#0" as a header from the dataframe
df_traces = pd.read_csv(fileFormatConverter("sample.txt"), skiprows=1, header=None)   #, error_bad_lines=False)
df_traces.head()

output:

    0   1   2   3   4   5   6   7   8   9   ...     15  16  17  18  19  20  21  22  23  24
0   A,  10,     20,     10,     10,     30,     10,     20,     10,     30,     ...     20,     10,     10,     30,     10,     30,     10,     20,     30.0    NaN
1   Part#2  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   A,  30,     30,     30,     10,     10,     20,     20,     20,     10,     ...     20,     10,     10,     30,     10,     30,     10,     30,     10.0    NaN
3   Part#3  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
4   A,  10,     20,     10,     30,     10,     20,     10,     20,     10,     ...     20,     20,     20,     30,     10,     10,     20,     20,     30.0    NaN

I used a function to change the headers so it would be easier to manipulate the letters inside of each partition:

def changeDFHeaders(df):

    df_transpose = df.T
    new_header = df_transpose.iloc[0]                       # stores the first row for the header
    df_transpose = df_transpose[1:]                         # take the data less the header row
    df_transpose.columns = new_header                       # set the header row as the df header
    return(df_transpose)


# The counter column serves as an index for the entire dataframe
#df_transpose['counter'] = range(len(df_transpose))      # adds the counter for rows column
#df_transpose.set_index('counter', inplace=True)
df_transpose_headers = changeDFHeaders(df_traces)
df_transpose_headers.infer_objects()

Output:

    A,  Part#2  A,  Part#3  A,
1   10,     NaN     30,     NaN     10,
2   20,     NaN     30,     NaN     20,
3   10,     NaN     30,     NaN     10,
4   10,     NaN     10,     NaN     30,
5   30,     NaN     10,     NaN     10,
6   10,     NaN     20,     NaN     20,
7   20,     NaN     20,     NaN     10,
8   10,     NaN     20,     NaN     20,
9   30,     NaN     10,     NaN     10,
10  10,     NaN     10,     NaN     20,
11  20,     NaN     10,     NaN     10,
12  B,  NaN     B,  NaN     B,
13  10,     NaN     10,     NaN     10,
14  10,     NaN     10,     NaN     10,
15  20,     NaN     20,     NaN     20,
16  10,     NaN     10,     NaN     20,
17  10,     NaN     10,     NaN     20,
18  30,     NaN     30,     NaN     30,
19  10,     NaN     10,     NaN     10,
20  30,     NaN     30,     NaN     10,
21  10,     NaN     10,     NaN     20,
22  20,     NaN     30,     NaN     20,
23  30  NaN     10  NaN     30
24  NaN     NaN     NaN     NaN     NaN

--still not quite right...

and if you check this statement:

df = df_transpose_headers
partitions_dict = df.head(5).to_dict(orient='list')      

print(partitions_dict) 

output:

{'A,': ['10,', '20,', '10,', '30,', '10,'], 'Part#2': [nan, nan, nan, nan, nan], 'Part#3': [nan, nan, nan, nan, nan]}

Solution

  • With an input file of:

      Part#1
             A 10 20 10 10 30 10 20 10 30 10 20
             B 10 10 20 10 10 30 10 30 10 20 30
      Part#2
             A 30 30 30 10 10 20 20 20 10 10 10
             B 10 10 20 10 10 30 10 30 10 30 10
      Part#3
             A 10 20 10 30 10 20 10 20 10 20 10
             B 10 10 20 20 20 30 10 10 20 20 30
    

    This should work

    def parse_file(file_name):
        return_dict = dict()
        section = str()
        with open(file_name, "r") as source:
            for line in source.readlines():
                if "#" in line:
                    section = line.strip()
                    return_dict[section] = dict()
                    continue
                tmp = line.strip().split()
                group = tmp.pop(0)
                return_dict[section][group] = dict()
                for item in tmp:
                    if item in return_dict[section][group].keys():
                        return_dict[section][group][item] += 1
                    else:
                        return_dict[section][group][item] = 1
    
        return return_dict
    

    outputs

    {'Part#1': {'A': {'10': 6, '20': 3, '30': 2},
                'B': {'10': 6, '20': 2, '30': 3}},
     'Part#2': {'A': {'10': 5, '20': 3, '30': 3},
                'B': {'10': 7, '20': 1, '30': 3}},
     'Part#3': {'A': {'10': 6, '20': 4, '30': 1},
                'B': {'10': 4, '20': 5, '30': 2}}}
    

    I honestly don't understand why you'd want an intermediate stage, it seems like if you have to parse the file once to create a CSV, you can just put your logic to create your dict() into that. So if I missed some subtlety in the question I apologize.

    Edit: reformulated answer based on comments that the input file is actually a single line

    so with an input file of

    Part#1 A 10 20 10 10 30 10 20 10 30 10 20 B 10 10 20 10 10 30 10 30 10 20 30 Part#2 A 30 30 30 10 10 20 20 20 10 10 10 B 10 10 20 10 10 30 10 30 10 30 10 Part#3 A 10 20 10 30 10 20 10 20 10 20 10 B 10 10 20 20 20 30 10 10 20 20 30
    

    The following modified code will work

    import string
    from pprint import pprint
    
    def parse_file2(file_name):
        return_dict = dict()
        section = None
        group = None
        with open(file_name, "r") as source:
            for line in source.readlines():
                tmp_line = line.strip().split()
                for token in tmp_line:
                    if "#" in token:
                        section = token
                        return_dict[section] = dict()
                        continue
                    elif token in string.ascii_uppercase:
                        group = token
                        return_dict[section][group] = dict()
                        continue
                    if section and group:
                        if token in return_dict[section][group].keys():
                            return_dict[section][group][token] += 1
                        else:
                            return_dict[section][group][token] = 1
    
        return return_dict
    
    if __name__ == "__main__":
        pprint(parse_file(file_name))
        pprint(parse_file2(file_name2))
    

    Please note that this function is specifically keyed to the file format you've noted in the comments. If the file format is not as you say it will probably blow up.

    Based on the problem though this should work.

    Additionally if you can simplify the question post above to just illustrate the actual file contents, and desired results, or just put in I have structure A and want to convert it to structure B, I'll clean up all of the history in this post and have a more simple answer as well.

    Hope this helps! :)