I have a data file that contains the following:
Part#1
A 10 20 10 10 30 10 20 10 30 10 20
B 10 10 20 10 10 30 10 30 10 20 30
Part#2
A 30 30 30 10 10 20 20 20 10 10 10
B 10 10 20 10 10 30 10 30 10 30 10
Part#3
A 10 20 10 30 10 20 10 20 10 20 10
B 10 10 20 20 20 30 10 10 20 20 30
From there I'm looking to have dictionary of dictionaries with summarized data per letter, so it will be something like this:
dictionary = {{Part#1:{A:{10:6, 20:3, 30:2},
B:{10:6, 20:2, 30:3}}},
{Part#2:{A:{10:5, 20:3, 30:3},
B:{10:7, 20:1, 30:3}}},
{Part#3:{A:{10:6, 20:4, 30:1},
B:{10:4, 20:5, 30:2}}}}
that way if I want to display each part it will give me an output like this:
dictionary[Part#1]
A
10: 6
20: 3
30: 2
B
10: 6
20: 2
30: 3
… and so on for the next couple of partitions in the file.
At the moment I've been able to parse the file from txt to csv. and convert such into a dictionary let's say the outer dictionary. I've been testing a couple of ways to see the output I get, and so far this piece of code is the one closer (but not in its entirety) to the structure I'm looking for, which I already described above.
partitions_dict = df_head(5).to_dict(orient='list')
print(partitions_dict)
Output:
{0: ['A', 'B', 'A', 'B', 'A'], 1: ['10', '10', '10', '10', '10'], 2: [10, 10, 10, 10, 10], 3: [10, 10, 10, 10, 10], 4: [10, 10, 10, 10, 10], 5: [10, 10, 10, 10, 10], 6: [10, 10, 10, 10, 10], 7: [10, 10, 10, 10, 10]
The functions I'm using to parse the file:
def fileFormatConverter(txt_file):
""" Receives a generated text file of partitions as a parameter
and converts it into csv format.
input: text file
return: csv file """
filename, ext = os.path.splitext(txt_file)
csv_file = filename + ".csv"
in_txt = csv.reader(open(txt_file, "r"), delimiter = ' ')
out_csv = csv.writer(open(csv_file,'w'))
out_csv.writerows(in_txt)
return (csv_file)
# removes "Part#0" as a header from the dataframe
df_traces = pd.read_csv(fileFormatConverter("sample.txt"), skiprows=1, header=None) #, error_bad_lines=False)
df_traces.head()
output:
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
0 A, 10, 20, 10, 10, 30, 10, 20, 10, 30, ... 20, 10, 10, 30, 10, 30, 10, 20, 30.0 NaN
1 Part#2 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 A, 30, 30, 30, 10, 10, 20, 20, 20, 10, ... 20, 10, 10, 30, 10, 30, 10, 30, 10.0 NaN
3 Part#3 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 A, 10, 20, 10, 30, 10, 20, 10, 20, 10, ... 20, 20, 20, 30, 10, 10, 20, 20, 30.0 NaN
I used a function to change the headers so it would be easier to manipulate the letters inside of each partition:
def changeDFHeaders(df):
df_transpose = df.T
new_header = df_transpose.iloc[0] # stores the first row for the header
df_transpose = df_transpose[1:] # take the data less the header row
df_transpose.columns = new_header # set the header row as the df header
return(df_transpose)
# The counter column serves as an index for the entire dataframe
#df_transpose['counter'] = range(len(df_transpose)) # adds the counter for rows column
#df_transpose.set_index('counter', inplace=True)
df_transpose_headers = changeDFHeaders(df_traces)
df_transpose_headers.infer_objects()
Output:
A, Part#2 A, Part#3 A,
1 10, NaN 30, NaN 10,
2 20, NaN 30, NaN 20,
3 10, NaN 30, NaN 10,
4 10, NaN 10, NaN 30,
5 30, NaN 10, NaN 10,
6 10, NaN 20, NaN 20,
7 20, NaN 20, NaN 10,
8 10, NaN 20, NaN 20,
9 30, NaN 10, NaN 10,
10 10, NaN 10, NaN 20,
11 20, NaN 10, NaN 10,
12 B, NaN B, NaN B,
13 10, NaN 10, NaN 10,
14 10, NaN 10, NaN 10,
15 20, NaN 20, NaN 20,
16 10, NaN 10, NaN 20,
17 10, NaN 10, NaN 20,
18 30, NaN 30, NaN 30,
19 10, NaN 10, NaN 10,
20 30, NaN 30, NaN 10,
21 10, NaN 10, NaN 20,
22 20, NaN 30, NaN 20,
23 30 NaN 10 NaN 30
24 NaN NaN NaN NaN NaN
--still not quite right...
and if you check this statement:
df = df_transpose_headers
partitions_dict = df.head(5).to_dict(orient='list')
print(partitions_dict)
output:
{'A,': ['10,', '20,', '10,', '30,', '10,'], 'Part#2': [nan, nan, nan, nan, nan], 'Part#3': [nan, nan, nan, nan, nan]}
With an input file of:
Part#1
A 10 20 10 10 30 10 20 10 30 10 20
B 10 10 20 10 10 30 10 30 10 20 30
Part#2
A 30 30 30 10 10 20 20 20 10 10 10
B 10 10 20 10 10 30 10 30 10 30 10
Part#3
A 10 20 10 30 10 20 10 20 10 20 10
B 10 10 20 20 20 30 10 10 20 20 30
This should work
def parse_file(file_name):
return_dict = dict()
section = str()
with open(file_name, "r") as source:
for line in source.readlines():
if "#" in line:
section = line.strip()
return_dict[section] = dict()
continue
tmp = line.strip().split()
group = tmp.pop(0)
return_dict[section][group] = dict()
for item in tmp:
if item in return_dict[section][group].keys():
return_dict[section][group][item] += 1
else:
return_dict[section][group][item] = 1
return return_dict
outputs
{'Part#1': {'A': {'10': 6, '20': 3, '30': 2},
'B': {'10': 6, '20': 2, '30': 3}},
'Part#2': {'A': {'10': 5, '20': 3, '30': 3},
'B': {'10': 7, '20': 1, '30': 3}},
'Part#3': {'A': {'10': 6, '20': 4, '30': 1},
'B': {'10': 4, '20': 5, '30': 2}}}
I honestly don't understand why you'd want an intermediate stage, it seems like if you have to parse the file once to create a CSV, you can just put your logic to create your dict() into that. So if I missed some subtlety in the question I apologize.
Edit: reformulated answer based on comments that the input file is actually a single line
so with an input file of
Part#1 A 10 20 10 10 30 10 20 10 30 10 20 B 10 10 20 10 10 30 10 30 10 20 30 Part#2 A 30 30 30 10 10 20 20 20 10 10 10 B 10 10 20 10 10 30 10 30 10 30 10 Part#3 A 10 20 10 30 10 20 10 20 10 20 10 B 10 10 20 20 20 30 10 10 20 20 30
The following modified code will work
import string
from pprint import pprint
def parse_file2(file_name):
return_dict = dict()
section = None
group = None
with open(file_name, "r") as source:
for line in source.readlines():
tmp_line = line.strip().split()
for token in tmp_line:
if "#" in token:
section = token
return_dict[section] = dict()
continue
elif token in string.ascii_uppercase:
group = token
return_dict[section][group] = dict()
continue
if section and group:
if token in return_dict[section][group].keys():
return_dict[section][group][token] += 1
else:
return_dict[section][group][token] = 1
return return_dict
if __name__ == "__main__":
pprint(parse_file(file_name))
pprint(parse_file2(file_name2))
Please note that this function is specifically keyed to the file format you've noted in the comments. If the file format is not as you say it will probably blow up.
Based on the problem though this should work.
Additionally if you can simplify the question post above to just illustrate the actual file contents, and desired results, or just put in I have structure A and want to convert it to structure B, I'll clean up all of the history in this post and have a more simple answer as well.
Hope this helps! :)