Having a .txt file
structure as below
#n 1
a 1:0.0002 3:0.0003...
#n 2
b 2:0.0002 3:0.0003...
#n 3
a 1:0.0002 2:0.0003...
...
trying to parse into dataframe of the following structure
# type 1 2 3
1 a 0.0002 null 0.0003 ....
2 b null 0.0002 0.0003 ....
3 a 0.0002 0.0003 null ....
...
describing the rule:
# i - 'i' is the row number
n:data - 'n' is the column number to fill, 'data' is the value to fill into i'th row
if the number of columns would be small enough it could be done manually, but txt considered has roughly 2000-3000 column values and some of them are missing.
import pandas as pd
data = pd.read_csv("filename.txt", sep = "#", header = None)
data1 = data.iloc[1::2]
data2 = data.iloc[::2]
I tried to remove the odd rows in data1 even in data2, then will hopefully figure out how to split the odd and merge the 2 df's, but there might be a faster and more beautiful method to do it, that's why asking here
update, spent 3 hours figuring out how to work with dataframes, as I was not that familiar with them. now from that
using
import pandas as pd
df = pd.read_csv("myfile.txt", sep = "#", header = None)
for index, col in df.iterrows():
if index%2 == 0:
col[1] = int(col[1].split('\t')[1])
for index, col in df.iterrows():
if index%2 == 1:
# print(col[0])
col[0] = col[0].split(' ')
df[0] = df[0].shift(-1)
df = df.iloc[::2]
df = df[[1,0]]
df = df.rename(columns={0: 1, 1: 0})
df.index = range(len(df))
any suggestions on how to add unknown number of phantom columnsnd fill them using "n:value" from the list to fill the "n" column with the "value"?
I think you are better off parsing the file yourself than relying on read_csv
and then dealing with the mess. Here is how I would do it. Since I do not have access to your real file I am using a small example you have in your question.
First, load the file.
from io import StringIO
file = StringIO(
"""\
#n 1
a 1:0.0002 3:0.0003
#n 2
b 2:0.0002 3:0.0003
#n 3
a 1:0.0002 2:0.0003
""")
# You would just use file = open('myfile.txt','r) instead of the above
Then we read all lines, group them in pairs, parse and stick the results into a dict
# read all lines
lines = file.readlines()
# here we will store the results, dictionary of dictionaries
parsing_res = {}
# a fancy way of processing two lines, odd and even, at the same time
for line1,line2 in zip(lines[::2],lines[1::2]):
# line1 has the form '#n 1', we split on whitespace and take the second tokem
row_index = line1.split()[1]
# line2 is the other type of lines, split into tokens by whitespace
tokens = line2.split()
# first one is 'type'
t = tokens[0]
# the others are pairs 'x:y', split them into x,y and stick into a dictionary with label x and value y
row_dict = {token.split(':')[0]:token.split(':')[1] for token in tokens[1:]}
# add type
row_dict['type'] = t
# store the result for these two lines into the main dictionary
parsing_res[row_index] = row_dict
parsing_res
Now we have something that looks like this:
{'1': {'1': '0.0002', '3': '0.0003', 'type': 'a'},
'2': {'2': '0.0002', '3': '0.0003', 'type': 'b'},
'3': {'1': '0.0002', '2': '0.0003', 'type': 'a'}}
this dict can now be used directly to create a dataframe, which we proceed to do and also order columns as they are in somewhat random order
df = pd.DataFrame.from_dict(parsing_res, orient='index')
df.reindex(sorted(df.columns), axis=1).reindex(sorted(df.index), axis=0)
output
1 2 3 type
1 0.0002 NaN 0.0003 a
2 NaN 0.0002 0.0003 b
3 0.0002 0.0003 NaN a