Search code examples
pythonpandasdataframeseparatortxt

Parsing a txt file into data frame, filling columns based on the multiple separators


Having a .txt file

structure as below

#n  1
a 1:0.0002 3:0.0003...
#n  2
b 2:0.0002 3:0.0003...
#n  3
a 1:0.0002 2:0.0003... 
...

trying to parse into dataframe of the following structure

#    type  1        2       3 
1    a     0.0002   null    0.0003 ....
2    b     null     0.0002  0.0003 ....
3    a     0.0002   0.0003  null   ....
...

describing the rule:

# i - 'i' is the row number
n:data - 'n' is the column number to fill, 'data' is the value to fill into i'th row

if the number of columns would be small enough it could be done manually, but txt considered has roughly 2000-3000 column values and some of them are missing.

import pandas as pd
data = pd.read_csv("filename.txt", sep = "#", header = None)

gives the following result

data1 = data.iloc[1::2]
data2 = data.iloc[::2]

I tried to remove the odd rows in data1 even in data2, then will hopefully figure out how to split the odd and merge the 2 df's, but there might be a faster and more beautiful method to do it, that's why asking here

update, spent 3 hours figuring out how to work with dataframes, as I was not that familiar with them. now from that

using

import pandas as pd
df = pd.read_csv("myfile.txt", sep = "#", header = None)
for index, col in df.iterrows():
    if index%2 == 0:
        col[1] = int(col[1].split('\t')[1])
for index, col in df.iterrows():
    if index%2 == 1:
#         print(col[0])
        col[0] = col[0].split(' ')
df[0] = df[0].shift(-1)
df = df.iloc[::2]
df = df[[1,0]]
df = df.rename(columns={0: 1, 1: 0})
df.index = range(len(df))

It became this

any suggestions on how to add unknown number of phantom columnsnd fill them using "n:value" from the list to fill the "n" column with the "value"?


Solution

  • I think you are better off parsing the file yourself than relying on read_csv and then dealing with the mess. Here is how I would do it. Since I do not have access to your real file I am using a small example you have in your question. First, load the file.

    from io import StringIO
    file = StringIO(
    """\
    #n  1
    a 1:0.0002 3:0.0003
    #n  2
    b 2:0.0002 3:0.0003
    #n  3
    a 1:0.0002 2:0.0003
    """)
    # You would just use file = open('myfile.txt','r) instead of the above
    

    Then we read all lines, group them in pairs, parse and stick the results into a dict

    # read all lines
    lines = file.readlines()
    
    # here we will store the results, dictionary of dictionaries
    parsing_res = {}
    
    # a fancy way of processing two lines, odd and even, at the same time
    for line1,line2 in zip(lines[::2],lines[1::2]):
        # line1 has the form '#n  1', we split on whitespace and take the second tokem
        row_index = line1.split()[1]
        # line2 is the other type of lines, split into tokens by whitespace
        tokens = line2.split()
        # first one is 'type'
        t = tokens[0]
    
        # the others are pairs 'x:y', split them into x,y and stick into a dictionary with label x and value y
        row_dict = {token.split(':')[0]:token.split(':')[1] for token in tokens[1:]}
    
        # add type
        row_dict['type'] = t
       
        # store the result for these two lines into the main dictionary
        parsing_res[row_index] = row_dict
    parsing_res
    

    Now we have something that looks like this:

    {'1': {'1': '0.0002', '3': '0.0003', 'type': 'a'},
     '2': {'2': '0.0002', '3': '0.0003', 'type': 'b'},
     '3': {'1': '0.0002', '2': '0.0003', 'type': 'a'}}
    

    this dict can now be used directly to create a dataframe, which we proceed to do and also order columns as they are in somewhat random order

    df = pd.DataFrame.from_dict(parsing_res, orient='index')
    df.reindex(sorted(df.columns), axis=1).reindex(sorted(df.index), axis=0)
    

    output

        1       2       3       type
    1   0.0002  NaN     0.0003  a
    2   NaN     0.0002  0.0003  b
    3   0.0002  0.0003  NaN     a