Search code examples
pythonpython-3.xpandasdataframetext-files

Opening textfile in python


The structure of the data file looks like this:

  2.0  0    3    9.15400
      5.40189    0.77828    0.66432
      0.44219    0.00000
  2.0  0    1    9.15400
      0.00000
  2.0  0    6    9.15400
      7.38451    3.99120    2.23459    1.49781    0.77828    0.00000
  2.0  0    3    9.15400
      2.09559    0.77828    0.00000
  2.0  0    3    9.15400
      2.09559    0.77828    0.65828
      0.58990    0.00000

and so on

I want to create a data frame that should look like this:

9.15400    5.40189    0.77828    0.66432    0.44219    0.00000
9.15400    0.00000
9.15400    7.38451    3.99120    2.23459    1.49781    0.77828    0.000
9.15400    2.09559    0.77828    0.00000
9.15400    2.09559    0.77828    0.65828    0.58990    0.00000

Can someone please help me how I can get started with this?


Solution

  • Why doesn't the 0th row of your desired dataframe include 0.66432?

    It's unclear how the table is structured. If it is as unstructured as is shown in your question, try this:

    Input:

      2.0  0    3    9.15400
          5.40189    0.77828    0.66432
          0.44219    0.00000
      2.0  0    1    9.15400
          0.00000
      2.0  0    6    9.15400
          7.38451    3.99120    2.23459    1.49781    0.77828    0.00000
      2.0  0    3    9.15400
          2.09559    0.77828    0.00000
      2.0  0    3    9.15400
          2.09559    0.77828    0.65828
          0.58990    0.00000
    
    
    filename = r"C:\Users\Bobson Dugnutt\Desktop\table2.txt"
    
    # This returns a dataframe with a single column
    df = pd.read_table(filename, header=None)
    
    # Split the bad-boy at every double space
    df = df[0].str.split("  ", expand=True)
    
    new_rows = []
    values = []
    
    for row in df.itertuples():
        # row[2] is the column which is either 2.0 or blank ("")
        if row[2]:
            if values:
                new_rows.append(values)
                values = []
            
            # row[7] is the column with a value like 9.15400
            values.append(row[7])
        else:
            # Add all non-blank values starting from the the 4th column.
            # The 4th column is the first column meaningful values are 
            # found for these rows
            values.extend(value for value in row[4:] if value)
        
    new_rows.append(values)
    
    # fillna("") so make the NaN values blank
    new_df = pd.DataFrame(new_rows).fillna("")
    print(new_df)
    

    Output:

            0       1       2       3       4       5       6
    0 9.15400 5.40189 0.77828 0.66432 0.44219 0.00000        
    1 9.15400 0.00000                                        
    2 9.15400 7.38451 3.99120 2.23459 1.49781 0.77828 0.00000
    3 9.15400 2.09559 0.77828 0.00000                        
    4 9.15400 2.09559 0.77828 0.65828 0.58990 0.00000             
    

    Or if each column is fixed width, and is just not displayed correctly in your question, try this:

    Input:

      2.0  0    3    9.15400
                     5.40189    0.77828    0.66432
                     0.44219    0.00000
      2.0  0    1    9.15400
                     0.00000
      2.0  0    6    9.15400
                     7.38451    3.99120    2.23459    1.49781    0.77828    0.00000
      2.0  0    3    9.15400
                     2.09559    0.77828    0.00000
      2.0  0    3    9.15400
                     2.09559    0.77828    0.65828
                     0.58990    0.00000
    
    import pandas as pd
    
    # Make the floats the same width of your desired output
    pd.options.display.float_format = '{:.5f}'.format
    
    filename = r"C:\Users\Bobson Dugnutt\Desktop\table.txt"
    
    df = pd.read_fwf(filename, header=None).fillna("")
    
    new_rows = []
    values = []
    
    for row in df.itertuples():
        # row[1] is the column which is either 2.0 or blank ("")
        if row[1] != "":
            if values:
                new_rows.append(values)
                values = []
            
            # row[4] is the column with a value like 9.15400
            values.append(row[4])
        else:
            # Add all non-blank values starting from the same column as above
            values.extend(value for value in row[4:] if value != "")
    
    new_rows.append(values)
    
    # fillna("") so make the NaN values blank
    new_df = pd.DataFrame(new_rows).astype(float).fillna("")
    
    print(new_df)
    

    Output:

            0       1       2       3       4       5       6
    0 9.15400 5.40189 0.77828 0.66432 0.44219 0.00000        
    1 9.15400 0.00000                                        
    2 9.15400 7.38451 3.99120 2.23459 1.49781 0.77828 0.00000
    3 9.15400 2.09559 0.77828 0.00000                        
    4 9.15400 2.09559 0.77828 0.65828 0.58990 0.00000        
    

    In both cases all you have to do is iterate over each row, check if one of the first columns is not blank (i.e. is 2.0), and if it is, then get all the other meaningful values in the next rows until you come across another similar row. The specific index differs depending on how the table was originally parsed.