Search code examples
pythonpandastext

Pandas read text file slicing columns possibly with empty strings/values according to strings/values alignment with header


Imagine a text file that looks like this:

Places       Person  Number         Comments
   bar    anastasia      75        very lazy
  home        jimmy           nothing to say
 beach                    2                 

Consider the first line as the header containing the names of the columns I want for my pandas data frame. You can see that there are empty cells. And there is a column that has strings with spaces. There is a readable patter in this file, columns are separated by at least 2 spaces and the information of each column can be red from the end point of the column name to the end point of the previous column name basically. There is no ambiguity in this.


If I do

df = pd.read_csv('text_file.txt')

I will get a 3 x 1 data frame where the only column gets called "Places Person Number Comments". So it fails to understand the table format.


If I do

df = pd.read_csv('text_file.txt', delim_whitespace = True)

It will create the good number of columns but won't be able to understand the spaces in the values in Comments and will split the comments and send it to different cells, like so:

          Places   Person Number Comments
bar    anastasia       75   very     lazy
home       jimmy  nothing     to      say
beach          2      NaN    NaN      NaN

If I do

df = pd.read_csv('text_file.txt', sep = '\s{2,}', engine = 'python')

It will understand that only if there are two or more spaces it can be considered part of another column. So that's correct. But it won't be able to understand that there are empty cells and will wrongly displace cells from one column to another.

  Places     Person          Number   Comments
0    bar  anastasia              75  very lazy
1   home      jimmy  nothing to say       None
2  beach          2            None       None

At this point I don't know what to do. Is there an elegant way to do this in Pandas?


Solution

  • You can use pd.read_fwf() to read your file, which is a file of fixed-width formatted lines, into DataFrame.

    df = pd.read_fwf('text_file.txt')
    

    Demo

    I use StringIO as demo. You can use your actual file name as parameter to the function call.

    text = """
    Places       Person  Number         Comments
       bar    anastasia      75        very lazy
      home        jimmy           nothing to say
     beach                    2                 
    """
    
    from io import StringIO
    df = pd.read_fwf(StringIO(text))
    
    print(df)
    
      Places     Person  Number        Comments
    0    bar  anastasia    75.0       very lazy
    1   home      jimmy     NaN  nothing to say
    2  beach        NaN     2.0             NaN