Search code examples
pythonpandas

Pandas DataFrame returning only 1 column after creating from a list


I'm using something similar to this as input.txt

 040525 $$$$$   9999         12345
 040525 $$$$$   8888         12345
 040525 $$$$$   7777         12345
 040525 $$$$$   6666         12345

Due to the way this input is being pre-processed, I cannot correctly use pd.read_csv. I must first create a list from the input; Then, create a DataFrame from the list.

data_list = []
with open('input.txt', 'r') as data:
    for line in data:
        data_list.append(line.strip())
df = pd.DataFrame(data_list)

This results in each row being considered 1 column

print(df.shape)
print(df)
print(df.columns.tolist())

(4, 1)
                                   0
0  040525 $$$$$   9999         12345
1  040525 $$$$$   8888         12345
2  040525 $$$$$   7777         12345
3  040525 $$$$$   6666         12345
[0]

How can I create 4 columns in this DataFrame? Desired output would be:

(4, 4)
       a      b     c      d
0  40525  $$$$$  9999  12345
1  40525  $$$$$  8888  12345
2  40525  $$$$$  7777  12345
3  40525  $$$$$  6666  12345
['a', 'b', 'c', 'd']

Solution

  • In your loop, you should split the strings into a list of substrings for the fields:

    for line in input_txt:
        data_list.append(line.strip().split())
    

    This will give you the correct number of columns.

    Alternatively, keep your loop as it is, but create a Series and str.split with expand=True. This might be less efficient, but could be more robust if you don't have a consistent number of fields:

    data_list = []
    with open('input.txt', 'r') as data:
        for line in data:
            data_list.append(line.strip())
    df = pd.Series(data_list).str.split(expand=True)
    

    Output:

            0      1     2      3
    0  040525  $$$$$  9999  12345
    1  040525  $$$$$  8888  12345
    2  040525  $$$$$  7777  12345
    3  040525  $$$$$  6666  12345
    

    For the first approach, if you want column names:

    df = pd.DataFrame(data_list, columns=['a', 'b', 'c', 'd'])
    

    Output:

            a      b     c      d
    0  040525  $$$$$  9999  12345
    1  040525  $$$$$  8888  12345
    2  040525  $$$$$  7777  12345
    3  040525  $$$$$  6666  12345