I'm using something similar to this as input.txt
040525 $$$$$ 9999 12345
040525 $$$$$ 8888 12345
040525 $$$$$ 7777 12345
040525 $$$$$ 6666 12345
Due to the way this input is being pre-processed, I cannot correctly use pd.read_csv. I must first create a list from the input; Then, create a DataFrame from the list.
data_list = []
with open('input.txt', 'r') as data:
for line in data:
data_list.append(line.strip())
df = pd.DataFrame(data_list)
This results in each row being considered 1 column
print(df.shape)
print(df)
print(df.columns.tolist())
(4, 1)
0
0 040525 $$$$$ 9999 12345
1 040525 $$$$$ 8888 12345
2 040525 $$$$$ 7777 12345
3 040525 $$$$$ 6666 12345
[0]
How can I create 4 columns in this DataFrame? Desired output would be:
(4, 4)
a b c d
0 40525 $$$$$ 9999 12345
1 40525 $$$$$ 8888 12345
2 40525 $$$$$ 7777 12345
3 40525 $$$$$ 6666 12345
['a', 'b', 'c', 'd']
In your loop, you should split the strings into a list of substrings for the fields:
for line in input_txt:
data_list.append(line.strip().split())
This will give you the correct number of columns.
Alternatively, keep your loop as it is, but create a Series
and str.split
with expand=True
. This might be less efficient, but could be more robust if you don't have a consistent number of fields:
data_list = []
with open('input.txt', 'r') as data:
for line in data:
data_list.append(line.strip())
df = pd.Series(data_list).str.split(expand=True)
Output:
0 1 2 3
0 040525 $$$$$ 9999 12345
1 040525 $$$$$ 8888 12345
2 040525 $$$$$ 7777 12345
3 040525 $$$$$ 6666 12345
For the first approach, if you want column names:
df = pd.DataFrame(data_list, columns=['a', 'b', 'c', 'd'])
Output:
a b c d
0 040525 $$$$$ 9999 12345
1 040525 $$$$$ 8888 12345
2 040525 $$$$$ 7777 12345
3 040525 $$$$$ 6666 12345