I have an excel sheet (Bloomberg Data License output) I read in with
import pandas as pd
raw_data = pd.read_excel('my-file.xlsx')
There is one column (START-OF-FILE
) and a varying number rows, depending on the amount of data returned.
I am interested in the data between two rows, specifically START-OF-DATA
and END-OF-DATA
. The rows in the column look like
19 START-OF-DATA
20 WTS Equity|0|6|WTS|50545|54.440000|54.000000|5...
21 XOM Equity|0|6|XOM|6555175|84.950000|85.300000...
22 SUP Equity|0|6|SUP|27405|19.250000|19.200000|1...
23 END-OF-DATA
with a varying number of rows, (not always 20 to 22). How can I filter the rows in the column to only the data between the cells, ie raw_data['START-OF-FILE']['START-OF-DATA' : 'END-OF-DATA']
. and then use str.split('|')
to seperate the pipe delimited data into seperate columns for a new dataframe?
>>> import pandas as pd
>>> df = pd.DataFrame(['abcdef', 'START-OF-DATA', 'g|h|i', 'j|k|l', 'm|n|o', 'END-OF-DATA', 'pqrstu', columns=['A']])
>>> df
A
0 abcdef
1 START-OF-DATA
2 g|h|i
3 j|k|l
4 m|n|o
5 END-OF-DATA
6 pqrstu
>>> start, end = df[df['A'].str.contains('(START|END)-OF-DATA')].index.tolist()
>>> pd.DataFrame(df[start+1:end]['A'].str.split('|').tolist(), columns=['A', 'B', 'C'])
A B C
2 g h i
3 j k l
4 m n o