Search code examples
pythonpandasbloomberg

Python Pandas filter rows based on the string value of an entry


I have an excel sheet (Bloomberg Data License output) I read in with

import pandas as pd
raw_data = pd.read_excel('my-file.xlsx')

There is one column (START-OF-FILE) and a varying number rows, depending on the amount of data returned.

I am interested in the data between two rows, specifically START-OF-DATA and END-OF-DATA. The rows in the column look like

19                                      START-OF-DATA
20  WTS Equity|0|6|WTS|50545|54.440000|54.000000|5...
21  XOM Equity|0|6|XOM|6555175|84.950000|85.300000...
22  SUP Equity|0|6|SUP|27405|19.250000|19.200000|1...
23                                        END-OF-DATA

with a varying number of rows, (not always 20 to 22). How can I filter the rows in the column to only the data between the cells, ie raw_data['START-OF-FILE']['START-OF-DATA' : 'END-OF-DATA']. and then use str.split('|') to seperate the pipe delimited data into seperate columns for a new dataframe?


Solution

  • >>> import pandas as pd
    >>> df = pd.DataFrame(['abcdef', 'START-OF-DATA', 'g|h|i', 'j|k|l', 'm|n|o', 'END-OF-DATA', 'pqrstu', columns=['A']])
    >>> df
                   A
    0         abcdef
    1  START-OF-DATA
    2          g|h|i
    3          j|k|l
    4          m|n|o
    5    END-OF-DATA
    6         pqrstu
    >>> start, end = df[df['A'].str.contains('(START|END)-OF-DATA')].index.tolist()
    >>> pd.DataFrame(df[start+1:end]['A'].str.split('|').tolist(), columns=['A', 'B', 'C'])
       A  B  C
    2  g  h  i
    3  j  k  l
    4  m  n  o