Search code examples
pandastextextractseparator

Python Pandas: any ideas on text extraction?


I have thousands of txt files like the following (the values are made up):

Date :  [ 2010-01-01 XX:XX:XX ]  Age :  [ 22 ]  Sex :  [ M ]   :  [ XXX ]
Height(cm) :  [ 145 ]  Weight(kg) :  [ 56.4 ]  Race :  [ Hispanic ]
Spirometry :  [ restrictive pattern ]
Treatment response :  [ Negative ]
Tissue volume :  [ Normal ]
Tissue volume
[ Normal RV ] 
Diffusing capacity :  [ Normal capacity ]
FVC Liters : [ 2.22 ] FVC Liters :  [ 67 ] FVC Liters :  [ 3.35 ] 
FEV1 Liters :  [ 1.96 ] FEV1 Liters :  [ 66 ] FEV1 Liters :  [ 2.06 ] 
FEV1 / FVC % :  [ 58 ] FEV1 / FVC % :  [ 62 ]
DLCO mL/mmHg/min :  [ 21.5 ] DLCO mL/mmHg/min :  [ 102 ]
DLCO Adj mL/mmHg/min :  [ 21.5 ] DLCO Adj mL/mmHg/min :  [ 102 ]
RV/TLC % :  [ 22 ]

I want to extract variable names and their corresponding values in a csv format. Luckily, as you have noticed, all txt files have similar format as such:

variable : [ value ]   
  1. My first problem is how to write a code that extract data with the above structure.

  2. My second problem is that I do now know how to separate when there are multiple sets of "variable : [value]" in a single line.(They are not comma separated!).

I've only managed to come up with the following code...but I am going in circles now. Any ideas?

df = pd.read_csv(filename, sep='\n')
df = df[0].str.split(':', expand=True)

Thanks in advance


Solution

  • Looks like you need regular expressions. Let's try this

    First, load sample data

    text = \
    """Date :  [ 2010-01-01 XX:XX:XX ]  Age :  [ 22 ]  Sex :  [ M ]   :  [ XXX ]
    Height(cm) :  [ 145 ]  Weight(kg) :  [ 56.4 ]  Race :  [ Hispanic ]
    Spirometry :  [ restrictive pattern ]
    Treatment response :  [ Negative ]
    Tissue volume :  [ Normal ]
    Tissue volume
    [ Normal RV ] 
    Diffusing capacity :  [ Normal capacity ]
    FVC Liters : [ 2.22 ] FVC Liters :  [ 67 ] FVC Liters :  [ 3.35 ] 
    FEV1 Liters :  [ 1.96 ] FEV1 Liters :  [ 66 ] FEV1 Liters :  [ 2.06 ] 
    FEV1 / FVC % :  [ 58 ] FEV1 / FVC % :  [ 62 ]
    DLCO mL/mmHg/min :  [ 21.5 ] DLCO mL/mmHg/min :  [ 102 ]
    DLCO Adj mL/mmHg/min :  [ 21.5 ] DLCO Adj mL/mmHg/min :  [ 102 ]
    RV/TLC % :  [ 22 ]
    """
    

    Next, use a regex to find all matching 'blah : [ blahblah ] ' pairs, and stick into a dictionary (striped from white space -- could have gone into a regex but to avoid over-complicating things)

    import re
    parsed = re.findall('(.*?)\:\s*?\[(.*?)\]',text)
    res = {g[0].strip() : g[1].strip() for g in parsed}
    res
    

    The result:

    {'Date': '2010-01-01 XX:XX:XX',
     'Age': '22',
     'Sex': 'M',
     '': 'XXX',
     'Height(cm)': '145',
     'Weight(kg)': '56.4',
     'Race': 'Hispanic',
     'Spirometry': 'restrictive pattern',
     'Treatment response': 'Negative',
     'Tissue volume': 'Normal',
     'Diffusing capacity': 'Normal capacity',
     'FVC Liters': '3.35',
     'FEV1 Liters': '2.06',
     'FEV1 / FVC %': '62',
     'DLCO mL/mmHg/min': '102',
     'DLCO Adj mL/mmHg/min': '102',
     'RV/TLC %': '22'}
    

    You can stick it into a dataframe if you want:

    df = pd.DataFrame.from_records([res])
    df
    

    to get

        Date                   Age  Sex           Height(cm)    Weight(kg)  Race      Spirometry           Treatment response    Tissue volume    Diffusing capacity      FVC Liters    FEV1 Liters    FEV1 / FVC %    DLCO mL/mmHg/min    DLCO Adj mL/mmHg/min    RV/TLC %
    --  -------------------  -----  -----  ---  ------------  ------------  --------  -------------------  --------------------  ---------------  --------------------  ------------  -------------  --------------  ------------------  ----------------------  ----------
     0  2010-01-01 XX:XX:XX     22  M      XXX           145          56.4  Hispanic  restrictive pattern  Negative              Normal           Normal capacity               3.35           2.06              62                 102                     102          22
    

    Note that the sample you provided has this line towards the top Sex : [ M ] : [ XXX ] which does not fit the pattern but the code deals with it by using an empty string '' as the key. I assume this is an issue with copy-paste rather than an issue in the original data, but if you have many of these you may have to handle them more carefully