I have thousands of txt files like the following (the values are made up):
Date : [ 2010-01-01 XX:XX:XX ] Age : [ 22 ] Sex : [ M ] : [ XXX ]
Height(cm) : [ 145 ] Weight(kg) : [ 56.4 ] Race : [ Hispanic ]
Spirometry : [ restrictive pattern ]
Treatment response : [ Negative ]
Tissue volume : [ Normal ]
Tissue volume
[ Normal RV ]
Diffusing capacity : [ Normal capacity ]
FVC Liters : [ 2.22 ] FVC Liters : [ 67 ] FVC Liters : [ 3.35 ]
FEV1 Liters : [ 1.96 ] FEV1 Liters : [ 66 ] FEV1 Liters : [ 2.06 ]
FEV1 / FVC % : [ 58 ] FEV1 / FVC % : [ 62 ]
DLCO mL/mmHg/min : [ 21.5 ] DLCO mL/mmHg/min : [ 102 ]
DLCO Adj mL/mmHg/min : [ 21.5 ] DLCO Adj mL/mmHg/min : [ 102 ]
RV/TLC % : [ 22 ]
I want to extract variable names and their corresponding values in a csv format. Luckily, as you have noticed, all txt files have similar format as such:
variable : [ value ]
My first problem is how to write a code that extract data with the above structure.
My second problem is that I do now know how to separate when there are multiple sets of "variable : [value]" in a single line.(They are not comma separated!).
I've only managed to come up with the following code...but I am going in circles now. Any ideas?
df = pd.read_csv(filename, sep='\n')
df = df[0].str.split(':', expand=True)
Thanks in advance
Looks like you need regular expressions. Let's try this
First, load sample data
text = \
"""Date : [ 2010-01-01 XX:XX:XX ] Age : [ 22 ] Sex : [ M ] : [ XXX ]
Height(cm) : [ 145 ] Weight(kg) : [ 56.4 ] Race : [ Hispanic ]
Spirometry : [ restrictive pattern ]
Treatment response : [ Negative ]
Tissue volume : [ Normal ]
Tissue volume
[ Normal RV ]
Diffusing capacity : [ Normal capacity ]
FVC Liters : [ 2.22 ] FVC Liters : [ 67 ] FVC Liters : [ 3.35 ]
FEV1 Liters : [ 1.96 ] FEV1 Liters : [ 66 ] FEV1 Liters : [ 2.06 ]
FEV1 / FVC % : [ 58 ] FEV1 / FVC % : [ 62 ]
DLCO mL/mmHg/min : [ 21.5 ] DLCO mL/mmHg/min : [ 102 ]
DLCO Adj mL/mmHg/min : [ 21.5 ] DLCO Adj mL/mmHg/min : [ 102 ]
RV/TLC % : [ 22 ]
"""
Next, use a regex to find all matching 'blah : [ blahblah ] ' pairs, and stick into a dictionary (strip
ed from white space -- could have gone into a regex but to avoid over-complicating things)
import re
parsed = re.findall('(.*?)\:\s*?\[(.*?)\]',text)
res = {g[0].strip() : g[1].strip() for g in parsed}
res
The result:
{'Date': '2010-01-01 XX:XX:XX',
'Age': '22',
'Sex': 'M',
'': 'XXX',
'Height(cm)': '145',
'Weight(kg)': '56.4',
'Race': 'Hispanic',
'Spirometry': 'restrictive pattern',
'Treatment response': 'Negative',
'Tissue volume': 'Normal',
'Diffusing capacity': 'Normal capacity',
'FVC Liters': '3.35',
'FEV1 Liters': '2.06',
'FEV1 / FVC %': '62',
'DLCO mL/mmHg/min': '102',
'DLCO Adj mL/mmHg/min': '102',
'RV/TLC %': '22'}
You can stick it into a dataframe if you want:
df = pd.DataFrame.from_records([res])
df
to get
Date Age Sex Height(cm) Weight(kg) Race Spirometry Treatment response Tissue volume Diffusing capacity FVC Liters FEV1 Liters FEV1 / FVC % DLCO mL/mmHg/min DLCO Adj mL/mmHg/min RV/TLC %
-- ------------------- ----- ----- --- ------------ ------------ -------- ------------------- -------------------- --------------- -------------------- ------------ ------------- -------------- ------------------ ---------------------- ----------
0 2010-01-01 XX:XX:XX 22 M XXX 145 56.4 Hispanic restrictive pattern Negative Normal Normal capacity 3.35 2.06 62 102 102 22
Note that the sample you provided has this line towards the top Sex : [ M ] : [ XXX ]
which does not fit the pattern but the code deals with it by using an empty string '' as the key. I assume this is an issue with copy-paste rather than an issue in the original data, but if you have many of these you may have to handle them more carefully