Search code examples
pythonregex

RegEx to differentiate chemical compound names from pattern


I have a RegEx problem with a string pattern that i used to extract data from them by breaking it. I am using Python. The string lines are, simplified, something like this (there are some lines in between but these are the problematic ones):


Compound Name,RT,Score\n2,4,6-Cycloheptatrien-1-one, 3,5-bis-trimethylsilyl-,8.705,626,Area,386902.453,tR (min),8.705

['Compound Name', 'RT', 'Score', '2', '4', '6-Cycloheptatrien-1-one', ' 3', '5-bis-trimethylsilyl-', '8.705', '626', 'Area', '386902.453', 'tR (min)', '8.705', '']

Compound Name,RT,Score\nBenzene, 4-ethenyl-1,2-dimethyl-,11.682,600,Area,16289.239,tR (min),11.682

['Compound Name', 'RT', 'Score', 'Benzene', ' 4-ethenyl-1', '2-dimethyl-', '11.682', '600', 'Area', '16289.239', 'tR (min)', '11.682', '']

Compound Name,RT,Score\n2-(p-Tolyl)propan-2-ol, TMS,11.921,605,Area,17976.558,tR (min),11.921

['Compound Name', 'RT', 'Score', '2-(p-Tolyl)propan-2-ol', ' TMS', '11.921', '605', 'Area', '17976.558', 'tR (min)', '11.921', '']

Compound Name,RT,Score\nTris(tert-butyldimethylsilyloxy)arsane,12.036,685,Area,76702.743,tR (min),12.036

['Compound Name', 'RT', 'Score', 'Tris(tert-butyldimethylsilyloxy)arsane', '12.036', '685', 'Area', '76702.743', 'tR (min)', '12.036', '']

Compound Name,RT,Score\nPhenyl-pentamethyl-disiloxane,13.646,656,Area,84633.281,tR (min),13.646

['Compound Name', 'RT', 'Score', 'Phenyl-pentamethyl-disiloxane', '13.646', '656', 'Area', '84633.281', 'tR (min)', '13.646', '']

Compound Name,RT,Score\n**[1,1'-Bicyclopropyl]-2-octanoic acid, 2'-hexyl-, methyl est**,16.642,605,Area,566.512,tR (min),16.642

['Compound Name', 'RT', 'Score', '[1', "1'-Bicyclopropyl]-2-octanoic acid", " 2'-hexyl-", ' methyl est', '16.642', '605', 'Area', '566.512', 'tR (min)', '16.642', '']

Compound Name,RT,Score\n**(t-Butyl-dimethylsilyl)[2-methyl-2-(4-methyl-pent-3-enyl)-c**,19.853,650,Area,10141.071,tR (min),19.853


The bold area are the chemical compound names which change the most and are the main cause of problems, and straightaway my Python result. These chemical names should be treated as a whole string which may happen in some cases. Everything else between commas should be separated.

The point is first substracting those chemical compounds as characters on their own. I had a first pattern that used to work with simpler compound names:

pattern = r',(?=(?:[^"]*"[^"]*")*[^"]*$)|\n(?=(?:[^"]*"[^"]*")*[^"]*$)'

which put all compound names between single quotes into double quotes to treat them as a single character string. However now it will split if , is found and as you can see in the examples the compounds aren't between single quotes.

I believe it is way easier to tackle from behind: get the ,8.705,626,Area,386902.453,tR (min),8.705 portion first and treat everything before as a single string. Here Area and tR(min) won't change, however their numbers accompanying will.

Any idea how to proceed to separate both sides of the line, the compound and the rest? Shall i use RegEx ^ & $ special characters?


CODE:

import os
import io # for utf8
import sys
import re
from pprint import pprint
import xlsxwriter # for the data, to write them into xlsx.
import tkinter as tk # for GUI
from tkinter import filedialog # for GUI
from collections import Counter # for another thing

file_path=filedialog.askopenfilename()

file=io.open(file_path,'r',encoding="utf-8") 
filecsvname=os.path.basename(file.name)
lines=file.readlines()
file.close()

filename=filecsvname.strip('.csv')

liness=[]
seqnames=[]
for i in range(len(lines)-1):
    i
    lines[i]
    if lines[i].startswith('Compound')==True and lines[i+1].startswith('A')==False:
        liness.append(lines[i]+lines[i+1])
    elif lines[i].startswith('Area')==True:
        liness.append(lines[i]+',Compound Name, ')
    elif lines[i].startswith('Sample Name')==True:
        liness.append(lines[i])
    elif lines[i].startswith('Sequence Name')==True:
        seqnames.append(lines[i])
        liness.append(lines[i])    
# some irrelevant extra code for seqnames.
pattern = r',(?=(?:[^"]*"[^"]*")*[^"]*$)|\n(?=(?:[^"]*"[^"]*")*[^"]*$)'
datalines=[]
for i in range(len(liness)):
    lineofdata=liness[i].rstrip('\t')
    splittedlines=re.split(pattern,lineofdata)
    datalines.append(splittedlines)

Minimal Example Extract of the input *.csv file. Link to file: https://drive.google.com/file/d/1rXanJdnZP-ey0D8dU7XQ6TrHLZK6HLo2/view?usp=sharing

The main idea here is to make it prettier for Excel automatically via the script.

Sequence Name,20240603_S24031_anonymous
Sample Name,Blank
Area,15906292.428,tR (min),3.696
Area,10164232.248,tR (min),3.783
Area,5541647.716,tR (min),4.091
Area,203969.547,tR (min),6.514
Area,542692.442,tR (min),6.579
Area,1039445.806,tR (min),6.788
Area,4011338.408,tR (min),7.249
Area,504687.529,tR (min),8.177
Area,108363.994,tR (min),8.420
Compound Name,RT,Score
2,4,6-Cycloheptatrien-1-one, 3,5-bis-trimethylsilyl-,8.705,626,Area,386902.453,tR (min),8.705
Area,368393.389,tR (min),8.850
Area,91574.405,tR (min),9.447
Area,899294.259,tR (min),9.651
Area,193389.843,tR (min),10.041
Area,56853.148,tR (min),10.812
Compound Name,RT,Score
Benzene, 4-ethenyl-1,2-dimethyl-,11.682,600,Area,16289.239,tR (min),11.682
Compound Name,RT,Score
2-(p-Tolyl)propan-2-ol, TMS,11.921,605,Area,17976.558,tR (min),11.921
Area,78966.216,tR (min),12.031
Compound Name,RT,Score
Tris(tert-butyldimethylsilyloxy)arsane,12.036,685,Area,76702.743,tR (min),12.036
Area,534.128,tR (min),12.285
Compound Name,RT,Score
Phenyl-pentamethyl-disiloxane,13.646,656,Area,84633.281,tR (min),13.646
Area,11464.379,tR (min),13.696
Compound Name,RT,Score
Cyclopentene, 3,3-dimethyl-4-methylene-1,2-bis(trimethylsil,14.543,645,Area,402308.085,tR (min),14.543

Solution

  • This is a common example of bad exported CSV, if you don't have any mean to replace the column separator then you will indeed post process the files.

    Anyway it is sufficiently structured to recover the information without the need for regular expression.

    tokens = [
        "Compound Name,RT,Score\n2,4,6-Cycloheptatrien-1-one, 3,5-bis-trimethylsilyl-,8.705,626,Area,386902.453,tR (min),8.705",
        "Compound Name,RT,Score\nBenzene, 4-ethenyl-1,2-dimethyl-,11.682,600,Area,16289.239,tR (min),11.682",
        "Compound Name,RT,Score\n2-(p-Tolyl)propan-2-ol, TMS,11.921,605,Area,17976.558,tR (min),11.921",
        "Compound Name,RT,Score\nTris(tert-butyldimethylsilyloxy)arsane,12.036,685,Area,76702.743,tR (min),12.036",
        "Compound Name,RT,Score\nPhenyl-pentamethyl-disiloxane,13.646,656,Area,84633.281,tR (min),13.646",
        "Compound Name,RT,Score\n**[1,1'-Bicyclopropyl]-2-octanoic acid, 2'-hexyl-, methyl est**,16.642,605,Area,566.512,tR (min),16.642",
        "Compound Name,RT,Score\n**(t-Butyl-dimethylsilyl)[2-methyl-2-(4-methyl-pent-3-enyl)-c**,19.853,650,Area,10141.071,tR (min),19.853",
    ]
    
    def process(token):
        lines = token.split("\n")
        fields = lines[-1].split(",")
        return {
            "compound": ",".join(fields[:-6]),
            "area": fields[-3],
            "rt": fields[-1],
        }
    
    results = [process(token) for token in tokens]
    

    The result gives three columns for each chromatographic peak.

                                                compound        area      rt
    0  2,4,6-Cycloheptatrien-1-one, 3,5-bis-trimethyl...  386902.453   8.705
    1                   Benzene, 4-ethenyl-1,2-dimethyl-   16289.239  11.682
    2                        2-(p-Tolyl)propan-2-ol, TMS   17976.558  11.921
    3             Tris(tert-butyldimethylsilyloxy)arsane   76702.743  12.036
    4                      Phenyl-pentamethyl-disiloxane   84633.281  13.646
    5  **[1,1'-Bicyclopropyl]-2-octanoic acid, 2'-hex...     566.512  16.642
    6  **(t-Butyl-dimethylsilyl)[2-methyl-2-(4-methyl...   10141.071  19.853
    

    Update

    Based on the file sample you provided the following function should import the format:

    import pandas as pd
    
    def process(filename):
        with open(filename, "r") as file:
            raw = file.read()
        results = []
        for sequence_id, sequence in enumerate(raw.split("\n\n")):
            lines = sequence.split("\n")
            sequence_name = lines[0].split(",")[-1]
            for line in lines:
                fields = line.split(",")
                data = {}
                if "Area," in line:
                    data |= {"area": float(fields[-3]), "rt": float(fields[-1])}
                if ",Area," in line:
                    data |= {"compound": ",".join(fields[:-6])}
                if data:
                    results.append({
                        "sequence_id": sequence_id,
                        "sequence_name": sequence_name,
                    } | data)
        return pd.DataFrame(results)
    

    It returns:

          sequence_id                sequence_name          area      rt  \
    0               0    20240603_S24031_anonymous  1.590629e+07   3.696   
    1               0    20240603_S24031_anonymous  1.016423e+07   3.783   
    2               0    20240603_S24031_anonymous  5.541648e+06   4.091   
    3               0    20240603_S24031_anonymous  2.039695e+05   6.514   
    4               0    20240603_S24031_anonymous  5.426924e+05   6.579   
    ...           ...                          ...           ...     ...   
    1675           37  20240604_S24032_BelénFlórez  2.436482e+07  19.036   
    1676           37  20240604_S24032_BelénFlórez  5.640226e+06  21.963   
    1677           37  20240604_S24032_BelénFlórez  5.021575e+05  23.108   
    1678           37  20240604_S24032_BelénFlórez  1.940947e+06  23.592   
    1679           37  20240604_S24032_BelénFlórez  9.665506e+05  24.550   
    
                                      compound  
    0                                      NaN  
    1                                      NaN  
    2                                      NaN  
    3                                      NaN  
    4                                      NaN  
    ...                                    ...  
    1675                         Hexanoic acid  
    1676                         Octanoic acid  
    1677  2H-Pyran-2-one, tetrahydro-6-pentyl-  
    1678                       n-Decanoic acid  
    1679    Hexanediamide, N,N'-di-benzoyloxy-