I have some kind of a Regex problem I wanted to make it as general as possible although I have written my code in MATLAB.
INFO:
LipidData
is a 68x2 table that contains a name column and the Short
column, that are strings like LPC
, PC
, AC4PIM2
, SHexCer
, SQDG
and many more. This LipidData
matrix is not going to change, whereas foundpattern
may vary depending on the real input data where it comes from.
foundpattern
is an N×4 table, where in my example N is 7. The only relevant column here is the first one, called ISDs
and which contains the strings to check(for reproducibility you may copy only the column as a cell array). Here you can see both MATLAB tables:
INPUT:
>> LipidData
LipidData =
68×2 table
Lipid subclass name Short
___________________________________________________ ___________
{'Diacylated phosphatidylinositol monomannoside' } {'Ac2PIM1' }
{'Diacylated phosphatidylinositol dimannoside' } {'Ac2PIM2' }
{'Triacylated phosphatidylinositol dinomannoside' } {'Ac3PIM2' }
{'Tetraaacylated phosphatidylinositol dimannoside' } {'AC4PIM2' }
{'Anacardic Acid' } {'ACar' }
{'Acetylglucose andrographolide' } {'AcylGlcADG' }
{'Bis[monoacylglycero]phosphates' } {'BMP' }
{'Cholesteryl esters' } {'CE' }
{'Ceramide' } {'Cer' }
{'Ceramide alpha-hydroxy fatty acid-dihydrosphingosine' } {'CerADS' }
{'Ceramide alpha-hydroxy fatty acid-phytospingosine' } {'CerAP' }
{'Ceramide beta-hydroxy fatty acid-sphingosine' } {'CerAS' }
{'Ceramide beta-hydroxy fatty acid-dihydrosphingosine' } {'CerBDS' }
{'Ceramide beta-hydroxy fatty acid-sphingosine' } {'CerBS' }
{'Ceramide Esterified omega-hydroxy fatty acid-dihydrosphingosine'} {'CerEODS' }
{'Ceramide Esterified omega-hydroxy fatty acid-sphingosine' } {'CerEOS' }
{'Ceramide non-hydroxyfatty acid-dihydrosphingosine' } {'CerNDS' }
{'Ceramide non-hydroxyfatty acid-phytospingosine' } {'CerNP' }
{'Ceramide non-hydroxyfatty acid-sphingosine' } {'Cer_NS' }
{'Ceramide phosphate' } {'CerP' }
{'Cholesterol' } {'Cholesterol'}
{'Cardiolipins' } {'CL' }
{'Diacyl/alkylglycerides' } {'DG' }
{'Digalactosyldiacylglycerols' } {'DGDG' }
{'1,2-diacylglyceryl-3-O-4'-(N,N,N-trimethyl)-homoserine' } {'DGTS' }
{'Ether Oxygenated Phosphatidylcholines' } {'EtherOxPC' }
{'Ether Oxygenated Phosphatidylethanolamines' } {'EtherOxPE' }
{'Ether-linked Phosphatidylcoline' } {'EtherPC' }
{'Ether-linked Phosphatidylethanolamine' } {'EtherPE' }
{'Fatty Acids' } {'FA' }
{'Fatty acid ester of hydroxyl fatty acid' } {'FAHFA' }
{'Glucuronosyldiacylglycerol' } {'GlcADG' }
{'GM3 Ganglioside' } {'GM3' }
{'Hidroxy Bis[monoacylglycero]phosphates' } {'HBMP' }
{'Hexosylceramide alpha-hydroxy fatty acid-phytospingosine' } {'HexCerAP' }
{'Hexosylceramide non-hydroxyfatty acid-dihydrosphingosine' } {'HexCerNDS' }
{'Hexosylceramide non-hydroxyfatty acid-sphingosine' } {'HexCer_NS' }
{'Lyso 1,2-diacylglyceryl-3-O-4'-(N,N,N-trimethyl)-homoserine' } {'DGTS' }
{'Lyso Phosphatidic acids' } {'LPA' }
{'Lyso Phosphatidylcholines' } {'LPC' }
{'Lyso Phosphatidylethanolamines' } {'LPE' }
{'Lyso Phosphatidylglycerols' } {'LPG' }
{'Lyso Phosphatidylinositols' } {'LPI' }
{'Lyso Phosphatidylserines' } {'LPS' }
{'Monoacyl/alkylglycerides' } {'MG' }
{'Monogalactosyldiacylglycerols' } {'MGDG' }
{'Oxygenated Cardiolipins' } {'OxCL' }
{'Oxygenated Fatty Acids' } {'OxFA' }
{'Oxygenated Phosphatidic acids' } {'OxPA' }
{'Oxygenated Phosphatidylcholines' } {'OxPC' }
{'Oxygenated Phosphatidylethanolamines' } {'OxPE' }
{'Oxygenated Phosphatidylglycerols' } {'OxPG' }
{'Oxygenated Phosphatidylinositols' } {'OxPI' }
{'Oxygenated Phosphatidylserines' } {'OxPS' }
{'Oxygenated Triacyl/alkylglycerides' } {'OxTG' }
{'Phosphatidic acids' } {'PA' }
{'Phosphatidylbutyl alcohol' } {'PBtOH' }
{'Phosphatidylcholines' } {'PC' }
{'Phosphatidylethanolamines' } {'PE' }
{'Phosphatidyletanol' } {'PEtOH' }
{'Phosphatidylglycerols' } {'PG' }
{'Phosphatidylinositols' } {'PI' }
{'Phosphatidylmethanol' } {'PMeOH' }
{'Phosphatidylserines' } {'PS' }
{'Sulfatides hexosyl ceramide' } {'SHexCer' }
{'Sphingomyelines' } {'SM' }
{'Sulfoquinovosyl diacylglycerols' } {'SQDG' }
{'Triacyl/alkylglycerides' } {'TG' }
>> foundpattern
foundpattern =
7×4 table
ISDs tR Standard desv RSD
__________________________ ______ _____________ _______
{'18:1 (d7) MG' } 1.34 0.020418 1.5238
{'18:1(d7) LPC' } 1.5868 0.0056024 0.35305
{'18:1 (d9) SM' } 6.8999 0.08336 1.2081
{'15:0-18:1(d7) PC' } 7.989 0.072533 0.90791
{'15:0-18:1(d7) DG' } 12.085 0.097445 0.80631
{'15:0-18:1 (d7)-15:0 TG'} 17.487 0.029701 0.16984
{'Cholesterol (d7)' } 18.247 0.032275 0.17687
The problem resides when comparing the regular expression of the LipidData PC
with a foundpattern value of {'18:1(d7) LPC'}
which would make a 'match' that I don't know how to avoid it. I only need to find the exact same Short
values within the foundpattern.ISDs
. Another example of the same problem would appear hypothetically if in found pattern there was a Cer_NS
, which would match not only with its LipidData value Cer_NS
but also with Cer
.
I believe making the values a group (using regex with parentheses) as you would see in the code is a solution, but of course the groups are 'slightly modified' and thus the repetition. I know I miss something there but I don't know what.
Anyway to avoid match repetitions there? As you would see at the OUTPUT, the Codes cell array should only have 7 entries instead of 8.
CODE:
Codes={}
for j=1:size(ID,1)
expression=strcat("(",char(LipidData{j,2}),")");
for i=1:size(foundpattern,1)
if regexp(char(foundpattern{i,1}),expression) ~= 0
disp(foundpattern{i,1})
disp(LipidData{j,2})
Codes{end+1}=LipidData{j,2};
end
end
end
OUTPUT:
>> Codes
Codes =
1×8 cell array
Columns 1 through 6
{1×1 cell} {1×1 cell} {1×1 cell} {1×1 cell} {1×1 cell} {1×1 cell}
Columns 7 through 8
{1×1 cell} {1×1 cell}
>> for i=1:size(Codes,2)
Codes{i}
end
ans =
1×1 cell array
{'Cholesterol'}
ans =
1×1 cell array
{'DG'}
ans =
1×1 cell array
{'LPC'}
ans =
1×1 cell array
{'MG'}
ans =
1×1 cell array
{'PC'}
ans =
1×1 cell array
{'PC'}
ans =
1×1 cell array
{'SM'}
ans =
1×1 cell array
{'TG'}
>>
You need
expression=strcat('\<(', regexptranslate('escape', char(LipidData{j,2})),')\>')
The \<
part matches the start of a word.
The regexptranslate('escape', char(LipidData{j,2}))
now escapes special regex metacharacters in the text used literally in the regex pattern.
And \>
matches the end of a word.
See this regex demo.