OBJECTIVE
Filter out the SMILES strings if any alphabetical letter (atoms) in the string, insensitive to capitalization, does not come from the following list of elements H, B, C, N, O, F, Al, Si, P, S, Cl
, this is a truncated list. In total, there are 38 elements.
BACKGROUND
I have a database containing SMILES strings:
The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings.
(More info Wikipedia link)
SMILES example:
OC[C@H]1O[C@H]([C@H](O)[C@@H]1O)n1cnc2c(NC3CCCC3)ncnc12
The purpose of this was to get rid of rare elements and organometallics from the database.
One general approach would work on any version of SQL would be to create a table which contains a blacklist of not allowed atomic symbols. Smiles to be retained would be those which do not match to any blacklisted element symbol.
WITH disallowed AS (
SELECT 'He' AS symbol UNION ALL
SELECT 'Li' UNION ALL
SELECT 'Be' UNION ALL
SELECT 'Ne' UNION ALL
...
SELECT 'Lr'
)
SELECT t1.smile
FROM yourTable t1
WHERE NOT EXISTS (SELECT 1 FROM disallowed t2
WHERE t1.smile LIKE '%' || t2.symbol || '%');
A compound such as UO2 would be filtered off by the above inner join, it having the blacklisted element uranium in it.