Search code examples
sqlbioinformaticshue

Find alphabetical letter from a (smiles) string not from a list of elements


OBJECTIVE

Filter out the SMILES strings if any alphabetical letter (atoms) in the string, insensitive to capitalization, does not come from the following list of elements H, B, C, N, O, F, Al, Si, P, S, Cl, this is a truncated list. In total, there are 38 elements.

BACKGROUND

I have a database containing SMILES strings:

The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings.

(More info Wikipedia link)

SMILES example:

OC[C@H]1O[C@H]([C@H](O)[C@@H]1O)n1cnc2c(NC3CCCC3)ncnc12

The purpose of this was to get rid of rare elements and organometallics from the database.


Solution

  • One general approach would work on any version of SQL would be to create a table which contains a blacklist of not allowed atomic symbols. Smiles to be retained would be those which do not match to any blacklisted element symbol.

    WITH disallowed AS (
        SELECT 'He' AS symbol UNION ALL
        SELECT 'Li' UNION ALL
        SELECT 'Be' UNION ALL
        SELECT 'Ne' UNION ALL
        ...
        SELECT 'Lr'
    )
    
    SELECT t1.smile
    FROM yourTable t1
    WHERE NOT EXISTS (SELECT 1 FROM disallowed t2
                      WHERE t1.smile LIKE '%' || t2.symbol || '%');
    

    A compound such as UO2 would be filtered off by the above inner join, it having the blacklisted element uranium in it.