Search code examples
pythonpython-3.xrecursionchemistry

Counting the number of elements in a molecular compound with Python (recursion if possible)?


So I'm trying to code something to tell me the number of elements in any given compound. I'm not even sure where to start: I tried coding something but then realized that it only worked for simple compounds (or did not work at all). Here's an example of what I want:

>>> function_input : 'NaMg3Al6(BO3)3Si6O18(OH)4', 'O'

>>> function_return : 31

I've come so far in my mess of a code (IT DOESN'T WORK, it just illustrates my rough thought process):

def get_pos_from_count(string: str, letter: str):
    count = string.count(letter)
    lens = [-1]
    for i in range(count):
        lens += [string[lens[i] + 1:].index(letter) + lens[i] + 1]
    return lens[1:]



def find_number(string, letter):
    if string.count(letter) == 0: return 0
    numbers = '1234567890'
    try:
        mul1 = int(string[0])
    except ValueError:
        mul1 = 0
    mul2 = []
    sub_ = 1
    list_of_positions = get_pos_from_count(string, letter)
    for i in list_of_positions:
        try:
            sub_ += int(string[i + 1]) if string[i + 1] in numbers else 0
        except IndexError: pass
        if string[i + 1:].count(')') > string[i + 1].count('('):
            try:
                mul2 += int(string[string[i + 1:].count(')') + 1])
            except (IndexError, ValueError): pass
    return mul1 * sub_ * mul2

The approach I was trying to implement was:

  1. Find the number of occurrences of said element in compound.
  2. Find each subscript, multiplying by subscript outside bracket if said element is in bracket.
  3. Sum up all subscripts, multiply by number of compounds (first character in string)
  4. Return said number to user

But then I realized my code would either be extremely long or require recursion, which I don't know how to apply here.

If possible, I'd like a semi-working function, but a quick tip on how to approach this is also helpful!

And I don't want to use external libraries if possible.

tl;dr: This question for atomicity of elements, without external libraries (if possible).

EDIT: Yes, the question I linked does have hints on how to do this, but when I tried to make any code work for only one element, and set it's weight to 1, I ran into a host of issues I don't know to solve.


Solution

  • Let us divide the task into three parts:

    • Tokenize the string into a list of elements, numbers, and brackets;
    • Parse the bracket to have a nested list with sublists;
    • Count the elements in the nested list.

    Introducing my tools:

    from more_itertools import split_when, pairwise
    from itertools import chain
    from collections import Counter
    
    def nest_brackets(tokens, i = 0):
        l = []
        while i < len(tokens):
            if tokens[i] == ')':
                return i,l
            elif tokens[i] == '(':
                i,subl = nest_brackets(tokens, i+1)
                l.append(subl)
            else:
                l.append(tokens[i])
            i += 1
        return i,l
    
    def parse_compound(s):
        tokens = [''.join(t) for t in split_when(s, lambda a,b: b.isupper() or b in '()' or (b.isdigit() and not a.isdigit()))]
        tokens = [(int(t) if t.isdigit() else t) for t in tokens]
        i, l = nest_brackets(tokens)
        assert(i == len(tokens)) # crash if unmatched ')'
        return l
    
    def count_elems(parsed_compound):
        c = Counter()
        for a,b in pairwise(chain(parsed_compound, (1,))):
            if not isinstance(a, int):
                subcounter = count_elems(a) if isinstance(a, list) else {a: 1}
                n = b if isinstance(b, int) else 1
                for elem,k in subcounter.items():
                    c[elem] += k * n
        return c
    
    s = 'NaMg3Al6(B(CO2)3)3Si6O18(OH)4'
    
    l = parse_compound(s)
    print(l)
    # ['Na', 'Mg', 3, 'Al', 6, ['B', ['C', 'O', 2], 3], 3, 'Si', 6, 'O', 18, ['O', 'H'], 4]
    
    
    c = count_elems(l)
    print(c)
    # Counter({'O': 40, 'C': 9, 'Al': 6, 'Si': 6, 'Mg': 3, 'B': 3, 'Na': 1})
    
    print(c['O'])
    # 40