Search code examples
pythonstringfloating-pointgroupingcategorical-data

How can I label a column of strings into numbered groups based on another column containing substrings?


I have the 1st column that is around 4920 different chemical compounds.

For example:

0              Ag(AuS)2      
1            Ag(W3Br7)2      
2      Ag0.5Ge1Pb1.75S4     
3     Ag0.5Ge1Pb1.75Se4     
4                Ag2BBr      
...                 ...      
4916             ZrTaN3     
4917               ZrTe      
4918             ZrTi2O      
4919             ZrTiF6      
4920               ZrW2  

I have the 2nd column that has all the elements of the periodic table numerically listed atomic number

0      H
1     He
2     Li
3     Be
4      B
..   ...
113   Fl
114  Uup
115   Lv
116  Uus
117  Uuo

How can I classify the first column into groups based on the compound's first element corresponding to their atomic number from column 2 so that I can return the first column

The atomic number of Ag = 27 The atomic number of Zr = 40

    0            47      
    1            47      
    2            47     
    3            47    
    4            47      
    ...                 ...      
    4916         40    
    4917         40       
    4918         40         
    4919         40        
    4920         40     

Solution

  • Since the first element could be a varying number of letters, the simplest solution would be to use the regex approach for getting the first section. For example:

    import re
    
    compounds = ["Ag(AuS)2", "HTiF", "ZrTaN3"]
    
    for compound in compounds:
        match = re.match(r"[A-Z][a-z]*", compound)
        if match:
            fist_element = match.group(0)
            print(fist_element)
    

    this will print out the first element of each compound. Note: If there are some more complex compounds and you need to adjust your regex, I recommend using https://regex101.com/ as a playground.

    Once you have that information it just needs to be connected with the element in the second column which would be easiest if you mapped that column to a dictionary resembling:

    { H: 0, He: 1, Li: 2 ...}
    

    which would allow you to simply get the element index by calling dict_with_elements.get(first_element).

    From there on, the rest is just looping and writing data. I hope this helps.