Search code examples
pythonpandasfor-loopconcatenation

outer join of pandas df generated from for loop


I have a for loop that in the first iteration generates a dataframe like:

pd.DataFrame(columns = ["Al", "Si", "K", "Th"], data = [[1,2,3,4]])

The second iteration produces a dataframe that looks like:

pd.DataFrame(columns = ["W", "Cu"], data = [[5,6]])

Both the columns and data variables are generated through the loop in each iteration. I want to be able to add something at the end of the loop that performs and outer join of each one of the dataframes, such that the final result is:

pd.DataFrame(columns = ["Al", "Si", "K", "Th", "W", "Cu"], data = [[1,2,3,4, 0,0], [0,0,0,0, 5,6]])

I've tried with append, concat and outer join but can't crack it, because I need a live update on the final dataframe on each iteration, and can't sort it out.

Also, worth to mention that I can't predefine the total amount columns a priori, the elements calculated are dependent on the data and created during the loop.

edit: Here's the loop:

formulas = ("NaAlSiO2", "WCu2")

for form in formulas:

    s = re.findall('([A-Z][a-z]?)([0-9]*)', form)

    perc_weight = []
    atoms = []

    for elem, count in s:

        total_weight = molecular_w_calc(form)
        atoms.append(elem)
        perc_weight.append((Element_mass[elem]*100*int(count)) / total_weight)
        perc_df = pd.DataFrame(columns = np.array(atoms), data = [perc_weight]) 

Element_mass is a dictionary with values for each atom. perc_df is the dataframe produced in each iteration. molecular_w_calc returns a single value.

Thanks!


Solution

  • If you want to extend the frame iteratively then concat should actually work. This

    df1 = pd.DataFrame(columns = ["Al", "Si", "K", "Th"], data = [[1,2,3,4]])
    df2 = pd.DataFrame(columns = ["W", "Cu"], data = [[5,6]])
    df = pd.concat([df1, df2], axis='rows')
    df.fillna(0, inplace=True)
    

    gives you

        Al   Si    K   Th    W   Cu
    0  1.0  2.0  3.0  4.0  0.0  0.0
    0  0.0  0.0  0.0  0.0  5.0  6.0
    

    Just a suggestion: Wouldn't you be better off if you do the creation of the underlying data with basic Python only?

    Something like

    import re
    import pandas as pd
    
    re_comps = re.compile(r'([A-Z][a-z]?)([0-9]*)')
    
    formulas = ("NaAlSiO2", "WCu2")
    elements = {element for formula in formulas
                        for element, _ in re_comps.findall(formula)}
    perc_dict = {key: len(formulas) * [None] for key in elements.union({'Formula'})}
    for i, formula in enumerate(formulas):
        perc_dict['Formula'][i] = formula
        total_weight = molecular_w_calc(formula)
        for element, count in re_comps.findall(formula):
            count = 1 if count == '' else int(count)
            perc_dict[element][i] = (Element_mass[element] * 100 * count) / total_weight
    

    and only then Pandas

    perc_df = pd.DataFrame(perc_dict)
    perc_df.set_index('Formula', drop=True, inplace=True)
    perc_df.sort_index(axis='columns', inplace=True)
    

    The structure of the resulting perc_df looks like (the values are obviously wrong, since I didn't have the Element_mass dictionary and molecular_w_calc function):

               Al   Cu   Na    O   Si    W
    Formula                               
    NaAlSiO2  1.0  NaN  1.0  2.0  1.0  NaN
    WCu2      NaN  2.0  NaN  NaN  NaN  1.0