Search code examples
pythonpandaspython-dataclasses

How would I return Python dataclass fields/values that are only defined in __post_init__ in print and pd.Dataframe?


TL,DR: I've created a dataclass, where not all fields are defined in the init phase, some get added in __post_init__ in relation to an InitVar. When I print the class object, only fields in init gets printed but not those added in __post_init__. This is also true if I convert the object to a pandas dataframe (which is what I ultimately would like to achieve). How do I need to change the class code or the print/pandas statements to get all dataclass fields/values?


This is an extract of my dataclass definition:

@dataclass
class PcpCompound:
    compound: InitVar[Compound] # Compound: class of the pubchempy package
    query_status: str
    query_term: str

    def __post_init__(self, compound: Compound | None):
        if compound is None:
            return
        self.query_finding: str = compound.iupac_name

Results and expectations are shown down below.

My optimal solution would be to move everything from __post_init__ to init, but that would require that I check compound for None and write all the wanted details from it to their respective fields of the dataclass. Unfortunately, I couldn't figure out how to use compound programmaticly in init ... and I believe that this is intended.

If this is not possible, how would I solve this? I would prefer to not define all fields in init, as this is not only tedious and doubles the code amount, but would lead to a lot of 'None' if compound is itself None.


The calls to construct an object and to print/convert to a daraframe:

PcpCompound = PcpCompound(query_status=status, query_term=query_term, compound=compound)
print(PcpCompound)

PdfCompound = pd.DataFrame([PcpCompound])
print(PdfCompound)

Actual output:

PcpCompound(query_status='Success!', query_term='someterm')
  query_status query_term
0     Success!   110-89-4

Expected output:

PcpCompound(query_status='Success!', query_term='110-89-4', query_finding='Piperidine')
  query_status query_term query_finding
0     Success!   110-89-4    Piperidine

Note: All expected values are correctly present in the object, as shown in the VSCode debug view. This can also be verified by print(PcpCompound.query_finding) which returns Piperidine as expected.


Solution

  • The reason why query_finding isn't included in the print output or in the DataFrame is because dataclass simply isn't aware of your new property, you never told dataclass about it (and by extension, pandas).

    The solution is to declare your field in the class by using field(init=False) to indicate the field cannot be included directly to __init__ and is instead created in __post_init__:

    from dataclasses import InitVar, dataclass, field
    
    import pandas as pd
    
    
    @dataclass
    class PcpCompound:
        compound: InitVar[int]  # using `int` for reproducibility
        query_status: str
        query_term: str
        query_finding: str = field(init=False)
    
        def __post_init__(self, compound: int) -> None:
            self.query_finding = compound * 2  # pretend this is a real query
    
    
    c = PcpCompound(compound=123, query_status="Success!", query_term="someterm")
    print(c)
    print(pd.DataFrame([c]))
    

    Output:

    PcpCompound(query_status='Success!', query_term='someterm', query_finding=246)
      query_status query_term  query_finding
    0     Success!   someterm            246