Search code examples
pythonpandasdataframecomplex-numbers

Pandas split only columns with complex data into real- and imaginary part


I am new to pandas and am trying to work with Dataframes containing a mix of complex-valued numerical data and some other stuff (strings etc.)

A stripped down version of what I am talking about:

import numpy as np
import pandas as pd
a = np.array([
        [0.1 + 1j, 0.2 + 0.2j, 0.2, 0.1j, "label_a", 1],
        [0.1 + 1j, 0.5 + 1.2j, 0.5, 1.0j, "label_b", 3],
    ])
columns = np.array([-12, -10, 10, 12, "label", "number"])
df = pd.DataFrame(data=a, columns=columns)

For the purpose of saving and reading to disk persistently, I need to split the complex values into their real- and imaginary, since apparently none of the relevant disk formats (hdf5, parquet, etc.) support complex numbers.

Now if the dataframe were to contain only complex numbers, I could do this by introducing a multi-index, which is what other questions already cover (e.g. Modify dataframe with complex values into a new multiindexed dataframe with real and imaginary parts using pandas).

# save to file
pd.concat(
    [df.apply(np.real), df.apply(np.imag)],
    axis=1,
    keys=("R", "I"),
).swaplevel(0, 1, 1).sort_index(axis=1).to_parquet(file)

# read from file
df = pd.read_parquet(file)
real = df.loc[:, (slice(None), "R")].droplevel(1, axis=1)
imag = df.loc[:, (slice(None), "I")].droplevel(1, axis=1)
df = real + 1j * imag

However this approach breaks down in the presence of e.g. string fields.

I am currently doing this by splitting the dataframe into one containing only complex numbers (i.e. the first four columns here) and the rest. Then I apply the above approach on the former, merge with the latter and save to file. This works, but isn't particularly nice, especially when the columns aren't ordered that neatly.

I was hoping that someone with more pandas experience would have a simpler way of achieving that. In case it matters: In terms of performance, I don't care about writes, but I do care about reading from file back into a dataframe.


Solution

  • You can process the columns which you know are complex, and the other ones independently. Add a dummy second level for the other columns:

    Writing

    N = 4
    cols = df.columns[:N] # or define an explicit list of names
    
    # ensure the type is complex
    # you might need to adjust to other types (np.complex128, np.complex256…)
    tmp = df[cols].astype(np.complex64)
    
    (pd.concat(
        # slice the complex columns
        # NB. using a more efficient way to get the real/imaginary parts
        [pd.DataFrame(np.real(tmp), index=tmp.index, columns=cols),
         pd.DataFrame(np.imag(tmp), index=tmp.index, columns=cols),
        ],
        axis=1,
        keys=("R", "I"),
              )
       # add the other columns
       .join(pd.concat({None: df[df.columns.difference(cols)]}, axis=1))
       .swaplevel(0, 1, 1).sort_index(axis=1)
       .to_parquet('test_pqt')
    )
    

    Reading

    # read from file
    df = pd.read_parquet('test_pqt')
    
    N = 4
    cols = df.columns.get_level_values(0)[:N] # or define an explicit list of names
    
    other_cols = df.columns.get_level_values(0).difference(cols)
    
    real = df.loc[:, (cols, "R")].droplevel(1, axis=1)
    imag = df.loc[:, (cols, "I")].droplevel(1, axis=1)
    df = (real + 1j * imag).join(df.droplevel(1, axis=1)[other_cols])
    
    print(df)
    

    Output:

            -10       -12   10   10   12   12    label number
    0  0.2+0.2j  0.1+1.0j  0.0  0.2  0.1  0.0  label_a      1
    1  0.5+1.2j  0.1+1.0j  0.0  0.5  1.0  0.0  label_b      3