Search code examples
pythonpython-3.xmachine-learningcategorical-dataone-hot-encoding

Python3: how to select columns I want and avoid keyerror if they are absent


I have some categorical values

E.g. things = 'cat', dog', 'pen', 'bar'

Which I encode to numerical values via OneHotEncoding:

car dog pen bar
1   1   1   1

I want to use some of the columns in my dataset.

E.g. car dog pen and not bar.

I do it by defining the specific columns:

dataset = dataset[['car', 'dog', 'pen']]

But sometimes some of the columns I want - are absent in my dataset, e.g. 'car'

Then Python prints the error:

KeyError: "['car'] not in index"

How can I solve the problem:

  1. to have the columns I want
  2. to avoid the error if the columns I want are absent

Solution

  • You can do some sanity checks. An example is the following function:

    def custom_dataset(dataset, req_cols):
        in_, out_ = [], []
        if isinstance(dataset, pd.DataFrame):  # optional
            for col in req_cols:  # check for every existing column
                if col in dataset.columns:
                    in_.append(col)  # append those that are in (i.e. valid)
                else:
                    out_.append(col)  # append those that are NOT in (i.e. invalid)
        return dataset[in_] if in_ else None, out_ if out_ else None
    

    As you can see, it returns a tuple of two elements:

    1. The dataset required of the existing columns only, otherwise it returns None (so you can check for None on the result to avoid errors).
    2. A list of non-found columns (for your records). Otherwise, if all found, it returns None.

    Even if the dataset is not an instance of DataFrame or the user did not provide any columns to collect, the function won't throw an error but rather it will return (None, None).