Search code examples
pandasscikit-learnone-hot-encoding

How can I recode 53k unique addresses (saved as objects) w/o One-Hot-Encoding in Pandas?


My data frame has 3.8 million rows and 20 or so features, many of which are categorical. After paring down the number of features, I can "dummy up" one critical column with 20 or so categories and my COLAB with (allegedly) TPU running won't crash.

But there's another column with about 53,000 unique values. Trying to "dummy up" this feature crashes my session. I can't ditch this column.

I've looked up target encoding, but the data set is very imbalanced and I'm concerned about target leakage. Is there a way around this?

EDIT: My target variable is a simple binary one.


Solution

  • Without knowing more details of the problem/feature, there's no obvious way to do this. This is the part of Data Science/Machine Learning that is an art, not a science. A couple ideas:

    1. One hot encode everything, then use a dimensionality reduction algorithm to remove some of the columns (PCA, SVD, etc).
    2. Only one hot encode some values (say limit it to 10 or 100 categories, rather than 53,000), then for the rest, use an "other" category.
    3. If it's possible to construct an embedding for these variables (Not always possible), you can explore this.
    4. Group/bin the values in the columns by some underlying feature. I.e. if the feature is something like days_since_X, bin it by 100 or something. Or if it's names of animals, group it by type instead (mammal, reptile, etc.)