Search code examples
pythonpandasone-hot-encoding

Restructure data in Pandas for One-Hot Encoding?


this is my first time ever asking a Stack Overflow question after years of lurking and reading posts.

I've been working with headerless CSV data sets that look like this:

list-of-things-that-are-present-in-this-entry, Yes
list-of-things-that-are-present-in-this-entry, No

Each data set ranges in size from ~30KB to ~100MB.

Now, each entry is a different length in terms of the number of values. As I read in the CSV file with Pandas, it has been aligning the entries for me -- but not in a way that I want it.

Let's say that the second entry is the shorter of the two. Pandas has been filling in the rest of the row with NaN so it's the same length as the other entries.

list-of-things-that-are-present-in-this-entry, Yes
list-of-things-that-are-present-in-this-entry, No, NaN, NaN

Problem is, this throws off the get_dummies function I've tried to use so far. The 'Yes' and 'No' values are for the same property. By "throws off", I mean that it's been treating each column of values as for the same property (eg, Yes and NaN, when it should be Yes and No being compared).

Any suggestions on what I could do? Maybe tack on a header? What I want is essentially this:

    A B C D E F isThingTrue?
0   0 1 0 1 0 0     0
1   1 0 1 0 0 0     1
2   0 1 1 1 0 1     1

from this:

B, D, No
A, C, Yes
B, C, D, F, Yes

The file itself would look something like this:

A, B, C, D, E, F, isThingTrue?
0, 1, 0, 1, 0, 0, 0
1, 0, 1, 0, 0, 0, 1
0, 1, 1, 1, 0, 1, 1

I'm not wedded to Pandas or anything; I'm asking out of desperation after going down a million and five rabbit holes of search enquiries. This is my first foray into data munging and Python. Let me know if I'm doing anything wrong.


Solution

  • You could use a scikit-learn CountVectorizer, you'll need to override the token pattern since by default it only captures words of 2 or more characters.

    import pandas as pd
    from sklearn.feature_extraction.text import CountVectorizer
    
    s = """B, D, No
           A, C, Yes
           B, C, D, F, Yes"""
    
    c = CountVectorizer(token_pattern='\w+')
    values = c.fit_transform(s.split('\n'))
    
    pd.DataFrame(values.toarray(), columns=c.vocabulary_)
       a  d  c  yes  f  b  no
    0  0  1  0    1  0  1   0
    1  1  0  1    0  0  0   1
    2  0  1  1    1  1  0   1
    

    Your data format isn't really a CSV, so pandas probably isn't the best way to read it in, better to just read it in as a text file.