Search code examples
pythonpytorchneural-network

Converting column of object type to pytorch tensor


I am new to machine learning and python. I am working on data which has 2 columns of object type and a large number of columns of float type. For converting a float type columns to tensor, the below code works fine:

cont_cols = ['row_id', 'player1','player2','playervar_0','player_1'] 
conts = np.stack([train_df[col].values for col in cont_cols],1)
conts = torch.tensor(conts,dtype= torch.float)

But when I tried doing with object type data column as below:

    obj_cols = ['winner','team'] 
    objs = np.stack([train_df[col].values for col in obj_cols],1)
    objs = torch.tensor(objs, dtype= torch.float)

I am getting the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[60], line 2
      1 objs = np.stack([train_df[col].values for col in obj_cols],1)
----> 2 objs = torch.tensor(objs, dtype= torch.float)

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

I would be really great and kind help if someone can guide me on this.

Edit To make the question more clear, column 'winner' contains winner, loser, draw, loser, winner, winner,draw,...... The column team contains team1, team2, team1, team1, team2,....

Edit2

I tried this approach. I think, this approach is fine? Please suggest some better approach?

 train_df['winner'] = train_df['winner'].map({'loser': 0, 'winner': 1, 'draw': 2})
    train_df['team'] = train_df['team'].map({'team1': 0, 'team2': 1})
    obj_cols = ['winner','team'] 
    objs = np.stack([train_df[col].values.tolist() for col in obj_cols],1)
    objs = torch.tensor(objs, dtype= torch.float)
    objs[:5]
    tensor([[1., 0.],
            [0., 1.],
            [0., 0.],
            [0., 1.],
            [2., 0.]])

Solution

  • To make the question more clear, column 'winner' contains winner, loser, draw, loser, winner, winner,draw,...... The column team contains team1, team2, team1, team1, team2,....

    Your problem is that your data are strings (or "objects") that cannot be converted to a tensor directly.

    You have to convert your unique string values into numbers somehow. You are on the right path regarding what you did in "Edit2" :)

    If you want to preserve the labels of your columns, you could map the column to pandas.Categorical and then use the .codes attribute to get the integers for the tensor (see here), e.g.:

    winner team
    loser team1
    winner team2
    draw team1
    winner team1
    df = pd.DataFrame({
        "winner": ["loser", "winner", "draw", "winner"],
        "team": ["team1", "team2", "team1", "team1"]
    })
    # you can control the order here, i.e. winner -> 0, loser -> 1, etc.
    df["winner"] = pd.Categorical(df["winner"], ["winner", "loser", "draw"]) 
    df["team"] = pd.Categorical(df["team"], ["team1", "team2"])
    
    objs = np.stack([df[col].cat.codes for col in ["winner", "team"]],1)
    
    # Output of objs:
    # array([[1, 0],
    #        [0, 1],
    #        [2, 0],
    #        [0, 0]], dtype=int8)
    
    

    Or you can also simply use pandas.factorize() to get an integer representation of you labels:

    objs = np.stack([train_df[col].factorize().values for col in obj_cols],1)