Search code examples
pythonpandasdataframeappendnan

Transfer the specific values from one pd dataframe to another


Can you please help me?

I have two pandas dataframes and I need to extract every tenth value from the specific column from the first dataframe in a list. Then I need to create a new column in a second dataframe and put these values in this column. I also need my number of a row to match in both dataframes. Here is the code:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

# we will use random forest classifier as our classifier
forest_classifier=RandomForestClassifier(max_depth=4)

# reading in accelerometer data
train_time = pd.read_csv("https://courses.edx.org/assets/courseware/v1/b98039c3648763aae4f153a6ed32f38b/asset-v1:HarvardX+PH526x+3T2022+type@asset+block/train_time_series.csv", index_col=0)
train_labels = pd.read_csv("https://courses.edx.org/assets/courseware/v1/d64e74647423e525bbeb13f2884e9cfa/asset-v1:HarvardX+PH526x+3T2022+type@asset+block/train_labels.csv", index_col=0)

x = []
y = []
z = []

# making lists out of the x, y, z columns
for i in range(3, len(train_time), 10):
    x.append(train_time.iloc[i]["x"])
    y.append(train_time.iloc[i]["y"])
    z.append(train_time.iloc[i]["z"])
    
print(z)  # checking the list

# making new columns in train_labels file with the obtained lists
train_labels["x"] = pd.Series([x])
train_labels["y"] = pd.Series([y])
train_labels["z"] = pd.Series([z])

train_labels.head()

But I get the output, where the created columns have just the values of "NaN"

The output should be the dataframe with the created x, y, z columns, which have corresponding numbers of observations.

enter image description here


Solution

  • You can stack x, y and z and use:

    train_labels[['x', 'y', 'z']] = np.stack([x, y, z], axis=1)
    

    Output:

    >>> train_labels
               timestamp                 UTC time  label         x         y         z
    20589  1565109931087  2019-08-06T16:45:31.087      1 -0.053802 -0.987701  0.068985
    20599  1565109932090  2019-08-06T16:45:32.090      1  0.013718 -0.852371 -0.000870
    20609  1565109933092  2019-08-06T16:45:33.092      1  0.145584 -1.007843 -0.036819
    20619  1565109934094  2019-08-06T16:45:34.094      1 -0.099380 -1.209686  0.304489
    20629  1565109935097  2019-08-06T16:45:35.097      1  0.082794 -1.001434 -0.025375
    ...              ...                      ...    ...       ...       ...       ...
    24289  1565110302030  2019-08-06T16:51:42.030      4 -0.641953 -1.469177  0.301041
    24299  1565110303032  2019-08-06T16:51:43.032      4 -0.171616 -0.366074 -0.059082
    24309  1565110304034  2019-08-06T16:51:44.034      4  0.401810 -1.077698  0.258911
    24319  1565110305037  2019-08-06T16:51:45.037      4  0.330338 -1.470062  0.303894
    24329  1565110306039  2019-08-06T16:51:46.039      4  0.689346 -0.991043  0.034973
    
    [375 rows x 6 columns]
    

    Note 1: your code doesn't work because you create a pd.Series without using indexes from train_labels so the indexes are misaligned:

    train_labels['x'] = pd.Series(x, index=train_labels.index)
    train_labels['y'] = pd.Series(y, index=train_labels.index)
    train_labels['z'] = pd.Series(z, index=train_labels.index)
    

    Note 2: you can also avoid the loop:

    x, y, z = train_time.iloc[3::10][['x', 'y', 'z']].T.values
    

    So the code could be:

    x, y, z = train_time.iloc[3::10][['x', 'y', 'z']].T.values
    train_labels[['x', 'y', 'z']] = np.stack([x, y, z], axis=1)
    
    # Or (if you needn't x, y, and z)
    train_labels[['x', 'y', 'z']] = train_time.iloc[3::10][['x', 'y', 'z']