Can you please help me?
I have two pandas dataframes and I need to extract every tenth value from the specific column from the first dataframe in a list. Then I need to create a new column in a second dataframe and put these values in this column. I also need my number of a row to match in both dataframes. Here is the code:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
# we will use random forest classifier as our classifier
forest_classifier=RandomForestClassifier(max_depth=4)
# reading in accelerometer data
train_time = pd.read_csv("https://courses.edx.org/assets/courseware/v1/b98039c3648763aae4f153a6ed32f38b/asset-v1:HarvardX+PH526x+3T2022+type@asset+block/train_time_series.csv", index_col=0)
train_labels = pd.read_csv("https://courses.edx.org/assets/courseware/v1/d64e74647423e525bbeb13f2884e9cfa/asset-v1:HarvardX+PH526x+3T2022+type@asset+block/train_labels.csv", index_col=0)
x = []
y = []
z = []
# making lists out of the x, y, z columns
for i in range(3, len(train_time), 10):
x.append(train_time.iloc[i]["x"])
y.append(train_time.iloc[i]["y"])
z.append(train_time.iloc[i]["z"])
print(z) # checking the list
# making new columns in train_labels file with the obtained lists
train_labels["x"] = pd.Series([x])
train_labels["y"] = pd.Series([y])
train_labels["z"] = pd.Series([z])
train_labels.head()
But I get the output, where the created columns have just the values of "NaN"
The output should be the dataframe with the created x, y, z columns, which have corresponding numbers of observations.
You can stack x
, y
and z
and use:
train_labels[['x', 'y', 'z']] = np.stack([x, y, z], axis=1)
Output:
>>> train_labels
timestamp UTC time label x y z
20589 1565109931087 2019-08-06T16:45:31.087 1 -0.053802 -0.987701 0.068985
20599 1565109932090 2019-08-06T16:45:32.090 1 0.013718 -0.852371 -0.000870
20609 1565109933092 2019-08-06T16:45:33.092 1 0.145584 -1.007843 -0.036819
20619 1565109934094 2019-08-06T16:45:34.094 1 -0.099380 -1.209686 0.304489
20629 1565109935097 2019-08-06T16:45:35.097 1 0.082794 -1.001434 -0.025375
... ... ... ... ... ... ...
24289 1565110302030 2019-08-06T16:51:42.030 4 -0.641953 -1.469177 0.301041
24299 1565110303032 2019-08-06T16:51:43.032 4 -0.171616 -0.366074 -0.059082
24309 1565110304034 2019-08-06T16:51:44.034 4 0.401810 -1.077698 0.258911
24319 1565110305037 2019-08-06T16:51:45.037 4 0.330338 -1.470062 0.303894
24329 1565110306039 2019-08-06T16:51:46.039 4 0.689346 -0.991043 0.034973
[375 rows x 6 columns]
Note 1: your code doesn't work because you create a pd.Series
without using indexes from train_labels
so the indexes are misaligned:
train_labels['x'] = pd.Series(x, index=train_labels.index)
train_labels['y'] = pd.Series(y, index=train_labels.index)
train_labels['z'] = pd.Series(z, index=train_labels.index)
Note 2: you can also avoid the loop:
x, y, z = train_time.iloc[3::10][['x', 'y', 'z']].T.values
So the code could be:
x, y, z = train_time.iloc[3::10][['x', 'y', 'z']].T.values
train_labels[['x', 'y', 'z']] = np.stack([x, y, z], axis=1)
# Or (if you needn't x, y, and z)
train_labels[['x', 'y', 'z']] = train_time.iloc[3::10][['x', 'y', 'z']