I have 3 lists like as shown below
numeric_cols = df.select_dtypes(include=np.number).columns # 3 items `qty`, `age`, `sqft`
date_cols = df.select_dtypes(include=['datetime64']).columns # 2 items
string_cols = df.select_dtypes(include=['object']).columns # 3 items `bucket`, `category`, `level`
Now, I would like to
a) select only one item from numeric cols - qty
and all the items from string_cols
(so dataframe has to have only 4 columns)
I tried the below
df[[*string_cols]] = df[[*string_cols]]
df.insert(2, "Qty",df['Qty'],True)
Please note that what I have shown is just a sample. In real data, I have millions of rows and 100's of columns. Hence, I would like to follow the above approach.
Can guide me on how can I do this efficiently?
You can select subset of columns using a list of column names:
new_df = df[[*string_cols] + ['Qty']]
For example, for DataFrame,
Qty Age Sqft bucket category level
0 11 8 1.0 a a a
1 2 9 0.0 b b b
2 3 10 0.0 c c c
3 18 3 1.0 d d d
4 21 2 NaN e e e
string_cols = df.select_dtypes(inclue=['object']).columns
new_df = df[[*string_cols] + ['Qty']]
produces:
bucket category level Qty
0 a a a 11
1 b b b 2
2 c c c 3
3 d d d 18
4 e e e 21