python pandas optimization data-science google-colaboratory

How can I optimize my code so my Google Colab doens't crash

I ran into a issue where Google Colab's ram is running out. I use the free version and I'm not sure if it's because it can't handle or if my code is very bad optimized. As I'm new to the field I believe my code is very slow and badly optimized. Wanted to ask a bit of help as I'm still learning.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor 
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('path/beforeNeural.csv')
df.shape
df.head()

df.isnull().sum()

encoder = LabelEncoder()

df['Property Type'] = encoder.fit_transform(df['Property Type'])
df['Old/New'] = encoder.fit_transform(df['Old/New'])
df['Record Status - monthly file only'] = encoder.fit_transform(df['Record Status - monthly file only'])
df['PPDCategory Type'] = encoder.fit_transform(df['PPDCategory Type'])
df['County'] = encoder.fit_transform(df['County'])
df['District'] = encoder.fit_transform(df['District'])
df['Town/City'] = encoder.fit_transform(df['Town/City'])
df['Duration'] = encoder.fit_transform(df['Duration'])
df['Transaction unique identifier'] = encoder.fit_transform(df['Transaction unique identifier'])
df['Date of Transfer'] = encoder.fit_transform(df['Date of Transfer'])

X = df.drop(columns='Price', axis=1)
Y = df['Price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

df.shape

boostenc = XGBRegressor()

boostenc.fit(X_train, Y_train)

Solution

I'll give it a try, here is a possible option to optimize your code,

Code:

import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('path/beforeNeural.csv')

categorical_columns = ['Property Type', 'Old/New', 'Record Status - monthly file only', 'PPDCategory Type', 'County', 'District', 'Town/City', 'Duration', 'Transaction unique identifier', 'Date of Transfer']
encoder = OneHotEncoder()
X_concat = encoder.fit_transform(df[categorical_columns])
# Approach 1:
X_concat = pd.DataFrame(X_concat.toarray(), columns = encoder.get_feature_names(categorical_columns))
# Approach 2:
X_concat = pd.SparseDataFrame(X_concat.to_coo(), columns = encoder.get_feature_names(categorical_columns))

X_numerical = df.drop(columns = categorical_columns + ['Price'])
X = pd.concat([X_numerical, X_concat], axis = 1)
Y = df['Price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)

boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)

Note, I removed the unused imports and deleted the calls such as df.head() for example in the middle of the code, which does nothing and also does not print anything when you use it like that in the middle of the code

Code Explanation:

Instead of using LabelEncoder, I used OneHotEncoder in order to one-hot-encode all of the categorical features. This creates a new binary column for each unique value in the categorical features. In general, one-hot-encoding is usually a better approach to handle categorical features when using machine learning other than just assigning the integer values using the LabelEncoder.
I extracted the names of all of the categorical columns into a list, that way it's easier to modify them when it's needed.