EXPAI Docs
Search…
Creating a Pipeline model
In this section, we provide you a guide to create a Pipeline model. This type of model is highly recommended to use EXPAI.

Download the files

    You can access the complete code here.
    Download the dataset here.

Imports

1
import numpy as np
2
import pandas as pd
3
import xgboost as xgb
4
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
5
from sklearn import model_selection, preprocessing
6
from sklearn.base import BaseEstimator, TransformerMixin
7
from sklearn.pipeline import Pipeline
8
from sklearn.compose import ColumnTransformer
9
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
10
from sklearn import metrics as ms
11
import pickle as pickle
12
import os
Copied!

Loading the data

Remember to download the data here.
1
# Define the path to the sample file
2
original_sample_path = os.path.abspath("./car_ad.csv")
3
4
# Read the file
5
df = pd.read_csv(original_sample_path, encoding='iso-8859-1', sep = ";", index_col=0)
6
df.head()
Copied!

Data transformation

Transform and clean your data according to your necessities.
1
# Drop registers with negative price (corrupted)
2
df = df.drop(df[df.price <= 0 ].index)
3
4
# Drop null values for engine and outliers
5
df = df.dropna(how = "any", subset = ["engV"])
6
df = df.drop(df[df.engV > 40].index)
7
8
# Drop null values
9
df = df.dropna()
Copied!

Train-Test split

Split your dataset into train and test.
1
# Select target column
2
y_train = df["price"]
3
4
# Drop target from input
5
x_train = df.drop(["price"], axis=1)
6
7
# Split with 20% for test
8
data_train, data_test, label_train, label_test = train_test_split(x_train, y_train, test_size = 0.2, random_state = 42)
9
10
# Transform index into a column
11
data_train.reset_index(level=0, inplace=True)
12
data_test.reset_index(level=0, inplace=True)
Copied!

Store input data

It is really important to store the very same data we used for train-test split. This will be the input dataset for EXPAI.
1
df.to_csv('./expai_input_data.csv')
Copied!

Create a model using Pipeline

Pipelines are implemented by Scikit-Learn and allow users to build an unique object for the whole analyticial process. See docs.
In this case, we will build a Pipeline that:
    Encodes categorical variables
    Scales numerical variables
    Implements a XGBoost Regressor
1
# Define a transformer for numerical and categorical variables
2
transformer = ColumnTransformer(
3
transformers=[
4
('num', StandardScaler(), ["mileage", "engV", "year"]),
5
('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan), ["car", "body", "engType", "registration", "model", "drive"])
6
]
7
)
Copied!
1
# Define XGBoost Regressor parameters
2
xgb_params = {
3
'eta': 0.05,
4
'max_depth': 5,
5
'subsample': 0.7,
6
'colsample_bytree': 0.7,
7
'objective': 'reg:squarederror',
8
'eval_metric': 'rmse',
9
'silent': 1
10
}
11
12
# Init model
13
model = xgb.XGBRegressor(**xgb_params)
Copied!
1
# Create Pipeline object where steps are transformation and model.
2
clf = Pipeline(steps=[
3
('preprocessor', transformer),
4
('model', model)
5
])
6
7
# Fit the pipeline
8
clf.fit(X = data_train, y = label_train)
Copied!

Measure performance

Since we are building a regressor, we will use Mean Squared Error as our metric to check performance.
1
# Predict price for test data
2
y_hat = clf.predict(data_test)
3
4
# Compute MSE
5
ms.mean_squared_error(label_test, y_hat)
Copied!

Export the model using Pickle

Use Pickle to store your model locally.
1
model_path = os.path.abspath("./model_pipeline.pkl")
2
with open(model_path, 'wb') as f:
3
pickle.dump(clf, f)
Copied!
Last modified 1mo ago