# Define the path to the sample fileoriginal_sample_path = os.path.abspath("./car_ad.csv")# Read the filedf = pd.read_csv(original_sample_path, encoding='iso-8859-1', sep =";", index_col=0)df.head()
Data transformation
Transform and clean your data according to your necessities.
# Drop registers with negative price (corrupted)df = df.drop(df[df.price <=0 ].index)# Drop null values for engine and outliersdf = df.dropna(how ="any", subset = ["engV"])df = df.drop(df[df.engV >40].index)# Drop null valuesdf = df.dropna()
Train-Test split
Split your dataset into train and test.
# Select target columny_train = df["price"]# Drop target from inputx_train = df.drop(["price"], axis=1)# Split with 20% for testdata_train, data_test, label_train, label_test =train_test_split(x_train, y_train, test_size =0.2, random_state =42)# Transform index into a columndata_train.reset_index(level=0, inplace=True)data_test.reset_index(level=0, inplace=True)
Store input data
It is really important to store the very same data we used for train-test split. This will be the input dataset for EXPAI.
df.to_csv('./expai_input_data.csv')
Create a model using Pipeline
Pipelines are implemented by Scikit-Learn and allow users to build an unique object for the whole analyticial process. See docs.
# Create Pipeline object where steps are transformation and model.clf =Pipeline(steps=[ ('preprocessor', transformer), ('model', model)])# Fit the pipelineclf.fit(X = data_train, y = label_train)
Measure performance
Since we are building a regressor, we will use Mean Squared Error as our metric to check performance.
# Predict price for test datay_hat = clf.predict(data_test)# Compute MSEms.mean_squared_error(label_test, y_hat)