For this Machine Learning project, we will use the “Walmart Recruiting – Store Sales Forecasting” dataset, from Kaggle.

The goal is to predict the Weekly Sales for specific stores, departments and dates.

Download Data

First, we install the opendatasets library.

In [1]:

pip install opendatasets --upgrade

Collecting opendatasets
  Downloading opendatasets-0.1.20-py3-none-any.whl (14 kB)
Requirement already satisfied: click in /opt/conda/lib/python3.9/site-packages (from opendatasets) (8.0.1)
Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
     |████████████████████████████████| 58 kB 4.8 MB/s eta 0:00:011
Requirement already satisfied: tqdm in /opt/conda/lib/python3.9/site-packages (from opendatasets) (4.61.2)
Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.9/site-packages (from kaggle->opendatasets) (1.16.0)
Requirement already satisfied: certifi in /opt/conda/lib/python3.9/site-packages (from kaggle->opendatasets) (2021.5.30)
Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.9/site-packages (from kaggle->opendatasets) (2.8.2)
Requirement already satisfied: requests in /opt/conda/lib/python3.9/site-packages (from kaggle->opendatasets) (2.26.0)
Collecting python-slugify
  Downloading python_slugify-5.0.2-py2.py3-none-any.whl (6.7 kB)
Requirement already satisfied: urllib3 in /opt/conda/lib/python3.9/site-packages (from kaggle->opendatasets) (1.26.6)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
     |████████████████████████████████| 78 kB 6.7 MB/s  eta 0:00:01
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/conda/lib/python3.9/site-packages (from requests->kaggle->opendatasets) (2.0.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.9/site-packages (from requests->kaggle->opendatasets) (3.1)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... done
  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73053 sha256=dd4304085b0006cc5664f8c02d94d7c54a480af025e40f5521209280e0c17b9a
  Stored in directory: /home/jovyan/.cache/pip/wheels/ac/b2/c3/fa4706d469b5879105991d1c8be9a3c2ef329ba9fe2ce5085e
Successfully built kaggle
Installing collected packages: text-unidecode, python-slugify, kaggle, opendatasets
Successfully installed kaggle-1.5.12 opendatasets-0.1.20 python-slugify-5.0.2 text-unidecode-1.3
Note: you may need to restart the kernel to use updated packages.

Now we import some libraries that we will use

In [2]:

import opendatasets as od
import os
from zipfile import ZipFile

import numpy as np # linear algebra
import pandas as pd # data processing
import seaborn as sns; sns.set(style="ticks", color_codes=True)
import matplotlib.pyplot as plt

Here we download the datasets from Kaggle

In [4]:

dataset_url = 'https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting'
od.download('https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting')

Skipping, found downloaded files in "./walmart-recruiting-store-sales-forecasting" (use force=True to force download)

Let’s review the downloaded files.

In [5]:

os.listdir('walmart-recruiting-store-sales-forecasting')

Out[5]:

['test.csv.zip',
 'train.csv.zip',
 'stores.csv',
 'features.csv.zip',
 'sampleSubmission.csv.zip']

1. stores.csv: This file contains anonymized information about the 45 stores, indicating the type and size of store.

2. train.csv: This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file we will find the following fields:

Store – the store number
Dept – the department number
Date – the week
Weekly_Sales – sales for the given department in the given store
IsHoliday – whether the week is a special holiday week

3. test.csv: This file is identical to train.csv, except we have withheld the weekly sales. We must predict the sales for each triplet of store, department, and date in this file.

4. features.csv: This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

Store – the store number
Date – the week
Temperature – average temperature in the region
Fuel_Price – cost of fuel in the region
MarkDown1-5 – anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
CPI – the consumer price index
Unemployment – the unemployment rate
IsHoliday – whether the week is a special holiday week

Now, we get the zip files and create the datasets that we will use

In [6]:

train_zip_file = ZipFile('walmart-recruiting-store-sales-forecasting/train.csv.zip')
features_zip_file = ZipFile('walmart-recruiting-store-sales-forecasting/features.csv.zip')
test_zip_file = ZipFile('walmart-recruiting-store-sales-forecasting/test.csv.zip')
sample_submission_zip_file = ZipFile('walmart-recruiting-store-sales-forecasting/sampleSubmission.csv.zip')

In [7]:

train_df = pd.read_csv(train_zip_file.open('train.csv'))
features_df = pd.read_csv(features_zip_file.open('features.csv'))
stores_df = pd.read_csv('walmart-recruiting-store-sales-forecasting/stores.csv')
test_df = pd.read_csv(test_zip_file.open('test.csv'))
submission_df = pd.read_csv(sample_submission_zip_file.open('sampleSubmission.csv'))

Before starting to work with the data, we can merge the files train, stores and features, in order to increase the number of input variables.

In [8]:

dataset = train_df.merge(stores_df, how='left').merge(features_df, how='left')
test_dataset = test_df.merge(stores_df, how='left').merge(features_df, how='left')

Data exploration

Let’s have a look of our dataset

In [9]:

dataset.head()

Out[9]:

	Store	Dept	Date	Weekly_Sales	IsHoliday	Type	Size	Temperature	Fuel_Price	MarkDown1	MarkDown2	MarkDown3	MarkDown4	MarkDown5	CPI	Unemployment
0	1	1	2010-02-05	24924.50	False	A	151315	42.31	2.572	NaN	NaN	NaN	NaN	NaN	211.096358	8.106
1	1	1	2010-02-12	46039.49	True	A	151315	38.51	2.548	NaN	NaN	NaN	NaN	NaN	211.242170	8.106
2	1	1	2010-02-19	41595.55	False	A	151315	39.93	2.514	NaN	NaN	NaN	NaN	NaN	211.289143	8.106
3	1	1	2010-02-26	19403.54	False	A	151315	46.63	2.561	NaN	NaN	NaN	NaN	NaN	211.319643	8.106
4	1	1	2010-03-05	21827.90	False	A	151315	46.50	2.625	NaN	NaN	NaN	NaN	NaN	211.350143	8.106

We can identify the following input variables:

Store
Dept
Date
IsHoliday
Type
Size
Temperature
Fuel_Price
MarkDown1
MarkDown2
MarkDown3
MarkDown4
MarkDown5
CPI
Unemployment

The Target variable is Weekly_Sales

Now let’s review some measures of central tendency.

In [10]:

dataset.describe(include = 'all')

Out[10]:

	Store	Dept	Date	Weekly_Sales	IsHoliday	Type	Size	Temperature	Fuel_Price	MarkDown1	MarkDown2	MarkDown3	MarkDown4	MarkDown5	CPI	Unemployment
count	421570.000000	421570.000000	421570	421570.000000	421570	421570	421570.000000	421570.000000	421570.000000	150681.000000	111248.000000	137091.000000	134967.000000	151432.000000	421570.000000	421570.000000
unique	NaN	NaN	143	NaN	2	3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
top	NaN	NaN	2011-12-23	NaN	False	A	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
freq	NaN	NaN	3027	NaN	391909	215478	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
mean	22.200546	44.260317	NaN	15981.258123	NaN	NaN	136727.915739	60.090059	3.361027	7246.420196	3334.628621	1439.421384	3383.168256	4628.975079	171.201947	7.960289
std	12.785297	30.492054	NaN	22711.183519	NaN	NaN	60980.583328	18.447931	0.458515	8291.221345	9475.357325	9623.078290	6292.384031	5962.887455	39.159276	1.863296
min	1.000000	1.000000	NaN	-4988.940000	NaN	NaN	34875.000000	-2.060000	2.472000	0.270000	-265.760000	-29.100000	0.220000	135.160000	126.064000	3.879000
25%	11.000000	18.000000	NaN	2079.650000	NaN	NaN	93638.000000	46.680000	2.933000	2240.270000	41.600000	5.080000	504.220000	1878.440000	132.022667	6.891000
50%	22.000000	37.000000	NaN	7612.030000	NaN	NaN	140167.000000	62.090000	3.452000	5347.450000	192.000000	24.600000	1481.310000	3359.450000	182.318780	7.866000
75%	33.000000	74.000000	NaN	20205.852500	NaN	NaN	202505.000000	74.280000	3.738000	9210.900000	1926.940000	103.990000	3595.040000	5563.800000	212.416993	8.572000
max	45.000000	99.000000	NaN	693099.360000	NaN	NaN	219622.000000	100.140000	4.468000	88646.760000	104519.540000	141630.610000	67474.850000	108519.280000	227.232807	14.313000

Null Values

There are some null values, let’s review them

In [11]:

# Checking the NaN percentage
dataset.isnull().mean() * 100

Out[11]:

Store            0.000000
Dept             0.000000
Date             0.000000
Weekly_Sales     0.000000
IsHoliday        0.000000
Type             0.000000
Size             0.000000
Temperature      0.000000
Fuel_Price       0.000000
MarkDown1       64.257181
MarkDown2       73.611025
MarkDown3       67.480845
MarkDown4       67.984676
MarkDown5       64.079038
CPI              0.000000
Unemployment     0.000000
dtype: float64

We will drop the null values in the Data Manipulation section.

Input Variables Correlation with the output feature Weekly_Sales

In [12]:

corr = dataset.corr()
f, ax = plt.subplots(figsize=(15, 15))
cmap = sns.diverging_palette(220, 20, as_cmap=True)
sns.heatmap(corr, cmap=cmap, vmax=.3, center=0, annot=True,
            square=True, linewidths=.5, cbar_kws={'shrink': .5})
plt.show()

By watching the correlation matrix, we can see that Weekly_Sales have a higher correlation with Store, Dept and Size.

We will drop the variables with lower correlation in the Data Manipulation section.

Data Manipulation

In [13]:

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

Now, we will do the following steps:

Remove null values from the markdown variables.
Create variables for year, month and week, based on the date field.
Remove the variables with low correlation.

In [14]:

dataset[['MarkDown1','MarkDown2','MarkDown3','MarkDown4', 'MarkDown5']] = dataset[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']].fillna(0)
dataset['Year'] = pd.to_datetime(dataset['Date']).dt.year
dataset['Month'] = pd.to_datetime(dataset['Date']).dt.month
dataset['Week'] = pd.to_datetime(dataset['Date']).dt.isocalendar().week
dataset = dataset.drop(columns=["Date", "CPI", "Fuel_Price", 'Unemployment', 'Temperature'])

We can move the target variable to the last column of the dataframe to ease the manipulation of the data.

In [15]:

df = dataset.pop('Weekly_Sales')
dataset['Weekly_Sales'] = df

Here, we identify inputs and target columns.

In [16]:

input_cols, target_col = dataset.columns[:-1], dataset.columns[-1]
inputs_df, targets = dataset[input_cols].copy(), dataset[target_col].copy()

Now, we identify numeric and categorical columns.

In [17]:

numeric_cols = dataset[input_cols].select_dtypes(include=np.number).columns.tolist()
categorical_cols = dataset[input_cols].select_dtypes(include='object').columns.tolist()

Here, we impute (fill) and scale numeric columns.

In [18]:

imputer = SimpleImputer().fit(inputs_df[numeric_cols])
inputs_df[numeric_cols] = imputer.transform(inputs_df[numeric_cols])
scaler = MinMaxScaler().fit(inputs_df[numeric_cols])
inputs_df[numeric_cols] = scaler.transform(inputs_df[numeric_cols])

We can only use numeric data to train our models, that’s why we have to use a technique called “one-hot encoding” for our categorical columns.

One hot encoding involves adding a new binary (0/1) column for each unique category of a categorical column.

In [19]:

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(inputs_df[categorical_cols])
encoded_cols = list(encoder.get_feature_names(categorical_cols))
inputs_df[encoded_cols] = encoder.transform(inputs_df[categorical_cols])

Finally, let’s split the dataset into a training and validation set. We’ll use a randomly select 25% subset of the data for validation. Also, we’ll use just the numeric and encoded columns, since the inputs to our model must be numbers.

In [20]:

train_inputs, val_inputs, train_targets, val_targets = train_test_split(
    inputs_df[numeric_cols + encoded_cols], targets, test_size=0.25, random_state=42)

Models

Now that we have our training and validation set, we will review two models of machine learning:

Decision Tree
Random Forest

Based on the results, we will pick one of them.

Decision Tree

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

To create our decision tree model, we can use the function DecisionTreeRegressor.

In [21]:

from sklearn.tree import DecisionTreeRegressor

In [22]:

tree = DecisionTreeRegressor(random_state=0)

Now, we fit our model to the training data.

In [23]:

%%time
tree.fit(train_inputs, train_targets)

CPU times: user 2.17 s, sys: 26.8 ms, total: 2.2 s
Wall time: 2.19 s

Out[23]:

DecisionTreeRegressor(random_state=0)

We generate predictions on the training and validation sets using the trained decision tree, and compute the Root Mean Squared Error (RMSE) loss.

In [24]:

from sklearn.metrics import mean_squared_error

In [25]:

tree_train_preds = tree.predict(train_inputs)

In [26]:

tree_train_rmse = mean_squared_error(train_targets, tree_train_preds, squared=False)

In [27]:

tree_val_preds = tree.predict(val_inputs)

In [28]:

tree_val_rmse = mean_squared_error(val_targets, tree_val_preds, squared=False)

In [29]:

print('Train RMSE: {}, Validation RMSE: {}'.format(tree_train_rmse, tree_val_rmse))

Train RMSE: 6.045479193458183e-20, Validation RMSE: 5441.340994336662

Here, we can see that the RMSE loss for our train data is 6.045479193458183e-20, and the RMSE loss for our validation data is 5441.340994336662

Decision tree visualization

In [30]:

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree, export_text
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

Let’s visualize the tree graphically using plot_tree.

In [31]:

plt.figure(figsize=(30,15))
plot_tree(tree, feature_names=train_inputs.columns, max_depth=3, filled=True);

Now, let Visualize the tree textually using export_text.

In [32]:

tree_text = export_text(tree, feature_names=list(train_inputs.columns))

Here we can display the first few lines.

In [33]:

print(tree_text[:2000])

|--- Dept <= 0.89
|   |--- Dept <= 0.13
|   |   |--- Size <= 0.35
|   |   |   |--- Size <= 0.08
|   |   |   |   |--- Dept <= 0.02
|   |   |   |   |   |--- Dept <= 0.01
|   |   |   |   |   |   |--- Type_A <= 0.50
|   |   |   |   |   |   |   |--- Week <= 0.75
|   |   |   |   |   |   |   |   |--- Week <= 0.32
|   |   |   |   |   |   |   |   |   |--- MarkDown5 <= 0.01
|   |   |   |   |   |   |   |   |   |   |--- Store <= 0.94
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 14
|   |   |   |   |   |   |   |   |   |   |--- Store >  0.94
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 13
|   |   |   |   |   |   |   |   |   |--- MarkDown5 >  0.01
|   |   |   |   |   |   |   |   |   |   |--- MarkDown5 <= 0.01
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |   |   |--- MarkDown5 >  0.01
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 14
|   |   |   |   |   |   |   |   |--- Week >  0.32
|   |   |   |   |   |   |   |   |   |--- Store <= 0.07
|   |   |   |   |   |   |   |   |   |   |--- MarkDown5 <= 0.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 10
|   |   |   |   |   |   |   |   |   |   |--- MarkDown5 >  0.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 9
|   |   |   |   |   |   |   |   |   |--- Store >  0.07
|   |   |   |   |   |   |   |   |   |   |--- Store <= 0.83
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 15
|   |   |   |   |   |   |   |   |   |   |--- Store >  0.83
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 15
|   |   |   |   |   |   |   |--- Week >  0.75
|   |   |   |   |   |   |   |   |--- Week <= 0.85
|   |   |   |   |   |   |   |   |   |--- Week <= 0.81
|   |   |   |   |   |   |   |   |   |   |--- Store <= 0.07
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 5
|   |   |   |   |

Decision Tree feature importance

Let’s look at the weights assigned to different columns, to figure out which columns in the dataset are the most important.

In [34]:

tree_importances = tree.feature_importances_

In [35]:

tree_importance_df = pd.DataFrame({
    'feature': train_inputs.columns,
    'importance': tree_importances
}).sort_values('importance', ascending=False)

In [36]:

tree_importance_df

Out[36]:

	feature	importance
1	Dept	0.636039
2	Size	0.198802
0	Store	0.072103
10	Week	0.055737
12	Type_B	0.013551
5	MarkDown3	0.004337
9	Month	0.003614
8	Year	0.003393
6	MarkDown4	0.003347
11	Type_A	0.003196
7	MarkDown5	0.002141
3	MarkDown1	0.001904
4	MarkDown2	0.001453
13	Type_C	0.000383

In [37]:

plt.title('Decision Tree Feature Importance')
sns.barplot(data=tree_importance_df.head(10), x='importance', y='feature');

The variables Dept, Size, Store and Week are the most important for this model.

Random Forests

Random forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time.

For classification tasks, the output of the random forest is the class selected by most trees.

For regression tasks, the mean or average prediction of the individual trees is returned.

To create our random forest model, we can use the function RandomForestRegressor.

In [38]:

from sklearn.ensemble import RandomForestRegressor

When I created the random forest with the default number of estimators (100), the jupyter notebook crashed due to a lack of memory, so let’s start with a number of estimators of 10.

In [39]:

rf1 = RandomForestRegressor(random_state=0, n_estimators=10)

Now, we fit our model to the training data.

In [40]:

%%time
rf1.fit(train_inputs, train_targets)

CPU times: user 13.1 s, sys: 112 ms, total: 13.2 s
Wall time: 13.2 s

Out[40]:

RandomForestRegressor(n_estimators=10, random_state=0)

Now we generate predictions on the training and validation sets using the trained random forest, and compute the Root Mean Squared Error (RMSE) loss.

In [41]:

rf1_train_preds = rf1.predict(train_inputs)

In [42]:

rf1_train_rmse = mean_squared_error(train_targets, rf1_train_preds, squared=False)

In [43]:

rf1_val_preds = rf1.predict(val_inputs)

In [44]:

rf1_val_rmse = mean_squared_error(val_targets, rf1_val_preds, squared=False)

In [45]:

print('Train RMSE: {}, Validation RMSE: {}'.format(rf1_train_rmse, rf1_val_rmse))

Train RMSE: 1620.993367981347, Validation RMSE: 3997.6712441772224

Here, we can see that the RMSE loss for our train data is 1620.993367981347, and the RMSE loss for our validation data is 3997.6712441772224

The random forest model shows better results for the validation RMSE, so we will use that model.

Hyperparameter Tuning

Let’s define a helper function test_params which can test the given value of one or more hyperparameters.

For this new random forest model, I will use a number of estimators of 16.

In [46]:

def test_params(**params):
    model = RandomForestRegressor(random_state=0, n_jobs=-1, n_estimators=16, **params).fit(train_inputs, train_targets)
    train_rmse = mean_squared_error(model.predict(train_inputs), train_targets, squared=False)
    val_rmse = mean_squared_error(model.predict(val_inputs), val_targets, squared=False)
    return train_rmse, val_rmse

Let’s also define a helper function to test and plot different values of a single parameter.

In [47]:

def test_param_and_plot(param_name, param_values):
    train_errors, val_errors = [], [] 
    for value in param_values:
        params = {param_name: value}
        train_rmse, val_rmse = test_params(**params)
        train_errors.append(train_rmse)
        val_errors.append(val_rmse)
    plt.figure(figsize=(10,6))
    plt.title('Overfitting curve: ' + param_name)
    plt.plot(param_values, train_errors, 'b-o')
    plt.plot(param_values, val_errors, 'r-o')
    plt.xlabel(param_name)
    plt.ylabel('RMSE')
    plt.legend(['Training', 'Validation'])

In [48]:

test_params()

Out[48]:

(1538.6031204571725, 3946.105075819126)

We can see better results with a higher number of estimators.

In [49]:

test_param_and_plot('min_samples_leaf', [1, 2, 3, 4, 5])

In [50]:

test_params(min_samples_leaf = 5)

Out[50]:

(3139.860605280673, 4191.528032848276)

Here, we can see how the RMSE increases with the min_samples_leaf parameter, so we will use the default value (1).

In [51]:

test_param_and_plot('max_leaf_nodes', [20, 25, 30, 35, 40])

In [52]:

test_params(max_leaf_nodes = 20)

Out[52]:

(13172.46592071017, 13485.14172240374)

The RMSE decreases with the max_leaf_nodes parameter, so we will use the default value (none).

In [53]:

test_param_and_plot('max_depth', [5, 10, 15, 20, 25])

In [54]:

test_params(max_depth = 10)

Out[54]:

(7597.643155388232, 8145.895516784769)

The RMSE decreases with the max_depth parameter, so we will use the default value (none).

Training the Best Model

We create a new Random Forest model with custom hyperparameters.

In [55]:

rf2 = RandomForestRegressor(n_estimators=16, random_state = 0, min_samples_leaf = 1)

Now we train the model.

In [56]:

rf2.fit(train_inputs, train_targets)

Out[56]:

RandomForestRegressor(n_estimators=16, random_state=0)

Now we generate predictions for the final model.

In [57]:

rf2_train_preds = rf2.predict(train_inputs)

In [58]:

rf2_train_rmse = mean_squared_error(train_targets, rf2_train_preds, squared=False)

In [59]:

rf2_val_preds = rf2.predict(val_inputs)

In [60]:

rf2_val_rmse = mean_squared_error(val_targets, rf2_val_preds, squared=False)

In [61]:

print('Train RMSE: {}, Validation RMSE: {}'.format(rf2_train_rmse, rf2_val_rmse))

Train RMSE: 1538.6031204571725, Validation RMSE: 3946.105075819126

Here, we can see a decrease for the RMSE loss.

Random Forest Feature Importance

Let’s look at the weights assigned to different columns, to figure out which columns in the dataset are the most important for this model.

In [62]:

rf2_importance_df = pd.DataFrame({
    'feature': train_inputs.columns,
    'importance': rf2.feature_importances_
}).sort_values('importance', ascending=False)

In [63]:

rf2_importance_df

Out[63]:

	feature	importance
1	Dept	0.627801
2	Size	0.205967
0	Store	0.073425
10	Week	0.050421
12	Type_B	0.011083
5	MarkDown3	0.007831
9	Month	0.007729
8	Year	0.003370
11	Type_A	0.002985
6	MarkDown4	0.002700
7	MarkDown5	0.002547
3	MarkDown1	0.001938
4	MarkDown2	0.001794
13	Type_C	0.000409

In [64]:

sns.barplot(data=rf2_importance_df, x='importance', y='feature')

Out[64]:

<AxesSubplot:xlabel='importance', ylabel='feature'>

The variables Dept, Size, Store and Week are the most important for this model.

Making Predictions on the Test Set

Let’s make predictions on the test set provided with the data.

First, we need to reapply all the preprocessing steps.

In [65]:

test_dataset[['MarkDown1','MarkDown2','MarkDown3','MarkDown4', 'MarkDown5']] = test_dataset[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']].fillna(0)
test_dataset['Year'] = pd.to_datetime(test_dataset['Date']).dt.year
test_dataset['Month'] = pd.to_datetime(test_dataset['Date']).dt.month
test_dataset['Week'] = pd.to_datetime(test_dataset['Date']).dt.isocalendar().week
test_dataset = test_dataset.drop(columns=["Date", "CPI", "Fuel_Price", 'Unemployment', 'Temperature'])

In [66]:

test_dataset

Out[66]:

	Store	Dept	IsHoliday	Type	Size	MarkDown1	MarkDown2	MarkDown3	MarkDown4	MarkDown5	Year	Month	Week
0	1	1	False	A	151315	6766.44	5147.70	50.82	3639.90	2737.42	2012	11	44
1	1	1	False	A	151315	11421.32	3370.89	40.28	4646.79	6154.16	2012	11	45
2	1	1	False	A	151315	9696.28	292.10	103.78	1133.15	6612.69	2012	11	46
3	1	1	True	A	151315	883.59	4.17	74910.32	209.91	303.32	2012	11	47
4	1	1	False	A	151315	2460.03	0.00	3838.35	150.57	6966.34	2012	11	48
…	…	…	…	…	…	…	…	…	…	…	…	…	…
115059	45	98	False	B	118221	4842.29	975.03	3.00	2449.97	3169.69	2013	6	26
115060	45	98	False	B	118221	9090.48	2268.58	582.74	5797.47	1514.93	2013	7	27
115061	45	98	False	B	118221	3789.94	1827.31	85.72	744.84	2150.36	2013	7	28
115062	45	98	False	B	118221	2961.49	1047.07	204.19	363.00	1059.46	2013	7	29
115063	45	98	False	B	118221	212.02	851.73	2.06	10.88	1864.57	2013	7	30

115064 rows × 13 columns

In [67]:

test_dataset[numeric_cols] = imputer.transform(test_dataset[numeric_cols])
test_dataset[numeric_cols] = scaler.transform(test_dataset[numeric_cols])
test_dataset[encoded_cols] = encoder.transform(test_dataset[categorical_cols])

In [68]:

test_inputs = test_dataset[numeric_cols + encoded_cols]

We can now make predictions using our final model.

In [69]:

test_preds = rf2.predict(test_inputs)

Let’s replace the values of the Weekly_Sales column with our predictions.

In [70]:

submission_df['Weekly_Sales'] = test_preds

Let’s save it as a CSV file and download it.

In [71]:

submission_df.to_csv('submission.csv', index=False)

In [72]:

from IPython.display import FileLink
FileLink('submission.csv') # Doesn't work on Colab, use the file browser instead to download the file.

Out[72]:

submission.csv

Making Predictions on Single Inputs

In [73]:

def predict_input(model, single_input):
    input_df = pd.DataFrame([single_input])
    input_df[numeric_cols] = imputer.transform(input_df[numeric_cols])
    input_df[numeric_cols] = scaler.transform(input_df[numeric_cols])
    input_df[encoded_cols] = encoder.transform(input_df[categorical_cols].values)
    return model.predict(input_df[numeric_cols + encoded_cols])[0]

In [74]:

sample_input = {'Store':9, 'Dept':72, 'IsHoliday':True, 'Type':'B', 'Size':125833, 'MarkDown1':2.5, 'MarkDown2':0.02, 
                'MarkDown3':55952.99, 'MarkDown4':14.64, 'MarkDown5':310.72, 'Year':2012, 'Month':11, 'Week':47}

In [75]:

predicted_price = predict_input(rf2, sample_input)

In [76]:

print('The predicted weekly sales is ${}'.format(predicted_price))

The predicted weekly sales is $475908.2475000001

Saving the Model

In [77]:

import joblib

In [78]:

walmart_sales_rf = {
    'model': rf2,
    'imputer': imputer,
    'scaler': scaler,
    'encoder': encoder,
    'input_cols': input_cols,
    'target_col': target_col,
    'numeric_cols': numeric_cols,
    'categorical_cols': categorical_cols,
    'encoded_cols': encoded_cols
}

In [79]:

joblib.dump(walmart_sales_rf, 'walmart_sales_rf.joblib')

Out[79]:

['walmart_sales_rf.joblib']

References

Check out the following resources to learn more:

Walmart – Store Sales Forecasting: https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting
Scikit-learn documentation: https://scikit-learn.org/stable/index.html
Scikit-learn DecisionTreeRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Scikit-learn RandomForestRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
Pandas: https://pandas.pydata.org/docs/user_guide/index.html
TP2 – Walmart Sales Forecast: https://www.kaggle.com/andredornas/tp2-walmart-sales-forecast
Walmart – Store Sales Forecasting: https://www.kaggle.com/avelinocaio/walmart-store-sales-forecasting

Blog Post

Walmart — Store Sales Forecasting

Download Data

Data exploration

Null Values

Input Variables Correlation with the output feature Weekly_Sales

Data Manipulation

Models

Decision Tree

Decision tree visualization

Decision Tree feature importance

Random Forests

Hyperparameter Tuning

Training the Best Model

Random Forest Feature Importance

Making Predictions on the Test Set

Making Predictions on Single Inputs

Saving the Model

References

Sergio Alves

Latest posts

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Forecasting model for influenza A cases (H7N9) based on Random Forests

Walmart — Store Sales Forecasting

Hotel Booking Demand EDA

Archives

Categories

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Forecasting model for influenza A cases (H7N9) based on Random Forests

Walmart — Store Sales Forecasting

Blog Post

Walmart — Store Sales Forecasting

Download Data

Data exploration

Null Values

Input Variables Correlation with the output feature Weekly_Sales

Data Manipulation

Models

Decision Tree

Decision tree visualization

Decision Tree feature importance

Random Forests

Hyperparameter Tuning

Training the Best Model

Random Forest Feature Importance

Making Predictions on the Test Set

Making Predictions on Single Inputs

Saving the Model

References

Sergio Alves

Related Posts

Hotel Booking Demand EDA

5 Useful tensor functions for PyTorch

Image Classification with COIL-100 Dataset in PyTorch

Latest posts

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Forecasting model for influenza A cases (H7N9) based on Random Forests

Walmart — Store Sales Forecasting

Hotel Booking Demand EDA

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Forecasting model for influenza A cases (H7N9) based on Random Forests

Walmart — Store Sales Forecasting