Datasklr is a blog to provide examples of data science projects to those passionate about learning and having fun with data.

Scaling, Centering and Standardization

Scaling, Centering and Standardization

shutterstock_793780861-bn.jpg

Centering or scaling variables may be advantageous in regression although how, when and what to standardize seems to be a matter of preference based on scientific background/field of scientists.  See this article for further thought: https://statmodeling.stat.columbia.edu/2009/07/11/when_to_standar/

 First, when regression is used for explaining a phenomenon, interpreting the y-intercept is important. Centering results in predictors having a mean of zero. As a result, the intercept can be assumed to have the expected value Y when the predictor values are set to the mean (zero).  Without centering, the predictors would have to be set to zero to interpret the Y intercept, which may not be feasible or sensible.

Second, centering also helps with allowing easier interpretation of variable coefficients associated with different magnitudes, e.g. when one regressor is measured on a very small order, while another on a very large order.  Standardization allows the units of regression coefficients to be expressed in the same units. Luckily, centering or scaling does not have an impact on p-values, therefore regression model statistics can be interpreted the same way as if centering or scaling did not take place.

 Third, when creating sums or averages of variables on different scale, it may be important to scale the variables to have the same unit. 

 Finally, other methods, such as Principal Components Analysis may require centering or scaling.  Factor loadings in PCA represent weights by which each standardized original variable should be multiplied to get the component score.  Standardization has a role in creating eigenvectors, which are then used during orthogonal rotation of variables to form principal components.  For now, we will focus on regression only. 

Feature scaling is relatively easy with Python.  Note that it is recommended to split data into test and training data sets BEFORE scaling.  If scaling is done before partitioning the data, the data may be scaled around the mean of the entire sample, which may be different than the mean of the test and mean of the train data.

Standardization:

One of the most frequently used methods of scaling is standardization.  During standardization, we remove the mean from each value, and then scale them by dividing them by their standard deviation.  The result is data with a mean of zero and the standard deviation of one.

 Standardization can improve the performance of models.  For example, an unstandardized feature with a very large variance may dominate others, which may result in subpar model performance. 

 Sci-kit in Python offers several scalers: a.) StandardScaler, b.) MinMaxScaler, c.) MaxAbsScaler and d.) RobustScaler.

Standard Scaler

StandardScales, as its name suggests is the most standard, garden variety standardization tool. It centers the data by subtracting the mean of a variable from each observation and dividing it by the variable’s standard deviation. It is possible not to scale the values around zero but around a preselected value.  In this case the with_mean parameter needs to be set to False. 

#STANDARDIZATION #########################
#Standard Scaler
#centers data by removing the mean value then scale by dividing by standard deviation. 
#Pure centering

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

boston_scaled_df=boston_features_df.copy()
boston_scaled_df=pd.DataFrame(scaler.fit_transform(boston_scaled_df), columns=boston_scaled_df.columns)
boston_scaled_df.head()
 
Screen Shot 2019-10-15 at 9.35.02 PM.png
 
#Plot to demonastrate the effect of scaling on one of the variables: CRIM
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import matplotlib.gridspec as gridspec

fig1 = plt.figure(constrained_layout=True)
spec1 = gridspec.GridSpec(ncols=2, nrows=2, figure=fig1)

f1_ax1 = fig1.add_subplot(spec1[0, 0])
plt.scatter(boston_features_df['CRIM'], boston_target_df['MEDV'])
plt.xlabel('CRIM')
plt.ylabel("MEDV")

f1_ax2 = fig1.add_subplot(spec1[0, 1])
plt.scatter(boston_scaled_df['CRIM'], boston_target_df['MEDV'])
plt.xlabel('CRIM')
plt.ylabel('MEDV')
f1_ax3 = fig1.add_subplot(spec1[1, 0])
boston_features_df['CRIM'].plot(kind='hist',edgecolor='black',figsize=(6,3))
plt.title('CRIM', size=10)

f1_ax4 = fig1.add_subplot(spec1[1, 1])
boston_scaled_df['CRIM'].plot(kind='hist',edgecolor='black',figsize=(6,3))
plt.title('CRIM', size=10)
 
Screen Shot 2019-10-15 at 9.36.47 PM.png
 

MinMax Scaler

The MinMaxScaler allows the features to be scaled to a predetermined range.  This scaler subtracts the smallest value of a variable from each observation and then divides it by a specified range.  Note that the feature_range parameter has a default of 0-1.  The scaler is best used for non-normal distributions, but its drawback is its sensitivity to outliers. 

#MinMax Scaler
#scaling each feature to a given range

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))

boston_scaled2_df=boston_features_df.copy()
boston_scaled2_df=pd.DataFrame(scaler.fit_transform(boston_scaled2_df), columns=boston_scaled2_df.columns)
boston_scaled2_df.head()
 
Screen Shot 2019-10-15 at 9.38.05 PM.png
 

MaxAbs Scaler

The MaxAbsScaler divides each observation within a variable by the absolute value of the highest value.  The default range is -1,1.  The MaxAbs scaler is most useful when data was already centered around zero and is sparse. 

#MaxAbs Scaler
#scales the data to a [-1,1] range based on the absolute maximum

from sklearn.preprocessing import MaxAbsScaler
scaler = MinMaxScaler(feature_range=(-1,1))

boston_scaled3_df=boston_features_df.copy()
boston_scaled3_df=pd.DataFrame(scaler.fit_transform(boston_scaled3_df), columns=boston_scaled3_df.columns)
boston_scaled3_df.head()
 
Screen Shot 2019-10-15 at 9.39.35 PM.png
 

Robust Scaler

When the data contains a large number of outliers, the standard deviation and mean will be impacted by them and scaling with the above scalers may be problematic.  In this case, the RobustScaler may work better because it  removes the median and scales the data according to the quantile range. The quantile range to be used for scaling can be specified.

#Robust scaler
#removes the median and scales the data according to the quantile range

from sklearn.preprocessing import RobustScaler
robust = RobustScaler(quantile_range = (0.1,0.9))

boston_robust_df=boston_features_df.copy()
boston_robust_df=pd.DataFrame(robust.fit_transform(boston_robust_df), columns=boston_robust_df.columns)
boston_robust_df.head()
 
Screen Shot 2019-10-15 at 9.40.53 PM.png
 

Here, I wanted to quickly demonstrate that the coefficients will be different based on the type of scaler used but the statistics pertaining to the model are the same. Take a look at the p-values for example.

I simply printed the OLS Regression Table for three models as a demonstration. The first table contains statistics for the unscaled model, the second table is a depiction of how values change (or do not change in case of p-values or R-squared) when the Standard Scaler is used. The third table shows the results of using the MaxAbs scaler.

#Fit models to each standardized dataset
import sklearn
from sklearn.model_selection import train_test_split #sklearn import does not automatically install sub packages
from sklearn import linear_model
import statsmodels.api as sm
import numpy as np

#Partition the data
#Create training and test datasets
X1 = boston_features_df
X2 = boston_scaled_df
X3 = boston_scaled2_df
X4 = boston_scaled3_df
X5 = boston_robust_df
Y = boston_target_df

X1_train, X1_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X1, Y, test_size = 0.20, random_state = 5)
X2_train, X2_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X2, Y, test_size = 0.20, random_state = 5)
X3_train, X3_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X3, Y, test_size = 0.20, random_state = 5)
X4_train, X4_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X4, Y, test_size = 0.20, random_state = 5)
X5_train, X5_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X5, Y, test_size = 0.20, random_state = 5)

#Train regression model: Unscaled
from sklearn.linear_model import LinearRegression

lin_mod1 = LinearRegression()
lin_mod1.fit(X1_train, Y_train)

#Create predictions using test features
Y1_pred = lin_mod1.predict(X1_test)

###### Standard Scaler
lin_mod2 = LinearRegression()
lin_mod2.fit(X2_train, Y_train)

#Create predictions using test features: Standard Scaler
Y2_pred = lin_mod2.predict(X2_test)

###### MinMax Scaler
lin_mod3 = LinearRegression()
lin_mod3.fit(X3_train, Y_train)

#Create predictions using test features: MinMax Scaler
Y3_pred = lin_mod3.predict(X3_test)

###### MaxAbs Scaler
lin_mod4 = LinearRegression()
lin_mod4.fit(X4_train, Y_train)

#Create predictions using test features: MaxAbs Scaler
Y4_pred = lin_mod4.predict(X4_test)

###### Robust Scaler
lin_mod5 = LinearRegression()
lin_mod5.fit(X5_train, Y_train)

#Create predictions using test features: Robust Scaler
Y5_pred = lin_mod5.predict(X5_test)

# Compute and print fit statistics
import sklearn
from sklearn import metrics

print('Mean Absolute Error (Y1 - Not Scaled):', metrics.mean_absolute_error(Y_test, Y1_pred))  
print('Mean Absolute Error (Y2 - Standard Scaler):', metrics.mean_absolute_error(Y_test, Y2_pred)) 
print('Mean Absolute Error (Y3 - MinMax Scaler):', metrics.mean_absolute_error(Y_test, Y3_pred))  
print('Mean Absolute Error (Y4 - MaxAbbs Scaler):', metrics.mean_absolute_error(Y_test, Y4_pred))  
print('Mean Absolute Error (Y5 - Robust Scaler):', metrics.mean_absolute_error(Y_test, Y5_pred))
print('')

print('Mean Squared Error (Y1 - Not Scaled):', metrics.mean_squared_error(Y_test, Y1_pred))  
print('Mean Squared Error (Y2 - Standard Scaler):', metrics.mean_squared_error(Y_test, Y2_pred))  
print('Mean Squared Error (Y3 - MinMax Scaler):', metrics.mean_squared_error(Y_test, Y3_pred))  
print('Mean Squared Error(Y4 - MaxAbbs Scaler):', metrics.mean_squared_error(Y_test, Y4_pred))  
print('Mean Squared Error (Y5 - Robust Scaler):', metrics.mean_squared_error(Y_test, Y5_pred)) 
print('')

print('Root Mean Squared Error (Y1 - Not Scaled):', np.sqrt(metrics.mean_squared_error(Y_test, Y1_pred)))
print('Root Mean Squared Error (Y2 - Standard Scaler):', np.sqrt(metrics.mean_squared_error(Y_test, Y2_pred)))
print('Root Mean Squared Error (Y3 - MinMax Scaler):', np.sqrt(metrics.mean_squared_error(Y_test, Y3_pred)))
print('Root Mean Squared Error (Y4 - MaxAbbs Scaler:', np.sqrt(metrics.mean_squared_error(Y_test, Y4_pred)))
print('Root Mean Squared Error (Y5 - Robust Scaler):', np.sqrt(metrics.mean_squared_error(Y_test, Y5_pred)))
Mean Absolute Error (Y1 - Not Scaled): 3.21327049584237
Mean Absolute Error (Y2 - Standard Scaler): 3.2132704958423823
Mean Absolute Error (Y3 - MinMax Scaler): 3.2132704958423863
Mean Absolute Error (Y4 - MaxAbbs Scaler): 3.213270495842384
Mean Absolute Error (Y5 - Robust Scaler): 3.213270495842394

Mean Squared Error (Y1 - Not Scaled): 20.869292183770682
Mean Squared Error (Y2 - Standard Scaler): 20.86929218377084
Mean Squared Error (Y3 - MinMax Scaler): 20.86929218377086
Mean Squared Error(Y4 - MaxAbbs Scaler): 20.869292183770842
Mean Squared Error (Y5 - Robust Scaler): 20.86929218377092

Root Mean Squared Error (Y1 - Not Scaled): 4.568292042303193
Root Mean Squared Error (Y2 - Standard Scaler): 4.56829204230321
Root Mean Squared Error (Y3 - MinMax Scaler): 4.568292042303213
Root Mean Squared Error (Y4 - MaxAbbs Scaler: 4.568292042303211
Root Mean Squared Error (Y5 - Robust Scaler): 4.568292042303219
#Print Model Data

#Model statistics: Unscaled
model1 = sm.OLS(Y_train, sm.add_constant(X1_train)).fit()
print_model1 = model1.summary()
print(print_model1)

#Model statistics: Standard
model2 = sm.OLS(Y_train, sm.add_constant(X2_train)).fit()
print_model2 = model2.summary()
print(print_model2)

#Model statistics: MinMax
model3 = sm.OLS(Y_train, sm.add_constant(X3_train)).fit()
print_model3 = model3.summary()
print(print_model3)

#Model statistics: MaxAbs
model4 = sm.OLS(Y_train, sm.add_constant(X4_train)).fit()
print_model4 = model4.summary()
print(print_model4)

#Model statistics: Robust
model5 = sm.OLS(Y_train, sm.add_constant(X5_train)).fit()
print_model5 = model5.summary()
print(print_model5)
 
#Model statistics: Unscaled

#Model statistics: Unscaled

 
 
#Model statistics: Standard: Note collinearity warning removed

#Model statistics: Standard: Note collinearity warning removed

 
 
#Model statistics: MaxAbs

#Model statistics: MaxAbs

 
Transforming Variables

Transforming Variables

OLS Regression: Boston Housing Dataset

OLS Regression: Boston Housing Dataset

0