Study the data pool and form a customer segmentation for business strategy formulation by various business function specially Marketing and Finance department
IIT Kanpur
Mohammad Asad 22358019
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
from scipy.stats import boxcox
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima_model import ARIMA
data = pd.read_csv("ecomdata.csv", encoding="ISO-8859-1")
data.shape
(541909, 8)
data.head(3)
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 541909 entries, 0 to 541908 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 541909 non-null object 1 StockCode 541909 non-null object 2 Description 540455 non-null object 3 Quantity 541909 non-null int64 4 InvoiceDate 541909 non-null object 5 UnitPrice 541909 non-null float64 6 CustomerID 406829 non-null float64 7 Country 541909 non-null object dtypes: float64(2), int64(1), object(5) memory usage: 33.1+ MB
We can see that the datafile has around 5.5 Lakh entries and 8 variables with the properties as listed above and information given for each single transaction. Take a look at the InvoiceNo and the CustomerID of the first entries. Here we can see that one customer with ID 17850 of the United Kingdom made a single order that has the InvoideNo 536365. The customer ordered several products with different stockcodes, descriptions, unit prices and quantities. In addition we can see that the InvoiceDate was the same for these products.
How many % of missing values do we have for each feature?
data_size=data.shape
missing_percentage = data.isnull().sum() / data.shape[0] * 100
missing_percentage
InvoiceNo 0.000000 StockCode 0.000000 Description 0.268311 Quantity 0.000000 InvoiceDate 0.000000 UnitPrice 0.000000 CustomerID 24.926694 Country 0.000000 dtype: float64
sns.heatmap(data.isnull(),yticklabels=False,cbar=False)
<Axes: >
Almost 25 % of the customers (customerID) are unknown! That's very strange. In addition we have 0.2 % of missing descriptions. This looks dirty. Let's gain a further impression by considering some examples.
Missing descriptions
data.loc[data.Description.isnull()].count()
InvoiceNo 1454 StockCode 1454 Description 0 Quantity 1454 InvoiceDate 1454 UnitPrice 1454 CustomerID 0 Country 1454 dtype: int64
How often do we miss the customer as well?
data.loc[data.Description.isnull(),:].count()
InvoiceNo 1454 StockCode 1454 Description 0 Quantity 1454 InvoiceDate 1454 UnitPrice 1454 CustomerID 0 Country 1454 dtype: int64
And the unit price?
data.loc[(data.Description.isnull()) & (data.UnitPrice<=0),:]
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
622 | 536414 | 22139 | NaN | 56 | 12/1/2010 11:52 | 0.0 | NaN | United Kingdom |
1970 | 536545 | 21134 | NaN | 1 | 12/1/2010 14:32 | 0.0 | NaN | United Kingdom |
1971 | 536546 | 22145 | NaN | 1 | 12/1/2010 14:33 | 0.0 | NaN | United Kingdom |
1972 | 536547 | 37509 | NaN | 1 | 12/1/2010 14:33 | 0.0 | NaN | United Kingdom |
1987 | 536549 | 85226A | NaN | 1 | 12/1/2010 14:34 | 0.0 | NaN | United Kingdom |
... | ... | ... | ... | ... | ... | ... | ... | ... |
535322 | 581199 | 84581 | NaN | -2 | 12/7/2011 18:26 | 0.0 | NaN | United Kingdom |
535326 | 581203 | 23406 | NaN | 15 | 12/7/2011 18:31 | 0.0 | NaN | United Kingdom |
535332 | 581209 | 21620 | NaN | 6 | 12/7/2011 18:35 | 0.0 | NaN | United Kingdom |
536981 | 581234 | 72817 | NaN | 27 | 12/8/2011 10:33 | 0.0 | NaN | United Kingdom |
538554 | 581408 | 85175 | NaN | 20 | 12/8/2011 14:06 | 0.0 | NaN | United Kingdom |
1454 rows × 8 columns
In cases of missing descriptions we always miss the customer (ID) and the unit price as well. Why does the retailer records such kind of entries without a further description? It seems that there is no sophisticated procedure how to deal with and record such kind of transactions. This is already a hint that we could expect strange entries in our data and that it can be difficult to detect them!
Missing Customer IDs
data.loc[data.CustomerID.isnull(), ["UnitPrice", "Quantity"]].describe()
UnitPrice | Quantity | |
---|---|---|
count | 135080.000000 | 135080.000000 |
mean | 8.076577 | 1.995573 |
std | 151.900816 | 66.696153 |
min | -11062.060000 | -9600.000000 |
25% | 1.630000 | 1.000000 |
50% | 3.290000 | 1.000000 |
75% | 5.450000 | 3.000000 |
max | 17836.460000 | 5568.000000 |
That's bad as well. The price and the quantities of entries without a customer ID can show extreme outliers. While preparing the custmoer segmentation model we need to drop such values specially the Negative values for price and qiantity or make a special segment class Faulty/Outliers for such cases. Perhaps you can purchase as a quest but then it would of a good and clean style to plugin a special ID that indicates that this one is a guest. Ok, next one: Do we have hidden nan-values in Descriptions? To find it out, let's create a new feature that hold descriptions in lowercase:
Hidden missing descriptions
Can we find "nan"-Strings?
data.loc[data.Description.isnull()==False, "lowercase_descriptions"] = data.loc[
data.Description.isnull()==False,"Description"
].apply(lambda l: l.lower())
data.lowercase_descriptions.dropna().apply(
lambda l: np.where("nan" in l, True, False)
).value_counts()
lowercase_descriptions False 539724 True 731 Name: count, dtype: int64
Can we find empty ""-strings?
data.lowercase_descriptions.dropna().apply(
lambda l: np.where("" == l, True, False)
).value_counts()
lowercase_descriptions False 540455 Name: count, dtype: int64
We found additional, hidden nan-values that show a string "nan" instead of a nan-value. Let's transform them to NaN:
data.loc[data.lowercase_descriptions.isnull()==False, "lowercase_descriptions"] = data.loc[
data.lowercase_descriptions.isnull()==False, "lowercase_descriptions"
].apply(lambda l: np.where("nan" in l, None, l))
data.loc[:,["Description","lowercase_descriptions"]].count()
Description 540455 lowercase_descriptions 539724 dtype: int64
How long is the period in days?
data["InvoiceDate"] = pd.to_datetime(data.InvoiceDate, cache=True)
data.InvoiceDate.max() - data.InvoiceDate.min()
Timedelta('373 days 04:24:00')
print("Datafile starts with timepoint {}".format(data.InvoiceDate.min()))
print("Datafile ends with timepoint {}".format(data.InvoiceDate.max()))
Datafile starts with timepoint 2010-12-01 08:26:00 Datafile ends with timepoint 2011-12-09 12:50:00
How many different invoice numbers do we have?
data.InvoiceNo.nunique()
25900
In the data description we can find that a cancelled transactions starts with a "C" in front of it. Let's create a feature to easily filter out these cases:
data["IsCancelled"]=np.where(data.InvoiceNo.apply(lambda l: l[0]=="C"), True, False)
data.IsCancelled.value_counts() / data.shape[0] * 100
IsCancelled False 98.286059 True 1.713941 Name: count, dtype: float64
1.7 % of all entries are cancellations.
data.loc[data.IsCancelled==True].describe()
Quantity | InvoiceDate | UnitPrice | CustomerID | |
---|---|---|---|---|
count | 9288.000000 | 9288 | 9288.000000 | 8905.000000 |
mean | -29.885228 | 2011-06-26 03:42:05.943152640 | 48.393661 | 14991.667266 |
min | -80995.000000 | 2010-12-01 09:41:00 | 0.010000 | 12346.000000 |
25% | -6.000000 | 2011-03-21 16:15:00 | 1.450000 | 13510.000000 |
50% | -2.000000 | 2011-07-07 17:33:30 | 2.950000 | 14895.000000 |
75% | -1.000000 | 2011-10-06 20:36:00 | 5.950000 | 16393.000000 |
max | -1.000000 | 2011-12-09 11:58:00 | 38970.000000 | 18282.000000 |
std | 1145.786965 | NaN | 666.600430 | 1706.772357 |
All cancellations have negative quantites but positive, non-zero unit prices. Given this data we are not easily able to understand why a customer made a return and it's very difficult to predict such cases as there could be several, hidden reasons why a cancellation was done. Let's drop them:
data = data.loc[data.IsCancelled==False].copy()
data = data.drop("IsCancelled", axis=1)
data.shape
(532621, 9)
We have 532,621 remaining entries down from initial 541,909 entries after treating this cancellation
Lets us handle all such identified occurances above by dropping such value and creating a special category where our assumptions dont meet.
We wont be dropping any row with no customerID instead we will drop the customer column as customerID does not play any role in customer segmentation
data.drop("CustomerID",axis=1)
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | Country | lowercase_descriptions | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | United Kingdom | white hanging heart t-light holder |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | United Kingdom | white metal lantern |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | United Kingdom | cream cupid hearts coat hanger |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | United Kingdom | knitted union flag hot water bottle |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | United Kingdom | red woolly hottie white heart. |
... | ... | ... | ... | ... | ... | ... | ... | ... |
541904 | 581587 | 22613 | PACK OF 20 SPACEBOY NAPKINS | 12 | 2011-12-09 12:50:00 | 0.85 | France | pack of 20 spaceboy napkins |
541905 | 581587 | 22899 | CHILDREN'S APRON DOLLY GIRL | 6 | 2011-12-09 12:50:00 | 2.10 | France | children's apron dolly girl |
541906 | 581587 | 23254 | CHILDRENS CUTLERY DOLLY GIRL | 4 | 2011-12-09 12:50:00 | 4.15 | France | childrens cutlery dolly girl |
541907 | 581587 | 23255 | CHILDRENS CUTLERY CIRCUS PARADE | 4 | 2011-12-09 12:50:00 | 4.15 | France | childrens cutlery circus parade |
541908 | 581587 | 22138 | BAKING SET 9 PIECE RETROSPOT | 3 | 2011-12-09 12:50:00 | 4.95 | France | baking set 9 piece retrospot |
532621 rows × 8 columns
Negative Values?
Removing entries with Negative Quantity
data.loc[data.Quantity<=0,:].count()
InvoiceNo 1336 StockCode 1336 Description 474 Quantity 1336 InvoiceDate 1336 UnitPrice 1336 CustomerID 0 Country 1336 lowercase_descriptions 474 dtype: int64
data = data.loc[data.Quantity>0].copy()
data.shape
(531285, 9)
Removing entries with Negative Price
data.loc[data.UnitPrice<=0,:].count()
InvoiceNo 1181 StockCode 1181 Description 589 Quantity 1181 InvoiceDate 1181 UnitPrice 1181 CustomerID 40 Country 1181 lowercase_descriptions 589 dtype: int64
data = data.loc[data.UnitPrice>0].copy()
data.shape
(530104, 9)
Just to be sure: Is there a missing value left?
data.info()
<class 'pandas.core.frame.DataFrame'> Index: 530104 entries, 0 to 541908 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 530104 non-null object 1 StockCode 530104 non-null object 2 Description 530104 non-null object 3 Quantity 530104 non-null int64 4 InvoiceDate 530104 non-null datetime64[ns] 5 UnitPrice 530104 non-null float64 6 CustomerID 397884 non-null float64 7 Country 530104 non-null object 8 lowercase_descriptions 529382 non-null object dtypes: datetime64[ns](1), float64(2), int64(1), object(5) memory usage: 40.4+ MB
How many unique stockcodes do we have?
data.StockCode.nunique()
3922
Which codes are most common?
stockcode_counts = data.StockCode.value_counts().sort_values(ascending=False)
fig, ax = plt.subplots(2,1,figsize=(20,15))
sns.barplot(x=stockcode_counts.iloc[0:20].index,
y=stockcode_counts.iloc[0:20].values,
alpha=0.8,
ax = ax[0], palette="Oranges_r")
ax[0].set_ylabel("Counts")
ax[0].set_xlabel("Stockcode")
ax[0].set_title("Which stockcodes are most common?");
sns.distplot(np.round(stockcode_counts/data.shape[0]*100,2),
kde=False,
bins=20,
ax=ax[1], color="Orange")
ax[1].set_title("How seldom are stockcodes?")
ax[1].set_xlabel("% of data with this stockcode")
ax[1].set_ylabel("Frequency");
Even though the majority of samples has a stockcode that consists of 5 numeric chars, we can see that there are other occurences as well. The length can vary between 1 and 12 and there are stockcodes with no numeric chars at all!
How many unique descriptions do we have?
data.Description.nunique()
4026
And which are most common?
description_counts = data.Description.value_counts().sort_values(ascending=False).iloc[0:30]
plt.figure(figsize=(20,5))
sns.barplot(x=description_counts.index, y=description_counts.values, palette="Purples_r")
plt.ylabel("Counts")
plt.title("Which product descriptions are most common?");
plt.xticks(rotation=90);
Ok, we can see that some descriptions correspond to a similar product type. Do you see the multiple occurences of lunch bags? We often have color information about the product as well. Furthermore the most common descriptions seem to confirm that the retailer sells various different kinds of products. All descriptions seem to consist of uppercase chars. Ok, now let's do some addtional analysis on the descriptions by counting the length and the number of lowercase chars.
def count_lower_chars(l):
return sum(1 for c in l if c.islower())
#data["DescriptionLength"] = data.Description.apply(lambda l: len(l))
data["LowCharsInDescription"] = data.Description.apply(lambda l: count_lower_chars(l))
Oh, great! Almost all descriptions do not have a lowercase chars, but we have found exceptional cases!
lowchar_counts = data.loc[data.LowCharsInDescription > 0].Description.value_counts()
plt.figure(figsize=(15,3))
sns.barplot(x=lowchar_counts.index, y=lowchar_counts.values, palette="Purples_r")
plt.xticks(rotation=90);
We still have more descriptions than stockcodes and we should continue to find out why they differ.
data.groupby("StockCode").Description.nunique().sort_values(ascending=False).iloc[0:10]
StockCode 23236 4 23196 4 17107D 3 23366 3 23203 3 23370 3 23126 3 23244 3 23413 3 23131 3 Name: Description, dtype: int64
Wow, we still have stockcodes with multiple descriptions. Let's look at an example:
data.loc[data.StockCode == "23244"].Description.value_counts()
Description ROUND STORAGE TIN VINTAGE LEAF 103 STORAGE TIN VINTAGE LEAF 7 CANNISTER VINTAGE LEAF DESIGN 2 Name: count, dtype: int64
Ok, browsing through the cases we can see that stockcodes are sometimes named a bit differently due to missing or changed words or typing errors. None the less they look ok and we can continue.
data.UnitPrice.describe()
count 530104.000000 mean 3.907625 std 35.915681 min 0.001000 25% 1.250000 50% 2.080000 75% 4.130000 max 13541.330000 Name: UnitPrice, dtype: float64
np.quantile(data.UnitPrice, 0.999)
165.0
99.9% of data has a unitprice of Less than 165
data = data.loc[(data.UnitPrice > 0.001) & (data.UnitPrice < 200)].copy()
We can drop the oulier price or we can keep it as is for our purpose
Ok, the most important one - the target. Let's take a look at its distribution:
data.Quantity.describe()
count 529710.000000 mean 10.549021 std 155.581721 min 1.000000 25% 1.000000 50% 3.000000 75% 10.000000 max 80995.000000 Name: Quantity, dtype: float64
Ok, most products are sold in quantities from 1 to 30. But, we have extreme, unrealistic outliers again:
fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.distplot(data.Quantity, ax=ax[0], kde=False, color="limegreen");
sns.distplot(np.log(data.Quantity), ax=ax[1], bins=20, kde=False, color="limegreen");
ax[0].set_title("Quantity distribution")
ax[0].set_yscale("log")
ax[1].set_title("Log-Quantity distribution")
ax[1].set_xlabel("Natural-Log Quantity");
As you can see by the log-transformed distribution it would make sense to make a cut at:
np.exp(4)
54.598150033144236
np.quantile(data.Quantity,0.95)
30.0
np.quantile(data.Quantity, 0.999)
451.74599999981
In this case we would still cover more than 99 % of the data!
data = data.loc[data.Quantity <=452].copy()
plt.figure(figsize=(20,5))
plt.plot(data.groupby("InvoiceDate").Quantity.sum(), marker='+', c="darkorange", label='Original')
plt.plot(data.groupby("InvoiceDate").Quantity.sum().rolling(window=30, center=True).mean(),c="red", label='Rolling Mean')
plt.xticks(rotation=90)
plt.legend(loc='best')
plt.title("How many quantities are sold per day over the given time?");
#Perform Augmented ADCF test:
print('Results of ADCF Test:')
dftest = adfuller(data.Quantity, autolag='AIC')
dfoutput