Customer Segmentation: Bertelsmann / Arvato Capstone Project

This is a blog post for a technical audience summarising my work on the final capstone project of the Udacity’s Data Scientist Nanodegree.

18 min readNov 15, 2020

I have structured this blog post in 8 sections make the reader’s life easier. The first section is just a brief introduction to the project.

The second section describes the standard process model that I have utilised along this project, the well-known CRISP-DM methodology.

Sections 2 to 6 correspond to each step defined in CRISP-DM and here I explain the business problem, how I tried to understand all the data available and how I preprocessed it, the algorithms that I used and why, etc.

Section 7 talks about the experience in the Kaggle competition amongst the Udacity’s students for this project.

In the last section, I put together some personal take-aways from this project that will be definitely useful for future projects.

All the code and information can be found in the GitHub repository below:

https://github.com/bergr7/Bertelsmann_Arvato_Project

1. Introduction to the project

In this capstone project, I was asked to analyse demographics data for customers of a mail-order sales company in Germany and compare it against demographics information for the general population.

The datasets used were provided by Bertelsmann Arvato Analytics, and the project represents a real-life data science task. This data has been strictly used for this project only and will be removed from my laptop after completion of the project in accordance with Bertelsmann Arvato terms and conditions.

The project is divied in 3 parts:

Part 1: Customer Segmentation Report

In this part, I used unsupervised learning techniques to analyse features of established customers and the general population in order to create customer segments with the aim of targeting them in marketing campaigns.

Part 2: Supervised Learning Model

Next step was to build a supervised learning machine learning model that predicted whether or not each individual will respond to the campaign.

Part 3: Kaggle Competition

Finally, after spot-checking several models, choosing the most promising one and fine-tuning its hyperparemeters, I used the model to make predictions on the campaign data as part of a Kaggle Competition.

2. CRISP-DM Methodology

Cross-industry standard process for data mining (CRISP-DM) is the most widely-used model in data science projects. CRISP-DM allowed me to keep track of all the necessary steps in this project.

For those not familiar with this methodology, CRISP-DM sets out the following steps for an end-to-end data science project:

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment (Not applicable to this project)

Almost 100% of the times, it is necessary to iterate over different steps during the lifecycle of the project.

You can find the CRISP-DM document that I created in Notion for this project in the GitHub repository.

3. Business Understanding

I identified the following business objectives for this project:

Customer Segmentation: Identify the parts of the population that best describe the core customer base of the company.
Improve marketing campaigns efficiency: Using the different features that define customer groups (from the Customer Segmentation report), get to know individuals that are most likely to becoming into customers for the company in order to target them in the marketing campaign and then prepare a supervised ML model to predict if an individual is likely to become a customer.

As important as idenfitying and understanding the business objectives of the project is to idenfity the project derivables. These project had three main derivables:

Customer Segmentation Report: A Jupyter Notebook with markdown notes and visualisations. Note I needed to create a couple of supportive files to explore the data and preprocess it.
Supervised Learning Model: I built a supervised learning model that used demographics data to classify people into potential customers or non-potential customer.
Blog post for a technical audience: You are reading it! :)

Dividing the project into derivables allows you to keep track of your project progress and increases the chances of delivering the project on time.

4. Data Understanding

Bertelsmann Arvato Analytics provided demographics data for the general population of Germany, for customers of a mail-order company and for individuals who were targets of a marketing campaign.

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood.

Bertelsmann Arvato Analytics also provided a top-level list of attributes and descriptions, organised by informational category and a detailed mapping of data values for each feature in alphabetical order.

My business domain knowledge was very limited so these Excel files helped me understand some of the features.

The majority of the features were categorical, with a mix of ordinal and nominal data. All these features had already been encoded with the exception of CAMEO_DEU_2015 feature, that needed to be label encoded in the data preprocessing stage.

There were also a few numerical variables in the datasets.

During the exploratory data analysis (EDA), I found two potential issues in the data:

Some features in the .csv files provided did not have a description in the Excel files. That means that I did not know the meaning of some of the features.
In the categorical features, there was a category called ‘unknown’ that had been mapped to 0, -1, ‘X’ or ‘XX’.

I went through the Excel files and the .csv files provided and manually created the following new .csv file:

There were 54 features that had not been described/included in the Excel files.

Going through all the features (+350) gave me a better understanding of the data that I had in hand. I was also able to identified a couple of variables with outliers.

For example, the feature ANZ_HAUSHALTE_AKTIV represents the number of households in the building and is numeric. The information provided in the Excel files stated that this variable is typical encoded from 1–10 but I spotted much larger values so I plotted a histogram of the variable:

Histogram showing ANZ_HAUSHALTE_AKTIV feature distribution.

As we can see in the histogram above, there are some rare values that are likely to be errors in the data (Maximum value was 595.0) and also outliers.

5. Data Preparation

Once I had understood the data meaning and the data types with the particularities for each feature, I started to prepared a python script with functions that preprocessed the data for clustering analysis.

The preprocessing steps were the following:

Step 1: Map unknown values to NaN’s

As mentioned in the previous section, there were unknown values mapped to a category so I created a function to mapped them to NaN’s. Note the feature had already some NaN’s or missing values. The function used for this is below:

def map_unknowns(attributes, df):
    """It maps unknown values identified during data exploration to NaN's.
    Parameters
    __________
    :param attributes: Attributes pandas DataFrame.
    :param df: df: AZDIAS or CUSTOMERS Pandas DataFrame
    __________
    :return: mapped_df: AZDIAS or CUSTOMERS Pandas DataFrame with unknown values mapped to NaN's
    """
    # create a dict with original dtypes for each column
    original_dtypes = dict()
    for col in df.columns:
        original_dtypes[col] = str(df[col].dtype)

    # convert all columns to object type
    df.astype(dtype='str')
    # loop through all attributes
    for attribute in attributes.index:
        # for each attribute, retrieve a list with unknown values
        unknowns_list = attributes['Unknown'].loc[attribute].strip('][').split(', ')[0].split(',')
        # if there are unknown values, map them to NaN's
        if unknowns_list != ['']:
            if attribute in ['CAMEO_DEUG_2015', 'CAMEO_DEU_2015', 'CAMEO_INTL_2015']:
                df.loc[df[attribute].isin(['X','XX', '-1']), attribute] = np.nan
            else:
                df.loc[df[attribute].isin(unknowns_list), attribute] = np.nan

    # transform columns to original dtypes
    df.astype(original_dtypes, errors='ignore')

    mapped_df = df

    return mapped_df

Note all the data is converted to string to map the unknown values and then converted back to the original data types.

Step 2: Clean the data

I created a function to clean the datasets. This function does the following:

It drops features that were not useful for analysis due to its nature, for example an ID column, or due to the amount of missing values.

Distribution of missing values per column in the training set.

The majority of the features had less than 200000 missed values, which is about 22% of missing values per attribute.

I decided to drop features with more than 60% of missing values. Features with less than 60% of missing values were imputed with the mode or most frequent value. To do this, Scikit-Learn SimpleImputer class came very handy.

It drops rows with high proportion of missing values from the training set.

Distribution of missing values per row in the training set.

The majority of the rows have less than 100 features missed. This is about 27% of missing values per row. I decided to drop rows with more than 150 features missed only from the training set. The chunk of code that does this in the function is shown below:

# drop individuals with more than 150 missed values
rows_toDrop = list(azdias_df.isnull().sum(axis=1).loc[azdias_df.isnull().sum(axis=1) > 150].index)
azdias_df = azdias_df.drop(rows_toDrop, axis=0)

It drops outliers found in ANZ_HAUSHALTE_AKTIV and ANZ_PERSONEN features.

azdias_df = azdias_df.loc[azdias_df['ANZ_HAUSHALTE_AKTIV'] < 10, :]
azdias_df = azdias_df.loc[azdias_df['ANZ_PERSONEN'] < 3, :]

The criteria to drop outliers was based on the 1.5*Interquantile Range (IQR) method for outlier detection and the information provided in the Excel files.

Step 3: Impute missing values

Once the data was cleaned, I created another function to impute the missing values.

As the majority of the data is categorical, the strategy to impute the missing value was the mode or most frequent value using Scikit-Learn SimpleImputer class. See created function below:

def impute_mv(df, strategy):
    """It performs imputation of missing values using skelarn SimpleImputer.
    Parameters
    __________
    :param df: AZDIAS or CUSTOMERS Pandas DataFrame
    :param strategy: string. The imputation strategy for SimpleImputer
    __________
    :return: AZDIAS or CUSTOMERS df with imputed values
    """
    # instantiate SimpleImputer
    imputer = SimpleImputer(strategy=strategy)

    # impute missing values
    data_with_no_mv = imputer.fit_transform(df)

    # put back column names from df as fit_transform returns an array of shape (n_samples, n_features_new)
    imputed_df = pd.DataFrame(data_with_no_mv, columns=df.columns)

    return imputed_df

Step 4: Label Encode categorical variables

During the EDA, I identified a categorical feature (CAMEO_DEU_2015) that had not been encoded.

This is a nomical categorical variable so I label encoded it. Again, Scikit-Learn came very handy here with its LabelEncoder class. See function below:

def label_encode_cameo(df):
    """It performs label encoding on 'CAMEO_DEU_2015' feature using sklearn LabelEncoder.
    Parameters
    __________
    :param df: AZDIAS or CUSTOMERS Pandas DataFrame
    __________
    :return: AZDIAS or CUSTOMERS Pandas DataFrame with encoded 'CAMEO_DEU_2015'.
    """
    # instantiate LabelEncoder
    lab_encoder = LabelEncoder()

    # pull out list of unique classes
    classes = list(df['CAMEO_DEU_2015'].unique())

    # fit encoder
    lab_encoder.fit(classes)

    # label encode 'CAMEO_DEU_2015'
    df['CAMEO_DEU_2015'] = lab_encoder.transform(df['CAMEO_DEU_2015'])

    encoded_df = df

    return encoded_df

Step 5: Save DataFrame as a pickle file

Running the preprocessing steps of the dataset took some time due to the dimensionality of the data. For example, the dataset with the demographics data for the general population of Germany has 891221 rows and 367 columns.

For that reason, I saved the preprocessed dataframes as pickle files that would be loaded in later for Clustering Analysis.

Dataset for individuals who were targets of a marketing campaign

All the preprocessing steps above are applicable either to the datasets for Part 1: Customer Segmentation and Part 2: Supervised Learning model. I just needed to make a couple of minor modifications to the script that run the preprocessing steps on the dataset for individuals who were targets of a marketing campaign.

6. Modeling and Evaluation

6.1. Customer Segmentation Report

I used unsupervised learning techniques to describe the relationship between the demographics of the company’s existing customers and the general population of Germany.

There were mainly two difficulties in the clustering analysis:

The dimensionality of the data: There were +350 features to be considered. This makes the investigation of the clusters quite difficult. Furthermore, it might be computationally expensive running a simple clustering algorithm such as K-Means.
The lack of description for some of the features: Here I could drop the 54 features without a description or include them in the analysis and try to understand the cluster hopefully with other features. I opted for the latter.

Due to the large number of columns, I use Principal Component Analysis (PCA) to reduce the dimensionality of the data and eliminate noise as well.

Note that as the aim is to describe the relationship between the demographics of the company’s existing customers and the general population of Germany, the model is trained on general population data and the customers dataset is transform with this model so that I could compare the results later.

I selected the first principal components that explained the 80% of the variability of the data. To do this, I plotted a scree plot:

Scree plot shown PCs (General Population dataset).

The first 92 PCs captured the 80% of the variability in the data so the cardinality was reduced from 293 columns to 92 columns.

Having transformed the General Population data and the Customers data, I was ready to carry out the clustering analysis. I decided to use one of the most simple but also popular algorithm, K-Means clustering.

K-Means algorithm requires a predefined number of clusters. In order to find an optimal number of cluster, I used the Elbow method.

Elbow plot: Sum of square errors (SSE) vs. number of clusters (K). General population.

By the look of the Elbow plot, an optimal number of cluster was 9.

I fitted K-Means on the Principal Components from the General Populations data and predicted the labels for each individual. Then, I predicted the labels for the Customers using the same model.

Thus, I could compare which clusters were over-represented in the Customers base and which ones were under-represented.

Bar chart showing comparison of proportion of individuals in each cluster.

Bar chart showing Representation Ratio (Customers / General Population of Germany).

As we can see in the bar charts above, the over-represented clusters are clusters number 2, 6, 7 and 9 and the under-represented clusters are clusters number 1, 4, 5 and 8. Cluster number 3 is pretty much balanced.

It is sensible to believe that an individual of the general population of Germany in the over-represented clusters are more likely to become a customer and that individuals in the under-represented clusters are less so.

By targeting people in the over-represented groups, the company could improve the efficiency of the marketing campaigns.

Now I needed to investigate the clusters to know the individual characteristic of each cluster. This was a bit tricky due to the number of features and because I used a dimensionality reduction technique before running K-Means.

A principal component (PC) is a unit vector that points in the direction of highest variance of the data points projected in a new subspace, after accounting for the variance captured by the earlier PC. If I could map each feature weight to its corresponding feature and sort them by weight in order to find out which features are more important for each PC, I would be able to find the most interesting features which are the ones at the beginning and the ends of the sorted weights. The reasons are explained below:

The larger the feature weight (in absoulte values), the more the PC is in the direction of the feature.
If two features have large weights of the same sign (both positive or both negative), then an increase in one feature will yield an increase in the other. They show a positive correlation.
Similarly, two features with weights of different signs will be expected to show a negative correlation.

So I inverse-transformed the cluster centres using the PCA model previously trained, obtained the feature weights and plotted them for each cluster. I created the following functions:

def get_feature_weights_cluster(pca_model, clustering_model, cluster):
 “””It inversely transforms the cluster centers using a PCA model to find feature weights.
 Parameters
 — — — — — 
 pca_model: PCA model trained on general population data
 clustering_model: KMeans model trained on PCA transformed general population
 cluster: int. Cluster number
 
 Returns
 — — — — — 
 weights_by_feature: Series. Sorted weights by feature
 “””
 feature_weigths = pca_model.inverse_transform(clustering_model.cluster_centers_[cluster — 1])
 # convert to series
 weigths_by_feature = pd.Series(feature_weigths, index=azdias_df.columns).sort_values()
 
 return weigths_by_feature
def plot_feature_weights(weights):
 “””It plots a bar chart showing the weights by feature.
 
 Parameters
 — — — — — 
 weights: Series. Weights by feature.
 
 Returns
 — — — — — 
 None
 “””
 
 # plot bar chart with all weights by feature
 plt.figure(figsize=(15,7))
 plt.bar([i for i in range(1, len(weights.index.tolist())+1)], weights.values.tolist(), color=’lightcoral’)
 plt.title(“Weight by feature”)
 plt.xlabel(“Feature #”)
 plt.ylabel(“Feature Weight”)
 plt.show();
 
 # plot bar chart with only the first 3 weights and the last 3 weights
 plt.figure(figsize=(15,7))
 plt.bar(weights.iloc[[0, 1, 2, -3, -2, -1]].index.tolist(), 
 weights.iloc[[0, 1, 2, -3, -2, -1]].values.tolist(), color=’lightcoral’)
 plt.title(“Weight by feature”)
 plt.xlabel(“Feature name”)
 plt.ylabel(“Feature Weight”)
 plt.show();

As an example, the plots for over-represented cluster 6 are shown below:

Weight by feature (only first 3 and last 3 in sorted weights) for Cluster 6

Features KBA13_HERST_BMW_BENZ, KBA13_KW_121 and KAB13_KMH_211 show positive correlation whereas KBA13_KMH_140_210, KBA13_HERST_FORD_OPELand KBA13_KMH_180 show negative correlation.

According to the features information, KBA13_HERST_BMW_BENZ represents share of BMW & Mercedes, KBA13_KW_121 represents share of cars with an engine power of more than 121KW and KAB13_KMH_211 represents share of cars with a greater max speed than 210 km/h.

In contrast, KBA13_KMH_140_210, KBA13_HERST_FORD_OPELand KBA13_KMH_180 represent less powerful and less luxurious cars.

This cluster corresponds to people that own luxurious and very powerful cars.

I did the same exercise for all the over-represnted and under-represented clusters.

Due to the lack of meaning for some of the features utilised in the analysis, it is hard to describe some of the clusters. I would probably drop features without feature meaning if I had to carry out the cluster analysis again or I would come back to people with business domain knowledge if possible.

However, I have been able to identify the following groups of potential customers:

Individuals that own luxurious and powerful cars and, therefore, these are likely to be in a good financial position and have high incomes.
Individuals that live in good neighborhood areas and municipalities. Also there is more than one car in their household.
Relatively young individuals with great interest in environmental sustainability, that are somewhat dissapointed with their current social status and interested in low interest rates financial services.

If the company were to launch a marketing campaign for the current services, it would be sensible to target these groups of people.

I have also identified some groups of the German population that are under-represented in the customers base:

Young people with average cars and income.
Money savers with low interest in financial services.
Young people with low-midclass cars and low income.

If the company were looking to reach a larger part of the German population, it might be worth designing some type of financial service that aligns with the needs of these groups and target them with a different marketing campaign. The company would need to consider the profitability of these services though before launching them as it is probably that customers of these profile will not tend to spend a lot of money often.

6.2. Supervised Learning Model

I In the second part of the project, I built a supervised ML model that uses demographics information from each individual and decides whether or not it will be worth it to include a person in the marketing campaign.

I had train and test sets avaiable for individuals who were targets of a marketing campaign. The test set would be used to make predictions and submit the to the Kaggle Competition (See section 7).

I randomly split the train set into train and validation sets using Scikit-Learn train_test_split class.

The dataset set was highly imbalance with a 1:68 ratio as the majority of the individuals who received the promotion did not response to it. This is a common issue in classification problems.

There are several ways of handling imbalance datasets. Oversampling the minority class using systhetic samples combined with undersampling of the majority class has been proved an effective way.

Imbalanced-learn library is very useful for this type of problems. It provides similar functionality to Scikit-Learn’s but specifically for problems with imbalance datasets.

In this case, I used Synthetic Minority Oversampling Technique (SMOTE) to oversmaple the minority class by 40% and randomly undersampled the majority class by 60%. How I got to this proportions is a bit trial and error.

I started with defining a pipeline and spot-checking some algorithms: RandomForestClassifier, BalancedRandomForestClassifier, XGBClassifier and ADABoostClassifier with BalancedRandomForestClassifier.

In this pipeline for spot-checking algorithms, I set the SMOTE proportion to 10% and the undersampling proportion to 50%.

Scikit-Learn RepeatedStratifiedKFold and cross_val_score classes are very useful to quickly spot-check algorithms using the whole training set. RepeatedStratifiedKFold was used as opposed to RepeatedKFold due to the imbalance in the dataset.

The most promising algorithm was XGBClassifier:

# define pipeline steps
scaler = StandardScaler()
over = SMOTE(sampling_strategy=0.1) # oversampling minority class by 10%
under = RandomUnderSampler(sampling_strategy=0.5) # undersample majority class by 50%
clf = XGBClassifier(random_state=42)steps = [
 (‘scaler’, scaler),
 (‘over’, over),
 (‘under’, under),
 (‘clf’, clf)
]pipeline = Pipeline(steps=steps)# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=42)
scores = cross_val_score(pipeline, X, y, scoring=’roc_auc’, cv=cv, n_jobs=-1)
print(‘Mean roc auc:’, scores.mean())Mean roc auc: 0.7190005023445788

Note that the evaluation metric was the Area Under the ROC Curve (AUC) as this would be the metric used in the Kaggle competition.

I selected the XGBClassifier model and run GridSearchCV over the following parameters:

# define parameters for GridSearchCV
parameters = {
 ‘over__sampling_strategy’: [0.2, 0.4, 0.7],
 ‘under__sampling_strategy’: [0.4, 0.6, 0.8],
 ‘clf__n_estimators’: [50, 100, 150, 300],
 ‘clf__max_depth’: [50, 150, 300],
 ‘clf__learning_rate’: [0.001, 0.01, 0.1],
 ‘clf__booster’: (‘gbtree’, ‘gblinear’, ‘dart’), 
}

This was particularly computationally demanding for my laptop and took +4 hours to run. These were the best parameters:

{'clf__booster': 'gbtree',
 'clf__learning_rate': 0.1,
 'clf__max_depth': 150,
 'clf__n_estimators': 100,
 'over__sampling_strategy': 0.4,
 'under__sampling_strategy': 0.4}

I could have spent more time on fine-tuning the hyperparameters but I decided to stick to these ones and extend the lifespan of my macbook!

With this parameters, I trained the model on the training set and made predictions for the validation set.

# define XGBClassifier pipeline and tweak sampling parameters manually
scaler = StandardScaler()
over = SMOTE(sampling_strategy=0.4) # oversampling minority class by 40%
under = RandomUnderSampler(sampling_strategy=0.6) # undersample majority class by 60%
clf = XGBClassifier(booster=’gbtree’, learning_rate=0.1, n_estimators=100, random_state=42)steps = [
 (‘scaler’, scaler),
 (‘over’, over),
 (‘under’, under),
 (‘clf’, clf)
]pipeline = Pipeline(steps=steps)# fit pipeline on train set
pipeline.fit(X_train, y_train)# make predictions
y_preds = pipeline.predict_proba(X_val)# score
score = roc_auc_score(y_val, y_preds[:, 1])
print(“Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction score is {}”.format(score))Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction score is 0.7246471020853598

Note I manually tweaked the oversampling and undersampling proportions.

Both train and validation sets are highly skewed towards the majority class.

# summarise class distribution
counter = Counter(y_train)
print(counter)
“1:{} ratio”.format(int(round((counter[0] / counter[1]),0)))Counter({0: 13869, 1: 194})
'1:71 ratio'counter = Counter(y_val)
print(counter)
“1:{} ratio”.format(int(round((counter[0] / counter[1]),0)))Counter({0: 3455, 1: 61})
'1:57 ratio'

I tried to handle this imbalance by applying oversampling and undersampling techniques. I could have further fine-tuned the hyperparameters but it is computationally expensive for my personal laptop so I proceed to test my model performance on the test set.

7. Kaggle competition

The Kaggle competition page can be found at the link below:

https://www.kaggle.com/c/udacity-arvato-identify-customers/leaderboard#score

For this competition, I needed to provide a .csv file with 2 columns:

‘LNR’: which acts as an ID number for each individual in the “TEST” partition.
‘RESPONSE’: should be some measure of how likely each individual became a customer. Predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method due to high imbalance so I predicted probabilities for each invididual.

The preparation code is attached below:

# define XGBClassifier pipeline and tweak sampling parameters manually
scaler = StandardScaler()
over = SMOTE(sampling_strategy=0.4) # oversampling minority class by 40%
under = RandomUnderSampler(sampling_strategy=0.6) # undersample majority class by 60%
clf = XGBClassifier(booster=’gbtree’, learning_rate=0.1, n_estimators=100, random_state=42)steps = [
 (‘scaler’, scaler),
 (‘over’, over),
 (‘under’, under),
 (‘clf’, clf)
]pipeline = Pipeline(steps=steps)# fit pipeline on full train set
pipeline.fit(X, y)# make predictions
y_preds = pipeline.predict_proba(preprocessed_mailout_test.drop(‘LNR’, axis=1))# put preds together with LNR
predictions = pd.DataFrame(columns=['LNR','RESPONSE'])
predictions['LNR'] = LNR_col
predictions['RESPONSE'] = y_preds[:, 1]
predictions.head()

# convert to .csv file
predictions.to_csv(‘kaggle_bg.csv’, index=False)

My position in the Kaggle leaderboard was 198 with a ROC AUC score of 0.73226. Not very impressive but expected due to small amount of time that i spent tuning the hyperparameters. This is acceptable for the purpose of learning though.

8. Take-aways

The Udacity’s Nanodegree Capstone Project was challenging for me. I had never had to deal with a dataset of +350 columns in a Customer Segmentation problem so finding a way to reduce the dimensionality and investigate the clusters was tricky and exciting at the same time.

The popular saying “A Data Scientist spends 80% of their time cleaning and understanding the data.” was proven in this project to me.

Eye to the detail is important. For example, to find unexpected values in the feature like the unknown values that had been mapped to a category.

From my experience in previous projects, dealing with a highly imbalanced dataset in a classification problem is not a walk in the park and this project was not an exception.