Segmentation - A Few Practical Considerations

October 26, 2014

Segmentation is a fundamental basis of developing understanding about customers and gaining insights in their motivations and interests. The quest for knowledge in all intelligent beings actually is about developing an approach to find an underlying basis of segregating objects and studying their characteristics. No knowledge could have been built by studying every object as unique. In fact Segmenting objects is instinctive for any curious mind.

I would like to restrict this article to practical considerations involved in multivariate statistical methods for clustering analysis rather than conceptual considerations faced by marketing researchers.

1. To Standardize or Not: I have come across many researchers on either extreme of always standardizing all independent variables or always keeping all of them unstandardized. Either way this thumb rule is not a wise strategy. Changing scales through standardization affects the distances between cases and reduces the variability among clusters. Say for example, age of the respondents shall have higher central measures and wide dispersion against circumference of wrist shall have much lower central measures and very low variance. The standardization affects the separation power. This happens because dominating variables separating clusters by definition have higher variance which is reflected in F statistics. Standardization reduces separation. Whereas when you leave independent variables unstandardized then some few variables having larger measures on larger scales with higher variance dominate the entire clustering. In such cases it would be advisable to run the cluster analysis both ways and look for the differences in result. I have also found that any variance in range of scales upto 10 times the smallest scale is acceptable. Many a times, it may be a good idea to normalize the large scale variables on an appropriate scale instead of standardizing them on z-scores. SPSS does not allow for selective treatment of the independent variables. It only allows for all or none of the variables. The different treatment to different variables needs to be done manually. Further it is about what we know and understand about the context and how we plan to use segmentation.

2. Using Cluster Analysis for Identifying Relevant Variable: It is dangerous to use Cluster Analysis to identify the relevant variables. Cluster Analysis has no such mechanism. SPSS gives an option for Step by Step inclusion of variables. This prompts many to use this option to identify relevant variables. The formation of clusters is dependent on the variables included. There is no way one can do marginal analysis by subtracting or adding variables. The change in underlying structure does not permit it. So any tendency to throw in all metric variables available for cluster analysis and trying to segregate relevant and irrelevant variables is incorrect. The variables to be included or excluded should be based on conceptual considerations.

3. Correlated variables: If you believe that certain variables are definite differentiators of clusters then one can consider including same variable more than once to increase its weightage in determining clusters. It would be prudent to use factor scores for clustering particularly when variables are far too many compared to observations but it may equally be alright to manually do factor analysis while dropping or aggregating some variables.

4. Choosing the Data model: The statistical methods are developed considering the normal distribution of the variables under consideration. In practice, the variables may not be distributed that way. Majority of time they are mixture of distributions. Particularly for ratio scaled variables spread over larger values. They may appear to be a combination of normal and exponential. In such a case removing some of the extreme values not exceeding 5% of the total data points and analysing them separating instead of including them in the clustering. Even standardizing to z-score may not help. There are certain non distance based algorithms which may be more suitable for such clustering but that I shall address then in a separate article.

There are many more considerations such as treatment of rank order data, binary variables, Categorical variables, choice of algorithms. I shall write about them separately when I will also talk about the clustering method to choose. Please feel free to share your experiences and considerations.

Comments