Spotfire Tips & Tricks: Clustering made simple with Spotfire

Last updated:
1:21pm Oct 24, 2017

Introduction

Data clustering is the process of grouping things together based on similarities between the things in the group. Clustering can be used for data compression, data mining, pattern recognition and machine learning. Examples of applications are clustering consumers into market segments, classifying manufactured units by their failure signatures, identifying (Financial) crime hot spots, and identifying regions with similar geographical characteristics. Once Clusters are defined next step may be to build a predictive model. Spotfire makes it easy to perform clustering with these two popular out of box user-friendly solutions

  1. K means Clustering
  2. Hierarchical Clustering

A data function with additional capabilities is also introduced

K means Clustering

The k-means method is a popular and simple approach to perform clustering and Spotfire Line charts help visualize data before performing Calculations. In order to perform K means Clustering you need to create a line chart visualization in which each line is element you would like to represent which can be Customer ID, Store ID, Region, Village, Well, Wafer and so on. Next user can select multiple attributes on Y-axis, which can be on same scale or can use multiple scales. Select [Column Names] as X-axis so that Multiple Y attributes are represented as points on Line. It is important to note that null values will not be used in your clustering calculations. Spotfire data panel allows you to view and even replace Null values before performing Clustering

Once you have your line chart ready select Tools> K-means Clustering

Kmeans Clustering

You can select Distance measure and number of clusters as input for calculation. It is also possible to update existing calculation.

K means Distance measure

Here is quick explanation of Distance measure option 

Distance Measure explained

The result of this calculation is a categorical column, which automatically assigns each line to a cluster group and displays each cluster group in a separate trellis panel in the resulting clustered line chart. It is important to note that this new categorical column called “ K-means Clustering” can be updated and Clustered line chart visualization provides ease of interpretation.

K means Result

Hierarchical Clustering

Hierarchical clustering arranges items in a hierarchy with a treelike structure based on the distance or similarity between them. The graphical representation of the resulting hierarchy is a tree-structured graph called a dendrogram. In Spotfire, hierarchical clustering and dendrograms are strongly connected to heat map visualizations which provides visual insights into data and ease of interpretation.The algorithm used for hierarchical clustering in Spotfire is a hierarchical agglomerative method. For row clustering, the cluster analysis begins with each row placed in a separate cluster. Then the distance between all possible combinations of two rows is calculated using a selected distance measure. The two most similar clusters are then grouped together and form a new cluster. In subsequent steps, the distance between the new cluster and all remaining clusters is recalculated using a selected clustering method. The number of clusters is thereby reduced by one in each iteration step. Eventually, all rows are grouped into one large cluster. The order of the rows in a dendrogram are defined by the selected ordering weight. The cluster analysis works the same way for column clustering.

Spotfire user Guide provides details about huge bunch of distance measures, clustering methods that can be used for performing calculation.

Hierarchical ClusteringClustering Methods

The hierarchical clustering calculation results in a heat map visualization with the specified dendrograms. A cluster column is also added to the data table and made available in the filters panel.

Clustering with Variable Importance Data Function

This Data Function accepts an input table with numeric columns, and uses k-means clustering to find groups of rows that belong to clusters.  Next, a Random Forestmodel is built to determine which variables are most influential in determining the clusters. The two most influential variables are returned and can be plotted on ascatter plot. If a logarithmic transform is appropriate, this is applied prior to the clustering and variable importance calculation.This data function is available on the TIBCO Community Exchange from this link.