Using Time Series to Understand Customer Engagement
When consumers engage with an organization — whether it be browsing a retailer’s website to buy products, or using electricity from a utility provider — over time people tend to exhibit standard patterns of behavior. Understanding these patterns can help better serve customer needs or identify interesting anomalies. In this blog, we’ll explain how TIBCO uses a number of different statistical and machine learning techniques to understand patterns of electricity consumption. We’ll discuss how time series analysis can be applied in a variety of scenarios across industries to understand customer behavior.
Part 1: Presenting insights with interactive TIBCO Spotfire dashboards
For this project, there are four Spotfire dashboards created to showcase the analytical findings created in the background by TIBCO’s Data Science platform. Let’s go through them one by one to understand more details.
Screenshot: Spotfire Interactive Dashboard - Data Summary
From the first Data Summary dashboard, we can understand that our dataset contains hourly electricity consumption records for seven days from Monday to Sunday from 228 unique devices, according to the summary statistics table. The distribution over the device installation data is displayed as a bar chart, and the pie chart shows the number of devices for each device type. In the heat map on the right side, each row shows the consumption for each unique device. The blue color represents an increasing trend of electricity usage and orange color represents a decreasing trend, which allows us to observe seasonality.
Screenshot: Spotfire Interactive Dashboard - Time Series Summary
Moving on to the Time Series Summary dashboard, we can drill down into individual devices to understand our data a little more. The user can select device ID from the text area on top and inspect a single device’s time series data. Then we can observe the results including four types of analysis.
Notice that we are using first order difference time series in the following analysis, which means for each timestamp the value is replaced by the difference between the value and the value of the previous timestamp. This is useful for converting a non stationary time series to a stationary form.
On the left side, we can tell the electricity usage values’s distribution, which can be treated as a normal distribution with this selection. We also plot samples of auto-correlation functions that can be used to explore the relationship between different time lags. In time series, values may be influenced by prior values going back in time, and the auto-correlation functions tell us how far back in time we have to go before influences are no longer significant.
On the right side, the line chart “First Order Difference Time Series of Device ID” shows the first order difference time series over the entire timeline, and we can use the built-in forecast function in Spotfire to check the predictive future values. Then, in the time series decomposition charts below, the results show us clear patterns of trend, seasonality and remainder, which can help us better understand the data, and also be useful if we decide to build forecasting models.
Screenshot: Spotfire Interactive Dashboard - Exploratory Data Analysis
Next, on this Exploratory Data Analysis dashboard, we start to apply clustering to our data. We can apply hierarchical clustering among all devices, which is a non-parametric method to check if some devices are highly correlated to each other. It produces clusters at several levels, and allows the user to explore the different partitions at each level. But we’d also like to try a simpler approach, with a single set of distinct groupings. For this, it’s better to use k-means clustering, which is a popular clustering algorithm that categorizes observations into “K” clusters. The goal of the k-means clustering algorithm is to find the cluster allocation such that the observations within the same cluster are as similar as possible, whereas the observations from different clusters are as dissimilar as possible. Each cluster is represented by its centroid, which is the mean of the points within the cluster.
First, we have to identify the right number of clusters. There is often no ‘best’ number, but there are statistics that can tell us how different values of K produce clusters that are better separated from each other while more tightly bound internally. On the right hand side of the Exploratory Dashboard, we have developed a k-means evaluator that can take in parameters of minimum and maximum value of K, and return the plot of Dunn Index and Weighted Sum of Squares Error over different numbers of clusters. From the WSSE chart, we can observe the elbow, which indicates diminishing returns in terms of tightly bound clusters. The elbow here occurs when the number of clusters is 5. Meanwhile the Dunn Index is a ratio that simultaneously measures how separated and tightly bound the clusters are, and gives us a reasonably high value at 5, before dropping off sharply. Together, these results imply that five clusters should be a good choice for our dataset.
Screenshot: Spotfire Interactive Dashboard - k-means
In the final dashboard, we apply the k-means clustering to the entire dataset with five clusters. Then we are ready to further digest the clustering results. The most interesting part is the line chart at the bottom “Original Consumption Time Series for Each Cluster”, where we can observe the original consumption values (y-axis) over the time period (x-axis) and learn the natural behavior for each cluster. Since this is an interactive dashboard, we can select each cluster from the bar chart, then observe the consumption patterns represented by the corresponding time series for that cluster. We can clearly see that for cluster 1, it has very consistent peak values in the morning and evening, although the overall usage is not very high, so this may be a group of households. For cluster 2, it maintains a very high usage level from 8am to 9pm on weekdays but almost zero usage on weekends, so we can guess that this group might be offices that open on weekdays but close on weekends. For cluster 3, it shows a similar pattern with cluster 2, except that there is active usage on Saturday, so we guess that these are perhaps restaurants or shops that open from Monday to Saturday. For cluster 4, it is interesting to see that these users are only active for late nights on Thursday, Friday, Saturday and Sunday, so it looks like they may be restaurants or nightclubs. Lastly, when we check on cluster 0, they seem to be users with relatively low usage or even inactive users, so we may treat this cluster as a special group on which we can do further analysis in future.
These dashboards contain intuitive visualizations derived from complex statistical and time series analysis of very large datasets. So you may wonder where these results come from? Let me introduce what we have done in TIBCO Data Science.
Part 2: Training machine learning models in TIBCO Data Science
TIBCO Data Science is a collaborative web based analytical platform for not only data scientists but also data engineers, business users and developers to deliver end-to-end data science solutions. From there we are able to create visual analytics workflows using plenty of powerful drag-and-drop operators. In this project, we have created three workflows, as explained below.
Screenshot: TIBCO Data Science Visual Analytical Workflow - Data Manipulation
The data manipulation workflow is the starting point. We import electricity consumption data and device information data, then join them by device ID and do necessary data cleaning and preparation work, including null value replacement, row filtering, normalization by device, pivoting, etc. We also use a window function to calculate first order difference time series, and the variable operator to run Spark SQL commands.
Screenshot: TIBCO Data Science Visual Analytical Workflow - Exploratory Data Analysis
Next, we have the exploratory data analysis workflow that contains the key k-means Cluster Evaluator operator. Recall from Part 1 that we needed two common metrics to evaluate the clustering results — the Dunn Index and Within-Cluster Sum of Squared Error (WSSE). In TIBCO Data Science, users can simply use the k-means Cluster Evaluator operator to compute these two metrics from their datasets.
Screenshot: TIBCO Data Science Visual Analytical Workflow - k-means
In the last workflow, having prepared our data and computed the ideal value of K, we simply compute the clusters on the entire dataset. For more detailed information about the k-means operator in TIBCO Data Science, please refer to the documentation.
Part 3: Connecting TIBCO analytics together with the Big Data Function
How do we then make TIBCO Data Science and Spotfire work together? The key component here is the Spotfire Data Function for Team Studio.
This data function allows users to run parameterized workflows in TIBCO® Data Science Team Studio from a dashboard in Spotfire. In addition, it also provides multiple interfaces to download the results of the analysis to Spotfire. We cover full details about the Big Data Function in this blog post: Accessing TIBCO® Data Science Team Studio workflow results in TIBCO Spotfire®. Visit this blog to learn more details regarding setup.
In this project, we started with raw consumption data from consumer devices, and used visual data science workflows to explore data, engineer features and create a machine learning model with clustering. Finally the processed data, time series analytics, and clustering results are visualized in Spotfire to show interesting patterns of electricity consumption and actionable insights. Therefore we can predict and plan the energy network more efficiently to offer better plans to specific groups of energy users as well as improve customer experience.
Jingchuan Lin, Data Scientist, TIBCO
Jingchuan Lin is a data scientist currently working at TIBCO Software Singapore office, where he can apply data science knowledge to provide solutions for customers from multiple APJ countries. He has developed strong analytical skills from solving real world problems in various industries. He always feels passionate about learning the most updated techniques in data science and discovering valuable insights from big data. Jingchuan grew up in China and went to Singapore for college. He obtained his Master degree in Computing and Bachelor degree in Business Analytics from the National University of Singapore.
Eric Hsu, Senior Data Scientist, TIBCO
Eric Hsu is a senior data scientist based in TIBCO New York office. His focus has been in developing state-of-the-art analytical solutions for TIBCO Data Science product. Prior to joining TIBCO, he pursued his Master’s in statistical science in Duke University. He also likes listening to live music; exploring restaurants and walking around neighborhoods in NYC.