A Primer for BI and IoT/IIoT and Edge Analytics
Part 1: Connecting to Data
Michael O'Connell, PhD - Chief Analytics Officer, TIBCO Software
Thomas Hill, PhD - Senior Director Analytics, TIBCO Software
Note - The components used to generate the examples in this Article are attached and can be downloaded as a zip file here
The purpose of this series of short white papers, tutorials, and use cases, is to provide short and concise overviews of practical considerations and methods commonly applied to extract actionable information from continuous or streaming process data. This is an area of analytics that is rapidly becoming increasingly important as streaming IoT (Internet of Things) technologies are transforming, disrupting, or creating new businesses across virtually all industries. IDC expects worldwide spend on such applications to grow at nearly 16% annually, to reach $1.3 Trillion dollars in 2020 (IDC, 2017).
The topics that will be covered are eclectic but organized around specific issues, use cases, and applications that are currently in use or in planning. The focus is not solely on the theory, mathematics, or technologies that enable the modeling and reporting against streaming data. Instead, this review will emphasize practical considerations and proven approaches to enable meaningful and actionable systems-of-insight.
This article is the first in a series. For Part 2, see: A Primer for BI and IoT/IIoT and Edge Analytics - Part 2: Process Monitoring for Steady States: QC Charting
Before discussing in some detail practical considerations of how to connect to and organize continuous streaming data for analysis, it is important to explore the differences between data representing discrete uncorrelated observations and data collected from continuous processes.
Discrete Data, Continuous Data
The main difference between continuous streaming data and discrete or non-continuous data is that the former is indexed or augmented by a time-stamp, or a proxy of time. To understand continuous process data it is critical to retain the order of the measurements and the trends over consecutive measurements. Traditional databases of static data are typically organized and indexed by other meaningful units-of-analyses; to understand discrete non-continuous data it is critical to understand the segmentations in the data, the correlations between variables, and so on.
Discrete observations. For example, data typically collected as part of the semiconductor manufacturing process are organized around lots, wafers, and dies. Dies are the individual "chips" on the wafers, which are produced in lots of multiple wafers. An analysis of historical data may, for example, look for commonalities among wafers with specific defects, with respect to specific machines and processing steps used to produce those wafers. Each wafer would be considered a new observation, the order of the observations (wafers) as recorded into an off-line data store with historical observations would not be relevant for the analyses, and measurements for consecutive observations are assumed to be independent of each other.
Continuous process data. Contrast this with the nature of the data describing continuous processes. For example, the manufacture of soy products is a continuous and mostly automated process where soy beans are processed through multiple steps to separate fiber and other materials from the protein. Measurements are continuously recorded. In order to identify root causes of quality problems in such data it is necessary to relate final product quality measurements to upstream process measurements. Further, consecutive measurements of parameters such as pressures or temperatures are likely highly auto-correlated, as temperatures will not change quickly from one second to the next.
Understanding and optimizing such processes requires analyzing the trends and time-lagged measurements that precede the final quality measurements, to identify common combinations and dynamic changes in process measurements that may determine product quality.
Traditionally, continuous process data are often stored in specialized or specifically indexed databases called process historians. In most manufacturing environments such data stores retain parameter measurements indexed by time and parameter names ("tags"); data-quality and other flags are also often retained. Figure 1 shows a typical graph and representation of process data recorded into a data historian.
In non-manufacturing domains, transaction data such as customer visits to a web site or calls to a call center are organized in similar ways: Consecutive individual data points are time-stamped, and the consecutive measurements are typically auto-correlated.
When analyzing continuous process data it is important to recognize and "deal-with" the auto-correlation of consecutive data points analytically, as will be discussed in later postings. Unlike in the analysis of independent observations and discrete data, information about data values preceding (in time) some outcome or event of interest is critical: For example, if the engine in a car quits unexpectedly then knowing that simultaneously the measurements of power, torque and fuel-flow suddenly drop to zero is of little value. Instead, in order to understand the root causes of such a malfunction, the data must be organized to enable meaningful analyses of preceding (in time) events, conditions, and trends.
An important aspect of how to connect to continuous and streaming data, how to analyze it, and how to inform decisions based on such data is the available time-to-action: What is the mechanism or method by which information extracted from steaming data is to be converted into actions, and how much time is available to take that action?
Time-to-action may be measured in microseconds or milliseconds. For example, if results are to be fed back to an automated control system or securities trading platform then obviously very-fast actions are essential. In other applications, time-to-action might be measured in hundreds of milliseconds, for example, when an automated system or web-service is to inform a web-based self-service application to process credit applications. The total process from receiving an application to adjudication should not require more than a few seconds or so, or else potential borrowers my change their minds. Finally, some industrial manufacturing processes are characterized by slow maturation, for example, of chemicals in reactors; in such cases, the time-to-action might be measured in hours because actions are usually taken slowly and gradually to adjust the process.
Connecting to Data: Pulling/Polling Data, Pushing Data and Event Processing
There are different ways by which real-time streaming data can be connected and processed: A computing system or platform may either poll the most recent data at specific intervals, or the arrival of new data itself in the storage platform could trigger a computing process to analyze and score the new data. The former system is essentially a batch-execution system where new data are processed at a fixed time interval; the latter is a real-time event processing system, where new-data-"events" are intercepted, triggering the computations required to, for example, compute predictions based on machine learning models.
These two systems of "real-time" data processing are very different. The polling system usually leads to predictable amounts of data to be processed at each poll or batch; for example, average sensor values could be extracted every 5 minutes in order to characterize and apply machine learning to a continuous process. Every time the system pulls a new set of 5-minute data, a single mean (median, range, etc.) could be returned for the underlying sensors, regardless of how many sensor readings have been collected.
On the other hand, a credit scoring application that supports an on-line self-service application portal would be an example of a real-time system that pushes data to the analysis when new data (applications for credit) arrive and need to be scored. Here the data volume is not uniform over time, and the system must be "elastic", i.e., be able to allocate sufficient resources to cope with possible peaks in scoring requests, while still maintaining minimum responsiveness.
Use Case Considerations
The terms "real-time", "IoT", or "streaming data" can imply very diverse requirements with respect to the time-to-action, and the respective architecture of how specifically to connect to data. If the requirement is to act automatically on new data as soon as it is collected then a push- or event-based architecture is required. Examples of such use cases would be trading desks where automated trading algorithms must respond very fast to new data in order to take advantage of opportunities for profitable trades as soon as possible, and faster than competing traders and their algorithms.
In manufacturing, automated closed-loop control systems respond in real-time to data as they are collected. Interactive customer facing Web portals also require data processing and taking actions – for example to recommend products to customers – as quickly as possible.
If the requirement is to monitor a continuous process for trends that may indicate problems or undesirable states that are developing, and if the action to present those problems can take some time to implement, then a pull/polling architecture is easier to apply. Such applications are typical found in continuous and continuous-batch manufacturing, for example in the pharma industry. Predictive maintenance applications in manufacturing also typically follow this pattern: Relevant continuous data streams may slowly shift or drift, indicating that problems or anomalies are slowly developing.
TIBCO Spotfire® (www.spotfire.tibco.com). Spotfire® is a general visual analytics platform for enterprise-wide analytic BI; Spotfire includes a proprietary version of high-performance R, can interface with various open-source libraries and languages including R and Python, and can connect to virtually any data source, including real-time data sources
TIBCO Statistica™ Enterprise (www.tibco.com/products/tibco-statistica). A mature platform for batch-real-time (pull/poll) applications and use cases; supports enterprise wide-deployment with model management of data-prep and analytic workflows; natively supports large number of statistical and machine learning algorithms, as well as open-source scripting languages and environments including Python, R, Scala (Spark), C#
TIBCO StreamBase® (www.tibco.com/products/tibco-streambase). A mature platform for real-time processing of streaming data and events; implements push-architecture; supports large number of connectors to virtually all common streaming data sources and data historians, capable of very high-volume data loads; can implement complex rules logic, prediction models, R-based prediction models and computations
TIBCO® Data Virtualization (www.tibco.com/index.php/products/tibco-data-virtualization). A mature platform for managing and aligning diverse data sources, including continuous Data Historians, relational, and non-relational data stores; provides essential capabilities for building useful real-time systems-of-insights that relate measured KPI's with dynamically streaming data
Use Case: Implementing a Virtual Sensor for Anomaly Prediction
This example illustrates a common use case for continuous process monitoring based on a predictive model, implemented in Statistica™ and Spotfire. A common goal for predictive analytics when applied to process data is to predict anomalies, before they actually occur. In a sense, the prediction model serves as a virtual sensor of anomalies that are likely or about to occur.
This use case is implemented in many manufacturing environments, but equally applicable to any domain where streaming data are collected, containing specific parameters that are critical for process health -- be that defined as exceedances of some engineering specifications or regulatory limits on some process outputs, or just undesirable exceedances of an upper or lower bounds for some continuous process parameters.
This simple example will illustrate how to lag data in a Statistica workflow, and to prepare data for predictive modeling. The goal is to estimate the risk of an anomaly to occur within an hour. The actual data are based on the Statistica example file Cyclone-1. These data originated from an older-style combustion furnace for power generation; the data were anonymized and altered from the original, without affecting the relationships between the parameters.
Creating / Opening the Workspace
In Statistica, create a new Workflow and select as the input file the Statistica Example Data file Cyclone–1.sta. Figure 4 shows the variable names contains in this data file.
To create a new Workspace, use the File-New-Workspace option, and create a Blank Workspace. As the Data Source select Files and browse to the Example Datasets folder in the Statistica install directory.
You can also open the fully built example Workspace which has the data embedded. The fully built workspace is shown in Figure 5.
This Workspace already has embedded in it all data sources. When you re-run this Workspace, make sure that the Spotfire Data Export specifies an output directory on the current machine.
Note: In real deployments of such workspaces, data can be sourced directly from a database, Data Historian, or other data repository.
Description of Processing Steps
The following paragraphs describe briefly the data processing steps performed in this Workspace.
Select Variables. This processing node facilitates the selection of variables. In this case, all variables in the input file are selected into the Dependent Continuous list; different variables are created and selected for subsequent modeling. Note that the Select Variables node is only required here in order to compute the Autocorrelations and Crosscorrelations, for illustration purposes; that specific node is an older-style Statistica VB-based node.
Autocorrelations and Crosscorrelations. This analytic node will compute the autocorrelations for all variables in the data. Continuous data are usually autocorrelated (consecutive data points are more similar to each other).
Transformations of Variables: Lagging and Coding. This node performs the lagging of observations, and the coding of a new binary outcome variable FlameTempOK. This variable is created from the lagged variable Flame temperature (°F). Shown below are the specific transformations applied to the data.
FlameTemperatureLagged=lag("Flame temperature (°F)",-10)
The new variable FlameTemperatureLagged is created by lagging by 10 observations the original variable with flame temperature measurements. The data were extracted and aggregated as averages over 6-minute-intervals, so lagging by 10 observations aligns the predictor (X) variables with the FlameTemerpatureLagged values 1 hour later. The FlameTempOK variable is created as a binary (1/0) variable from FlameTemerpatureLagged to record when the respective operations were or were not acceptable (>=2,700 °F, <2,700 °F, respectively). The DateTimeLagged variable is created to enable more convenient exploration of the data and predictions in Spotfire.
The outcome of these transformations is that in subsequent modeling the X (predictor) variables are now aligned with flame temperature measurements 1 hour forward.
Any successful prediction model that relates the X input variables to the (lagged) flame temperature measurements will now enable predictions 1-hour into the future, i.e., predict the risk that the flame temperature will fall below acceptable limits.
Frequency Tables. This analytic step will compute a simple frequency table for the FlameTempOK variable.
Feature Selection. This analytic step will apply a simple feature selection algorithm to rank order the importance of predictor X variables for predicting the lagged flame temperature indicator variable (FlameTempOK) 1 hour forward. The results of these analyses are useful to aid the understanding of what specific variables relate most strongly to problems as they develop and are likely to occur in the future.
Boosted Classification Trees. This computational step builds the predictive model. The Boosted Trees algorithm is used to predict the (future) FlameTempOK variable. This algorithm will automatically select a training and testing sample to avoid overfitting. In practice, additional hold-out samples may need to be defined in order to assess the accuracy of the model in an unbiased manner.
PMML Model, Rapid Deployment (and deployment options). These computational steps will compute predictions based on the trained model. Note that other model deployment methods and languages are also available by changing the respective output options in the Boosted Classification Trees processing step. Note also that the PMML model can be exported to StreamBase® for very-low-latency real-time scoring in a real-time push/event-driven architecture.
Spotfire Data Export. This processing step will export the data along with predictions and classification probabilities to a Spotfire file for visualization and exploration. By default the Spotfire (.SBDF) file will be written out to a directory C:\Temp; adjust this path by modifying it in the node. Visualizing results with Spotfire will provide quick verification if the prediction model successfully predicts low flame temperature, before it actually occurs.
Reporting Documents. As in all Statistica Workspaces, the Reporting Documents node holds the results organized by computational step.
Reviewing Results in Spotfire
The workflow will write out the predicted values 0 or 1 to indicate predicted low vs. normal flame temperatures, respectively. These predictions are written out based on lagged predictors, i.e., predicting the expected flame temperature 1 hour forward. Figure 7 summarizes the results in the Spotfire dashboard.
The graph highlights the low-temperature-predictions in the line graph of consecutive temperature measurements on the original time scale. Low-temperature points are correctly predicted approximately 1 hour before the actual drops in flame temperatures evident in the line graph. Thus, this prediction model could provide a useful early-warning-virtual-sensor of impending critical low-temperature events.
Deploying the model for real-time scoring in Statistica, StreamBase. Based on these results, the respective prediction model could now be put into production, for example through Statistica Enterprise and Monitoring and Alerting Server which implements a pull/batch architecture; or the model could be deployed directly through StreamBase for very-low-latency scoring in a pull/event architecture.
The Spotfire dashboard is also available for download.
The purpose of this overview was to discuss the basic distinction between continuous process and discrete data, how those data are typically stored, and how they are modeled. Further, the general real-time architecture for preparing and analyzing streaming data can be a pull/batch architecture where the analytics or scoring engine extracts new data at regular intervals, or it can be based on a push/event architecture where the arrival-of-new-data "event" triggers immediate analyses and/or scoring.
Continuous process data almost always show auto-correlations of consecutive data points. This information - the diagnostic information of lagged predictors or outcomes -- can be used to perform analyses and build useful models predicting future events from current observations or states. A simple way to build such a model is to include in the data preparation steps the appropriate lagging of variables. The size of the lag is often determined by requirements around the respective use cases. A critical issue is how predictions are to be converted into actions. This may happen automatically within milliseconds, or it may happen through manual operator intervention, requiring perhaps hours. The real-time modeling and scoring architecture should reflect those requirements.
IDC (2017). Internet of Things Spending Forecast to Grow 17.9% in 2016 Led by Manufacturing, Transportation, and Utilities Investments, According to New IDC Spending Guide. IDC Press Release, January 4, 2017; retrieved on August 4, 2017, from http://www.idc.com/getdoc.jsp?containerId=prUS42209117.
Figure 1: Autocorrelation Function Illustrating Correlations of Consecutive Observations in a Continuous Process
This illustration shows a typical auto-correlation function, describing the correlation (similarity) of consecutive observations in continuous process data.
Figure 2: Example TIBCO Spotfire Visualization of Process Data
This is a typical simple visualization of process data extracted from a Data Historian. The specific example data describe critical operational and performance parameters of an industrial furnace used in the power generation industry. The extracted data are shown in the lower pane of the Spotfire dashboard. Data were extracted from a Data Historian as parameter averages over consecutive 6-minute intervals. Modern Data Historians support various data aggregation functions that enable aggregation to user-defined intervals at the point of extraction. TIBCO software supports all common Data Historians, including OSI PI. This specific example is further discussed in an example later.
Figure 3: Scoring Real-Time Streaming Data in TIBCO StreamBase
TIBCO StreamBase is a system for analyzing streaming data using the Push/Event Architecture: Data are processed as / when they become available, and processed through a sequence of steps that can include transformations, conditional scoring logic, and machine learning or statistical prediction models. StreamBase supports real-time application of scoring or decision logic at very high speed (low latency scoring), and with very high data volumes. The illustration shows the workflow designer (design-time); once deployed the system enables automatic processing of real-time data as they are collected, for example, supporting very-high-speed automated securities trading.
Figure 4: Statistica Example Data File Cyclone-1.sta
This example data file captures the performance of a combustion furnace for a 4-day period. The data were extracted from a data historian as averages over 6-minute intervals. A key performance parameter is Flame Temperature, which needs to stay above 2,700 F for efficient and low-emissions combustion. Pulverized coal is used for fuel, and combustion air is introduced through different ports (Primary air, Secondary air at different locations, Tertiary air). The Stoichiometric ratio is an air-to-fuel ratio scaled to a theoretically derived quantity indicating the minimum air required to completely burn the respective fuel (so values >1 indicate that more air is introduced into the combustion process than what is required to burn all fuel).
Figure 5: Workspace for Building a Virtual Sensor from Lagged Predictors
This example workspace with all data and results embedded is available for download with this example. The workspace builds a risk model for Flame temperature to fall below 2,700 °F one hour forward, based on other parameters also recorded into data file Cyclone-1.sta. Note that the Select Variables node is only required here in order to compute the Autocorrelations and Crosscorrelations, for illustration purposes; that specific node is an older-style SVB-based node.
Figure 6: Continuous Process Data after Lagging
Data after applying lagging transformation. Note how the original variable containing flame temperature measurements (see highlighted value for Variable Flame temperature (˚F) at 9/24/05 1:00AM) is now aligned with the other variables and values in row 1. Variable DateTimeLagged shows the time stamp for the lagged flame temperature measurements. Note also that FlameTempOK is the binary variable indicating if the lagged flame temperature measurements are acceptable (>=2,700 ˚F).
Figure 7: TIBCO Spotfire Dashboard of Predicted Low-Temperature-Events
This simple TIBCO Spotfire visualization highlights the low-temperature-predictions from the Boosted-Trees model. Note that the X-axis in the line chart is the original Date / Time variable (time-stamps); the Y-axis shows the flame temperature averages. The highlighted points are those predicted as 0, i.e., that were predicted by the lagged predictors to fall below the critical 2,700 ˚F threshold 1 hour forward in time. The predicted 0 (zero-) values indicating risk of low flame temperature occur before the actual temperature drops occur. Thus, the prediction model can serve as an effective virtual sensor for predicting critical low-temperature events.