AutoML for TIBCO® Data Science - Team Studio
AutoML for TIBCO® Data Science – Team Studio is designed for the business analyst and the data scientist alike, and provides automated generation of ML workflows as well as opportunities for manual intervention and fine tuning.
TIBCO® Data Science
AutoML is supported on Team Studio version 6.4 and above. Users need a working installation of any HDFS data source & read/write privileges to said data source. Version 1.1 Tested with Team Studio 6.4 & 6.5. Spotfire 7.13, 10.3, 10.5
AutoML is supported on Team Studio version 6.4 and above. Users need a working installation of any HDFS data source & read/write privileges to said data source.
Version 1.1 Tested with Team Studio 6.4 & 6.5. Spotfire 7.13, 10.3, 10.5
TIBCO Component Exchange License
Machine Learning (ML) is a complex process, which involves several steps, choices and decisions. The objective of Automated Machine Learning (AutoML) is to make this process more manageable, by automating its most complex and lengthy parts. AutoML for TIBCO® Data Science – Team Studio is designed for the business analyst and the data scientist alike, and provides automated generation of ML workflows as well as opportunities for manual intervention and fine tuning.
Please use the tag #AutoML when asking questions about this extension
AutoML is a really great and unique addition to TIBCO® Data Science. Can't wait to explore it a bit more!
As a 'citizen data scientist' (engineer) myself, I'm very excited to have this new extension that allows me to leverage the wisdom of real data scientists to build the best model for many different applications. I have a number of projects I'll be using this for.
AutoML Product Extension for TIBCO® Data Science - Team Studio
Please use the tag #AutoML when asking questions about this extension
AutoML for TIBCO Data Science - Team Studio is supported on Team Studio version 6.4 and above. Users need a working installation of any HDFS data source & read/write privileges to said data source. Spotfire 7.13 or later is optionally required for viewing model explainability visualizations.
TIBCO Data Science - Team Studio 6.4 or later
Spotfire 7.13 or later: optional - required for Model Explanation visuals
Data Function for TIBCO(R) Data Science - Team Studio in TIBCO Spotfire(R) version 1.1 or later
R packages: data.table
Machine Learning is an increasingly popular branch of Artificial Intelligence, aimed at generating predictions from a dataset via an arsenal of computational algorithms and statistical methodologies. ML is a complex process which involves several steps: data exploration and cleaning, data preparation and feature engineering, model training, and finally model scoring and selection. Depending on our goals and on the nature of the dataset, at each step of the way we are faced with a richness of choices and decisions. Not only are such decisions increasingly hard, owing to the growing number and complexity of the available algorithms, they also involve time-consuming testing of different options and combinations.
The objective of Automated Machine Learning is to make this process more manageable, by automating the most complex and lengthy decisions. In this implementation, the AutoML generation on one hand enables analysts to quickly set up a meaningful process; on the other, because of the transparency of the generated system, it allows expert data scientists to see which decisions were made and why, and fine-tune the process if desired.
AutoML for TIBCO ® Data Science – Team Studio is a set of Team Studio Custom Operators that generate workflows within Team Studio, and a Spotfire model explainability template used for visual analytics. The AutoML workflows are generated into a Team Studio workspace and run in sequence, to cover the end-to-end ML process. Individual operators are included to perform data preparation, feature engineering, stability selection and automated modeling, along with a high-level orchestration operator that uses built-in logic to assemble these operators into AutoML workflows, run the analysis, and display all results. A Spotfire template is provided that integrates with TIBCO Data Science Team Studio to visualize the resulting predictions.
In addition to the 'getting started' information below, here are two videos and a blog that explain AutoML
- Short video by Neil Kanungo
- Longer demo recorded as part of TIBCO Analytics Meetup July 2019 by Dan Rope
- Demo plus installation instructions by George Chen
- The Real Power of AutoML blog by Steven Hillion
Using AutoML for TIBCO Data Science – Team Studio
To get started, the first thing you need to do is create a new Team Studio workflow and read your dataset into it, as you would do for any workflow. AutoML is designed to work with Hadoop, so the data needs to be in, or copied to, Hadoop. You will need to create a single data table, so any joins and merges to other data tables need to happen at this stage. AutoML expects data to be in wide mode, with column headers describing the content of each column. One of such columns needs to be the dependent variable (also referred to as response or target) for the predictive modelling. For instance, if your task is to predict fraud, the target could be the label assigning each row of the dataset to fraudulent or non fraudulent behaviour. Currently, AutoML handles binary classification tasks, so the target column needs to have two distinct values. Missing data are generally allowed, but rows where the target value itself is missing will be filtered out.
After reading your data in, the next thing you need is to channel the data into the AutoML Orchestrator, as in Figure 1 below (in this example, three input datasets are first joined together and then sent into AutoML).
Figure 1. Example of AutoML-generating workflow
The AutoML Orchestrator is a custom operator that contains in-built logic to generate a number of workflows. It does not need much direction, mainly the hostname and port of the Team Studio installation (i.e. the URL of the site you are running Team Studio on), login credentials, the name of the target (dependent) column and some information about the format of the input dataset. You will also need to specify the output Workspace ID, that is the workspace into which all the workflows will be generated. Ideally, this will be a clean workspace, and it must be on the same Team Studio installation. It can be the same workspace where the generating workflow lives, as long as there are no pre-existing workflows with the same names as the ones that will be generated (see later for a complete list of these). The Workspace ID itself is a short number, the integer that comes after the hostname and port and the #workspaces keyword in the Team Studio URL of the output workspace (which you will need to have created manually beforehand). Please refer to the AutoML Orchestrator documentation for further details. The documentation for all AutoML operators is included in the zip file for AutoML for TIBCO Data Science - Team Studio.
The AutoML Orchestrator has options to generate a ‘Shallow’ or a ‘Deep’ AutoML. This choice controls the extent of the hyper-parameter search during the predictive modelling phase. A shallow search is quicker and gives a good initial idea of the most suitable model. A deep search will normally provide more accurate models, at the cost of being slower. All generated default values are visible as input parameters of the specific operators.
When the AutoML Orchestrator runs, visual workflows are generated and written into the output workspace. Workflows for model training are generated and run on the fly by the Orchestrator. They are run in sequence, as the output of one workflow becomes the input for the next one. Their names reflect the different phases of the process and are, in order: Target Learning, Data Preparation, Feature Engineering, Feature Selection and Modeling. In addition, two workflows are created but not run by the Orchestrator: Scoring and Explainability. The former can be used to apply the winning model to new datasets; the latter provides input to the Spotfire model explainability template that visualizes the results. Other artefacts that are not visual workflows are generated into the workspace. Shown in Figure 2 are the workflows generated during orchestration and the top exported models, in Team Studio’s Analytics Model (.am) format. Other workfiles are generated by the Explainability workflow when this is run. See following Sections for details.
Figure 2. Typical workfiles generated into the output workspace
The Target Learning workflow checks that the variable declared as target contains the expected number of unique values, and stops the orchestration if more than two distinct values are found.
Inside the Data Preparation workflow (see Figure 3), the original target variable is mapped to a 0/1 integer to ensure consistent binary classification. The new target is renamed AutoML_Mapped_Target and is used as dependent variable throughout the generated workflows. The mapping of the original target variable is performed within the Target Labeling operator.
The dataset is then split into Training and Testing sets (using an 80/20 row split) and summary statistics is calculated on the Training dataset. The AutoML Orchestrator also classifies categorical variables into groups according to their cardinality (the number of distinct values, or levels) and their imbalance (the ratio between the maximum and minimum frequency of these levels).
Figure 3. Example of generated Data Preparation workflow
In addition, date-time variables are handled by a custom operator, the Date Time Transformer. This operator extracts features such as year, month, day etc. from the input columns, as long as these columns are parsed as dates, datetimes or times according to a supported format. This is implemented within the Data Preparation workflow as shown in Figure 4. Please refer to the operator’s documentation for details.
Figure 4. Example of Data Preparation workflow segment with DateTime handling
Within the Feature Engineering workflow (see Figure 5) transformations are applied to the dataset. All transformations are implemented taking into account the statistical properties that were computed on the Training dataset. Currently AutoML supports missing data imputation, normalization using mean/standard deviation, impact (target-mean) encoding, weight-of-evidence encoding and frequency encoding. Please refer to the individual AutoML Custom Operator documentation for further details. The last three are categorical encoding transformations, which are applied to transform categorical variables into numbers suitable for input into predictive-modeling operators.
Figure 5. Example of generated Feature Engineering workflow
Each data transformation that uses information from more than a single row (such as, for example, imputation using the mean) is performed after the Training/Testing split, using the parameters from the Training dataset (in this example, the mean) and then automatically applied to the Testing dataset using the same parameters. This ensures consistency of the ML process and helps minimise data leakage (the unintentional usage of Testing data information during the Training phase) and over-fitting. All data transformations follow this paradigm. The more complex categorical-encoding operators achieve this via a separate ‘applicator’ operator, called Categorical Feature Encoder, which is conceptually similar to the predictor or classifier operator of an ML algorithm: it takes a model (in this case, the encoding map) and applies it to a new dataset (the Testing dataset, or any new data flowing in). Depending on which cardinality/imbalance group they fall in, there could be different options for encoding variables. The result is potentially multiple branches, or alternative strategies of feature engineering. A feature engineering strategy is therefore simply the specific sequence of transformations applied to the variables. Depending on the number of categorical groups in a dataset, there could be different numbers and compositions of such feature engineering strategies. In Figure 5 we see two alternative strategies (WoE and Impact Encoding). The corresponding encoded datasets are assigned names consistent with the strategy branch they belong to. Feature Selection and Modeling will then be applied to each output dataset.
The Feature Selection workflow (see Figure 6) applies feature selection techniques to the output of Feature Engineering: either using RandomForest variable importance, or randomised Lasso probability of inclusion. The AutoML Orchestrator naturally assigns a RandomForest stability selection to subsequent tree ML algorithms, and a randomised Lasso one to subsequent elastic-net Logistic Regression algorithms. In Figure 6, each of the two strategy branches output by Feature Engineering is further split into two branches for each feature selection method.
Figure 6. Example of generated Feature Selection workflow
The Modeling workflow (see Figure 7) handles the predictive modeling phase. The modeling operators (elastic-net regularised Logistic Regression, Random Forest and Gradient Boosted Tree) each perform an internal hyper-parameter optimisation implementing the open-source Spark MLlib. The resulting models are exported onto the output workspace in the Team Studio Analytics Model (.am) format (see Figure 2). There is one output .am workfile per feature engineering strategy and ML algorithm type, representing the best model of each kind after hyper-parameter optimisation.
Figure 7. Example of generated Modeling workflow
At the end of the Modeling flow, all the different models that were generated are scored against the Testing dataset, and a model leaderboard is produced, sorted by the resulting accuracy (but displaying a number of other metrics as well). Once the AutoML Orchestrator has finished running, and the generating workflow has completed, the user can click on the Orchestrator icon to see a complete report of the AutoML, with details on all phases and models, and links to the individual generated workflows, as shown in Figure 8.
Figure 8. Example of AutoML Orchestrator results
Figure 9, 10 and 11 show examples of the details that can be seen when clicking on Feature Engineering Variable Category, Feature Engineering Summary, and Model Leaderboard respectively.
Figure 9. Example of Feature Engineering Variable Category
Figure 10. Example of Feature Engineering Summary
Figure 11. Example of Model Leaderboard
Finally, it is important to note that AutoML is designed to simplify and shorten the task of building an ML process, but a second appraisal of the workflows and the results is nevertheless encouraged. The success of Machine Learning projects depend as much on business knowledge and ingenuity as it does on sophisticated methods and algorithms. This is why all the workflows and results generated by the AutoML Orchestrator are transparent and editable, so that they can be inspected, assessed and fine-tuned where desired.
The Scoring workflow
The AutoML Orchestrator selects a winning model, according to the top row of the Model Leaderboard (see Figure 11) as well as the corresponding set of feature transformations (FE strategy). In order for new data to be scored, both transformations and model need to be applied to the new data exactly as they were in the AutoML testing phase. This is the task of the Scoring workflow. An example of the Scoring workflow is displayed in Figure 12.
Figure 12. Example of generated Scoring workflow
The new data goes through all the data preparation phases, including feature extraction from datetimes where applicable, then the appropriate categorical variables are transformed (in the example, Impact and Frequency encoding are applied to separate sets of variables). Note the Scoring Result node appears initially red. In order to activate it (turn it to black) the workflow branch up to that node needs to be executed. This can be done by right-clicking on the operator immediately before the Scoring Result, and selecting Step Run. The Scoring Result node can then be activated by double-clicking on it and pressing OK. After scoring, the resulting data set contains the additional variables as normally added by the model: in this example, using Gradient Boosting, the variables are PRED_AGB, CONF_AGB and INFO_AGB.
The Explainability workflow and Spotfire model-explainability template
The concept of model explainability is not new: it is the foundation upon which scientific advancement is based. The type of models involved however has greatly changed through the centuries. Nowadays, machine learning provides us with very sophisticated models, way more complex than the traditional scientific formula. The price to pay has been an ever decreasing understanding of the rationale behind the automated decisions. Model explainability is an open and active research field; AutoML for TIBCO Data Science provides a window into the model generated by the Orchestrator, by exploiting the integration between Team Studio and Spotfire, and the visual dynamism of Spotfire powered by TERR and IronPython. The components of this feature are an additional (automatically generated) Team Studio workflow (Explainability), an extension to Spotfire for running Team Studio data functions (Data Function for TIBCO® Data Science - Team Studio in TIBCO Spotfire®) and a Spotfire model-explainability template with embedded TERR data functions and Iron Python automation.
The Explainability workflow is designed to prepare the data for usage in the Spotfire template. It is automatically generated by the Orchestrator though not run during the orchestration phase. The associated Spotfire template contains a pre-defined Team Studio data function that connects directly to this workflow.
The first two branches of the Explainability workflow take as input the training dataset, pre-transformed to prepare it for the winning model. The top branch uses a new custom operator, Data Grid Builder, to reduce the size of the dataset by mapping it onto a grid, the granularity of which can be controlled from Spotfire. The second branch generates a representative sample of the same input dataset and uses the new Data Reshuffling custom operator to create and collate a number of copies of the dataset in which each predictor column has been in turn randomized. Further details on these two operators can be found in their documentation. Both branches score the data by applying the winning model and then export the result as SBDF (Spotfire Binary Data Format) ready to be consumed by Spotfire.
Figure 13: Data processing branches of the Explainability workflow
The second part of the Explainability workflow collects and exports all the transformations that were applied to the input dataset. These too are turned to SBDF files and will be used by Spotfire to reverse (decode) the transformations, and present the insights in human-readable form.
Figure 14: Data decoding branches of the Explainability workflow
The Spotfire model-explainability template will connect to the Explainability workflow to extract its output datasets, then perform calculations on the data to inspect the model’s behaviour.
When the Spotfire template is first opened, a login box appears (Figure 15). Since no Team Studio instance is connected yet to Spotfire, some initial configuration is needed:
click Cancel to exit the Login box
go to Notifications and click on Dismiss All
go to File | Manage Trust and Script and click on Trust All, OK then Close
go to Tools | TERR Tools | Package Management and install package data.table from CRAN, then Close
go to Tools | Team Studio data function | Edit Team Studio data function
select the available data function, then OK
fill in the form with your Team Studio instance url, username and password, then Login
choose a Workspace and select the Explainability workflow within it
press OK twice then Yes.
The Team Studio instance containing your generated AutoML workspace and Explainability workflow is now connected to Spotfire. For further documentation on the Team Studio data function setup, see Data Function for TIBCO(R) Data Science - Team Studio in TIBCO Spotfire(R).
Figure 15: Team Studio data function login box
The Team Studio data function within the Spotfire model-explainability template needs to be connected to the Explainability workflow generated by the Orchestrator. The data function is predefined to take as input the number of bins for the Data Grid Builder custom operator to decide how to bin numeric predictors, and return as output the five SBDF files generated by the Explainability workflow. Additional I/O parameters processid and success are used to guide the running of the data function and do not need to be modified by the user (see documentation for details). The Team Studio data function can be repointed to any generated Explainability workflow by following steps 5 to 9 above, and then it can be run by clicking on the GENERATE EXPLANATIONS button on the START page (see Figure 16). The results of this data function are automatically used to generate a variable importance chart via an embedded TERR data function. Only the top predictors are displayed; this behavior can be overridden by configuring the bar chart and clearing the Limit data using expression in the Properties|Data tab.
Figure 16: Example of Spotfire template Start page
The second page of the template, Predictor Analysis and Behaviours, is designed for interactivity and exploration (see Figure 17). A number of TERR data functions and IronPython scripts (embedded in the Spotfire template) perform calculations on the data and react to marking. The top-left bar-chart shows the same importance plot as in the START page, but now this plot responds to clicking; selecting a variable in this chart (for instance, WIDTH in Figure 17) has the effect of updating the other three plots in the page.
The correlation plot (bottom-left) will show the interplay between the correlation of the selected predictor with other predictors, and these predictors’ relative effect on the model. This information can be used to get an idea of the balance between variable association and importance. In this example, WEIGHT is both highly correlated to WIDTH and important to the model’s prediction.
The bar-chart in the centre of the page shows the record counts available in the training dataset for the different values of WIDTH. This plot provides information on the amount of data that was available when training the model.
The chart on the right of the page displays the median target probability (between 0 - unlikely, and 1 - very likely) assigned by the model to the different values of WIDTH. Both center and right chart display the data after they have been mapped on a grid (binned): so the labels of WIDTH show the average bin values, and the probability shown on the right chart is weighted by the cell occupancy on the grid. The variability bars incorporate the effect of the other predictors on the model’s predictions for WIDTH. This effect can be analyzed by visualizing the filter panel (click on the Show Filter button, top-left) and exploring how sliding the values of the other predictors affects both the available data and the response.
Figure 17: example of Predictor Analysis and Behaviors page
The bottom-left correlation plot does in turn react to clicking: by selecting a second variable, the bar-chart in the centre of the page becomes a heat-map, so the joint distribution of available data between the two selected predictors can be examined. Furthermore, selecting from the centre (whether bar-chart or heat-map) has the effect of limiting the data displayed on the right. In Figure 18 we see an example of how the page is transformed when selecting COLOR from the correlation chart, then marking the heat-map column corresponding to COLOR=lightmed.
Figure 18: activated Predictor Analysis and Behaviors page
This page can be reset by clicking on the Unmark All button (top-left) and resetting the applied filters as usual in Spotfire (Edit|Reset All Filters).
AutoML comes with .jar files that need to be manually installed on an existing Team Studio environment. Once downloaded and unzipped, there should be AutoML_Models-1.10.jar, AutoML_Orchestrator-1.10.jar, Feature Engineering-1.10.jar, StabilitySelection-1.10.jar, DateTime-1.10.jar, DataGridding-1.10.jar and DataReshuffling-1.10.jar.
These files can be installed on a Team Studio Environment by navigating to, and opening, any workflow located on the instance that is to be used for AutoML. Once the workflow is opened, users can upload Custom Operators (the AutoML .jar files), by selecting ‘Actions’ -> ‘Manage Custom Operators’ (see Figure 19 below)
Figure 19. Manage Custom Operators Action
Once here, select the ‘Upload’ button & find the .jar files on the local machine. When all four files are uploaded, then AutoML is ready to be run on the environment.
In order to enable the Team Studio data function, please download Data Function for TIBCO(R) Data Science - Team Studio in TIBCO Spotfire(R) and follow the installation instructions.
To download the data.table package from CRAN, use Tools|TERR Tools|Package Management tab, then select and install data.table from the CRAN Package Repository. For additional information, see the TERR Package Management instructions.
The AutoML Orchestrator only supports Hadoop data sources.
The target column should be a binary column of any data type. The operator will halt if there are more than two levels in the target column.
In the rare cases where integer variables have NaN values, such as values generated by zero division, the corresponding rows may be silently removed after the data has been read in. To prevent this, users should declare all numeric variables as double rather than long in the Hadoop File Structure section when importing the dataset.
There should not be a column named ‘label’ in the data source, as this has a special meaning in the Spark pipeline.
Rows with null target values are filtered out. This is done in the FilterTarget operator of the Data Preparation.
Non-word characters (including spaces) from categorical variables are removed during data preparation. This is done in the Data Cleaning operator of the Data Preparation flow.
Categorical variables with zero variance (only one value) or a number of unique values equal or greater than 60% of the number of rows in the training dataset are removed from further processing. This is done in the Column Filter operator of the Feature Engineering flow.
Variables with over 40% missing values are removed.
No correction for target class imbalance is applied.
AutoML expects certain schemas when handing off one flow to another, so any manual editing, post-generation, that results in a change of the schema (i.e. the detailed column structure) in-between generated flows, will most likely cause AutoML to fail when re-running the workflows.
When opening the Model Explainability DXP for the first time, you will encounter errors because it has not been configured to work with your instance of Team Studio. Please follow all setup instructions listed above.
Please use the tag #AutoML when asking questions about this extension