TIBCO at the Gartner Analytics Showdown
A little while ago, Gartner issued a challenge to the leaders and visionaries in its Data Science and Machine Learning ‘magic quadrant’. They gave us data on life expectancy for Australia and New Zealand — the site of Gartner’s 2020 Data and Analytics Summit — and asked us to find what contributes to greater longevity. “What can governments and businesses do to improve population health as defined by low mortality rates?” they asked.
Summary and Results
We had only a few days to prepare before the summit in Sydney, but we really wanted to show off the breadth of TIBCO’s data science platform. So we decided to bring in a team of people each focused on different parts of the analytics process — an analyst to explore the data and ask questions, a data scientist to build models, an engineer to write custom Python code, and a useless manager (that was me!) to remind people about deadlines and generally get in the way.
The result was a few days of intensive collaboration, which resulted in a powerful dashboard, backed by a collection of predictive models, built on multiple datasets that had required quite a bit of wrangling and preparation. We produced a recording of our results that summarizes each of these work streams in just a few minutes.
Collaborating on Several Workstreams
We started by collecting as much data as we could find. Gartner had pointed us towards two sources of historical data — health data from the World Health Organization, and socioeconomic data from the World Bank — and asked us to focus on just the last 17 years. We found that each year had a measure of life expectancy, plus hundreds of other variables (education levels, GDP, poverty rates, etc.) that might play into population health. We ended up using data from 15 different countries, for reasons I’ll explain later.
Our analyst, Neil, used TIBCO Spotfire to explore and clean up the data — joining the health and socioeconomic data together, using the built-in AI to auto-generate charts and find important relationships, imputing missing values, pivoting, and so on.
Spotfire dashboard for exploring and cleaning data
Then Spotfire could pass the resulting dataset along to TIBCO Data Science where the team built models to predict longevity. To get things going, I used the AutoML capability in TIBCO Data Science to build preliminary features and models. The AutoML ‘Orchestrator’ actually builds out completely transparent workflows, one each for data preparation, feature generation, modeling, and scoring. That suggested to us some of the models we might use, as well as how to build out new features to enhance model accuracy.
Then Prem, our data scientist, took over, building out visual workflows that performed normalization and variable selection followed by a range of regression models to predict life expectancy. We mostly used out-of-the-box models, but we were also able to inject a Python Notebook into Prem’s workflow, so that we could try out Adaboost models from scikit-learn.
Workflows for data preparation and regression modeling (including Python models), built in TIBCO Data Science
TIBCO Data Science offers a bunch of ways to operationalize models, for batch and realtime scoring. We chose to plug the models right back into Neil’s dashboard. The TIBCO Community has a Data Function for TIBCO Data Science which allows you to run Data Science workflows and retrieve results back to be displayed within a dashboard.
We were able to show not just the predictions for each country, but also measures of variable importance — exactly the sort of thing Gartner wanted, to show the levers that governments could pull on to improve population health. Neil created some very intuitive charts to show how these variables differed for Australia (or any other country) from the rest of the world.
The output of the predictive models, displayed in the Spotfire dashboard
All of this work was collected together in a single collaborative workspace which made it easier to share data and results, and to integrate each other’s work together — data, workflows, code, and dashboards.
It was clear right from the beginning that we didn’t have enough data to build accurate and defensible models. With only 17 years of data, it’s hard to imagine what a hold-out sample looks like! So we agreed to get data for a dozen more countries besides Australia and New Zealand, and to build models across all of them at once — the assumption being that life expectancy depended on the various socioeconomic and health factors in a consistent way across the industrialized world. The resulting models weren’t too bad — a mean absolute error less than 1%, and reasonable R2 values.
There were also far too many variables to include in a single model with so few observations. So we used a variety of techniques to select the most meaningful. Besides our own intuition, we also used p-values from standard regressions, and also the Correlation Filter function in Team Studio. It’s one of my favorite methods for doing variable selection, and you can read about it in our documentation.
Results of the Correlation Filter operator, which was used for variable selection
The feedback from Gartner, and from the audience at the summit, was very positive. In particular, everyone liked the way that the team was able to collaborate together. It emphasized the different personas that we target with the TIBCO Data Science stack — everyone from business analysts who are used to working with dashboards and spreadsheets, to data scientists and Python programmers.
This not only speaks to the breadth of the platform but the depth of the platform was also evident as well. You have interactive visualizations at the top of the stack, and with that comes extensive data wrangling capabilities. That can be augmented by R code, SQL code, and of course the full capabilities of the TIBCO Data Science workflows. Those workflows can execute a complete end-to-end analytics workflow, including ETL in a relational database and PySpark models in a Spark cluster.
This reminds me of some projects that TIBCO’s data science team has worked on in the area of public policy and health. An analysis of opioid prescriptions found that U.S. medical providers wrote opioid prescriptions at an alarming rate compared to other developed countries. And a study of parking citations in the city of San Francisco led to policy changes by the local government.
Steven Hillion works on large-scale machine learning. He was the co-founder of Alpine Data, acquired by TIBCO, and before that he built a global team of data scientists at Pivotal and developed a suite of open-source and enterprise software in machine learning. Earlier, he led engineering at a series of start-ups. Steven is originally from Guernsey, in the British Isles. He received his Ph.D. in mathematics from the University of California, Berkeley, and before that read mathematics at Oxford University.