Who's Who in a Data Science Team

Published:
11:50am Aug 21, 2020

The personas of Data Science and where they interact with the TIBCO Data Science platform

The personas of Data Science and where they interact with the TIBCO Data Science platform

 

On a recent Data Science Central webinar, the authors talked about a few projects in life sciences that we’d worked on recently, including image-based models to detect significant biomarkers in clinical prognostications at PerkinElmer, and a study by TIBCO’s Data Science team of factors behind life expectancy in industrialized countries.

The technology and the analytics were interesting enough. Check out the recording to see how both teams used Spotfire Data Functions to create interactive dashboards backed up by complex models running in TIBCO Data Science. And you can see how Alberto’s team used the extensibility of TIBCO’s analytics platform to add new connectors and statistical analyses to their workflows that automatically scaled to a terabyte of data and yielded order-of-magnitude performance improvements. (You can also see some examples of the team's work in the screenshots below, as well as the appendix at the end of this article.)

Spotfire applications created by PerkinElmer for drug discovery. High content screening data is summarized by plates and features, together with some quality control metrics.Spotfire applications created by PerkinElmer for drug discovery. High content screening data is summarized by plates and features, together with some quality control metrics.

Spotfire applications created by PerkinElmer for drug discovery. High content screening data is summarized by plates and features, together with some quality control metrics.

But what’s also interesting is how these projects illustrated the different roles within the data science team, and how they all need to work together in one integrated analytics environment. As Carlie Idoine recently pointed out in her Gartner Report, Worlds Collide as Augmented Analytics Draws Analytics, BI and Data Science Together, “Users — whether data scientists, analytics and BI analysts, or other citizen or expert users — become both producers and consumers, moving fluidly across capabilities as their analyses dictate.”

There are many roles on a data science team: for example, the Data Engineer who cleanses and transforms the data; the MLOps engineer who converts a hodge-prodge of scripts into something operational; the Project Manager who keeps everything on track. But the core team members that Idoine mentions are those that work directly with the data using machine learning and statistical methods. Their roles are particularly ‘fluid’, but they manifest as three distinct personas that need to collaborate especially closely — the Business Analyst who explores and consumes the output of these methods to develop insights and create hypotheses; the Data Scientist who creates the models and data pipelines; and the ML Engineer who creates the functions, tools, and infrastructure that the Data Scientist needs.

The Data Scientist of course is chiefly responsible for using machine learning and statistical methods to create predictive models and insights. In general, they use a range of tools, from open-source Python libraries to commercial IDEs. Ideally, they like to work with off-the-shelf functions that are commonly used and well-tested. They’re too busy figuring out what’s in the data to spend time developing new functions and methods from scratch.

Example of a predictive workflow created by the Data Scientist.

Example of a predictive workflow created by the Data Scientist.

But sometimes the team needs to implement new functions, maybe a particularly complex series of transformations for handling wide data that needs to be packaged up in a single Spark pipeline for speed and scale. Or maybe an implementation of a lesser-known time series encoding. Or maybe the infrastructure that supports all these capabilities. That calls for a data scientist with rather specialized skills, the Machine Learning Engineer. This hybrid role combines a knowledge of statistics and machine learning with software engineering skills that emphasize scale and mathematical rigor.

Custom code in Python Notebooks created by the Machine Learning Engineer. PySpark can also be used to run scalable code without moving massive amounts of data.

Custom code in Python Notebooks created by the Machine Learning Engineer. PySpark can also be used to run scalable code without moving massive amounts of data.

The work of the ML Engineer is supported in TIBCO Data Science by multiple modes of extensibility. When a certain set of operations becomes routine in TIBCO Data Science, then it makes sense to consolidate them into custom components (“MODs”) to be used in any workflow by either advanced Business Analysts or Data Scientists. A good example is how PerkinElmer is accessing its Signals Platform. Custom connectors have been developed to access specific Signals datasets and use them immediately inside an analytics workflow as below:

Data Scientists can access specific Signals datasets using custom connectors created as TIBCO Data Science MODs by Machine Learning Engineers

Data Scientists can access specific Signals datasets using custom connectors created as TIBCO Data Science MODs by Machine Learning Engineers

This level of flexibility permits companies to develop their own catalog of necessary operators by the ML engineers and keep the power of low-code, non-code for their data scientists and business analysts.

Just as the Data Scientist depends on the Machine Learning engineer to get into the weeds of programming, so the Business Analyst or Business User depends on the Data Scientist to get into the weeds of machine learning and statistics to tease models and insights from data. The Business Analyst will manipulate and interact with those analyses and will often have a thorough understanding of the techniques involved. But they are largely relying on the rest of the data science team to put the building blocks together.

Exploratory dashboards used by the Business Analyst or Business User to interact with models. Workflows can be started from Spotfire using data functions, and the results are automatically updated into the Spotfire visualizations.

Exploratory dashboards used by the Business Analyst or Business User to interact with models. Workflows can be started from Spotfire using data functions, and the results are automatically updated into the Spotfire visualizations.

You can think of the Business Analyst as a Lego enthusiast, the Machine Learning Engineer as the manufacturer of Lego Blocks — and the Data Scientist as whoever it is who designs those prefabricated Lego kits.

Of all these, I find the evolving role of the Business Analyst or Business User most interesting. Increasingly, the ‘business analyst’ is becoming the ‘citizen data scientist’, although they might not describe themselves this way. (The term ‘citizen’ is perhaps too patronizing to have a chance of surviving, and what’s wrong with ‘business analyst’ anyway? It’s arguably more descriptive and accurate than ‘data scientist’.)

For operational models, the business will continue to rely on a qualified team of Data Scientists. But for exploring data and deriving insights, there’s no reason why Business Users need to rely on anyone or anything else besides a well-equipped analytics application. Machine learning and statistical functions can be provided with self-explanatory interfaces and documentation. They can be designed to validate their own assumptions in the data to which they’re applied. In short, they can be made accessible.

And that’s what we’ve aimed to deliver with TIBCO Spotfire, backed up by the full power of TIBCO’s Data Science platform. By packaging up the methods of data science in a way that’s intuitive and straightforward, we believe they can be used by anyone who’s comfortable with (say) spreadsheets or business intelligence applications. In this way, the data science toolbox provided in a platform like TIBCO Data Science can be a point of collaboration between all three of the personas we’ve described. Carlie Idoine described the ideal approach this way: “Data scientists and BI analysts alike can move smoothly from one analytics capability to another whenever and wherever the need dictates.”

The methods of data science are, in many ways, very approachable. Most of them can — with appropriate care and technical guardrails — be treated as black boxes. Even methods as complex as deep learning can be used for anomaly detection or image recognition with hardly any knowledge of the internals. So why shouldn’t all of these methods be made available to the dashboard user? The Perkin Elmer example shows that there’s enormous power in this idea.

Appendix: Sample Use Cases from Perkin Elmer

 

High Content Screening Analysis

Feature selection and prediction of image-based features from High Content Screening experiments.

Advance Cohort Creation in Signals Translational

TIBCO Data Science is used to generate cohorts of clinical subjects according to multiple conditions and TIBCO Spotfire is then used by business users to evaluate the resulting cohorts of subjects.

Tumor mutational burden (TMB)

Measurement of mutations carried by tumor cells. This is a Biomarker associated with response to immuno-oncology therapy. TIBCO Data Science is used for the calculations and Spotfire to visualize the TMB results.

 

 

Authors

Alberto Pascual is a Doctor in Bioinformatics with a long experience in the biomedical domain. Graduated in Computer Science Engineer he received a PhD on Bioinformatics in 2002 from the Autonomous University of Madrid. He spent some Postdoc time at the KEY Institute for Brain-Mind Research in Zurich working on Neuroinformatics. In 2004 he joined the Computer Architecture Department at the Complutense University of Madrid and in 2009 he joined the National Center for Biotechnology (CNB) as a senior researcher, leading a bioinformatics Research and Core facility group. Alberto Pascual was one of the founders of Integromics, a Bioinformatics company that started in 2003. The company received the Frost and Sullivan European Bioinformatics Project of the Year Award for 2007. Integromics provided state-of-the-art bioinformatics software solutions for data management and data analysis in genomics and proteomics using TIBCO Spotfire® as its development platform. Alberto mix of backgrounds includes Business Intelligence, Biostatistics, Machine learning, Bioinformatics, Computational Biology among others. He also published more than 70 papers in peer-review high-impact journals. Currently he is working as Senior Analytics Solution architect in R+D Organization at Perkin Elmer Informatics developing innovative bioinformatics and analytical applications in the context of Translation Medicine, Screening, Imaging, Clinical, Manufacturing and other analytical areas.

Steven Hillion works on large-scale machine learning. He was the co-founder of Alpine Data, acquired by TIBCO, and before that he built a global team of data scientists at Pivotal and developed a suite of open-source and enterprise software in machine learning. Earlier, he led engineering at a series of start-ups. Steven is originally from Guernsey, in the British Isles. He received his Ph.D. in mathematics from the University of California, Berkeley, and before that read mathematics at Oxford University