Using Python and R in TIBCO Data Science
Adding R and Python code to a visual workflow in TIBCO Data Science
As a collaborative data science platform designed for enabling advanced analytics for everyone, TIBCO Data Science provides an intuitive way of creating drag-and-drop workflows and using various out-of-the-box machine learning algorithms. Meanwhile, there are built-in Python and R executors in this platform, which means we can bring in powerful user-defined functions to the workflows in terms of writing Python or R code with great flexibility.
In this blog, we will learn how to use embedded Jupyter Notebooks in TIBCO Data Science to train a Light GBM model, and how to start writing R code to train a decision tree model based on the iris dataset.
Part 1: Using Python for Building a Light GBM Model in TIBCO Data Science
The main steps for using Python in TIBCO Data Science are:
Create a Jupyter Notebook in the workspace
Open the Jupyter Notebook and fill in the Python code
Create a workflow in the workspace and add a Python Execute operator
Let’s go through the details of each step.
Step 1: Create a Jupyter Notebook in the workspace
After creating a workspace, we can easily add a Jupyter Notebook in the work files section.
Step 2: Open the Jupyter Notebook and fill in the Python code
From the notebook content section, we can start importing Python packages or installing any packages needed (i.e. typing in the code “! pip install lightgbm” and running it to install LightGBM package).
Then we can import the dataset for analysis. Here we need to use the iris data table from a PostgreSQL database, so we should define the data source name, table name, schema name and database name for targeting and importing the data.
Note that we have set “use_input_substitution” to be true value and the “execution_label” to be 1. This will serve the purpose of reading in data from the workflow later.
There is a useful trick for importing the dataset without typing any code. First of all, we need to find the required dataset in “Data Sources” (find it by clicking the top left menu icon), and associate it with our workspace.
After making this association, we can go to the notebook and find the function “Import Dataset into Notebook” under the data menu.
By clicking the “Import” button, the data importing code will be automatically generated in the notebook with the default settings, which means you may want to modify the values of parameters such as “use_input_substitution” and “execution_label” manually.
Then we can split the dataset for training and testing and build a Light GBM classifier. For predicting the iris species, there exists strong explanatory power in the flower’s sepal length, sepal width, petal length and petal width; therefore we can obtain a pretty good classifier with all default settings. The evaluation shows that the testing data can be predicted accurately.
In the end of the notebook, we can define the output and save it as a data table in our database. Here we show the example of saving testing data with prediction results generated from our light GBM model.
So far we have learnt how to create a Jupyter Notebook and build a model in Python. In the next step, we will integrate the notebook with our workflow.
Step 3: Create a workflow in the workspace and add a Python Execute operator
Here we have created a simple workflow including importing the iris dataset, standardizing the column types (this step is only about changing the data column types — you may omit it in your own project), and passing it to the Python Execute operator.
Editing the Python Execute operator, we will need to select the desired notebook as well as the substitute input. As we have set the “execution_label” to be 1 previously, so here we need to configure the input of “Substitute Input 1” section.
Now we are able to run the workflow and see the output of running the Python Jupyter Notebook. We can also observe that the results have been saved in a new data table as we defined.
Part 2: Using R for Building a Decision Tree Model in TIBCO Data Science
We can also run R code in a workflow by using the R Execute operator.
Here we have created a simple workflow of importing iris dataset then connecting it with the R Execute operator.
Then we can start writing R code by clicking the “Define Clause” button in the editing page.
As we have learnt that the iris data can be used for building a classifier, so here we build a decision tree model for better understanding variable importance using the full dataset as our training data.
Running the entire workflow, we are able to view the detailed modeling results including variable importance from the R Execute operator.
Through these simple examples of using Python and R to build machine learning models on the iris dataset in TIBCO Data Science, we hope you can understand how convenient it is to integrate code in your visual workflows. In this way, you can extend the analytics capabilities of your workflows, and use the combination of highly scalable data prep operators with advanced open-source functions in Python and R.
Jingchuan Lin, Data Scientist, TIBCO
Jingchuan Lin is a data scientist currently working at TIBCO Software Singapore office, where he can apply data science knowledge to provide solutions for customers from multiple APJ countries. He has developed strong analytical skills from solving real world problems in various industries. He always feels passionate about learning the most updated techniques in data science and discovering valuable insights from big data. Jingchuan grew up in China and went to Singapore for college. He obtained his Master degree in Computing and Bachelor degree in Business Analytics from the National University of Singapore.
Steven Hillion, Director of Data Science, TIBCO
Steven Hillion works on large-scale machine learning. He was the co-founder of Alpine Data, acquired by TIBCO, and before that he built a global team of data scientists at Pivotal and developed a suite of open-source and enterprise software in machine learning. Earlier, he led engineering at a series of start-ups. Steven is originally from Guernsey, in the British Isles. He received his Ph.D. in mathematics from the University of California, Berkeley, and before that read mathematics at Oxford University.