TIBCO Spotfire Data Function Library: Python and R Scripts for Data Analysis

Last updated:
12:25am Nov 09, 2021


What's a Data Function?

In the broadest sense, a Data Function is an operation you perform on your data. In the simplest terms, for the purposes of this page, Data Functions are R and Python scripts to extend your Spotfire analytics experience.

The functions themselves operate on Spotfire input data in the form of Data Tables, Data Columns, and Property variables. If you'd like, the Data Functions may be dynamically re-computed using interactive chart selections (markings) and filters, all triggering new calculations in Spotfire's memory and not requiring the user to manage storage of resulting information. The resulting information (data results) are outputted in the same format types as input data: Data Tables, Columns, and Property variables.

Data Functions may be extended to uses with other languages like MATLAB and SAS, and can be used to connect to other software directly like TDS Team Studio, TDS Statistica, KNIME, and more. These concepts are excluded from this companion, but good to be aware of.


Available Data Functions

While you can always create your own data functions, this page is a guide to easy-to-use prebuilt functions from TIBCO's Data Science Team. These data functions are all built in a generalized format -- something you can quickly plug-and-play into your own Spotfire analyses. Many are hosted for FREE DOWNLOAD on the TIBCO Exchange. Here's a 90-second helper video if you need it.

Download from the following available Data Functions: 

** - Coming Soon!



>> Submit Feedback


>> Geoanalytics


Spatial Heatmap [R/TERR]

Compute "spatial generalization" in 3-dimensions to give a clear, aggregate view of your spatial data. The heatmap utilizes x, y, z, and theta inputs where x and y are your latitude and longitude coordinates, and z is some variable of interest. Above, z represents the rental prices of Airbnb Properties in Boston; theta allows you to adjust smoothing levels of the underlying LOESS method. The contours and heatmap gradations emphasize the regions of highest-priced rentals.

Packages: none

(return to top)


Contour Plot [R/TERR]

Much like the Spatial Heatmap above, the contour plot data function "generalizes" spatial x and y coordinates for a given z-value, or variable of interest. Here the variable is the production of fictional oil and gas wells in the Texas/Oklahoma region, showing areas where wells are most and least productive. As with the Spatial Heatmap, a LOESS function is the underlying smoothing method.

Packages: none

(return to top)


Points-in-Polygons (aka Geofencing) [R/TERR]

Have a bunch of uncategorized point data? Do these points belong to certain regions or geometry polygons? Use this data function to automatically label all of your points by the regions they fall within -- here showing with individual New York City crimes then labeled by zip codes and colored by zip code.

This works across two tables: 1) the point data and, 2) the polygon data. For each point in a table of locations defined by Latitude and Longitude coordinates, identify its corresponding enclosing polygon contained in a separate table. The data function returns an "identifier" column that contains the enclosing polygon identifier to append to the point location table. Read the step-by-step Instructions for more info.

Packages: sp, wkb

(return to top)


Spatial Density [R/TERR]

Here, a Density Map is computed and displayed for crimes in the New York City area. Density Maps are useful in providing a high level summary to visualize overall patterns in the density of spatial data, much like a two-dimensional histogram density.  Otherwise studying raw point data for patterns can be difficult owing to uneven spatial coverage.

Density Maps start by the calculation of a smoothly varying surface to represent the density of the data. This surface is then represented by a colored heatmap and contours from the resulting calculation. Note this is different from the other Spatial Heatmap and Contour Plot Data Functions in that no z-value is used to compute. That is, the groupings shown from the Spatial Density calculations are base entirely on the concentration of x- and y-coordinates alone.

packages: none

(return to top)


Voronoi Polygon [R/TERR]

From input point locations, create Voronoi polygons for Spotfire map visualization. Can produce one Voronoi polygon per input point, or polygons that summarize multiple points. Resulting polygons form a tessellation that covers a study region.

Outline is determined either by convex hull or by optional separate boundary. Coordinates can be specified either as Cartesian (x, y) or Geographic (Longitude, Latitude). Result is an object that can be dragged directly from Spotfire data panel onto a map visualization.

Packages: ggvoronoi, sp, stats, wkb

(return to top)


>> Machine Learning


Random Forest [R/TERR] 

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Random forests can be used in many areas, such as modelling and prediction of binary response variable, such as offer acceptance, customer churn, financial fraud or product / equipment fail; as well as explanation of detected anomalies.

This data function includes the R/TERR code for the Random forest model, missing data imputation and random Over-Sampling Examples to resolve un-balanced class issue for binary response variables.  It uses the CRAN randomForest package within the Spotfire interface.  It is focused on supervised classification with a binary response variable. Read more here

(return to top)


Random Forest [Python] 

Similar to the Random Forest Data Function for R/TERR, however, this is written in Python for those with preference.

Packages: pandas, numpy, scikit-learn

(return to top)


Support Vector Machine [Python] 



Support Vector Machines (SVMs) are one of the most popular machine learning models and are particularly well suited for classification problems on small-to-medium sized datasets. SVM classifiers attempt to find the "widest possible street" between classes and minimize the number of margin violations (points between the dotted lines above).

(return to top)


Logistic Regression [Python] 



This data function will train and execute a logistic regression machine learning model on a given input dataset. Logistic regression is one of the most basic models used for classification and is ordinarily used for binary classification.

(return to top)


Local Outlier Factor [Python] 

This data function will use the unsupervised local outlier factor method to perform anomaly detection on a given dataset.

(return to top)


Gradient Boosting Machine Regression [R/TERR] 

Gradient boosting is an ensemble-decision-tree, machine learning data function that’s useful to identify variables that best predict some outcome and build highly accurate predictive models.  For example, a retailer might use a gradient boosting algorithm to determine the propensity of customers to buy a product based on their buying histories

(return to top)


PCA [Python] 

This data function performs Principal Component Analysis (PCA) on a given numeric dataset. PCA is an unsupervised method often used for dimensionality reduction that reduces a dataset into one with a smaller set of variables that still captures the original dataset's information. Above, we can evaluate the reconstruction of images using varying numbers of principal components. 

(return to top)


K-Means Clustering [Python] 

This python data function uses K-Means clustering to group similar items in the form of clusters. Common uses for K-Means include customer segmentation, data pre-processing, and text classification.

(return to top)


Dunn Index [Python] 

This python data function calculates the Dunn Index (DI), which is a metric for judging a clustering algorithm. A higher DI implies better clustering and better clustering means that clusters are both compact and well-separated from other clusters.

(return to top)


>> Statistical Analysis


Correlation 

This Python data function calculates the correlation coefficients between columns of data. Correlation analysis is an important step in comparing data to determine whether it is highly correlated or not, and if so is that negatively or positively correlated. This can help determine relationships in data, as well as aid in data reduction by removing highly correlated data in use cases such as data science model building.

In this data function, three methods are available. These are: Pearson, Spearman and Kendall. All methods return a correlation score between -1 and 1 indicating the correlation score between two columns of data. A score of -1 is the maximum negative correlation, a score of 1 is the maximum positive correlation and a score of 0 means there is no correlation.

(return to top)


One-Hot Encoder [Python] 

This data function uses one-hot-encoding to transform categorical columns into multiple numeric columns containing 0's and 1's, based on the presence of a column-value combination. This method is commonly used to transform data prior to model creation in machine learning because models often require all numeric datasets as input.

(return to top)


>> Data Sources


Johns Hopkins University's COVID-19 Data [R/TERR] 

The quintessential COVID-19 Dashboard and Data Source. Johns Hopkins University gathers, stores, and distributes data on COVID-19 infections worldwide. Using a R/TERR data function, you can download this data for your own analyses. Visit the following link to get a the DXP template with data function included.

(return to top)