TIBCO Spotfire Data Function Library: Python and R Scripts for Data Analysis
What's a Data Function?
In the broadest sense, a Data Function is an operation you perform on your data. In the simplest terms, for the purposes of this page, Data Functions are R and Python scripts to extend your Spotfire analytics experience.
The functions themselves operate on Spotfire input data in the form of Data Tables, Data Columns, and Property variables. If you'd like, the Data Functions may be dynamically re-computed using interactive chart selections (markings) and filters, all triggering new calculations in Spotfire's memory and not requiring the user to manage storage of resulting information. The resulting information (data results) are outputted in the same format types as input data: Data Tables, Columns, and Property variables.
Data Functions may be extended to uses with other languages like MATLAB and SAS, and can be used to connect to other software directly like TDS Team Studio, TDS Statistica, KNIME, and more. These concepts are excluded from this companion, but good to be aware of.
Available Data Functions
While you can always create your own data functions, this page is a guide to easy-to-use prebuilt functions from TIBCO's Data Science Team. These data functions are all built in a generalized format -- something you can quickly plug-and-play into your own Spotfire analyses. Many are hosted for FREE DOWNLOAD on the TIBCO Exchange. Here's a 90-second helper video if you need it.
- Exploratory Data Analysis
- Column Correlation [Python]
- Distribution Testing **
- Statistical Outlier Analysis **
- Data Preparation / Feature Engineering and Transformation
- Modeling and Prediction
- Random Forest [R]
- Random Forest [Python]
- Isolation Forest [Python]
- Basic Data Smoothing [R]
- SVM Support Vector Machine [Python]
- Logistic Regression [Python]
- Local Outlier Factor [Python]
- TensorFlow Regression and Classification [Python]
- TensorFlow Autoencoder [Python]
- K-Means Clustering [Python]
- Holt Winters Forecasting [R]
- GBM Gradient Boosted Machines Regression [R]
- NLP Toolkit - Features, Entities, Sentiment [Python]
- Model Evaluation
- Dunn Index [Python]
- Gain & Lift Charts **
- ROC & AUC Calculation **
- Hyperparameter Search **
- Cloud Vendor Tools
- Geospatial Analysis
- Spatial Heatmap [R]
- Spatial Density Heatmap [R]
- Contour Plot [R]
- Points-in-Polygons (Geofencing) [R]
- Voronoi Polygons [R]
- Spatial Interpolation [R]
- Spatial Join [R] **
- Geocoder [R] **
- Reverse Geocoder [R] **
- Calculate Travel Route [R]
- Trade Area from GPS Coordinate [R]
- Trade Area from Address [R]
- Driving Distance Matrix [R] **
- Area of Polygons [R]
- Draw Circular Radius [R]
- Draw Bounding Box [R]
- Draw Polygon from Points [R]
- Convert Lines to Point Series [R] **
- Transform CRS from Shapefile by EPSG Code [R]
- Transform CRS from Shapefile by PROJ.4 String [R]
- Transform CRS for Points by EPSG Code [R]
- Transform CRS for Points by PROJ.4 String [R]
- Dataset Downloaders
** - Coming Soon!
Compute "spatial generalization" in 3-dimensions to give a clear, aggregate view of your spatial data. The heatmap utilizes x, y, z, and theta inputs where x and y are your latitude and longitude coordinates, and z is some variable of interest. Above, z represents the rental prices of Airbnb Properties in Boston; theta allows you to adjust smoothing levels of the underlying LOESS method. The contours and heatmap gradations emphasize the regions of highest-priced rentals.
Much like the Spatial Heatmap above, the contour plot data function "generalizes" spatial x and y coordinates for a given z-value, or variable of interest. Here the variable is the production of fictional oil and gas wells in the Texas/Oklahoma region, showing areas where wells are most and least productive. As with the Spatial Heatmap, a LOESS function is the underlying smoothing method.
Have a bunch of uncategorized point data? Do these points belong to certain regions or geometry polygons? Use this data function to automatically label all of your points by the regions they fall within -- here showing with individual New York City crimes then labeled by zip codes and colored by zip code.
This works across two tables: 1) the point data and, 2) the polygon data. For each point in a table of locations defined by Latitude and Longitude coordinates, identify its corresponding enclosing polygon contained in a separate table. The data function returns an "identifier" column that contains the enclosing polygon identifier to append to the point location table. Read the step-by-step Instructions for more info.
Packages: sp, wkb
Here, a Density Map is computed and displayed for crimes in the New York City area. Density Maps are useful in providing a high level summary to visualize overall patterns in the density of spatial data, much like a two-dimensional histogram density. Otherwise studying raw point data for patterns can be difficult owing to uneven spatial coverage.
Density Maps start by the calculation of a smoothly varying surface to represent the density of the data. This surface is then represented by a colored heatmap and contours from the resulting calculation. Note this is different from the other Spatial Heatmap and Contour Plot Data Functions in that no z-value is used to compute. That is, the groupings shown from the Spatial Density calculations are base entirely on the concentration of x- and y-coordinates alone.
From input point locations, create Voronoi polygons for Spotfire map visualization. Can produce one Voronoi polygon per input point, or polygons that summarize multiple points. Resulting polygons form a tessellation that covers a study region.
Outline is determined either by convex hull or by optional separate boundary. Coordinates can be specified either as Cartesian (x, y) or Geographic (Longitude, Latitude). Result is an object that can be dragged directly from Spotfire data panel onto a map visualization.
Packages: ggvoronoi, sp, stats, wkb
>> Machine Learning
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Random forests can be used in many areas, such as modelling and prediction of binary response variable, such as offer acceptance, customer churn, financial fraud or product / equipment fail; as well as explanation of detected anomalies.
This data function includes the R/TERR code for the Random forest model, missing data imputation and random Over-Sampling Examples to resolve un-balanced class issue for binary response variables. It uses the CRAN randomForest package within the Spotfire interface. It is focused on supervised classification with a binary response variable. Read more here
Similar to the Random Forest Data Function for R/TERR, however, this is written in Python for those with preference.
Packages: pandas, numpy, scikit-learn
Support Vector Machines (SVMs) are one of the most popular machine learning models and are particularly well suited for classification problems on small-to-medium sized datasets. SVM classifiers attempt to find the "widest possible street" between classes and minimize the number of margin violations (points between the dotted lines above).
This data function will train and execute a logistic regression machine learning model on a given input dataset. Logistic regression is one of the most basic models used for classification and is ordinarily used for binary classification.
This data function will use the unsupervised local outlier factor method to perform anomaly detection on a given dataset.
Gradient boosting is an ensemble-decision-tree, machine learning data function that’s useful to identify variables that best predict some outcome and build highly accurate predictive models. For example, a retailer might use a gradient boosting algorithm to determine the propensity of customers to buy a product based on their buying histories
This data function performs Principal Component Analysis (PCA) on a given numeric dataset. PCA is an unsupervised method often used for dimensionality reduction that reduces a dataset into one with a smaller set of variables that still captures the original dataset's information. Above, we can evaluate the reconstruction of images using varying numbers of principal components.
This python data function uses K-Means clustering to group similar items in the form of clusters. Common uses for K-Means include customer segmentation, data pre-processing, and text classification.
This python data function calculates the Dunn Index (DI), which is a metric for judging a clustering algorithm. A higher DI implies better clustering and better clustering means that clusters are both compact and well-separated from other clusters.
>> Statistical Analysis
This Python data function calculates the correlation coefficients between columns of data. Correlation analysis is an important step in comparing data to determine whether it is highly correlated or not, and if so is that negatively or positively correlated. This can help determine relationships in data, as well as aid in data reduction by removing highly correlated data in use cases such as data science model building.
In this data function, three methods are available. These are: Pearson, Spearman and Kendall. All methods return a correlation score between -1 and 1 indicating the correlation score between two columns of data. A score of -1 is the maximum negative correlation, a score of 1 is the maximum positive correlation and a score of 0 means there is no correlation.
This data function uses one-hot-encoding to transform categorical columns into multiple numeric columns containing 0's and 1's, based on the presence of a column-value combination. This method is commonly used to transform data prior to model creation in machine learning because models often require all numeric datasets as input.
>> Data Sources
The quintessential COVID-19 Dashboard and Data Source. Johns Hopkins University gathers, stores, and distributes data on COVID-19 infections worldwide. Using a R/TERR data function, you can download this data for your own analyses. Visit the following link to get a the DXP template with data function included.