# TIBCO Statistica® Data Scientist

Customers can purchase add-ons to TIBCO Statistica® Data Scientist for a metadata store, job server, versioning/approval, monitoring & alerting, live scoring, manual data entry & analytics and interactive dashboards. If an add-on is purchased, then TIBCO Statistica™ Data Scientist includes a license server. The license server is also included if concurrent user licensing is purchased.

TIBCO Statistica™ Data Scientist can also be purchased for a single user on a desktop. The product contains the following features:

- automation for data cleaning; dirty data is the most common analytics problem
- business rules builder
- exploratory analysis & visualizations; learn about the problem space
- descriptive statistics, nonparmetrics; learn and share factoids about the problem to build situational awareness
- linear regression models, nonlinear regression models; estimate the relationships among your variables and create predictive models (machine learning); also use simulated data to create linear regression models and learn something new
- multivariate exploratory techniques; organize data into meaningful clusters, classify variables (reduce/relate variables), principal components & classification analysis
- process analysis, quality control, multivariate statistical process control; understand critical process parameters which impact critical quality attributes
- design of experiments, power analysis and interval estimation; experiment and discover; also use simulated data to execute virtual experiements
- tabulation options; everyone needs a summary table for their presentation to management

There are two modes of interaction with the analytics; spreadsheet and workspace. For ad-hoc analysis that does not need to be duplicated, users can import data into a spreadsheet and interact with menus, variables, and rows of data. The workspace is a visual analytic workflow management tool and is recommended. This allows work to be saved and reused. No coding is needed to complete a workspace. And for the users who need to manage their code, the workspace has a "code node" which can execute C#, Python, or R code.

### Data Profiling, Cleaning, Transformation

The Data Health Check node (data profiling) explores values, value ranges, discrete text lables, missing data, outliers, etc.. on every variable. The results of this analyses is a diagnostic report. This node can also be configured to automate fix the data problems uncovered by the analyses.

Additional options to transform and clean are available; feature selection (select the best predictors), remove duplicates, recode, rank, merge, process invariant variables, recode outliers, missing data inputation, recode missing data, subset, sample, filtering (variable 1 = "X"), crosstable (also known as pivot), etc..

Box-Cox is available to transform variables so that they have a distribution as close to normality as possible (Box and Cox, 1964). This allows the use of algorithms, like regression analysis, that only work with a normal distribution.

### Workspace Features

A workspace is a no-code and low-code tool that:

- documents the analytic steps
- imports excel, csv, fixed width (mainframe) data
- embed sdata within workspace as a lookup table; transform "m" to Monday for readability
- imports Spotfire SBDF data file and configure analytics (see options below)
- retrieves data from database with ODBC driver and configure analytics (see options below)
- creates data mashup
- creates visualizations
- formats output for reporting
- exports results to excel, csv, Spotfire SBDF, etc..
- writes results into a database; SQL Server, Oracle, Teradata, SQL Server PDW, PostgreSQL, DB2
- workspace calls another workspace

The workspace can also be extended with R, C# or Python coding.

### Visualizations

2D and 3D visualizations are available with the product; histogram, line, scatterplot, means with error, bag plots, quantile-quantile (beta, exponential, extreme, gamma, lognormal, normal, Rayleigh, Weibull), variability, contour, wafer, normal probability, etc.. Interactive dashboards are available for the analytic user.

#### PI Connector

The PI Server is a real-time data storage and distribution engine that powers the PI System (http://www.osisoft.com). Statistica has a PI Connector to retrieve data from this system for analysis. In 2015 OSIsoft announced that the PI SDK is transitioning to deprecation. And the vendor recommended users start transiting to the replacement product; PI AF (asset framework). Statistica will continue to support PI SDK until the vendor ends the product. PI AF connector is available.

### Analytics

- ANOVA/MANOVA
- Association Rules
- Automated Neural Networks
- Boosted Tree
- Calculators; Distributions, Pearson Product Moment Correlation Coefficient, Six Sigma
- Canonical Analysis
- Classification Trees
- Cluster Analysis
- Correlation
- Correspondence Analysis
- Cox Proportional Hazards Models
- Data Miner Recipes
- Descriptive Statistics
- Design of Experiments (DOE)
- Discriminant Function Analysis
- Distribution Fitting
- Distributions & Simulation
- Dynamic Time Warping
- Extract, Transform, and Load (analytics are used to align time based data)
- Factor Analysis
- Faster Independent Component Analysis
- Feature Selection
- Fixed Nonlinear Regression
- General CHAID Models
- General Classification and Regression Trees (C&RT)
- General Discriminant Analysis (GDA)
- General Linear Models (GLM)
- General Partial Least Squares Models (PLS)
- General Regression Models (GRM)
- Generalized Additive Models (GAM)
- Generalized Linear/Nonlinear Models (GLZ)
- Generates Predictive Models in C, C++, C#, Java, PMML, SAS, SQL Stored Procedure in C#, SQL User Defined Function in C#, Statistica Visual Basic
- Goodness of Fit, Classification, Prediction
- Independent Component Analysis
- Interactive Tree (C&RT, CHAID)
- Lasso Regression
- Link Analysis
- Log-Linear Analysis of Frequency Tables
- Machine Learning (Bayesian, Support Vectors, K-Nearest)
- Multidimensional Scaling (MDS)
- Multivariate Adaptive Regression Splines (MARSplines)
- Multiple Regression
- Nonlinear Estimation
- Nonparametric Statistics
- Power Analysis and Interval Estimation
- Multivariate Statistical Process Control (MSPC - PCA / PLS)
- Optimal Binning
- Predictor Screening
- Principal Components & Classification Analysis (PCCA)
- Process Analysis
- Process Optimization
- Quality Control Charts
- Random Forests
- Rapid Deployment of Predictive Models (PMML)
- Reliability and Item Analysis
- Sequence and Link Analysis
- Stabilty and Shelf Life Analysis (regulated by FDA)
- Stepwise Model Builder (what-if)
- Structural Equation Modeling and Path Analysis (SEPATH)
- Survival & Failure Time Analysis
- Time series / forecasting
- t-tests and other tests of group differences
- Tabulate
- Text Mining
- Variance Components & Mixed Model ANOVA/ANCOVA
- Weight of Evidence