TIBCO Spotfire® Data Science: Tricks of the Trade
12:37pm Oct 11, 2018
Table of Contents
General workflow hints
- Build a workflow from left to right. Using clean and well-positioned operators makes for a more understandable workflow.
- Every operator has a Note field in its Properties dialog box. If an operator has a note, the note is visible when the user hovers a mouse pointer over the operator. In addition, there is a Note operator. Good use of notes makes it easier to share flows between people.
- Local files can be uploaded into your data sources from the data tab of an open Spotfire® Data Science workflow.
- If you are formatting an HDFS file that has no column headers, you can upload the header information from a separate file.
- Operators have a default location to store their output. From the operator dialogue box menu, you can change this location to save important results.
- Some operators are terminal (for example, Pivot). This means that no other operators or datasets can be connected to its output.
- If you are using terminal operators, build a second flow from the output table, and then schedule a job to run the two flows successively.
- Lots of missing data messes up your model. Use the Summary Stats operator to identify problems, and use the Null Value Replacement operator for imputation.
- If you find yourself using and changing the same values in many different operators, consider using workflow variables to change them all at the same time.
- Step run runs a subsection of a workflow. To select multiple operators, lasso a group of operators.
- You can copy and paste operators, or groups of operators, within a workflow or from one workflow to another (in the same workspace).
- To delete a connection between two operators, you can highlight the connection, and then click the middle dot.
- Spotfire Data Science administrators have access to the Preferences menu, which includes options for decimal precision, datetime formats, maximum numbers of distinct values, and so on. You can find this menu in an open workflow, from the Action menu.
- To maximize Spark performance, you can tune parameters based on the job and cluster characteristics. To tune all global performance, use data source connection parameters. To tune a single workflow or operator, use workflow variables.
- Use the Jobs scheduler to run workflows in sequence.
- From either the activity stream or the Jobs tab, you can view a nice summary and the results log for the most recent run and historical runs of a scheduled job.
- The Variable operator has a multi-variable option that applies the same expression to multiple columns.
- The Variable operator has a Quantile Variables option that adds binned columns to the results dataset.
- For database tables, the Aggregation operator lets the user window functions.
- When you select columns, the Search bar accepts a wildcard search.
- In a column checkbox selector, you can drag the mouse to select multiple (successive) columns.
- Row Filter has a “script mode” to create more complicated filters using Pig or SQL.
- Pig/SQL UDFs can be used wherever users can create their own Pig and SQL scripts.
- Hadoop Join has both MR and Pig options.
- When you edit Pig or SQL code, you can type ctrl-space to see suggestions or auto-complete commands.
- The documentation is posted at https://alpine.atlassian.net/wiki/display/V6/. Optionally, you can go directly to the documentation page for a specific operator by selecting the operator in a workflow, and then clicking the blue question mark in the lower left-hand corner of the browser.
- RESTful API documentation is available at http://<host>:<port>/api
Technical Support and Data Science consultation
- Submit a support ticket at TIBCO Support.
- An administrator can find the option Download Logs under Help and Support.
- Get deep questions answered quickly by visiting our community site!