Tips and Tricks for Working with Python in Spotfire

Last updated:
6:37am Nov 26, 2021

Introduction

When you create an analysis using Spotfire®, you can enhance the capabilities through the use of data functions. Data functions are a method of calling languages such as Python, TERR®, Open Source R or even Matlab. This allows Spotfire users to interactively call Python through the Spotfire client, or through the web player to greatly enhance their analytics. This guide serves as a summary of tips and tricks which can be used to aid in the development of Python data functions, as well as how they can be implemented in your analytics to enhance the insights and data science in Spotfire.

To learn more about data functions in Spotfire, please visit the Spotfire Enablement Hub or visit the Dr Spotfire YouTube channel

Looking for tips on using R/TERR® data functions: check out these equivalent wiki articles: Tips for debugging TERR scripts and Top 10 tricks for TERR data functions

Setup and Background

For Spotfire 7.13 to 10.6, install the Python Data Function Extension. Spotfire 10.7+ has native support for Python Data Functions.

1. Spotfire and Python Data Type Mapping

When data in Spotfire is passed to a Python data function, the Spotfire data types are converted to Python data types. The reverse also happens for when data is returned from Python to Spotfire. View the data type conversions table in the official Spotfire Python documentation.

2. Other Setup Tips

For tips on how to use a different Python interpreter than the Spotfire default one, how to use a different package repository or how to install packages when behind a firewall or proxy, then please read the official Python documentation found here.

Debugging

1. Debugging Python data functions using an external IDE

For developing Python data functions it is often very useful, especially as a script grows larger, to be able to run and debug it in popular IDEs such as PyCharm and Visual Studio Code. To do this for a Python data function, follow this process:

Add the following code to your data function after the library imports. This code will export a pickle file with all the inputs to the data function. Note the file location which you may want to alter:

import pickle
import pandas as pd

#Note: place this code after imports (need to import pickle)
#uncomment and change file path to write all input parameters to a pickle file
def is_input_parameter(key, value):
	if key == '__builtins__':
		return False
	try:
		if value and ('module' in value or 'class' in value):
				return False
	except:
		if not((isinstance(value, pd.Series)) or (isinstance(value, pd.DataFrame))):
			return False
	return True
	
input_param = {k: v for k, v in globals().items() if is_input_parameter(k, v)}

# You may want to alter the file path
with open("C:/Temp/data_function_inputs.pkl", 'wb') as file:
	pickle.dump(input_param, file)

 

Now run your data function once (even if it fails), which will create all your inputs sent from Spotfire to Python as a pickle file. Then copy your data function code from Spotfire into a python file open in Visual Studio Code or PyCharm (or equivalent). Add the following code to the top of this Python script after the library imports:

import pickle

# load input parameters in Python IDE:
# remember to alter the file path if you altered this in the first script
input_param = pickle.load(open('C:/Temp/data_function_inputs.pkl', 'rb'))
globals().update(input_param)

 

After running this code in your IDE, all the objects that your Spotfire Python data function takes as inputs will be loaded into your Python instance. This means you can run your Python code using the same objects that Spotfire passes to Python, making it easy to develop and debug your script. Once your script is ready, you can copy and paste it back into your data function script in Spotfire.

2. Configure your external Python IDE to use Spotfire's Python interpreter

In IDEs such as Visual Studio Code, it is often an option to choose which Python interpreter you run code as. This means you can configure your IDE to point Spotfire's own Python instance so you can replicate how Spotfire will run your data function more closely while running the code in an external IDE. It also means that the libraries will match to those that are installed in your Spotfire Python instance. To get the path to Spotfire's own Python interpreter, go to the Tools menu in Spotfire and select 'Python Tools'. From there you will see the path to Python Spotfire uses and a button to copy this path. You can then use this to point your external IDE to Spotfire's own Python instance:

Examples:

  1. Add python interpreter to Visual Studio Code
  2. Add python interpreter to PyCharm

3. Using print statement in debugging

Print statement is one of the important quick tools when in comes to tracking and debugging your code. With print commands you can do for instance:

  • Checking the content of a variable, table, etc.
  • Checking the data type of a variable.
  • Checking the data frame shape (This is useful in ML applications)

below is an example on how to perform the tasks above:

In order to make sure that the function debugging is on, make sure to enable it by Tools -> Options -> Data Functions -> check Enable Data Function Debugging

4. Using python classes for better code debugging and re-usability

Classes are a big help when it comes to making our code reusable and maintainable. They also make the debugging process easier. Let's imagine a use case you several csv files each contains pressure and temperature reading about a specific pump in a pump station. You want to create a Pump() object that has the following attributes/methods:

Attributes:

  • Pump name
  • Pump location (which is included in the csv file name)
  • Pump size (included ni the csv file name)
  • Source file location

Methods:

  • Find the max pressure value
  • Find the max temp value
  • Find the correlation between pressure and temp
  • Find summary stats
class pump():
    def __init__(self, path):
        self.file = file
        self.location = self.location()
        self.size = self.size()
        self.name = self.name()
        self.data = self.data()
        self.max_press = self.max_press()
        self.max_temp = self.max_temp()
        self.corr = self.corr()
        self.sumstat = self.sumstat()
                
    def location(self):
        return self.file.split('_')[1]
    
    def size(self):
        return self.file.split('_')[2].replace('.csv', '')
    
    def name(self):
        return self.file.split('_')[0]
    
    def data(self):
        import pandas
        return pandas.read_csv(self.file)
    
    def max_press(self):
        return max(self.data['press'])
    
    def max_temp(self):
        return max(self.data['temp'])
    
    def corr(self):
        from scipy.stats import pearsonr
        return pearsonr(self.data['press'], self.data['temp'])[0]
        
    def sumstat(self):
        return self.data.describe()


file = r'C:\Users\aalattar\Downloads\pump01_houston_130kpa.csv'
p = pump(file)
print(p.file)
print(p.location)
print(p.size)
print(p.name)
print(p.data)
print(p.max_press)
print(p.corr)
print(p.sumstat)

 

Enhanced f(x) Flyout and End User Experience

Spotire 11.0 introduced the f(x) flyout to facilitate access to analytics tools and data functions. This allows a seamless interface to register and run library data functions.

1. Clear Display Names and Documentation

Proper documentation and instructions are crucial for others to understand and use data functions. The flyout will display the data function name, data function description, display names of each input and output parameter, and descriptions of each input and output parameter. To facilitate their use, ensure that the display names are clean and concise plus follow the terminology for the targeted industry or field. For example, while you might use ‘input_df’ in your code, its better to rename it to ‘Input Data’.

2. Optional Parameters

Some data functions, such as machine learning models, will have many input parameters so that the user has full flexibility to best configure or tweak the parameters. To avoid having data functions that are cumbersome to configure, make only the most essential parameters required. It helps to ask: what are the foundational inputs needed from the user so that the function can run and return meaningful results? You might not have the best model without testing various parameters but you’ll have some starting point. For example, in this general TensorFlow regression and classification data function, only the input data, target column name, and task are required (appear as 'not configured' in the flyout); the remaining parameters can be left blank or overriden.

To make parameters optional, you will need to mark them as not required and often provide some default. Each input parameter in a data function is designated as ‘Required’ or not. To make a parameter not required, leave the ‘Required Parameter’ box unchecked. 

There are standard code checks to check if these optional input variables have been specified by the user, and if not, then provide a default value. All non-required parameters appear as global variables and are set to null if no input is given.

If the parameter is a Python primitive type (this covers most input parameters of type ‘Value'), the following code will take the user’s input if its provided otherwise contiue with a default:

#save model to existing file path if provided otherwise save to a default file path
#file_path is a string Value
if not ("file_path" in globals() and file_path != None and len(file_path) > 0):
	file_path = 'best_model.h5'

 

If the input parameter is of type ‘Column' or type 'Table':

#check for optional Column
#if user does not provide a subset of columns to use, use all the columns in the input data
if not ("selected_columns" in globals() and isinstance(selected_columns, pd.Series)):
	selected_columns = pd.Series(list(input_data.columns))

#check for optional Table
#if user does not provide a new dataset for scoring, model is scored on training data
if not ("new_data" in globals() and isinstance(new_data, pd.DataFrame)):
	new_data = training_data

 

3. Design of Input and Output Parameters

Taking into account the design of input and output parameters can facilitate work for both Python data function developers and end users building dashboards. While there are a multitude of designs depending on the organization of your data and use case, here are some common scenarios.

If your features or columns of interest are few or set in number or you are working with columns from different tables, consider passing each column as its own input parameter of type ‘Column’. If you have many or variable number of columns of interest, have them all under a single input ‘Data table’.

Also consider how the output parameters will be used in the data canvas and subsequent visualizations. If you have model predictions outputted as a column, these can be added as a calculated column to any data table. To plot model predictions versus observed values in a line chart, scatter plot, etc., it helps to have a long data table with the input data and model prediction rows concatenated and designated (‘actuals’ versus ‘predictions’). This can be done in both the data canvas or data function (although there might be data duplication if you do it within data functions).

Advanced Uses

1. Using a Model Produced in A Previous Data Function

If you create an object such as a machine-learning model and you want to use that object in a second data function, you can use the pickle package to store the object in a document property in Spotfire. This way, the model can be set as inputs to other Python data functions. The following example demonstrates using the pickle package to manage this process. In the data function which creates the model object, the model object needs to be pickled and then set as an output of the data function:

#From data function number 1
import pickle

modelObject = pickle.dumps(myObject)

 

In your data function configuration, set the modelObject to be an output of type Value:

The binary data (called modelObject in the above example) can be stored in a Spotfire document property when the data function parameters are configured in Spotfire, and then loaded from that document property to the second data function as an input parameter:

Then in the second Python data function script, this model can be easily restored:

#To data function number 2
import pickle

myObject = pickle.loads(modelObject)

 

This technique should work generically for almost any object that you might want to retain between data function calls, not just models.

2. Formatting with an ID Column

Adding in an optional ID column helps relate each row from a data function results table with the observation it came from. This is necessary if there is a one-to-many resulting table or if the function manipulates or needs to know the order of observations in inputs. Here are two ways of checking for and otherwise providing a valid ID column as input parameter (the validity check is that all values in the column are unique). 

If the ID column (id_column) is a ‘String Value’ that is in the main data table (input_df):

#check if the ID column (id_column) is provided and valid in the data table model_data otherwise provide a default one
#default ID starts from 0 and increments with subsequent rows
if not ("id_column" in globals() and (id_column != None) and (id_column in model_data.columns) and model_data[id_column].nunique() != len(model_data)):
	id_column = "ID"
	model_data[id_column] = range(0, len(model_data))
assert(model_data[id_column].nunique() == len(model_data))
assert(id_column in model_data.columns)

 

Common Issues and Errors

Below is a list of messages and errors that may be returned when you try to run your Python data function. Common issues are often related to the data passed to the data funciton, such as passing an empty data table and the Python is not set to handle this. The other common issue seen is trying pass back an empty data frame from Python, or including invlaid types such as lists in a data frame column.

1. Cannot determine type / all values are missing

This message points to the fact that either a column in your data frame is completely empty and therefore Spotfire cannot determine which type of column this is i.e. String, Numeric etc. If Spotfire can't determine a type, an error is returned. To resolve this, it requires debugging your code to see why one of your columns is empty. There are two main methods to debug this: using simple print statements or by debugging your code in an external IDE like visual studio. Also see this stackoverflow question and answer. For example, adding a print head statement before the last occurance of your data frame to see how what is in each column:

## if my output data frame in my Python data function is called output_df
print(output_df.head())

 

2. ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

This commonly occurs when you have a column in a data frame of an array or list type. For example:

## A list of lists - assuming we already have a data frame called df
df['MyColumn'] = [[1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2]]

 

In this case, Spotfire can't convert a column where each item or row is a list rather than a distinct value. The solution here is debug your outputs of the data function using simple print statements, such as using a printing a head() statement for your output data frame(s), or printing the dtypes for each column and checking none are of type list or array:

## Check the first 5 rows to ensure no columns have lists or arrays
print(df.head())

## And/or check dtypes
print(df.dtypes)

See the above guides on using simple print statements or by debugging your code in an external IDE like visual studio to help diagnose the issue

3. Passing in empty data to Python: IndexError: single positional indexer is out-of-bounds 

Errors of this type commonly are due to an input data table being empty. This can easily occur if you set your data function input table to react to a filter scheme or marking, and a therefore it is possible to not mark any data, or filter out all data. In this case your Python script will react in different ways and the error may not match this exactly. So it is important to handle the situation where an empty data table is passed into your Python data function. A simple method to handle this is to check the length of your input data frame, and if it is 0 you can raise an exception or return a dummy data frame for the output table (assuming your data frame has an output of a data table). For example:

## if our data function input data table is called input_df
## and our expected output data table is called output_df

## check we have some rows of data
if len(input_df) > 0:
    ## Run your code as expected as we have data
else:
    ## An empty data frame as been passed in
    ## If we want to prevent an error from occurring we need to return the output_df as expected
    ## To do this we create output_df with a single row of data
    output_df = pd.DataFrame(values = {'Column A': ['No Data Passed In'], 'Column B': [0.0]})

 

In the above example, we return a dummy data frame with a single row if no data was passed to the Pythond data function. This single row is added (rather than just having a completely empty data frame), so that Spotfire knows that data type each column should be as an output. So, if the expected output from the Python data function was to output a data frame with 2 columns called Column A and Column B, and Column A is a String column, and Column B a Real column, then in our dummy data frame we add these two columns and a single string value and a single numeric value; 'No Data Passed in' and 0.0 respectively.