Using PySpark in a TIBCO Data Science Team Studio Notebook
PySpark is a Python API for Apache Spark. It is designed to run applications in parallel on a distributed cluster, one of the data sources that you can work with in Team Studio. Because it is written in Python, it can also be used with other common open source packages to speed up development, for example using multiple nodes to experiment with different hyperparameters in a deep learning model.
The Jupyter Notebooks in Team Studio has a helper function that makes it very easy to initialize PySpark on your cluster and read data from HDFS as a Spark DataFrame.
- Create and open a new Notebook under Work Files in your Team Studio Workspace.
- Click on the Data menu
- Select "Initialize Pyspark for Cluster...". This will automatically create a cell with the following code (the information will be specific to your cluster):
from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext, HiveContext import os # Modify the line below to change the number of executors and the amount of memory that they use # You can run Spark in local mode by changing 'yarn-client' to 'local', and setting the 'YARN_CONF_DIR' to '' os.environ['PYSPARK_SUBMIT_ARGS'] = "--master yarn-client --num-executors 1 --executor-memory 1g --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.11:3.0.1 pyspark-shell" os.environ['YARN_CONF_DIR'] = '/data/hdfs_configs/xx-xx.xx.xx.xx' cc.datasource_name = 'Your datasource name' # Each worker node in the cluster needs Python 2.7. # If this is not the default Python on the node, provide the Python path here # os.environ['PYSPARK_PYTHON'] = '' # Do not remove or modify the following line: # [[performPysparkInit(15)]] # This environment variable has the value 'workflow' when the notebook is being executed as part of an analytics workflow os.environ['NOTEBOOK_EXECUTION_ENVIRONMENT'] = 'notebook' # This will stop the SparkContext if there is one left over from a different notebook execution try: sc.stop() except NameError: pass APP_NAME = 'PySpark example.ipynb-my_team_studio_username' conf = SparkConf().setAppName(APP_NAME) sc = SparkContext(conf=conf) sqlContext = SQLContext(sc)
- Specify the path to your data and read it into a Spark DataFrame:
df = sqlContext.read.csv("/your/file/path.csv", header = True)
You can then perform any operations on 'df' using PySpark.
If you do not have access to a Hadoop cluster, you can run your PySpark job in local mode. To do that, you can still use the helper function, but change the following parameters:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local pyspark-shell' os.environ['YARN_CONF_DIR'] = ''