Configure SparkR to use TIBCO Enterprise Runtime for R

By:
Last updated:
3:38pm Apr 21, 2020

The SparkR package is an open-source R package that provides a front end for using the Apache™ Spark system for distributed computation. SparkR allows using R to invoke Spark jobs, which can then call R to perform computations on distributed worker nodes.

You can specify using the TERR engine, rather than the open-source R engine with the SparkR package.

This technical note includes instruction for configuring version 2.4.0 of Spark and the SparkR package to work with TERR. Check your version of Spark before proceeding. If you are using an older version of Spark, we recommend upgrading. If upgrading is not an option for you at this time, contact TIBCO Technical Support for information about configuring TERR to work with Spark versions 1.3, 1.4, or 1.5.

This technical note is directed at users who have a working knowledge of open-source R, Spark, and the SparkR package. To use TERR with SparkR, you must be able to perform the tasks described in this Technical Note.

These instructions include the following:

  • Downloading and installing the required software.
  • Testing TERR with SparkR.
  • Troubleshooting your configuration

Downloading and installing the required software

Before you use TERR with SparkR, you must download and install Hadoop and Spark.

Perform this task on a 64-bit Windows computer where KNIME is installed and that meets the requirements for running TERR.

Procedure

  1. Install Hadoop (with Yarn)
    1. Browse to hadoop.apache.org.
    2. Follow the instructions to install Hadoop with Yarn.
  2. Install Spark version 2.4.0 or later.
    1. Browse to spark.apache.org.
    2. Follow the instructions to install Spark.
  3. Install TERR and link it to the SparkR package.
    1. Browse to https://edelivery.tibco.com/storefront/eval/tibco-enterprise-runtime-for....
    2. Follow the instructions to install TERR.

Testing TIBCO Enterprise Runtime for R with SparkR

After you have installed SparkR and TERR, you can run a simple test to make sure it works as expected.

Perform this task using a code editor, such as RStudio®, on a computer that meets the prerequisites.

Spark 1.5 and later includes SparkR. It also includes a contribution by TIBCO to createRProcess. This contribution enables users to use TERR and SparkR without modifying the source or building the package. (Previous versions of Spark did not include this contribution and required additional modification and configuration. If you are using an older version of Spark, we encourage you to upgrade to version 2.4.0.)

Prerequisite

You must have completed the steps for downloading and installing the required components, above.

Procedure

  1. Open Spark. (SparkR is loaded into TERR.)
  2. Test TERR with SparkR.

/home/TERR/bin/TERR --no-restore --no-save

library(SparkR)

ss <- sparkR.session(master="local", spark.ui.showConsoleProgress="false", spark.sparkr.use.daemon="false", spark.sparkr.r.command="/home/TERR/bin/TERRscript")

xx <- data.frame(x=base::sample(LETTERS[1:5], 10000, replace=TRUE), y=rnorm(10000))

df <- as.DataFrame(xx)

collect(summarize(groupBy(df, df$x), count = n(df$x), mean=mean(df$y)))

## test: should print a data.frame with 5 rows and 3 columns

  • The parameter spark.ui.showConsoleProgress="false" is not required for using TERR, but it is useful: it turns off the progress bar printed to the console during Spark operations.
  • We specify spark.sparkr.use.daemon="false" so that SparkR does not create a daemon R process to spawn R engines. The SparkR code for this daemon uses several functions that are not currently implemented in TERR (for example, parallel:::mcfork, parallel:::mcexit, tools::pskill, socketSelect). (Eventually, we want to support these in TERR.)
  • The parameter spark.sparkr.r.command specifies the command to be used, in place of Rscript, when invoking the engine from worker nodes. Here, the path is given to the TERRscript command.
  • Both nodes should run, and the R Snippet node should have Std output showing the TERR banner.

Troubleshooting your SparkR configuration

If you experience problems using SparkR with TERR, try testing SparkR with open-source R.

Open-source R is available under separate open source software license terms and is not part of TERR. As such, open-source R is not within the scope of your license for TERR. Open-source R is not supported, maintained, or warranted in any way by TIBCO Software Inc. Download and use of open-source R is solely at your own discretion and subject to the free open source license terms applicable to open-source R.

Prerequisites

You must have installed open-source R.

Procedure

  1. Make the SparkR package available to open-source R.
  2. Start open-source R.
  3. In the open-source R console, load the SparkR package.
  4. Run a test script and evaluate the results.

Example: Testing open-source R with SparkR

## run R, load sparkR, run test
/home/R/bin/R --no-restore --no-save
library(SparkR)
ss <- sparkR.session(master="local")
xx <- data.frame(x=base::sample(LETTERS[1:5], 10000, replace=TRUE), y=rnorm(10000))
df <- as.DataFrame(xx)
collect(summarize(groupBy(df, df$x), count = n(df$x), mean=mean(df$y)))
## test: should print a data.frame with 5 rows and 3 columns