What is your recommendation for when to load data up front and when to use on demand?
Spotfire® can extract data from a data source up front or load slices of data on demand based on your interactions with the Spotfire clients. Loading "static" data up front has many benefits. You have all the data at hand during your analysis session, with no loading times while analyzing your data. You can analyze your data in the Spotfire data engine, independently from the data source, and choose when to refresh data. Data on demand, however, enables you to analyze more data, it's really a big data feature. It enables more data to be analyzed by providing slices of data from the data source based on your interactions with the analysis. Data on demand is frequently used for importing data into Spotfire's in-memory data engine but is also available for in-database data tables. On demand in combination with in-database reduces the amount of rows processed by in-database queries, thus enabling analytics across larger data sets and speeding up the rendering of visualizations. Below is a table with some of the pros and cons of each method.
|Scenario||Loading data up front/static||Loading data on demand|
|Data size||The size of data that can be analyzed depends on how much memory the client machine has and how long it's feasable to wait for data to load. For larger data sizes, data loading is often scheduled to be performed during nighttime. If this option is feasable, it's the recommended data loading option.||Enables more data to be analyzed by only providing the data needed at a given time. Works both in-memory and in-database.|
|Loading times||All loading time happens up front, before any analytics can be done.||Data from the previous session is loaded immediately as it's cached in the analysis file. New data is loaded during the analytic session as needed which means there are waiting times every now and then during an analytic session.|
We are replacing our SQL databases with Hadoop. What considerations are there from a Spotfire perspective when moving from SQL to Hadoop?
Just like with SQL databases, Spotfire connects to Hadoop's SQL interfaces. It's important to select Hadoop query engine carefully to avoid surprises later on. Hive, Impala, Spark SQL and Drill all offer different aggregation functions. Below are a few things to consider.
Spotfire lets you analyze data from data tables. These data tables can be copied into Spotfire's in-memory engine or be kept in the external data source. The latter data table type, in-database data tables, are only represented by a metadata view in Spotfire.
|In-memory||Live queries/in-database||In-memory + live queries|
|Queries are executed in Spotfire's in-memory data engine that provides extremely fast calculations on any data set and enables all of Spotfire features. On-demand data tables are supported.||Queries are executed in the external data source's data engine. Expect to be able to work with the largest amounts of data but also longer waiting times for visualizations and filters to refresh, compared to in-memory data tables. On-demand data tables are supported.||The key to success is often to combine data loading methods in the same analysis. For example, load as much as possible into memory up front. Then use live queries for the MASTER view of big data aggregates. When users drills down into the DETAILS, load row level transactions into memory using on demand.|
With higher query speed, more data and more complex calcuations are feasible to analyze using in-database data tables and live queries. Higher query speeds also means that (scheduled) data extracts takes less time to complete.
|Analytical databases like Teradata, HPE Vertica, IBM Netezza, SAP HANA and IBM DB2 BLU are in general very fast. This means they are often capabale of serving many concurrent Spotfire users and analytic files using in-database data tables and their live queries.||
Query speeds are often not on par with a SQL database but on the other hand scales across virtually unlimited amounts of data. Hive on MapReduce is not recommended for in-database and interactive live querying.
Expect better performance towards the SQL engines Hive on Tez with LLAP, Impala, and Spark SQL.
There are also databases tightly integrated with Hadoop. Examples are Apache HAWQ and IBM BigSQL. In this case they are represented as a Greenplum and an IBM DB2 database to Spotfire, with similar capabilities.
In addition, there are also Microsoft SQL Server Analysis Services compatible cubes available on Hadoop. Examples are Apache Kylin, Kyvos and AtScale.
The indexing engine Jehtro also provides a very fast SQL query interface for Hadoop. Facebook's database Presto is also worth mentioning.
|Wide, often including functions like binning.||The number of supported functions are in general not on par with most SQL databases but is growing over time. Data connectors hold a mapping of Spotfire functions to functions in the connector's data source. If you believe a certain function should be mapped but isn't, please share your idea with us on Spotfire's Idea portal here.|
For more details on function support please see the Spotfire web help here, browse to Connectors.
We use Spotfire's Unpivot data transformation a lot. Is it possible to do the Unpivot before end users load the data by pushing the Unpivot operation down to the database by configuring it as part of an Information Link or a Data Connection?
Spotfire's Unpivot data transformation is available when adding a data source, for example an Information Link, to your analysis. It's also possible to add later as a separate step. To perform this operation in the data source, custom SQL needs to be used, either as part of an Information Link or as part of a Custom Query in a Data Connection.
What options are available when doing a UNION/inserting rows in Spotfire?
With Spotfire 7.8 the recommendation engine helps users to add data as rows when applicable and match columns automatically. Additional columns can be added to the final data table and Spotfire keeps track of new and existing origin columns. If it would be valuable to add rows using UNION INTERSECT or UNION MINUS please let us know in Spotfire's Idea portal. Custom Queries or SQL can also be used, please see the section about Unpivot above.
Relations in Spotfire seem to be based on INNER JOINS. Is there a way to change JOIN type for Relations?
Relations are used for brush linking (marking across visualizations) and for filtering across data represented as separate data tables using for example the filter panel. However, Relations often exclude records and many times “empty” records need to be added just to ensure that the same records in your main table are in any related tables.
To eliminate this issue, data tables can be defined (in-memory or in-database) using connectors. By selecting tables and using custom or existing relations in the data source, Spotfire views all tables as one virtually joined data table. This eliminates the need for Spotfire Relations as table relations have already been defined in the data source/data connection. End users can analyze one virtually joined table dimension freely.
How can I include or exclude data (rows) from a data set before loading it into Spotfire? For example based on values in a true or false column?
When connecting to a data table it saves both loading time, memory and time to set filters if an inclusion/exclusion criteria is defined in the data connection. Spotfire provides a few different ways of doing this.
Information Links and Data Connectors both support prompts. These makes it easy to create selection dialogs that business users can select unique values from.
Parameterized Custom Queries
Data connectors support parameterized Custom Queries that let you pass values based on interactivity in your analysis into your SQL query. This can be used to limit data based on for example document properties or Spotfire's DOMAIN, GROUP and USER variables. The latter is useful to filter out personal data from a generic data set.
On-demand is a popular way to limit data both before initial loading and during an analytic session based on user interactions with marking, filtering and actions causing document properties to change.
What's your recommendation for implementing Row Level Security in combination with very large embedded data?
Because of the time it takes to load an analysis file with gigabytes of embedded data, scheduled updates are used. Scheduled updates loads the analysis file into the Web Player's memory so that it's ready for business users when they for example come into the office in the morning. However, since data is loaded by the scheduled updates user which is a service account, all data is loaded and in many cases users are only allowed to see their peronal slice of the complete data set.
To combine pre loading of large embedded data sets using scheduled updates with personal data, a lookup table and Spotfire user identities are used in combination with the complete data set. This is refered to as Personalized Information Links which is decribe in more detail here.
The idea is that a join is much quicker to perform than data loading. When a user logs in and opens the analysis file the first thing that happens is that the complete data table is joined with the users rows in the lookup table. The result is a much smaller data set, only containing the rows that are left after the join with the lookup table, and only personal data.
Note: This is a solution for the 1000's of consumers using the Spotfire Consumer client on the Web Player. Users with authoring licenses using the Spotfire Analyst or Business Author clients can edit the analysis files configuration and gain access to the full data set.
What options are available for write back of information into a data source from Spotfire?
Even though Spotfire by default is a read only platform it's very common to configure Spotfire to write back information into data sources. A common use case is to tag rows with comments or tags that group rows together. This is very useful for example when mutliple teams perform different parts of an analysis. The first team can then tag the rows of interest which makes it easy for the next team to know where to start the continued analysis.
For more information about different ways of implemeting write back please see the Spotfire community here.
What's the recommended way of connecting to star schemas?
Even though Information Links can be used to connect to star schemas it's much easier to connect using data connectors. The table below summarizes some of the differences.
|Capability||Information Links||Data Connectors|
|Relations||Relations are (re)created manually in Information Designer.||
Users configuring data connections can load relations from the data soruce with a couple of mouse clicks. The relations automatically become part of the (in-database) view to be analyzed.
If data is imported into memory the selected relations will define a join that is performed in the data source. The resulting data table is then loaded into memory. Since the result is only one data table visual analytics becomes very easy to do with no need for additional add columns operations or relations.
JDBC drivers only.
Drivers are needed on the Spotfire Server only.
No drivers, ODBC drivers and ADO .Net drivers depending on connector used.
Drivers are needed on the Node Manager/Web Player server and on the Windows clients running Spotfire Analyst.
What can I do to improve Spotfire’s data loading speed?
First of all, do you need to extract all that data? Spotfire works with both live queries against external data/data engines and data extracts imported into Spotfire's in-memory engine. By combining these data loading types and visualizations in the same analysis file you can often reduce the number of rows you actually need to extract into Spotfire.
You are probably aware of the general factors impacting your data loading speed. Examples are hardware resources, database/cluster size, query engine performance, query complexity, network bandwidth, the number of concurrent users and the importance of using the latest drivers recommended by the database vendors.
But there are capabilities and settings in the Spotfire platform that are good to be aware of. These impact the time it takes for data to be loaded and ready for visual data discovery. Some of these capabilities and settings are highlighted below. They are divided into loading data using live queries and in-database data tables and loading data as extracts into Spotfire’s in-memory data engine.
Big data analytics using live queries and in-database data tables is directly dependent on how fast the SQL or MDX query engine can process Spotfire's queries.
Pages in analysis files use multiple visualizations per page, filters on multiple columns and brush linking for joint marking across visualizations.
If the use case is a dashboard the user interaction with the analysis file is often lower with modest load on the data source’s query engine. In this case, Spotfire’s live query cache is extremely useful as a dashboard tends to reuse the same queries more often with no need to push down queries to the data source. Users are simply using the visualizations more or less as they are configured by the author of the analysis file.
If data in the data source is not updated live or during the use of the analysis file or if it’s not part of the use case to analyze live data, it’s recommended to increase the default time of the live query cache to 60-120 minutes. This will ensure that even sporadic interaction with the dashboard and the resulting queries will make use of the live query cache.
Spotfire's in-memory data engine ingests data as soon as it’s delivered by database drivers, the data source APIs or the file reader. There are however capabilities and settings to consider.
Data transformations that are part of the data loading step (part of the data source) or applied as a following data transformation step, take time to process. Reviewing data transformations might speed up the data loading step.
When loading star schemas you could consider if joining the tables you need in the data source or ingesting them into Spotfire’s data engine as separate tables and then working with Spotfire’s relations would be most effective. Connectors by default flatten/joins star schemas in the data source before load, which makes it easy to work with connectors but can be time consuming and increase the size of the table that is loaded.
Narrow data sets in general load more quickly and are more suitable for analytics. Wide data sets (1000’s of columns) in general take more time to load. In addition, they often require a potentially time consuming Unpivot data transformation to transform them to a narrow data set.
Is Spotfire limited to 10.000 rows when using live queries / in-database data tables?
No, this is a safety belt feature only in the unaggregated table visualization that can easily be overridden.
Starting with Spotfire 7.8, administrators can configure the limit for how many rows users are allowed to load into a table plot using live queries.
Before 7.8, if more than 10.000 rows need to be listed in an analysis file, an additional in-memory data table is used to drive the table plot. Data is then loaded into the table plot on-demand, driven by for example drill down marking in for example an adjacent aggregated bar chart.
The reason for this safety belt setting is that live queries are often used towards billions of rows of data. With no limitation, it would be too easy for users to by mistake start loading huge amounts of rows into their analysis files' table plots, quickly filling up client and server memory.
How many rows can Spotfire extract into data tables?
In-memory data tables: This depends on how much RAM you have available but the limit is often between 200M and 500M rows depending of the number of columns.
In-database data tables: the number of rows and columns is limited to what is needed to draw visualizations and filters. The data tables holds only aggregated tables of data. The rows are kept in the data source.
Besides saving the data tables as CSV files, which have limitation in file size, can users save the big data tables in Spotfire file formats for later use?
If you need to save data to file then use the Spotfire Binary Data File format (SBDF). For more information see Working with large data volumes.
Can we save extracted data into databases like Amazon Redshift?
Spotfire is by default a read only application but with some scripting you can enable write back from analysis files to databases. Here are a few resources:
How many columns does Spotfire in-memory data tables support?
This depends on how much RAM you have available but typically multiple thousands of columns.
Does Spotfire support scripts, SQL commands and automation?
Yes, for more information please see for example these resources:
How can I improve Apache Hive query performance?
We recommend that you evaluate LLAP to speed up query performance. From Hortonworks: "Hive LLAP (Low Latency Analytical Processing) is Hive’s new architecture that delivers MPP performance at Hadoop scale through a combination of optimized in-memory caching and persistent query executors that scale elastically within YARN clusters."
Hortonworks provides official support for LLAP starting with HDP 2.6. LLAP.
Useful LLAP resources.