How to Reprocess Solution Errors

By:
Last updated:
10:50pm Aug 01, 2019

Overview

Failures, errors and outages are unavoidable parts of any technical system. Of course, as engineers we should do our best to design solutions with failures in mind. Regardless of our best intentions and planning, situations sometimes come up we had not anticipated, which make elegant recovery difficult. All we can do is re-attempt and hope that connectivity is restored. One such example of this is so-called heisenbugs.

The Connect capability of TIBCO Cloud Integration provides the ability to reprocess failed records. When an execution fails with record errors, a copy of each source record with an error is stored, either in the cloud or locally in the on-premise agent database. It gives us ability to retry processing of these failed records.

In this article I will show you how we can automate reprocessing of solution errors with the help of the Scribe Platform API Connector.

Short on time? Check out this video on how to reprocess solution errors!

Use Case

Consider the case when you have an unstable connection to one of your source or target systems in a solution. We want to automate reprocessing of all failed records in this solution.

Prerequisites

As a prerequisite you should have one unstable solution. For demo purposes let’s use solution with single map as follows:

This map will only succeed in 50% of the cases. Let’s see why:

  • We’re using a fictional entity called SelectOne from Scribe Labs Tools Connector. It just provides a single row with current datetime in it. It can be very handy if you just want to start the map without querying an external data source.
  • IF block checks the seconds part of current datetime using DATEPART function and compares it with 30 (here we get 50% success rate)
  • In the ELSE clause we put an Execute command with a Dates entity – which will always fail, because we put invalid values to target connection fields

After you finish with the map you should keep in mind Id and OrganizationId of this solution (you can get them from the URI). In this article, I will use the following values:

  • OrganizationId = 3531
  • SolutionId = “6c6bac38-4447-4ce3-a841-8621a3f72f9b”

Also, I encourage you to check the Scribe Labs Tool Connector. It provides other useful blocks such as SHA1, which can help with GDPR compliance in some cases.

Iteration #1: Getting solutions with errors

The execution history of the solution can be retrieved both from the API directly, or from an external system like showed in a previous article. For simplicity, I will use the first approach since it doesn’t require any additional connectors:

A few notes about the map above:

  • We want to reprocess only the latest solution history, that’s why:
    • Query block sorting histories by Start column are in descending order
      • Possible values for ExecutionHistoryColumnSort and SortOrder columns you can see in API tester
    • We use Map Exit block to guarantee reprocessing of no more than one execution history
  • We want to reprocess only the histories that contain errors.
    • For this reason, we’re using If/Else control block which filter out histories by Result value
    • If you want to distinguish reprocess only fatal and/or record errors you can change the condition

Iteration #2: Marking solution errors for reprocessing

To reprocess errors, first we should mark all the errors for reprocessing. Scribe Platform API provides two REST resources to accomplish this task:

  • POST /v1/orgs/{orgId}/solutions/{solutionId}/history/{id}/mark
    • Mark all errors from solution execution history for reprocessing
  • POST /v1/orgs/{orgId}/solutions/{solutionId}/history/{historyId}/errors/{id}/mark
    • Mark particular error from solution execution history for reprocessing

Currently the Scribe Platform API connector supports only the first resource via MarkAllErrors command.

Iteration #3: Reprocessing solution errors

The next step after marking all the errors is reprocessing. We will use ReprocessAllErrors command block, which will reprocess all marked errors from solution execution. Important note from documentation: this command will be ignored if the solution is running.

Iteration #4: Retries

If you want to have more attempts for solving errors by reprocessing, we can add retry logic into the map itself. However, it will require to refactor our map a bit.

Notable changes:

  • We added a Loop with and If/Else control block which uses SEQNUM function as a retry counter
  • On every retry we want to work with latest Execution History record. That’s why the initial root block decomposed into two:
    • The new root query block which works with Solutions
    • Lookup History block which will retrieve the latest possible history record

Iteration #5: Truncated Exponential Backoff

From the other side, straightforward retries can be one of the sources of accidental Denial-of-Service. It’s a classic example of “The road to hell is paved with good intentions” anti-pattern.

To avoid this pitfall we can implement truncated exponential backoff algorithm. It’s not that hard as it sounds. The idea here is to exponentially increase delay time between retries until we reach maximum retry count or maximum backoff time.

Optionally, we can add some amount of randomness when we compute value of delay time, but it’s not needed for our case.

At the time of writing the Connect capability of TIBCO Cloud Integration doesn’t support POW function (you can check that here). But we can emulate it with precomputed Lookup Table Values since we know all the possible retry counter values. This is so-called memoization.

And here’s the updated map:

Notable Changes:

  • We used the Sleep block from Scribe Labs Tools Connector for suspending the work of the map
  • SEQNUM function was replaced by SEQNUMN function
    • We created “RetryCounter” named sequence, with which we can work in any further map blocks
  • With the help of SEQNUMNGET we can peek the current value of our named sequence without increment (just as with stack!)
  • LookupTableValue2 function gets precomputed, resulting a power of 2 from according Lookup Table

Summary

In this article we learned:

  • How to mark and reprocess all errors from particular solution execution with help of Command block from Scribe Platform API Connector
  • How to implement retries with exponential backoff to prevent accidental Denial-of-Service
    • Sleep block helped us with pausing the solution
    • With Lookup Tables we overcame the absence of POW function

First Published June 2018

About the Author

Vladimir Almaev - Vladimir is a software architect, polyglot developer and team lead at Aquiva Labs. His professional interests include programming languages and paradigms, software design and architecture, continuous integration, and hacks for personal productivity. He was lead developer on the Aquiva Labs team that partnered with Scribe on the Scribe Platform API Connector. While not working with software, Vladimir enjoys reading, drawing, listening to music, running, and practicing martial arts.