Overview
Failures, errors, and outages are unavoidable parts of any technical system. Of course, as engineers, we should do our best to design solutions with failures in mind. Regardless of our best intentions and planning, situations sometimes come up we had not anticipated, which makes elegant recovery difficult. All we can do is re-attempt and hope that connectivity is restored. One such example of this is the so-called heisenbugs.
The Connect capability of TIBCO Cloud Integration provides the ability to reprocess failed records. When an execution fails with record errors, a copy of each source record with an error is stored, either in the cloud or locally in the on-premise agent database. It gives us the ability to retry the processing of these failed records.
In this article, we will show you how we can automate reprocessing of solution errors with the help of the Scribe Platform API Connector.
Short on time? Check out this video on how to reprocess solution errors!
Use Case
Consider the case when you have an unstable connection to one of your source or target systems in a solution. We want to automate reprocessing of all failed records in this solution.
Prerequisites
As a prerequisite, you should have one unstable solution. For demo purposes let?s use solution with a single map as follows:
This map will only succeed in 50% of the cases. Let?s see why:
- We?re using a fictional entity called SelectOne from Scribe Labs Tools Connector. It just provides a single row with current datetime in it. It can be very handy if you just want to start the map without querying an external data source.
- IF block checks the seconds part of current datetime using DATEPART function and compares it with 30 (here we get 50% success rate)
- You can replace 30 with another value if you want a different success rate
- We?re using GETUTCDATETIME function to get current datetime instead of UtcNow property, because in the latter case TIBCO Cloud Integration will use the same datetime value during reprocessing. This leaves no chance of successful reprocessing. However, GETUTCDATETIME will always provide current datetime.
- In the ELSE clause, we put an Execute command with a Dates entity ? which will always fail because we put invalid values to target connection fields
After you finish with the map you should keep in mind Id and OrganizationId of this solution (you can get them from the URI). In this article, I will use the following values:
- OrganizationId = 3531
- SolutionId = ?6c6bac38-4447-4ce3-a841-8621a3f72f9b?
Also, I encourage you to check the Scribe Labs Tool Connector. It provides other useful blocks such as SHA1, which can help with GDPR compliance in some cases.
Iteration #1: Getting solutions with errors
The execution history of the solution can be retrieved both from the API directly, or from an external system as shown in a previous article. For simplicity, I will use the first approach since it doesn?t require any additional connectors:
A few notes about the map above:
- We want to reprocess only the latest solution history, that?s why:
- Query block sorting histories by Start column are in descending order
- Possible values for ExecutionHistoryColumnSort and SortOrder columns can see in API tester
- We use Map Exit block to guarantee to reprocess of no more than one execution history
- Query block sorting histories by Start column are in descending order
- We want to reprocess only the histories that contain errors.
- For this reason, we?re using If/Else control block which filters out histories by the Result value
- If you want to distinguish reprocess only fatal and/or record errors you can change the condition
Iteration #2: Marking solution errors for reprocessing
To reprocess errors, first, we should mark all the errors for reprocessing. Scribe Platform API provides two REST resources to accomplish this task:
- POST /v1/orgs/{orgId}/solutions/{solutionId}/history/{id}/mark
- Mark all errors from the solution execution history for reprocessing
- POST /v1/orgs/{orgId}/solutions/{solutionId}/history/{historyId}/errors/{id}/mark
- Mark particular errors from the solution execution history for reprocessing
Currently, the Scribe Platform API connector supports only the first resource via MarkAllErrors command.
Iteration #3: Reprocessing solution errors
The next step after marking all the errors is reprocessing. We will use ReprocessAllErrors command block, which will reprocess all marked errors from solution execution. Important note from documentation: this command will be ignored if the solution is running.
Iteration #4: Retries
If you want to have more attempts for solving errors by reprocessing, we can add retry logic into the map itself. However, it will require refactoring our map a bit.
Notable changes:
- We added a Loop with and If/Else control block which uses SEQNUM function as a retry counter
- As an alternative to SEQNUM function you can try to use Scribe Labs Variables Connector
- On every retry, we want to work with the latest Execution History record. That?s why the initial root block decomposed into two:
- The new root query block which works with Solutions
- Lookup History block which will retrieve the latest possible history record
Iteration #5: Truncated Exponential Backoff
From the other side, straightforward retries can be one of the sources of accidental Denial-of-Service. It?s a classic example of ?The road to hell is paved with good intentions? anti-pattern.
To avoid this pitfall we can implement truncated exponential backoff algorithm. It?s not as hard as it sounds. The idea here is to exponentially increase the delay time between retries until we reach the maximum retry count or maximum backoff time.
Optionally, we can add some amount of randomness when we compute value of delay time, but it?s not needed for our case.
At the time of writing the Connect capability of TIBCO Cloud Integration doesn?t support POW function (you can check that here). But we can emulate it with precomputed Lookup Table Values since we know all the possible retry counter values. This is so-called memoization.
Notable Changes:
- We used the Sleep block from Scribe Labs Tools Connector for suspending the work of the map
- SEQNUM function was replaced by SEQNUMN function
- We created ?RetryCounter? named sequence, with which we can work in any further map blocks
- With the help of SEQNUMNGET we can peek the current value of our named sequence without increment (just as with stack!)
- LookupTableValue2 function gets precomputed, resulting a power of 2 from according Lookup Table
Summary
In this article we learned:
- How to mark and reprocess all errors from particular solution execution with help of Command block from Scribe Platform API Connector
- How to implement retries with exponential backoff to prevent accidental Denial-of-Service
- Sleep block helped us with pausing the solution
- With Lookup Tables we overcame the absence of POW function
Recommended Comments
There are no comments to display.
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now