You can now choose to disable retrying commit or rollback for heuristic hazard transactions.
Heuristic hazard transactions can arise out of network connectivity issues during the commit phase: if a resource gets a prepare request and subsequently becomes unreachable during commit or rollback then the transaction will go into "heuristic hazard" mode. This essentially means that commit will be retried a number of times, even if com.atomikos.icatch.oltp_max_retries is set to zero. The rationale being: it is better to terminate pending in-doubt transactions sooner rather than later because of the pending locks they may be holding on to.
If you don't want this behaviour then you can now disable this, and rely on the recovery process in the background to take care of it (which also works, but will happen only periodically). To disable, just set this new property to false:
com.atomikos.icatch.retry_on_heuristic_hazard=false
A new startup property that can optionally be set. If not present, it will default to true to preserve compatibility with existing behaviour.
You can now explicitly trigger recovery in your application, via our API.
import com.atomikos.icatch.RecoveryService; import com.atomikos.icatch.config.Configuration; boolean lax = true; //false to force recovery, true to allow intelligent mode RecoveryService rs = Configuration.getRecoveryService(); rs.performRecovery(lax);
In order for this to work, make sure to set (in jta.properties):
# set to Long.MAX_VALUE so background recovery is disabled com.atomikos.icatch.recovery_delay=9223372036854775807L
We have added methods on an existing API interface, which does not break existing clients.
| Severity: | 2 |
|---|---|
| Affected version(s): | 5.0.x |
XAResource.recover() when failures happen during the regular commit or rollback, so the overhead for the backend is reduced.
For historical reasons we used to call the XA recovery routine on the backed whenever commit or rollback failed. The most common cause is network glitches, meaning that big clusters with a short network problem would suddenly hit the backends with recovery for all active transactions. Since recovery can be an expensive operation, this would result in needless load on the backends.
The rationale behind this was to avoid needless commit retries (based on the value of com.atomikos.icatch.oltp_max_retries), but the overhead does not justify the possible benefit.
From now on we no longer do this, since it is either the recovery process (in the background) or the application (via our API) that controls when recovery happens.
Worst case, this can lead to needless commit retries, in which case the backend should respond with error code XAER_NOTA and our code will handle this gracefully. However, we have historical records where some older version of ActiveMQ did not behave like this. This would result in errors in the ActiveMQ log files, in turn leading to alerts for the operations team.
If you experience issues with this, then it suffices to set com.atomikos.icatch.oltp_max_retries to zero. That will disable regular commit retries and delegate to the recovery background process.
| Severity: | 2 |
|---|---|
| Affected version(s): | 5.0.x |
For releases 5.0 or higher, the maximum timeout should not be set to 0 or recovery will interfere with regular application-level commits.
The 5.0 release has a new recovery workflow that is incompatible with com.atomikos.icatch.max_timeout being zero. That is because recovery depends on the maximum timeout to perform rollback of pending (orphaned) prepared transactions in the backends. If the maximum timeout is zero then recovery (in the background) will rollback prepared transactions that are concurrently being committed in your application. This will result in heuristic exceptions and inconsistent transaction outcomes.
Keep in mind that the maximum timeout is also indicative of maximum lock duration in your databases, so choose it wisely! If you are / were depending on an unlimited maximum timeout then you are also allowing unlimited lock times.
4
5.0.x
You can now more easily determine when connections are reaped because of another connection timing out on network I/O or DB locks.
We already used to collect the stack trace of the thread that acquired a reaped connection. However, we now also collect the thread name to correlate reap situations with timeouts, for instance like this:
Before this fix, you would see a timeout + application's thread name + stack trace for step 2, and a stack trace for 3. The stack trace would show where in your application the connection was gotten in step 1, but not by which thread. Indeed, step 3 would log the stack trace within the context of the pool maintenance thread, not the original application thread in step 1.
With this fix you will now also see the application's thread name (i.e., the thread of step 2) in step 3 so you can easily correlate 1-2-3 and determine the timeout in 2 as the root cause for the reap.
None.
Release notes for 5.0.99
Release notes for 4.0.81