Caché High Availability Guide
ECP and High Availability
InterSystems: The power behind what matters   
One of the most powerful and unique features of Caché is the ability to scale efficiently by distributing data and application logic among a number of server systems. The underlying technology behind this feature is the Enterprise Cache Protocol (ECP), a distributed data caching architecture that manages the distribution of data and locks across a heterogeneous network of server systems. ECP application servers are designed to preserve the state of the running application across a failover of the data server. Depending on the nature of the application activity, some users may experience a pause until failover completes, but can then continue operating without interrupting their workflow.

This chapter describes how the ECP architecture works to maintain high availability in the following topics:
For more detailed information about ECP, see the Caché Distributed Data Management Guide.
ECP Architecture for High Availability
The typical ECP architecture for high availability includes a set of application servers connected to a data server.The application load is distributed across the application server tier by external load balancing, while the data server maintains the database state and may host some application load locally (such as batch activity). ECP provides efficient scaling by allowing additional application servers to be added to meet growing demand. To adequately meet demand even if an application server fails, the application server tier should have at least one server more than the number strictly required. The data server is made highly available by utilizing any one of the system failover strategies discussed in System Failover Strategies, as shown in the following illustration.
Regardless of the failover strategy employed for the data server, the application servers reconnect and recover their states following a failover, allowing application processing to continue where it left off prior to the failure. While this application recovery capability is beneficial for environments using ECP, it is not typically a reason on its own to introduce ECP when not otherwise needed. An application server itself may fail, requiring its users to reconnect, and the introduction of application servers adds complexity to the environment.
ECP Recovery and Failover
On failure of an ECP application server, data server, or the network connecting them, ECP goes through a recovery protocol that differs depending on the nature of the failure. The result is that the connection is either recovered, allowing the application processes to continue as though nothing had happened, or the connection is reset, forcing transactions to rollback and the application processes to get rebuilt. The main principles are as follows:
ECP Recovery Detailed
The simplest case of ECP recovery is a temporary network interruption that is long enough to be noticed, but short enough that the underlying TCP connection stays active during the outage. During the outage, the application server notices that the connection is nonresponsive and blocks new network requests for that connection. Once the connection resumes, processes that were blocked are able to send their pending requests.
If the underlying TCP connection is reset, the data server waits for a reconnection for the Time interval for Troubled state setting (one minute by default). If the application server does not succeed in reconnecting during that interval, the data server resets its connection, rolls back its open transactions, and releases its locks. Any subsequent connection from that application server is converted into a request for a brand new connection and the application server is notified that its connection is reset.
The application server keeps a queue of locks to remove and transactions to roll back once the connection is reestablished. By keeping this queue, processes on the application server can always halt, whether or not the data server on which it has pending transactions and locks is currently available. ECP recovery completes any pending Set and Kill operations that had been queued for the data server before the network outage was detected, before it completes the release of locks.
Any time a data server learns that an application server has reset its own connection (due to application server restart, for example), even if it is still within the Time interval for Troubled state, the data server resets the connection immediately, rolling back transactions and releasing locks on behalf of that application server. Since the application server’s state was reset, there is no longer any state to be maintained by the data server on its behalf.
The final case is when the data server shut down, either gracefully or as a result of a crash. The application server maintains the application state and tries to reconnect to the database server for the Time to wait for recovery setting (20 minutes by default). The data server remembers the application server connections that were active at the time of the crash or shutdown; after restarting, it waits up to thirty seconds for those application servers to reconnect and recover their connections. Recovery involves several steps on the data server, some of which involve the data server journal file in very significant ways. The result of the several different steps is that:
During the recovery of an ECP-configured system, Caché guarantees a number of recoverable semantics which are described in detail in the ECP Recovery Guarantees section of the “ECP Recovery Guarantees and Limitations” appendix of the Caché Distributed Data Management Guide. Limitations to these guarantees are described in detail in the ECP Recovery Limitations section of the aforementioned appendix.
Help us improve this page