Developing Distributed Applications

This chapter discusses application development and design issues that are helpful if you would like to deploy your application using ECP, either as an option or as its primary configuration.

With Caché, the decision to deploy an application as a distributed (multiserver) system is primarily a runtime configuration issue (see Configuring Distributed Systems). Using the Caché configuration tools, map the logical names of your data (globals) and application logic (routines) to physical storage on the appropriate system.

This chapter discusses the following topics:

ECP Recovery

ECP is designed to automatically recover from interruptions in connectivity between the ECP application server and the ECP data server.

If the connection between an ECP application server and ECP data server is interrupted, the following happens:

The application server connection state is set to Trouble indicating that this connection is attempting to recover.
The ECP application server attempts to reestablish its connection with the ECP data server.
If the application server connection is reestablished within the Time interval for Troubled state (see ECP Timeout Values), the connection state changes to Normal. ECP restores all locks and open transactions to the state they were in prior to the interruption.
If the ECP application server cannot reestablish its connection with the ECP data server within the Time interval for Troubled state, it disables the connection, returns a <NETWORK> error to all processes waiting for remote activities, and then sets the application server connection state to Not Connected. The next reference establishes a new ECP connection to the data server.
If the connection is not reestablished within the Time to wait for recovery (see ECP Timeout Values), the ECP data server rolls back any open transactions involving the ECP application server and then releases all locks held on the server on behalf of the ECP application server. The state of the data server connection changes to Free.

By default, ECP uses the following timeout values:

ECP Timeout Values

Management Portal Setting	Default Value	Range
Time interval for Troubled state	60 seconds	20–65535 seconds
Time to wait for recovery	1200 seconds (20 minutes)	10–65535 seconds
Time between reconnections	5 seconds	1–60 seconds

Each is configurable from the ECP Settings page of the Management Portal (System Administration > Configuration > Connectivity > ECP Settings). They are described in the following:

Time Interval for Troubled State

The duration a connection stays in troubled state (in seconds). Once this period of time has elapsed, the data server declares the connection dead and presumes recovery is not possible.

The setting is in the This System as an ECP Data Server section; the default value is 60 seconds. The range is 20–65535 seconds.

Time to wait for recovery

How long an application server should keep trying to reestablish a connection before giving up or declaring the connection failed (in seconds).

The setting is in the This System as an ECP Application Server section; the default value is 1200 seconds (20 minutes). The value should be at least 10 seconds and can be a maximum of 65535 seconds. The application server continues reconnection attempts as scheduled by the reconnect interval until the full duration expires.

Time between reconnections

How long to wait between each reconnection attempt (in seconds), when a data server is not available.

The setting is in the This System as an ECP Application Server section; the default value is 5 seconds. The range is 1–60 seconds. The application server continues reconnection attempts at intervals scheduled by this interval until the full reconnect duration expires.

The default values are set so the data server gives up quicker because Caché does not want to tie up data server resources for an extended amount of time for an application server that is down. The application server waits for up to twenty minutes because when data servers crash and restart, Caché wants to give the application server a chance to complete recovery after the data server comes back up. There is an implicit assumption that a data server has something better to do with its time than wait for an application server to reconnect, but an application server does not.

ECP relies on the TCP physical connection to detect the health of the other side of the pipe. ECP ensures there is always a message going across without flooding the pipe. On most platforms you can adjust the TCP connection failure and detection behavior at the system level.

If no traffic is received from an application server for a while, the data server declares the connection non-responsive. In the non-responsive state, there is an active ECP data server daemon that is waiting for new requests to arrive on the connection, or for a new connection to be requested. If the old connection returns, it can immediately resume operation without recovery. Because of the underlying heartbeat mechanism, if the application server goes away completely, either because of application server failure or network failure, the underlying TCP connection is quickly reset. Thus, a long time in the non-responsive state on the data server generally indicates some kind of problem on the application server (a system hang, for example) that caused the application server to stop functioning, but without interfering with its connections.

If the underlying TCP connection is reset, the data server puts the connection in an “awaiting reconnection” state. In the “awaiting reconnection” state, there is no active ECP daemon on the data server. A new pair of ECP data server daemons will be created when the next incoming connection is requested by the application server system.

Collectively, the non-responsive state and the awaiting-recovery state are known as the data server-side Trouble state. The recovery required in both cases is very similar.

If the data server crashes or shuts down, it remembers the connections that were active at the time of the crash or shutdown. After restarting, the data server has a short window (usually 30 seconds) during which it places these interrupted connections in an “awaiting reconnection” state. In this state, the application server and data server can cooperate together to recover all the transaction and lock states as well as all the pending Set and Kill transactions from the moment of the data server shutdown.

During the recovery of an ECP-configured system, Caché guarantees a number of recoverable semantics which are described in detail in the ECP Recovery Guarantees section of the “ECP Recovery Guarantees and Limitations” appendix. There are limitations to these guarantees which are described in detail in the ECP Recovery Limitations section of the same appendix.

Forced Disconnects

By default, ECP automatically manages the connection between an application server system and a data server system. When an ECP-configured system is initially started, all connections between ECP application servers and data servers are in the Not Connected state (that is, the connection is defined, but not yet established). As soon as an ECP application server makes a request (for data or code) that requires a connection to an ECP data server, the connection is automatically established and the state changes to Normal. The network connection between the ECP application server and data server is kept open indefinitely.

In some applications, you may wish to close open ECP connections. For example, suppose you have a system, configured as an ECP application server, that periodically (a few times a day) needs to fetch data stored on a data server system, but does not need to keep the network connection with the data server open afterwards. In this case, the ECP application server system can issue a call to the ChangeToNotConnected method of the SYS.ECPOpens in a new tab class to force the state of the ECP connection to Not Connected.

For example:

 Do OperationThatUsesECP()
 Do SYS.ECP.ChangeToNotConnected("ConnectionName")

The ChangeToNotConnected method does the following:

Completes sending any data modifications to the data server and waits for acknowledgment from the data server.
Removes any locks on the ECP data server that were opened by the ECP application server.
Rolls back the data server side of any open transactions. The application server side of the transaction goes into a “rollback only” condition.
Completes pending requests with a <NETWORK> error.
Flushes all cached blocks.

After completion of the state change to Not Connected, the next request for data from the ECP data server automatically reestablishes the connection.

Performance and Programming Considerations

To achieve the highest performance and reliability from ECP applications, you should be aware of the following issues:

Avoid Transactions Spanning Multiple Data Servers

Restrict updates within a single transaction to a single server, whether it be a remote ECP data server or the local server. When a transaction includes updates to more than one server (including the local server) and the TCommit cannot complete successfully, some servers that are part of the transaction may have committed the updates while others may have rolled them back. See the Commit Guarantee section of “ECP Recovery Guarantees and Limitations” for details.

Note:

Updates to CACHETEMP are not considered part of the transaction for the purpose of rollback, and, as such, are not included in the restriction.

ZSync Command Nonfunctional

In ECP configurations, the ZSync command is not operable. ECP achieves an analogous effect within transactions by declaring the entire transaction as rollback-only whenever any Set or Kill operation receives an error.

Memory Use on Large ECP Systems

In addition to buffering the blocks that are served over ECP, ECP data servers use global buffers to store various ECP control structures. There are several factors that go into determining how much memory these structures might require, but the most significant is a function of the aggregate sizes of the clients' caches. To roughly approximate the requirements, so you can adjust the data server’s database caches if needed, use the following guidelines:

Database Block Size	Recommendation
8 KB	50 MB plus 1% of the sum of the sizes of all of the application servers’ 8 KB database caches
16 KB (if enabled)	0.5% of the sum of the sizes of all of the application servers’ 16 KB database caches
32 KB (if enabled)	0.25% of the sum of the sizes of all of the application servers’ 32 KB database caches
64 KB (if enabled)	0.125% of the sum of the sizes of all of the application servers’ 64 KB database caches

For example, if the 16 KB block size is enabled in addition to the default 8 MB block size, and there are six application servers, each with an 8 KB database cache of 2 GB and a 16 KB database cache of 4 GB, you should adjust the data server’s 8 KB database cache to ensure that 52 MB (50MB + [12 GB * .01]) is available for control structures, and the 16 KB cache to ensure that 2 MB (24 GB * .005) is available for control structures (rounding up in both cases).

For information about allocating memory to database caches, see Memory and Startup Settings in the “Configuring Caché” chapter of the Caché System Administration Guide.

Temporary Globals

Temporary (scratch) globals should be local to the application server, assuming they do not contain data that needs to be globally shared. Often, temporary globals are highly active and write-intensive. This may penalize other ECP application servers sharing the ECP connection if the temporary globals are located on the data server.

Multiple ECP Channels

InterSystems strongly discourages establishing multiple duplicate ECP channels between an application server and a database server to try to increase bandwidth. You run the dangerous risk of having locks and updates for a single logical transaction arrive out-of-sync on the database server, which may result in data inconsistency.

Load-balanced Application Servers

Connecting users to application servers in a round-robin or load-balancing scheme may diminish the benefit of caching on the application server. This is particularly true if users work in functional groups that have a tendency to read the same data. As these users are spread among application servers, each application server may end up requesting exactly the same data from the data server, which could lead to increased block invalidation as blocks are modified on one application server and refreshed across other application servers. This is somewhat subjective, but someone very familiar with the application characteristics should consider this possible condition.

Repeated References to Undefined Globals

Repeated references to a global that is not defined (for example, $Data(^x(1)) where ^x is not defined) always requires a network operation to test if the global is defined on the ECP data server.

By contrast, repeated references to undefined nodes within a defined global (for example, $Data(^x(1)) where any other node in ^x is defined) does not require a network operation once the relevant portion of the global (^x) is in the application server cache.

This behavior differs significantly from that of a non-networked application. With local data, repeated references to the undefined global are highly optimized to avoid unnecessary work. Designers porting an application to a networked environment may wish to review the use of globals that are sometimes defined and sometimes not. Often it is sufficient to make sure that some other node of the global is always defined.

The $Increment Function and Application Counters

A common operation in online transaction processing systems is generating a series of unique values for use as record numbers or the like. In a typical relational application, this is done by defining a table that contains a “next available” counter value. When the application needs a new identifier, it locks the row containing the counter, increments the counter value, and releases the lock. Even on a single-server system, this becomes a concurrency bottleneck: application processes spend more and more time waiting for the locks on this common counter to be released. In a networked environment, it is even more of a bottleneck at some point.

Caché addresses this by providing the $Increment function, which automatically increments a counter value (stored in a global) without any need of application-level locking. Concurrency for $Increment is built into the Caché database engine as well as ECP, making it very efficient for use in single-server as well as in distributed applications.

Applications built using the default structures provided by Caché objects (or SQL) automatically use $Increment to allocate object identifier values.

$Increment is a synchronous operation involving journal synchronization when executed over ECP. For this reason, $Increment over ECP is a relatively slow operation, especially compared to others which may or may not already have data cached (either in the application server buffer pool or the database server buffer pool). The impact of this may be even greater in a mirrored environment due to network latency between the failover members. For this reason, it may be useful to redesign an application to assign a batch of new values to each application server and use $Increment with that local batch within each application server, involving the database server only when a new batch of values is needed. (This approach cannot be used, however, when consecutive application counter values are required.) The $Sequence function can also be helpful in this context, as an alternative to or used in combination with $Increment.

ECP-related Errors

There are several runtime errors that can occur on a system using ECP. An ECP-related error may occur immediately after a command is executed or, in the case of commands that are asynchronous in nature, such as Kill, the error occurs a short time after the command completes.

<NETWORK> Errors

A <NETWORK> error indicates that an error has occurred that could not be handled by the normal ECP recovery mechanism.

In an application, it is always acceptable to halt a process or roll back any pending work whenever a <NETWORK> error is received. Some <NETWORK> errors are essentially fatal error conditions. Others indicate a temporary condition that might clear up soon. However, an expected programming practice is to always roll back any pending work in response to a <NETWORK> error and start the current transaction over again from the beginning.

A <NETWORK> error on a get-type request such as $Data or $Order can often be retried manually rather than simply rolling back the transaction immediately. ECP tries to avoid giving a <NETWORK> error that would lose data, but gives an error more freely for requests that are read-only.

Rollback Only Condition

The application-side rollback-only condition occurs if the data server detects a network failure during a transaction initiated by the application server. It enters a state where all network requests are met with errors until the transaction is rolled back.