Skip to main content

%HadoopGateway.Connection

This class represents a connection to a Hadoop instance via Hadoop Gateway. It provides methods for:

Connecting, disconnecting, and checking the state of a connection: %Connect(), %Disconnect(), %IsConnected(), %CurrentHadoopHome(), %CurrentHost(), %CurrentPort().

Creating an iterator to import a MapReduce result: %CreateMapReduceResult().

Executing commands on the Hadoop Namenode machine: %Execute().

Setting and getting transient config options for this connection: %SetOption(), %SetClassOption(), %SetNamespaceOption(), %GetOption(), %GetClassOption(), %GetNamespaceOption(). Transient values set by these set methods override the persistent values for the same options, set via the same-named class methods of %HadoopGateway.Config. They apply only to this connection, and persist across calls to %Connect() and %Disconnect(), for as long as this connection instance is in memory. They do not apply to a different connection instance that connects using the same "HadoopHome", "Host", and "Port" values. These get methods return the transient values of specified options if they have been set for this connection, else they return the persistent values that were set via the %HadoopGateway.Config set methods, or the defaults if no persistent values have been set. For options that can be specified both globally and for a class, if a transient value has not been set for a given class, %GetClassOption() returns the transient global value if one has been set, else it returns the value that would be returned by the %HadoopGateway.Config %GetClassOption class method. See %HadoopGateway.Config for details of available options.

Synchronizing data from Caché to Hadoop files: %StartSync(), %StopSync(), %Synchronize(), %SynchronizeAll(), %TimeSync(), %GetExportFilePathname(), %GetCurrentSyncJobID(), %GetSyncJobIDs(), %GetAllSyncJobIDs(), %GetSyncJobUpdateCount(), %GetSyncJobSyncCount(), %GetSyncJobStartTime(), %GetSyncJobLastTime(), %GetSyncJobErrors(), %GetSyncJobClassName(), %GetSyncJobUserName(), %GetSyncJobHadoopHome(), %GetSyncJobHost(), %IsSyncJobAlive(). These methods allows data from a Caché table to be synchronized to the Hadoop Distributed File System (HDFS), where it can be used as input for MapReduce analysis. Use %StartSync() to start near-real-time synchronization of a table to HDFS. Use %StopSync() to stop synchronizing a table. Use %Synchronize() to perform one-time synchronization of all not-yet-synchronized inserts, updates, and deletes to a table. Use %SynchronizeAll() to perform one-time synchronization of all data in a table, including data that existed prior to adding "DSTIME=AUTO" to the class definition. Get information about current and previously running background synchronization jobs by using methods %GetCurrentSyncJobID(), %GetSyncJobIDs(), %GetAllSyncJobIDs(), %GetSyncJobUpdateCount(), %GetSyncJobSyncCount(), %GetSyncJobStartTime(), %GetSyncJobLastTime(), %GetSyncJobErrors(), %GetSyncJobClassName(), %GetSyncJobUserName(), %GetSyncJobHadoopHome(), %GetSyncJobHost(), and %IsSyncJobAlive().

Prerequisites:

The configuration option "HadoopHome" must be set to the pathname of the root of a Hadoop installation. (Use methods of %HadoopGateway.Config to set persistent values of this and other options, or use the methods %SetOption(), %SetClassOption(), and %SetNamespaceOption() of the present class to set transient option values. All other options have reasonable defaults, but "HadoopHome" defaults to "/hadoophome", which is unlikely to be the actual pathname.)

The Java Gateway must be running on the machine which is the "namenode" of the Hadoop instance. Start it in an operating system command shell with the command:

java -cp classpath com.intersys.gateway.JavaGateway port [logfile]

where:

classpath must include the Caché jar files cachejdbc.jar and cachegateway.jar.

port is the port number on which Java Gateway listens. Hadoop Gateway assumes a default Java Gateway port number of 56789, so if you use a different number you must set the configuration option "port" to that number.

logfile is an option argument specifying the pathname of a file in which Java Gateway logs all messages and errors.

A table to be synchronized to Hadoop has these additional prerequisites:

Its class definition must specify the parameter setting DSTIME=AUTO.

Optionally, the class definition may include a class query called HadoopExport. This query is used to determine which columns of the table to export in which order, and any desired conversions or calculations. The query's WHERE clause must contain the predicate "%id=:pID", and the query definition must specify the keyword SqlProc. Here is an example of a HadoopExport query definition:

Query HadoopExport(pID As %String) As %SQLQuery [ SqlProc ]
{
   select patient, to_char(fromtime, 'YYYY-MM-DD'), 
          to_char(totime,'YYYY-MM-DD'), eventtype 
          from HSAA.EventCareProviderSite where %id=:pID
}

If no HadoopExport query is specified, the query select * from tablename where %id=:pID is used.

Data is exported in delimited text format, with delimiter defaulting to "," (comma) and line separator defaulting to LF (line feed character, $c(10)). These can be customized using configuration options "Delimiter" and "LineSeparator" respectively.

Method Inventory

Methods

method %Connect(pHadoopHome As %String, pHost As %String, pPort As %String)

Connects via Java Gateway to the Hadoop instance on the machine specified by pHost whose root directory is specified by pHadoopHome, connecting to Java Gateway using the port number specified by pPort. If any of the arguments is omitted, the value of the corresponding config option is used.

Values of any arguments specified are set as the transient values of the corresponding config options for this connection, so that if it is disconnected, it can re-connect to the same Hadoop instance by calling this method with no arguments.

The following methods of this class require this connection to be connected, and will throw an exception if called for a connection that is not connected: %StartSync(), %Synchronize(), %SynchronizeAll(), %TimeSync(), %CreateMapReduceResult(), %Execute().

method %CreateMapReduceResult(pDirPath As %String(MAXLEN=""), pLineSeparator As %String) as %HadoopGateway.MapReduceResult
Creates an instance of %HadoopGateway.MapReduceResult, which can be used to iterate over the lines of a Hadoop MapReduce application's result files, or to import the result data to a Caché table.

pDirPath specifies the pathname within HDFS of a directory containing one or more MapReduce result files, over the lines of which this MapReduceResult instance will iterate. pLineSeparator optionally specifies a sequence of one or more characters used to separate lines in the MapReduce result files. If not specified, it defaults to the value of the "lineseparator" configuration option.

This method throws an exception if the connection is not currently connected.

method %CurrentHadoopHome() as %String
Returns the home directory path of the Hadoop instance to which this connection is currently connected, or empty string if this connection is not currently connected.
method %CurrentHost() as %String
Returns the hostname of the Namenode machine of the Hadoop instance to which this connection is currently connected, or empty string if this connection is not currently connected.
method %CurrentPort() as %String
Returns the port number of the Java Gateway server to which this connection is currently connected, or empty string if this connection is not currently connected.
method %Disconnect()

Disconnects this connection. The connection retains any transient config options set via %SetOption(), %SetClassOption(), or %SetNamespaceOption(), and can be re-connected to the same Hadoop instance, or connected to a different Hadoop instance, by calling %Connect().

method %Execute(pCommandLine As %String(MAXLEN=""), pVerbose As %Boolean = 1, Output pErrorOutput As %String(MAXLEN="")) as %String
Executes the command specified by pCommandLine on the Hadoop Namenode machine to which this connection is connected. If pVerbose is non-zero, writes any command output to the console. Returns the standard output of the command, with lines separated by CR-LF ($c(13)_$c(10)). Returns the standard error output of the command in the reference argument pErrorOutput.

This method facilitates executing MapReduce jobs, starting and stopping the Hadoop instance, and performing other Hadoop management tasks from within Caché ObjectScript running on the client machine. Commands are executed via the Java Gateway, and execute with the operating system user, group, and privileges of the user who started the Java Gateway.

This method throws an exception if the connection is not currently connected.

method %GetAllSyncJobIDs(pClassName As %String(MAXLEN="")) as %String
Returns a list of job IDs for all of the background synchronization jobs that have been run for the table specified by pClassName, since this Caché instance was started, for any Hadoop instance. The return value is in $list format. Returns empty string if no background synchronization jobs have been run for this table since this Caché instance was started.
method %GetClassOption(pClassName As %String, pOptionName As %String) as %String
Returns the transient value, for the current connection, of the option specified by pOptionName, within the scope of the class specified by pClassName. If no transient value for this option has been set at class scope, its transient global value is returned if it has one, else its persistent value at class scope. See %HadoopGateway.Config for details of available options, and information about setting and getting persistent option values.
method %GetCurrentSyncJobID(pClassName As %String(MAXLEN=""), pHadoopHome As %String, pHost As %String) as %Integer
Returns the job ID of the background synchronization job currently running for the table specified by pClassName, or 0 if no such job is running.

If optional arguments pHadoopHome and pHost are not specified, returns the job ID of the background job that is connected to the Hadoop instance to which the current connection is connected, or, if the current connection is not connected, then the Hadoop instance specified by the values that would be returned by calling this connection's %GetOption() method for options "HadoopHome" and "Host".

method %GetExportFilePathname(pClassName As %String(MAXLEN="")) as %String
Returns the full pathname within HDFS of the file to which data is synchronized from the class specified by pClassName. By default, the pathname is formed from the HDFS home directory of the user who started the Java Gateway instance used for synchronizing, followed by a directory named for the current Caché namespace (all-uppercase), followed by a filename consisting of the name of the synchronized class (including package name) with the suffix ".data". The pathname can be customized using the global configuration option "rootpath", the namespace configuration option "pathname", and the class configuration option "pathname" (see %HadoopGateway.Config for details).
method %GetNamespaceOption(pNamespace As %String, pOptionName As %String) as %String
Returns the transient value, for the current connection, of the option specified by pOptionName, within the scope of the namespace specified by pNamespace. If no transient value for this option has been set at namespace scope, its persistent value is returned. See %HadoopGateway.Config for details of available options, and information about setting and getting persistent option values.
method %GetOption(pOptionName As %String) as %String
Returns the transient value, for the current connection, of the option specified by pOptionName. If no transient value has been set for this option, its persistent value is returned. See %HadoopGateway.Config for details of available options, and information about setting and getting persistent option values.
method %GetSyncJobClassName(pJobID As %Integer) as %String
Returns the class name of the table synchronized by the background synchronization job specified by pJobID. Throws an exception if the specified job never existed since this Caché instance was started.
method %GetSyncJobErrors(pJobID As %Integer) as %String
Returns a list of error statuses returned by synchronize passes of the background synchronization job specified by pJobID. The return value is in $list format. Returns empty string if no synchronize passes returned errors. Throws an exception if the specified job never existed since this Caché instance was started.
method %GetSyncJobHadoopHome(pJobID As %Integer) as %String
Returns the Hadoop home pathname of the Hadoop instance to which data was synchronized by the background synchronization job specified by pJobID. Throws an exception if the specified job never existed since this Caché instance was started.
method %GetSyncJobHost(pJobID As %Integer) as %String
Returns the hostname of the namenode of the Hadoop instance to which data was synchronized by the background synchronization job specified by pJobID. Throws an exception if the specified job never existed since this Caché instance was started.
method %GetSyncJobIDs(pClassName As %String(MAXLEN=""), pHadoopHome As %String, pHost As %String) as %String
Returns a list of job IDs for the background synchronization jobs that have been run for the table specified by pClassName, since this Caché instance was started, for the Hadoop instance specified by pHadoopHome and pHost. The return value is in $list format. Returns empty string if no background synchronization jobs have been run for this table since this Caché instance was started.

If optional arguments pHadoopHome and pHost are not specified, lists jobs that synchronized to the Hadoop instance to which the current connection is connected, or, if the current connection is not connected, then the Hadoop instance specified by the values that would be returned by calling this connection's %GetOption() method for options "HadoopHome" and "Host".

method %GetSyncJobLastTime(pJobID As %Integer) as %String
Returns the last time, in $horolog format, that the background synchronization job specified by pJobID completed a synchronize pass. If the job is no longer running, this is the time the job completed. If the job has not yet completed any synchronize pass, this is the time the job started. Throws an exception if the specified job never existed since this Caché instance was started.
method %GetSyncJobStartTime(pJobID As %Integer) as %String
Returns the start time, in $horolog format, of the background synchronization job specified by pJobID. Throws an exception if the specified job never existed since this Caché instance was started.
method %GetSyncJobSyncCount(pJobID As %Integer) as %Integer
Returns the number of synchronize passes that have been made by the background synchronization job specified by pJobID. Throws an exception if the specified job never existed since this Caché instance was started.
method %GetSyncJobUpdateCount(pJobID As %Integer) as %Integer
Returns the number of inserts, updates, and deletes that have been synchronized by the background synchronization job specified by pJobID. Throws an exception if the specified job never existed since this Caché instance was started.
method %GetSyncJobUserName(pJobID As %Integer) as %String
Returns the user name of the user who started the background synchronization job specified by pJobID. Throws an exception if the specified job never existed since this Caché instance was started.
method %IsConnected() as %Boolean
Returns 1 if this connection is currently connected, else returns 0.
method %SetClassOption(pClassName As %String, pOptionName As %String, pOptionValue As %String)
Sets the option specified by pOptionName to have the value specified by pOptionValue, within the scope of the class specified by pClassName, as a transient value for this connection only. If pOptionValue is not specified, the option reverts to its transient value at global scope, or to its persistent value at class scope if no transient value is set at global scope, or if it is an option that does not apply at global scope.
method %SetNamespaceOption(pNamespace As %String, pOptionName As %String, pOptionValue As %String)
Sets the option specified by pOptionName to have the value specified by pOptionValue, within the scope of the namespace specified by pNamespace, as a transient value for this connection only. If pOptionValue is not specified, the option reverts to its persistent value. See %HadoopGateway.Config for details of available options, and information about setting and getting persistent option values.
method %SetOption(pOptionName As %String, pOptionValue As %String)
Set the option specified by pOptionName to have the value specified by pOptionValue, with global scope, as a transient value for this connection only. If pOptionValue is not specified, the option reverts to its persistent value. See %HadoopGateway.Config for details of available options, and information about setting and getting persistent option values.
method %StartSync(pClassName As %String(MAXLEN="")) as %Integer

Starts a background job to synchronize the table specified by pClassName. Returns the job ID of the newly-started job, or 0 if no job was started because a job for the specified table was already running. All inserts, updates, and deletes which were not already synchronized to Hadoop prior to this call are synchronized to Hadoop in near-real time, until %StopSync() is called, or until this Caché instance is shut down.

Inserts, updates, or deletes which were performed prior to adding "DSTIME=AUTO" to the class definition are not synchronized. To synchronize these, call %SynchronizeAll().

This method throws an exception if the connection is not currently connected.

method %StopSync(pClassName As %String(MAXLEN=""), pHadoopHome As %String, pHost As %String) as %Integer
Stops synchronizing the table specified by pClassName.

If optional arguments pHadoopHome and pHost are not specified, stops the background job that is connected to the Hadoop instance to which the current connection is connected, or, if the current connection is not connected, then the Hadoop instance specified by the values that would be returned by calling this connection's %GetOption() method for options "HadoopHome" and "Host". A backround job connected to a different Hadoop instance can be stopped by using the arguments pHadoopHome and pHost to specify the values for that instance.

Returns the job ID of the job that was stopped, or 0 if there was no such job running.

method %SyncJobIsAlive(pJobID As %Integer) as %Boolean
Returns 1 if the background synchronization job specified by pJobID is alive, else 0. Throws an exception if the specified job never existed since this Caché instance was started.
method %Synchronize(pClassName As %String(MAXLEN=250), pVerbose As %Boolean = 0, Output pObjectsUpdated As %Integer)
Synchronizes the table specified by pClassName. All inserts, updates, and deletes which were not already synchronized to Hadoop prior to this call are synchronized to Hadoop.

Inserts, updates, or deletes which were performed prior to adding "DSTIME=AUTO" to the class definition are not synchronized. To synchronize these, call %SynchronizeAll()

This method throws an exception if the connection is not currently connected.

method %SynchronizeAll(pClassName As %String(MAXLEN=250), pVerbose As %Boolean = 0, Output pObjectsUpdated As %Integer)
Synchronizes all data in the table specified by pClassName to Hadoop. Use this method when it is required to synchronize data that existed in the table prior to adding "DSTIME=AUTO" to the table's class definition. Otherwise, best practice is to use %Synchronize(), in order to avoid redundantly synchronizing data that has not changed since the most recent prior call to %Synchronize() or %SynchronizeAll().

This method throws an exception if the connection is not currently connected.

method %TimeSync(pClassName As %String(MAXLEN=""), pElapsedTime As %Numeric = 0, pUpdateCount As %Integer = 0, pThroughput As %Numeric = 0)
Benchmarks a one-time synchronization of the table specified by pClassName. Output argument pElapsedTime is set to the elapsed time in seconds (with fractional component). Output argument pUpdateCount is set to the number of inserts, updates, and deletes that are synchronized. Output argument pThroughput is set to the throughput in inserts/updates/deletes synchronized per second.

This method throws an exception if the connection is not currently connected.

Inherited Members

Inherited Methods

FeedbackOpens in a new tab