Skip to main content

%HadoopGateway.MapReduceResult

This class is an iterator over the output of a MapReduce application, which may consist of one or more result files in HDFS. Result files are assumed to be text files with lines separated by the character sequence specified by the config option "LineSeparator" (which defaults to a single LINEFEED character). Each line consists of an application-defined key and value, separated by a TAB character. The key and value may each consist of one or more text fields, separated by an application-defined delimiter, which may be one or more characters long. A different delimiter may be used in the key and the value.

Create a MapReduceResult instance to iterate over a given MapReduce result, by calling the %CreateMapReduceResult method of a connected instance of %HadoopGateway.Connection, passing as first argument the pathname within HDFS of the directory containing the result files, and optionally passing as second argument a character sequence to override the "LineSeparator" config option value for files of this MapReduce result. Use %ReadLine() to read the next line directly from HDFS. Use %ResetToFirst() to cause the next call to %ReadLine() to read the first line of the first result file. Use %Import() to import the entire MapReduce result into a new or existing Caché table.

Method Inventory

Methods

method %Import(pClassName As %String, pKeyColumnDefs As %String, pValueColumnDefs As %String, pKeyDelimiter As %String, pValueDelimiter As %String)

Imports the entire MapReduce result into a new or existing table, depending on the value of the configuration option "ImportTableExists". If the option value is "error" (the default), creates a new table if no table of the name specified by pClassName exists, else throws an exception. If the option value is "drop", always creates a new table, dropping the existing table if it exists. If the option value is "reuse", creates a new table if a table of the specified name does not exist, otherwise imports into the existing table (whose definition must be compatible with the names and types of fields specified).

If only pClassName is specified, creates a table with two string columns, called key and value, and stores the key and value portions of the result lines in those columns, respectively. If pKeyColumnDefs or pValueColumnDefs are specified, they define columns to hold the fields parsed from the key and value portions of the result lines, respectively. If the key portion is to be parsed into multiple fields, then pKeyDelimiter must be used to specify a delimiter of one or more characters to be used in parsing the key portion. If the value portion is to be parsed into multiple fields, then pValueDelimiter must be used to specify a delimiter of one or more characters to be used in parsing the value portion.

pKeyColumnDefs and pValueColumnDefs are each comma-separated lists of column definitions. Each column definition consists of a name, optionally followed by a type specification, separated from the name by white space. If there is no type specification, the type is assumed to be string (i.e. Caché %String / SQL VARCHAR). The type specification may use either Caché Object syntax (e.g. As %Integer) or SQL syntax (e.g. int), and may specify any Caché Object or SQL datatype, or any persistent class, but may not specify a serial or transient class.

In type specifications using Caché syntax, the type name may be followed, after white space, by a parenthesized, comma-separated list of parameters, which must be appropriate to the type, and must be one of the following supported parameters: MAXLEN, MAXVAL, MINVAL, or SCALE. If parameters other than these need to be specified, define and compile the class prior to calling %Import, defining all necessary property parameters; set the config option "ImportTableExists" to "reuse"; and specify the same column names and types when calling %Import, to assure that appropriate conversions are performed.

In type specifications using SQL syntax, the type name may be followed, when appropriate to the type, by a parenthesized length, precision, or precision and scale. If other column attributes (e.g. "UNIQUE", "NOT NULL") need to be specified, define the table using an SQL CREATE TABLE statement prior to calling %Import, defining all necessary column attributes; set the config option "ImportTableExists" to "reuse"; and specify the same column names and types when calling %Import, to assure that appropriate conversions are performed.

If the type is %Date or %Timestamp, or an SQL type that maps to one of these, it can optionally be followed, after one or more white space characters, by a single-quoted format string (e.g. 'YYYY-MM-DD'), which is used as an argument of one of the SQL functions TO_DATE or TO_TIMESTAMP, depending on the type, to convert the MapReduce result field from text to the specified type. If no format string is specified, a MapReduce result field imported to a %Date column is assumed either already to be in Caché logical format, or to be in the default format of 'DD MON YYYY', and a MapReduce result field imported to a %Timestamp column is assumed to be in the default format 'DD MON YYYY HH:MI:SS'. Columns of type %Time do not require (or allow) a format string, and may be imported from MapReduce result fields that are either in Caché logical format (i.e. an integer representing the number of seconds since midnight), or in any of the display formats supported by $ztimeh and the %Time DisplayToLogical class method ('HH-MI-SS[.FFF]', 'HH-MI', 'HH-MI-SS[.FFF]A[M]/P[M]/N[OON]/M[IDNIGHT]', or 'HH-MI[M]/P[M]/N[OON]/M[IDNIGHT]').

If the type is any of the numeric datatypes (%BigInt, %Boolean, %Counter, %Currency, %Decimal, %Double, %Float, %Integer, %Numeric, %SmallInt, %TinyInt, or any SQL type that maps to one of these), then the value of the MapReduce result field is converted to the specified type using the SQL function TO_NUMBER. For all types other than %Date, %Timestamp, numeric types, and persistent classes, the value of the MapReduce result field is directly inserted in the import table column, following whatever conversion rules apply when performing SQL insert of a string literal to a column of the specified datatype.

If the type is a persistent class, then the cardinality keyword one may be specified after the type name, separated by whitespace, to indicate that this is the one side of a one-to-many relationship. This must be followed, after additional whitespace, by the name of the inverse relationship. Other relationship cardinalities (many, parent, and child, are not supported. For a relationship to be fully usable, the inverse relationship must be defined, with cardinality many, in the related class. This must be done after creating the import class, as compilation of the related class will fail if attempted with a relationship to the import class defined before the import class exists.

If the type is a persistent class, and it is not followed by keyword one and an inverse relationship name, then the column specification defines a reference to the specified class. In either case, values imported for this column are assumed to be ids of objects of the specified persistent class.

As each MapReduce result line is imported, its key and value portions are parsed into fields using the delimiters specified by pKeyDelimiter and pValueDelimiter, respectively, and these are stored, in the left-to-right order they are encountered, in the columns defined by pKeyColumnDefs and pValueColumnDefs, respectively, in the left-to-right order in which those columns were defined. If there are fewer actual key fields than key columns, extra key columns are set to null; if there are more actual key fields than key columns, extra key fields are ignored. Similarly, if there are fewer actual value fields than value columns, extra value columns are set to null; if there are more actual value fields than value columns, extra value fields are ignored. This makes it possible to import MapReduce result files in which every line does not have the same number of fields. However, best practice is to design your MapReduce application to store the same number of key and value fields in every line, and specify an import table definition that has exactly those numbers of key and value fields, unless you are sure that extra fields are safely ignored, and missing fields are safely imported as nulls.

The columns specified by pKeyColumns constitute a unique key. When a new table is created, an IDKEY index is defined for this key. If any imported rows have keys identical to those of existing rows, whether encountered during the initial import or during a subsequent import which reuses this table, a SQL exception is thrown reporting violation of the 'IDKEY' constraint.

method %ReadLine(pKey As %String, pValue As %String) as %String
Reads and returns the next line of the MapReduce result. Returns empty string if there is no next line. If pKey is specified, sets it to the key portion of the line (i.e. the portion before the TAB separator). If pValue is specified, sets it to the value portion of the line (i.e. the portion after the TAB separator).
method %ResetToFirst()
Resets the read position so that the next call to %ReadLine() returns the first line of the MapReduce result.

Inherited Members

Inherited Methods

FeedbackOpens in a new tab