Collation

Collation specifies how values are ordered and compared, and is part of both Caché SQL and Caché Objects. There are two fundamental collations: numeric and string.

Numeric collation orders numbers based on the complete number in the following order: null, then negative numbers from largest to smallest, zero, then positive numbers from smallest to largest. This creates a sequence such as the following: –210, –185, –54, –34, -.02, 0, 1, 2, 10, 17, 100, 120.
String collation orders strings by collating on each sequential character. This creates an order such as the following: null, A, AA, AAA, AAB, AB, B. For numbers, this creates an order such as the following: –.02, –185, –210, –34, –54, 0, 1, 10, 100, 120, 17, 2.

The default string collation is SQLUPPER; this default is set for each namespace. SQLUPPER collation converts all letters to uppercase (for the purpose of collation), and appends a space character to the beginning of the string. This conversion is for the purposes of collation only; in Caché SQL strings are usually displayed in uppercase and lowercase letters, regardless of the collation applied, and the length of a string does not include the appended space character.

A timestamp is a string, and therefore follows the current string collation. However, because a timestamp is in ODBC format, the string collation is the same as chronological sequence, if leading zeros are specified.

A string expression (such as those using the scalar string functions LEFT or SUBSTR) makes its result collation EXACT.
Any comparison of two literals uses EXACT collation.

You can use the ObjectScript Sorts After operator to determine the relative collation sequence order of two values.

You can specify collation as follows:

Also see “SQL Collation and NLS Collations,” later in this chapter.

Collation Types

Collation can be specified as a keyword in the definition of a field/property or the definition of an index.

Collation can be specified by applying a collation function to a field name in a query clause. The % prefix is required when specifying a collation function.

Collation is in ascending ASCII/Unicode sequence, with the following transformations:

EXACT — Enforces case sensitivity for string data. Not recommended for use if your string data contains values in canonical numeric format (for example 123 or -.57).
MVR — (For compatibility with MultiValue database systems.) For a string containing both numeric and non-numeric characters. MVR collation divides the string into substrings, each substring containing either all numeric or all non-numeric characters. The numeric substrings are sorted in signed numeric order. The non-numeric substrings are sorted in case-sensitive ASCII collation sequence. (Note that this collation does not appear in the Studio New Index Wizard.)
SQLSTRING — Strips trailing whitespace (spaces, tabs, and so on), and adds one leading blank space to the beginning of the string. It collates any value containing only whitespace (spaces, tabs, and so on) as the SQL empty string. SQLSTRING supports an optional maxlen integer value.
SQLUPPER — Converts all alphabetic characters to uppercase, strips trailing whitespace (spaces, tabs, and so on), and then adds one leading space character to the beginning of the string. The reason this space character is appended is to force numeric values to be collated as strings (because the space character is not a valid numeric character). This transformation also causes SQL to collate the SQL empty string ('') value and any value containing only whitespace (spaces, tabs, and so on) as a single space character. SQLUPPER supports an optional maxlen integer value. Note that the SQLUPPER transform is not the same as the result of the SQL function UPPER.
TRUNCATE — Enforces case sensitivity for string data and (unlike EXACT) allows you to specify a length at which to truncate the value. This is useful when indexing exact data that is longer than what is supported for use in a subscript. It takes a positive integer argument, in the form %TRUNCATE(string,n), to truncate the string to the first n characters, which improves indexing and sorting on long strings. If you do not specify a length for TRUNCATE, it behaves identically to EXACT; while this behavior is supported. your definitions and code may be easier to maintain if you use TRUNCATE only when you have a length defined and EXACT when you do not.
PLUS — Makes the value numeric. A non-numeric string value is returned as 0.
MINUS — Makes the value numeric and changes its sign. A non-numeric string value is returned as 0.

Note:

There are also various legacy collation types, the use of which is not recommended.

In an SQL query, you can specify a collation function without parentheses %SQLUPPER Name or with parentheses %SQLUPPER(Name). If the collation function specifies truncation, the parentheses are required %SQLUPPER(Name,10).

Three collation types: SQLSTRING, SQLUPPER, and TRUNCATE support an optional maxlen integer value. If specified, maxlen truncates parsing of the string to the first n characters. This can be used to improve performance when indexing and sorting long strings. You can use maxlen in a query to sort on, group by, or return a truncated string value.

You can also perform collation type conversions using the %SYSTEM.Util.Collation()Opens in a new tab method.

Namespace-wide Default Collation

Each namespace has a current string collation setting. This string collation is defined for the data type in %Library.StringOpens in a new tab. The default is SQLUPPER. This default can be changed.

You can define the collation default on a per-namespace basis. By default, namespaces have no assigned collation, which means they use SQLUPPER collation. You can assign a different default collation to a namespace. This namespace default collation applies to all processes, and persists across Caché restarts until explicitly reset.

   SET stat=$$GetEnvironment^%apiOBJ("collation","%Library.String",.collval)
   WRITE "initial collation for ",$NAMESPACE,!
     ZWRITE collval
SetNamespaceCollation
  DO SetEnvironment^%apiOBJ("collation","%Library.String","SQLstring")
  SET stat=$$GetEnvironment^%apiOBJ("collation","%Library.String",.collnew)
    WRITE "user-assigned collation for ",$NAMESPACE,!
     ZWRITE collnew
ResetCollationDefault
  DO SetEnvironment^%apiOBJ("collation","%Library.String",.collval)
  SET stat=$$GetEnvironment^%apiOBJ("collation","%Library.String",.collreset)
     WRITE "restored collation default for ",$NAMESPACE,!
     ZWRITE collreset

Note that if you have never set the namespace collation default, $$GetEnvironment returns an undefined collation variable, such as .collval in this example. This undefined collation defaults to SQLUPPER.

Note:

If your data contains German text, uppercase collation may not be a desirable default. This is because the German eszett character ($CHAR(223)) has only a lowercase form. The uppercase equivalent is the two letters “SS”. SQL collations that convert to uppercase do not convert eszett, which remains unchanged as a single lowercase letter.

Table Field/Property Definition Collation

Within SQL, collation can be assigned as part of field/property definition. The data type used by a field determines its default collation. The default collation for string data types is SQLUPPER. Non-string data types do not support collation assignment.

You can specify collation for a field in CREATE TABLE and ALTER TABLE:

CREATE TABLE Sample.MyNames (
    LastName CHAR(30),
    FirstName CHAR(30) COLLATE SQLstring)

Note:

When specifying collation for a field using CREATE TABLE and ALTER TABLE, the % prefix is optional: COLLATE SQLstring or COLLATE %SQLstring.

You can specify collation for a property when defining a table using a persistent class definition:

Class Sample.MyNames Extends %Persistent [DdlAllowed]
{
Property LastName As %String;
Property FirstName As %String(COLLATION = "SQLstring");
}

Note:

When specifying collation for class definitions and class methods do not use the % prefix for collation type names.

In these examples, the LastName field takes default collation (SQLUPPER, which is not case-sensitive), the FirstName field is defined with SQLSTRING collation, which is case-sensitive.

If you change the collation for a class property and you already have stored data for that class, any indices on the property become invalid. You must rebuild all indices based on this property.

Index Definition Collation

The CREATE INDEX command cannot specify an index collation type. The index uses the same collation as the field being indexed.

An index defined as part of class definition can specify a collation type. By default, an index on a given property (or properties) uses the collation type of the property data. For example, suppose you have defined a property Name of type %StringOpens in a new tab:

Class MyApp.Person Extends %Persistent [DdlAllowed]
{
Property Name As %String;
Index NameIDX On Name;
}

The collation for Name is SQLUPPER (the default for %StringOpens in a new tab). Suppose that the Person table contains the following data:

ID	Name
1	Jones
2	JOHNSON
3	Smith
4	jones
5	SMITH

Then an index on Name will contain the following entries:

Name	ID(s)
JOHNSON	2
JONES	1, 4
SMITH	3, 5

The SQL Engine can use this index directly for ORDER BY or comparison operations using the Name field.

You can override the default collation used for an index by adding an As clause to the index definition:

Class MyApp.Person Extends %Persistent [DdlAllowed]
{
Property Name As %String;
Index NameIDX On Name As SQLstring;
}

In this case the NameIDX index will now store values in SQLSTRING (case-sensitive) form. Using the data from the above example:

Name	ID(s)
JOHNSON	2
Jones	1
jones	4
SMITH	5
Smith	3

In this case, the SQL Engine can take advantage of this index for any queries requiring case-sensitive collation.

In general, you should not have to change the collations of indices. If you want to use a different collation, it is better to define it at the property level and let any indices on the property pick up the correct collation.

If you are performing a property comparison using an indexed property, the property as specified in the comparison should have the same collation type as the corresponding index. For example, the Name property in the WHERE clause of a SELECT or in the ON clause of a JOIN should have the same collation as the index defined for the Name property. If there is a mismatch between the property collation and the index collation, the index may be less effective or may not be used at all. For further details, refer to Index Collation in the “Defining and Building Indices” chapter of the Caché SQL Optimization Guide.

If your index is defined to use multiple properties, you can specify the collation of each individually:

Index MyIDX On (Name As SQLstring, Code As Exact);

Query Collation

Caché SQL provides collation functions that can be used to change the collation or display of a field.

select-item Collation

Applying a collation function to a query select-item changes the display of that item.

Letter Case: By default, a query displays strings with uppercase and lowercase letters. The exceptions to this are the DISTINCT or GROUP BY operations on a field of collation type SQLUPPER. These operations display that field in all uppercase letters. You can use the %EXACT collation function to reverse this letter case transformation and display the field in uppercase and lowercase letters. You should not use an %SQLUPPER collation function in the select-item list to display a field in all uppercase letters. This is because %SQLUPPER adds a space character to the length of the string. Use the UPPER function instead:
```
SELECT TOP 5 Name,$LENGTH(Name) AS NLen,
             %SQLUPPER(Name) AS UpCollN,$LENGTH(%SQLUPPER(Name)) AS UpCollLen,
             UPPER(Name) AS UpN,$LENGTH(UPPER(Name)) AS UpLen
FROM Sample.Person
```
String Truncation: You can use the %TRUNCATE collation function to limit the length of the string data you wish to display. %TRUNCATE is preferable to %SQLUPPER, which adds a space character to the length of the string.
```
SELECT TOP 5 Name,$LENGTH(Name) AS NLen,
             %TRUNCATE(Name,8) AS TruncN,$LENGTH(%TRUNCATE(Name,8)) AS TruncLen
FROM Sample.Person
```
Note that you cannot nest collation functions or case-transformation functions.
WHERE clause comparisons: Most WHERE clause predicate condition comparisons use the collation type of the field/property. Because string fields default to SQLUPPER, these comparisons are commonly not case-sensitive. You can use the %EXACT collation function to make them case-sensitive:
The following example returns Home_City string matches regardless of letter case:
```
SELECT Home_City FROM Sample.Person WHERE Home_City = 'albany'
```
The following example returns Home_City string matches that are case-sensitive:
```
SELECT Home_City FROM Sample.Person WHERE %EXACT(Home_City) = 'albany'
```
The SQL Follows operator ( ] ) uses the field/property collation type.
However, the SQL Contains operator ( [ ) uses EXACT collation, regardless of the collation type of the field/property:
```
SELECT Home_City FROM Sample.Person WHERE Home_City [ 'c' 
ORDER BY Home_City
```
The %MATCHES and %PATTERN predicate conditions use EXACT collation, regardless of the collation type of the field/property. The %PATTERN predicate provides both case-sensitive wildcards and a wildcard (‘A’) which is not case-sensitive.
ORDER BY clause: The ORDER BY clause uses the namespace default collation to order string values. Therefore, ORDER BY does not order based on lettercase. You can use %EXACT collation to order strings based on lettercase.

DISTINCT and GROUP BY Collation

By default, these operation use the current namespace collation. The default namespace collation is SQLUPPER.

DISTINCT: The DISTINCT keyword uses the namespace default collation to eliminate duplicate values. Therefore, DISTINCT Name returns values in all uppercase letters. You can use EXACT collation to return values in mixed uppercase and lowercase. DISTINCT eliminates duplicates that differ only in letter case. To preserve duplicates that differ in case, but eliminate exact duplicates, use EXACT collation. The following example eliminates exact duplicates (but not lettercase variants) and returns all values in mixed uppercase and lowercase:
```
SELECT DISTINCT %EXACT(Name) FROM Sample.Person
```
A UNION involves an implicit DISTINCT operation.
GROUP BY: The GROUP BY clause uses the namespace default collation to eliminate duplicate values. Therefore, GROUP BY Name returns values in all uppercase letters. You can use EXACT collation to return values in mixed uppercase and lowercase. GROUP BY eliminates duplicates that differ only in letter case. To preserve duplicates that differ in case, but eliminate exact duplicates, you must specify the %EXACT collation function on the GROUP BY clause, not the select-item.
The following example returns values in mixed uppercase and lowercase; the GROUP BY eliminates duplicates, including those that differ in lettercase:
```
SELECT %EXACT(Name) FROM Sample.Person GROUP BY Name
```
The following example returns values in mixed uppercase and lowercase; the GROUP BY eliminates exact duplicates (but not lettercase variants):
```
SELECT Name FROM Sample.Person GROUP BY %EXACT(Name)
```

Legacy Collation Types

Caché SQL supports several legacy collation types. These are deprecated and not recommended for use with new code, as their purpose is to provide continued support for legacy systems. They are:

ALPHAUP — Removes all punctuation characters except question marks (“?”) and commas (“,”), and translates all the lowercase letters to uppercase. Used mostly for mapping legacy globals. Replaced by SQLUPPER.
STRING — Converts a logical value to uppercase, strips all punctuation and white space (except for commas), and adds one leading blank space to the beginning of the string. It collates any value containing only whitespace (spaces, tabs, and so on) as the SQL empty string. Replaced by SQLUPPER.
UPPER — Translates all lowercase letters into uppercase letters. Used mostly for mapping legacy globals. Replaced by SQLUPPER.
SPACE — SPACE collation appends a single leading space to a value, forcing it to be evaluated as a string. To establish SPACE collation, CREATE TABLE provides a SPACE collation keyword, and ObjectScript provides a SPACE option in the Collation()Opens in a new tab method of the %SYSTEM.UtilOpens in a new tab class. There is no corresponding SQL collation function.

Note:

If a string data type field is defined with EXACT, UPPER, or ALPHAUP collation, and a query applies a %STARTSWITH condition on this field, inconsistent behavior may result. If the substring you specify to %STARTSWITH is a canonical number (especially a negative and/or fractional number), %STARTSWITH may give different results depending on whether the field is indexed. The %STARTSWITH should perform as expected if the column is not indexed. If the column is indexed, unexpected results may occur.

SQL and NLS Collations

The SQL collations described above should not be confused with the Caché NLS collation feature, which provides subscript-level encoding that adhere to particular national language collation requirements. These are two separate systems of providing collations, and they work at different levels of the product.

Caché NLS collations can have a process-level collation for the current process, and different collations for specific globals.

To ensure proper functioning when using Caché SQL, it is a requirement that the process-level NLS collation matches exactly the NLS collation of all globals involved, including globals used by the tables and globals used for temporary files such as process private globals and for CACHETEMP globals; otherwise, different processing plans devised by the Query Processor might give different results. In situations where sorting occurs, such as an ORDER BY clause or a range condition, the Query Processor selects the most efficient sorting strategy. It may use an index, use a temporary file in a process-private global, sort within a local array, or use a "]]" (Sorts After) comparison. All these are subscript-type comparisons that adhere to the Caché NLS collation that is in effect, which is why it is necessary that all these types of globals use the exact same NLS collation.

The system creates a global with the data base default collation. You can use the Create()Opens in a new tab method of the %Library.GlobalEditOpens in a new tab class to create a global with a different collation. The only requirement is that the specified collation be either built-in (such as the Caché standard) or one of the national collations available in the current locale. See “Using %Library.GlobalEdit to Set Collation For A Global” in Caché Specialized System Tools and Utilities.