FILTERNOISEWORDS controls whether common-word filtering is enabled.
Specifying a list of noise words can greatly reduce the size of a text index and the associated
index update time; however, to perform text search it is necessary to also remove noise words
from the search pattern, and this can produce some counter-intuitive results. See example below.
Setting up noise word filtering is
a two-step process: First enable noise word filtering by setting FILTERNOISEWORDS=1. Second,
populate the noise word dictionary by calling the ExcludeCommonTerms()
with the desired number of noise words to populate the corresponding DICTIONARY. ExcludeCommonTerms
purges the previous set of noise words, so it may be called any number of times, but it is necessary
to rebuild all text indexes on the corresponding properties whenever the list of noise words is changed.
Note: The SQL predicate:
SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('to be or not to be')
will not find any qualifying rows if 'to, be, or, not' are all noise words; however, if any
of these terms are not noise words, then only the non-noise words will participate in the matching
process.
parameter MINWORDLEN = 1;
MINWORDLEN specifies the minimum length word that will be retained
excluding ngram words and post-stemmed words.
MINWORDLEN provides a simple means of excluding terms based on their
length, since it is usually the case that short words such as 'a', 'to', 'an', etc., are
connectives that contain little information content. The length refers to the number of
characters in the original document. Note that if stemming or thesaurus translation is
enabled, then the length of the term in a text index may have fewer than MINWORDLEN
characters.
Note: MINWORDLEN should typically be set to 3 or less when STEMMING=1,
since otherwise a word stem could be classified as a noise word even though alternate forms of the
word would not be classified as a noise word. For example, with MINWORDLEN=5 "jump" would be discarded
as a noise word, whereas "jumps" would not.
parameter NGRAMLEN = 1;
NGRAMLEN is the maximum number of words that will be regarded as a single
search term. When NGRAMLEN=2, two-word combinations will be added to any
index, in addition to single words. Consecutive words exclude noise words.
parameter WORDCHARS = $%;
WORDCHARS specifies the characters other than alphabetic that may
appear in a word. For example, to regard hyphenated words as terms, include "-" in WORDCHARS.
Note that characters that are not numbers or words are ignored for the purpose of comparison
with the %CONTAINS operator, therefore the search pattern "off-hand" will match "off hand"
if WORDCHARS="", but not if WORDCHARS="-"; conversely, "off-hand" will match "offhand" if
WORDCHARS="-", but not if WORDCHARS="".
Methods
classmethod SeparateWords(rawText As %String) as %String
This is a copy of the default routine from %Text.Japanese
except that it does a better job at grouping characters. For example, some ASCII
strings are kept together. Groups of katakana characters are kept together
and only separated by the middle-dot (x30FB). Also diacritical marks are removed
from words.