| 
								Fts5Tokenizer
							 | * CUSTOM TOKENIZERS
*
* Applications may also register custom tokenizer types. A tokenizer
* is registered by providing fts5 with a populated instance of the
* following structure. All structure methods must be defined, setting
* any member of the fts5_tokenizer struct to NULL leads to undefined
* behaviour. The structure methods are expected to function as follows:
*
* xCreate:
*   This function is used to allocate and initialize a tokenizer instance.
*   A tokenizer instance is required to actually tokenize text.
*
*   The first argument passed to this function is a copy of the (void*)
*   pointer provided by the application when the fts5_tokenizer object
*   was registered with FTS5 (the third argument to xCreateTokenizer()).
*   The second and third arguments are an array of nul-terminated strings
*   containing the tokenizer arguments, if any, specified following the
*   tokenizer name as part of the CREATE VIRTUAL TABLE statement used
*   to create the FTS5 table.
*
*   The final argument is an output variable. If successful, (*ppOut)
*   should be set to point to the new tokenizer handle and SQLITE_OK
*   returned. If an error occurs, some value other than SQLITE_OK should
*   be returned. In this case, fts5 assumes that the final value of *ppOut
*   is undefined.
*
* xDelete:
*   This function is invoked to delete a tokenizer handle previously
*   allocated using xCreate(). Fts5 guarantees that this function will
*   be invoked exactly once for each successful call to xCreate().
*
* xTokenize:
*   This function is expected to tokenize the nText byte string indicated
*   by argument pText. pText may or may not be nul-terminated. The first
*   argument passed to this function is a pointer to an Fts5Tokenizer object
*   returned by an earlier call to xCreate().
*
*   The second argument indicates the reason that FTS5 is requesting
*   tokenization of the supplied text. This is always one of the following
*   four values:
*
* *
*   For each token in the input string, the supplied callback xToken() must
*   be invoked. The first argument to it should be a copy of the pointer
*   passed as the second argument to xTokenize(). The third and fourth
*   arguments are a pointer to a buffer containing the token text, and the
*   size of the token in bytes. The 4th and 5th arguments are the byte offsets
*   of the first byte of and first byte immediately following the text from
*   which the token is derived within the input.
*
*   The second argument passed to the xToken() callback ("tflags") should
*   normally be set to 0. The exception is if the tokenizer supports
*   synonyms. In this case see the discussion below for details.
*
*   FTS5 assumes the xToken() callback is invoked for each token in the
*   order that they occur within the input text.
*
*   If an xToken() callback returns any value other than SQLITE_OK, then
*   the tokenization should be abandoned and the xTokenize() method should
*   immediately return a copy of the xToken() return value. Or, if the
*   input buffer is exhausted, xTokenize() should return SQLITE_OK. Finally,
*   if an error occurs with the xTokenize() implementation itself, it
*   may abandon the tokenization and return any error code other than
*   SQLITE_OK or SQLITE_DONE.
*
* SYNONYM SUPPORT
*
*   Custom tokenizers may also support synonyms. Consider a case in which a
*   user wishes to query for a phrase such as "first place". Using the
*   built-in tokenizers, the FTS5 query 'first + place' will match instances
*   of "first place" within the document set, but not alternative forms
*   such as "1st place". In some applications, it would be better to match
*   all instances of "first place" or "1st place" regardless of which form
*   the user specified in the MATCH query text.
*
*   There are several ways to approach this in FTS5:
*
* FTS5_TOKENIZE_DOCUMENT - A document is being inserted into
*            or removed from the FTS table. The tokenizer is being invoked to
*            determine the set of tokens to add to (or delete from) the
*            FTS index.
*
*        FTS5_TOKENIZE_QUERY - A MATCH query is being executed
*            against the FTS index. The tokenizer is being called to tokenize
*            a bareword or quoted string specified as part of the query.
*
*        (FTS5_TOKENIZE_QUERY | FTS5_TOKENIZE_PREFIX) - Same as
*            FTS5_TOKENIZE_QUERY, except that the bareword or quoted string is
*            followed by a "*" character, indicating that the last token
*            returned by the tokenizer will be treated as a token prefix.
*
*        FTS5_TOKENIZE_AUX - The tokenizer is being invoked to
*            satisfy an fts5_api.xTokenize() request made by an auxiliary
*            function. Or an fts5_api.xColumnSize() request made by the same
*            on a columnsize=0 database.
*   
 *
*   Whether it is parsing document or query text, any call to xToken that
*   specifies a tflags argument with the FTS5_TOKEN_COLOCATED bit
*   is considered to supply a synonym for the previous token. For example,
*   when parsing the document "I won first place", a tokenizer that supports
*   synonyms would call xToken() 5 times, as follows:
*
*   
*       xToken(pCtx, 0, "i",                      1,  0,  1);
*       xToken(pCtx, 0, "won",                    3,  2,  5);
*       xToken(pCtx, 0, "first",                  5,  6, 11);
*       xToken(pCtx, FTS5_TOKEN_COLOCATED, "1st", 3,  6, 11);
*       xToken(pCtx, 0, "place",                  5, 12, 17);
*
*
*   It is an error to specify the FTS5_TOKEN_COLOCATED flag the first time
*   xToken() is called. Multiple synonyms may be specified for a single token
*   by making multiple calls to xToken(FTS5_TOKEN_COLOCATED) in sequence.
*   There is no limit to the number of synonyms that may be provided for a
*   single token.
*
*   In many cases, method (1) above is the best approach. It does not add
*   extra data to the FTS index or require FTS5 to query for multiple terms,
*   so it is efficient in terms of disk space and query speed. However, it
*   does not support prefix queries very well. If, as suggested above, the
*   token "first" is substituted for "1st" by the tokenizer, then the query:
*
*   
*     ... MATCH '1s*'
*
*   will not match documents that contain the token "1st" (as the tokenizer
*   will probably not map "1s" to any prefix of "first").
*
*   For full prefix support, method (3) may be preferred. In this case,
*   because the index contains entries for both "first" and "1st", prefix
*   queries such as 'fi*' or '1s*' will match correctly. However, because
*   extra entries are added to the FTS index, this method uses more space
*   within the database.
*
*   Method (2) offers a midpoint between (1) and (3). Using this method,
*   a query such as '1s*' will match documents that contain the literal
*   token "1st", but not "first" (assuming the tokenizer is not able to
*   provide synonyms for prefixes). However, a non-prefix query like '1st'
*   will match against "1st" and "first". This method does not require
*   extra disk space, as no extra entries are added to the FTS index.
*   On the other hand, it may require more CPU cycles to run MATCH queries,
*   as separate queries of the FTS index are required for each synonym.
*
*   When using methods (2) or (3), it is important that the tokenizer only
*   provide synonyms when tokenizing document text (method (2)) or query
*   text (method (3)), not both. Doing so will not cause any errors, but is
*   inefficient. By mapping all synonyms to a single token. In this case, the
*            In the above example, this means that the tokenizer returns the
*            same token for inputs "first" and "1st". Say that token is in
*            fact "first", so that when the user inserts the document "I won
*            1st place" entries are added to the index for tokens "i", "won",
*            "first" and "place". If the user then queries for '1st + place',
*            the tokenizer substitutes "first" for "1st" and the query works
*            as expected.
*
*        By adding multiple synonyms for a single term to the FTS index.
*            In this case, when tokenizing query text, the tokenizer may
*            provide multiple synonyms for a single term within the document.
*            FTS5 then queries the index for each synonym individually. For
*            example, faced with the query:
*
*   
*     ... MATCH 'first place'
*
*            the tokenizer offers both "1st" and "first" as synonyms for the
*            first token in the MATCH query and FTS5 effectively runs a query
*            similar to:
*
*   
*     ... MATCH '(first OR 1st) place'
*
*            except that, for the purposes of auxiliary functions, the query
*            still appears to contain just two phrases - "(first OR 1st)"
*            being treated as a single phrase.
*
*        By adding multiple synonyms for a single term to the FTS index.
*            Using this method, when tokenizing document text, the tokenizer
*            provides multiple synonyms for each token. So that when a
*            document such as "I won first place" is tokenized, entries are
*            added to the FTS index for "i", "won", "first", "1st" and
*            "place".
*
*            This way, even if the tokenizer does not provide synonyms
*            when tokenizing query text (it should not - to do would be
*            inefficient), it doesn't matter if the user queries for
*            'first + place' or '1st + place', as there are entries in the
*            FTS index corresponding to both forms of the first token.
*   
 |