Fts5Tokenizer
|
* CUSTOM TOKENIZERS
*
* Applications may also register custom tokenizer types. A tokenizer
* is registered by providing fts5 with a populated instance of the
* following structure. All structure methods must be defined, setting
* any member of the fts5_tokenizer struct to NULL leads to undefined
* behaviour. The structure methods are expected to function as follows:
*
* xCreate:
* This function is used to allocate and initialize a tokenizer instance.
* A tokenizer instance is required to actually tokenize text.
*
* The first argument passed to this function is a copy of the (void*)
* pointer provided by the application when the fts5_tokenizer object
* was registered with FTS5 (the third argument to xCreateTokenizer()).
* The second and third arguments are an array of nul-terminated strings
* containing the tokenizer arguments, if any, specified following the
* tokenizer name as part of the CREATE VIRTUAL TABLE statement used
* to create the FTS5 table.
*
* The final argument is an output variable. If successful, (*ppOut)
* should be set to point to the new tokenizer handle and SQLITE_OK
* returned. If an error occurs, some value other than SQLITE_OK should
* be returned. In this case, fts5 assumes that the final value of *ppOut
* is undefined.
*
* xDelete:
* This function is invoked to delete a tokenizer handle previously
* allocated using xCreate(). Fts5 guarantees that this function will
* be invoked exactly once for each successful call to xCreate().
*
* xTokenize:
* This function is expected to tokenize the nText byte string indicated
* by argument pText. pText may or may not be nul-terminated. The first
* argument passed to this function is a pointer to an Fts5Tokenizer object
* returned by an earlier call to xCreate().
*
* The second argument indicates the reason that FTS5 is requesting
* tokenization of the supplied text. This is always one of the following
* four values:
*
* - FTS5_TOKENIZE_DOCUMENT - A document is being inserted into
* or removed from the FTS table. The tokenizer is being invoked to
* determine the set of tokens to add to (or delete from) the
* FTS index.
*
*
- FTS5_TOKENIZE_QUERY - A MATCH query is being executed
* against the FTS index. The tokenizer is being called to tokenize
* a bareword or quoted string specified as part of the query.
*
*
- (FTS5_TOKENIZE_QUERY | FTS5_TOKENIZE_PREFIX) - Same as
* FTS5_TOKENIZE_QUERY, except that the bareword or quoted string is
* followed by a "*" character, indicating that the last token
* returned by the tokenizer will be treated as a token prefix.
*
*
- FTS5_TOKENIZE_AUX - The tokenizer is being invoked to
* satisfy an fts5_api.xTokenize() request made by an auxiliary
* function. Or an fts5_api.xColumnSize() request made by the same
* on a columnsize=0 database.
*
*
* For each token in the input string, the supplied callback xToken() must
* be invoked. The first argument to it should be a copy of the pointer
* passed as the second argument to xTokenize(). The third and fourth
* arguments are a pointer to a buffer containing the token text, and the
* size of the token in bytes. The 4th and 5th arguments are the byte offsets
* of the first byte of and first byte immediately following the text from
* which the token is derived within the input.
*
* The second argument passed to the xToken() callback ("tflags") should
* normally be set to 0. The exception is if the tokenizer supports
* synonyms. In this case see the discussion below for details.
*
* FTS5 assumes the xToken() callback is invoked for each token in the
* order that they occur within the input text.
*
* If an xToken() callback returns any value other than SQLITE_OK, then
* the tokenization should be abandoned and the xTokenize() method should
* immediately return a copy of the xToken() return value. Or, if the
* input buffer is exhausted, xTokenize() should return SQLITE_OK. Finally,
* if an error occurs with the xTokenize() implementation itself, it
* may abandon the tokenization and return any error code other than
* SQLITE_OK or SQLITE_DONE.
*
* SYNONYM SUPPORT
*
* Custom tokenizers may also support synonyms. Consider a case in which a
* user wishes to query for a phrase such as "first place". Using the
* built-in tokenizers, the FTS5 query 'first + place' will match instances
* of "first place" within the document set, but not alternative forms
* such as "1st place". In some applications, it would be better to match
* all instances of "first place" or "1st place" regardless of which form
* the user specified in the MATCH query text.
*
* There are several ways to approach this in FTS5:
*
* - By mapping all synonyms to a single token. In this case, the
* In the above example, this means that the tokenizer returns the
* same token for inputs "first" and "1st". Say that token is in
* fact "first", so that when the user inserts the document "I won
* 1st place" entries are added to the index for tokens "i", "won",
* "first" and "place". If the user then queries for '1st + place',
* the tokenizer substitutes "first" for "1st" and the query works
* as expected.
*
*
- By adding multiple synonyms for a single term to the FTS index.
* In this case, when tokenizing query text, the tokenizer may
* provide multiple synonyms for a single term within the document.
* FTS5 then queries the index for each synonym individually. For
* example, faced with the query:
*
*
* ... MATCH 'first place'
*
* the tokenizer offers both "1st" and "first" as synonyms for the
* first token in the MATCH query and FTS5 effectively runs a query
* similar to:
*
*
* ... MATCH '(first OR 1st) place'
*
* except that, for the purposes of auxiliary functions, the query
* still appears to contain just two phrases - "(first OR 1st)"
* being treated as a single phrase.
*
*
- By adding multiple synonyms for a single term to the FTS index.
* Using this method, when tokenizing document text, the tokenizer
* provides multiple synonyms for each token. So that when a
* document such as "I won first place" is tokenized, entries are
* added to the FTS index for "i", "won", "first", "1st" and
* "place".
*
* This way, even if the tokenizer does not provide synonyms
* when tokenizing query text (it should not - to do would be
* inefficient), it doesn't matter if the user queries for
* 'first + place' or '1st + place', as there are entries in the
* FTS index corresponding to both forms of the first token.
*
*
* Whether it is parsing document or query text, any call to xToken that
* specifies a tflags argument with the FTS5_TOKEN_COLOCATED bit
* is considered to supply a synonym for the previous token. For example,
* when parsing the document "I won first place", a tokenizer that supports
* synonyms would call xToken() 5 times, as follows:
*
*
* xToken(pCtx, 0, "i", 1, 0, 1);
* xToken(pCtx, 0, "won", 3, 2, 5);
* xToken(pCtx, 0, "first", 5, 6, 11);
* xToken(pCtx, FTS5_TOKEN_COLOCATED, "1st", 3, 6, 11);
* xToken(pCtx, 0, "place", 5, 12, 17);
*
*
* It is an error to specify the FTS5_TOKEN_COLOCATED flag the first time
* xToken() is called. Multiple synonyms may be specified for a single token
* by making multiple calls to xToken(FTS5_TOKEN_COLOCATED) in sequence.
* There is no limit to the number of synonyms that may be provided for a
* single token.
*
* In many cases, method (1) above is the best approach. It does not add
* extra data to the FTS index or require FTS5 to query for multiple terms,
* so it is efficient in terms of disk space and query speed. However, it
* does not support prefix queries very well. If, as suggested above, the
* token "first" is substituted for "1st" by the tokenizer, then the query:
*
*
* ... MATCH '1s*'
*
* will not match documents that contain the token "1st" (as the tokenizer
* will probably not map "1s" to any prefix of "first").
*
* For full prefix support, method (3) may be preferred. In this case,
* because the index contains entries for both "first" and "1st", prefix
* queries such as 'fi*' or '1s*' will match correctly. However, because
* extra entries are added to the FTS index, this method uses more space
* within the database.
*
* Method (2) offers a midpoint between (1) and (3). Using this method,
* a query such as '1s*' will match documents that contain the literal
* token "1st", but not "first" (assuming the tokenizer is not able to
* provide synonyms for prefixes). However, a non-prefix query like '1st'
* will match against "1st" and "first". This method does not require
* extra disk space, as no extra entries are added to the FTS index.
* On the other hand, it may require more CPU cycles to run MATCH queries,
* as separate queries of the FTS index are required for each synonym.
*
* When using methods (2) or (3), it is important that the tokenizer only
* provide synonyms when tokenizing document text (method (2)) or query
* text (method (3)), not both. Doing so will not cause any errors, but is
* inefficient.
|