Struct Fts5Tokenizer
* CUSTOM TOKENIZERS * * Applications may also register custom tokenizer types. A tokenizer * is registered by providing fts5 with a populated instance of the * following structure. All structure methods must be defined, setting * any member of the fts5_tokenizer struct to NULL leads to undefined * behaviour. The structure methods are expected to function as follows: * * xCreate: * This function is used to allocate and initialize a tokenizer instance. * A tokenizer instance is required to actually tokenize text. * * The first argument passed to this function is a copy of the (void*) * pointer provided by the application when the fts5_tokenizer object * was registered with FTS5 (the third argument to xCreateTokenizer()). * The second and third arguments are an array of nul-terminated strings * containing the tokenizer arguments, if any, specified following the * tokenizer name as part of the CREATE VIRTUAL TABLE statement used * to create the FTS5 table. * * The final argument is an output variable. If successful, (*ppOut) * should be set to point to the new tokenizer handle and SQLITE_OK * returned. If an error occurs, some value other than SQLITE_OK should * be returned. In this case, fts5 assumes that the final value of *ppOut * is undefined. * * xDelete: * This function is invoked to delete a tokenizer handle previously * allocated using xCreate(). Fts5 guarantees that this function will * be invoked exactly once for each successful call to xCreate(). * * xTokenize: * This function is expected to tokenize the nText byte string indicated * by argument pText. pText may or may not be nul-terminated. The first * argument passed to this function is a pointer to an Fts5Tokenizer object * returned by an earlier call to xCreate(). * * The second argument indicates the reason that FTS5 is requesting * tokenization of the supplied text. This is always one of the following * four values: * *
- FTS5_TOKENIZE_DOCUMENT - A document is being inserted into * or removed from the FTS table. The tokenizer is being invoked to * determine the set of tokens to add to (or delete from) the * FTS index. * *
- FTS5_TOKENIZE_QUERY - A MATCH query is being executed * against the FTS index. The tokenizer is being called to tokenize * a bareword or quoted string specified as part of the query. * *
- (FTS5_TOKENIZE_QUERY | FTS5_TOKENIZE_PREFIX) - Same as * FTS5_TOKENIZE_QUERY, except that the bareword or quoted string is * followed by a "*" character, indicating that the last token * returned by the tokenizer will be treated as a token prefix. * *
- FTS5_TOKENIZE_AUX - The tokenizer is being invoked to * satisfy an fts5_api.xTokenize() request made by an auxiliary * function. Or an fts5_api.xColumnSize() request made by the same * on a columnsize=0 database. *
- By mapping all synonyms to a single token. In this case, the * In the above example, this means that the tokenizer returns the * same token for inputs "first" and "1st". Say that token is in * fact "first", so that when the user inserts the document "I won * 1st place" entries are added to the index for tokens "i", "won", * "first" and "place". If the user then queries for '1st + place', * the tokenizer substitutes "first" for "1st" and the query works * as expected. * *
- By adding multiple synonyms for a single term to the FTS index.
* In this case, when tokenizing query text, the tokenizer may
* provide multiple synonyms for a single term within the document.
* FTS5 then queries the index for each synonym individually. For
* example, faced with the query:
*
*
* ... MATCH 'first place' * * the tokenizer offers both "1st" and "first" as synonyms for the * first token in the MATCH query and FTS5 effectively runs a query * similar to: * ** ... MATCH '(first OR 1st) place' * * except that, for the purposes of auxiliary functions, the query * still appears to contain just two phrases - "(first OR 1st)" * being treated as a single phrase. * * - By adding multiple synonyms for a single term to the FTS index. * Using this method, when tokenizing document text, the tokenizer * provides multiple synonyms for each token. So that when a * document such as "I won first place" is tokenized, entries are * added to the FTS index for "i", "won", "first", "1st" and * "place". * * This way, even if the tokenizer does not provide synonyms * when tokenizing query text (it should not - to do would be * inefficient), it doesn't matter if the user queries for * 'first + place' or '1st + place', as there are entries in the * FTS index corresponding to both forms of the first token. *
struct Fts5Tokenizer
;