Struct Fts5Tokenizer

* CUSTOM TOKENIZERS * * Applications may also register custom tokenizer types. A tokenizer * is registered by providing fts5 with a populated instance of the * following structure. All structure methods must be defined, setting * any member of the fts5_tokenizer struct to NULL leads to undefined * behaviour. The structure methods are expected to function as follows: * * xCreate: * This function is used to allocate and initialize a tokenizer instance. * A tokenizer instance is required to actually tokenize text. * * The first argument passed to this function is a copy of the (void*) * pointer provided by the application when the fts5_tokenizer object * was registered with FTS5 (the third argument to xCreateTokenizer()). * The second and third arguments are an array of nul-terminated strings * containing the tokenizer arguments, if any, specified following the * tokenizer name as part of the CREATE VIRTUAL TABLE statement used * to create the FTS5 table. * * The final argument is an output variable. If successful, (*ppOut) * should be set to point to the new tokenizer handle and SQLITE_OK * returned. If an error occurs, some value other than SQLITE_OK should * be returned. In this case, fts5 assumes that the final value of *ppOut * is undefined. * * xDelete: * This function is invoked to delete a tokenizer handle previously * allocated using xCreate(). Fts5 guarantees that this function will * be invoked exactly once for each successful call to xCreate(). * * xTokenize: * This function is expected to tokenize the nText byte string indicated * by argument pText. pText may or may not be nul-terminated. The first * argument passed to this function is a pointer to an Fts5Tokenizer object * returned by an earlier call to xCreate(). * * The second argument indicates the reason that FTS5 is requesting * tokenization of the supplied text. This is always one of the following * four values: * *

* * For each token in the input string, the supplied callback xToken() must * be invoked. The first argument to it should be a copy of the pointer * passed as the second argument to xTokenize(). The third and fourth * arguments are a pointer to a buffer containing the token text, and the * size of the token in bytes. The 4th and 5th arguments are the byte offsets * of the first byte of and first byte immediately following the text from * which the token is derived within the input. * * The second argument passed to the xToken() callback ("tflags") should * normally be set to 0. The exception is if the tokenizer supports * synonyms. In this case see the discussion below for details. * * FTS5 assumes the xToken() callback is invoked for each token in the * order that they occur within the input text. * * If an xToken() callback returns any value other than SQLITE_OK, then * the tokenization should be abandoned and the xTokenize() method should * immediately return a copy of the xToken() return value. Or, if the * input buffer is exhausted, xTokenize() should return SQLITE_OK. Finally, * if an error occurs with the xTokenize() implementation itself, it * may abandon the tokenization and return any error code other than * SQLITE_OK or SQLITE_DONE. * * SYNONYM SUPPORT * * Custom tokenizers may also support synonyms. Consider a case in which a * user wishes to query for a phrase such as "first place". Using the * built-in tokenizers, the FTS5 query 'first + place' will match instances * of "first place" within the document set, but not alternative forms * such as "1st place". In some applications, it would be better to match * all instances of "first place" or "1st place" regardless of which form * the user specified in the MATCH query text. * * There are several ways to approach this in FTS5: * *
  1. By mapping all synonyms to a single token. In this case, the * In the above example, this means that the tokenizer returns the * same token for inputs "first" and "1st". Say that token is in * fact "first", so that when the user inserts the document "I won * 1st place" entries are added to the index for tokens "i", "won", * "first" and "place". If the user then queries for '1st + place', * the tokenizer substitutes "first" for "1st" and the query works * as expected. * *
  2. By adding multiple synonyms for a single term to the FTS index. * In this case, when tokenizing query text, the tokenizer may * provide multiple synonyms for a single term within the document. * FTS5 then queries the index for each synonym individually. For * example, faced with the query: * * * ... MATCH 'first place' * * the tokenizer offers both "1st" and "first" as synonyms for the * first token in the MATCH query and FTS5 effectively runs a query * similar to: * * * ... MATCH '(first OR 1st) place' * * except that, for the purposes of auxiliary functions, the query * still appears to contain just two phrases - "(first OR 1st)" * being treated as a single phrase. * *
  3. By adding multiple synonyms for a single term to the FTS index. * Using this method, when tokenizing document text, the tokenizer * provides multiple synonyms for each token. So that when a * document such as "I won first place" is tokenized, entries are * added to the FTS index for "i", "won", "first", "1st" and * "place". * * This way, even if the tokenizer does not provide synonyms * when tokenizing query text (it should not - to do would be * inefficient), it doesn't matter if the user queries for * 'first + place' or '1st + place', as there are entries in the * FTS index corresponding to both forms of the first token. *
* * Whether it is parsing document or query text, any call to xToken that * specifies a tflags argument with the FTS5_TOKEN_COLOCATED bit * is considered to supply a synonym for the previous token. For example, * when parsing the document "I won first place", a tokenizer that supports * synonyms would call xToken() 5 times, as follows: * * * xToken(pCtx, 0, "i", 1, 0, 1); * xToken(pCtx, 0, "won", 3, 2, 5); * xToken(pCtx, 0, "first", 5, 6, 11); * xToken(pCtx, FTS5_TOKEN_COLOCATED, "1st", 3, 6, 11); * xToken(pCtx, 0, "place", 5, 12, 17); * * * It is an error to specify the FTS5_TOKEN_COLOCATED flag the first time * xToken() is called. Multiple synonyms may be specified for a single token * by making multiple calls to xToken(FTS5_TOKEN_COLOCATED) in sequence. * There is no limit to the number of synonyms that may be provided for a * single token. * * In many cases, method (1) above is the best approach. It does not add * extra data to the FTS index or require FTS5 to query for multiple terms, * so it is efficient in terms of disk space and query speed. However, it * does not support prefix queries very well. If, as suggested above, the * token "first" is substituted for "1st" by the tokenizer, then the query: * * * ... MATCH '1s*' * * will not match documents that contain the token "1st" (as the tokenizer * will probably not map "1s" to any prefix of "first"). * * For full prefix support, method (3) may be preferred. In this case, * because the index contains entries for both "first" and "1st", prefix * queries such as 'fi*' or '1s*' will match correctly. However, because * extra entries are added to the FTS index, this method uses more space * within the database. * * Method (2) offers a midpoint between (1) and (3). Using this method, * a query such as '1s*' will match documents that contain the literal * token "1st", but not "first" (assuming the tokenizer is not able to * provide synonyms for prefixes). However, a non-prefix query like '1st' * will match against "1st" and "first". This method does not require * extra disk space, as no extra entries are added to the FTS index. * On the other hand, it may require more CPU cycles to run MATCH queries, * as separate queries of the FTS index are required for each synonym. * * When using methods (2) or (3), it is important that the tokenizer only * provide synonyms when tokenizing document text (method (2)) or query * text (method (3)), not both. Doing so will not cause any errors, but is * inefficient.

struct Fts5Tokenizer ;