Zen: Tsearch V2 in Brief

Tsearch2 configuration: There are four tables in system catalog

=# \d pg_ts_cfg 
  Table "public.pg_ts_cfg"
  Column  | Type | Modifiers 
----------+------+-----------
 ts_name  | text | not null
 prs_name | text | not null
 locale   | text | 
Indexes: pg_ts_cfg_pkey primary key btree (ts_name)

Tsearch2 config. Locale can be pointed for defining which config is used for current locale.

=# \d pg_ts_dict
     Table "public.pg_ts_dict"
     Column      | Type | Modifiers 
-----------------+------+-----------
 dict_name       | text | not null
 dict_init       | oid  | 
 dict_initoption | text | 
 dict_lexize     | oid  | not null
 dict_comment    | text | 
Indexes: pg_ts_dict_pkey primary key btree (dict_name)

	Table for storing dictionaries. Dict_init field store Oid of function
	that initialize dictionary. Dict_init has one option: text value from
	dict_initoption and should return internal representation (structure)
	of dictionary. Structure must be malloced or palloced in 
	TopMemoryContext. Dict_init is called only one times per process.
	dict_lexize field store Oid of function that lemmatize lexem. 
	Input values: structure of dictionary, pionter to string and it's 
	length. Output: pointer to array of pointers to C-strings. Last pointer
	in array must be NULL. Returns NULL means that dictionary can't resolve
	 this word, but return void array means that dictionary know input word,
	but suppose that word is stop-word.

apod=# \d pg_ts_parser
   Table "public.pg_ts_parser"
    Column     | Type | Modifiers 
---------------+------+-----------
 prs_name      | text | not null
 prs_start     | oid  | not null
 prs_nexttoken | oid  | not null
 prs_end       | oid  | not null
 prs_headline  | oid  | not null
 prs_lextype   | oid  | not null
 prs_comment   | text | 
Indexes: pg_ts_parser_pkey primary key btree (prs_name)

	Store parser. prs_start store Oid of function that initialize
	parser, arguments: pointer to string and it's length,
	returns internal structure of parser. Structure must be malloced or 
        palloced in TopMemoryContext. prs_nexttoken store Oid of function that 
        return next lexem. Input: structure of parser, pointer to 
        pointer of char, pointer to int4.	
	Returns type of lexem, if type is equal to 0 then all is parsed.
	Returning lexem is stored in last two pointers.
	prsd_end store Oid of function that finished parse session. Input:
	structure of parser.
	prs_headline is generate headline and work on parsed text, stored in
	HLPRSTEXT structure (ts_cfg.h). Arguments: pointer to HLPRSTEXT,
	pointer to query, pointer to option (as text pgsql's type)
	prs_lextype returns array of LexDescr (see wparser.h), describing
	types of lexem that can be returned by parser.

=# \d pg_ts_cfgmap
  Table "public.pg_ts_cfgmap"
  Column   |  Type  | Modifiers 
-----------+--------+-----------
 ts_name   | text   | not null
 tok_alias | text   | not null
 dict_name | text[] | 
Indexes: pg_ts_cfgmap_pkey primary key btree (ts_name, tok_alias)

	Table for storing info about dictionaries per lexem type.

Limitations

	13.1 2048 bytes for lexems
	13.2 ts_vector has limit about 1Mb. Exact value depends on
		quantity of position information. If there is no any position 
                information, then sum of length of lexem must be less than 1Mb, 
                otherwise, sum of length of and pos. info. 
                Positional information uses 2 bytes per each 
		position and 2 bytes per lexem with pos info. The number of 
                lexems is limited by 4^32, so in practice it's unlimited.
	13.3 ts_query: 
                Number of entries (nodes, i.e sum of lexems and operation) 
                is limited: internal representation is in polish notation 
                and position of one operand is pointed by int2, so it's 
                rather soft limit. 
                In any case, low range of limit - 32768 nodes.
		Notice: ts_query doesn't designed for storing in table and
                is optimized for speed, not for size.
	13.4 Positional information in ts_vector:
		13.4.1 Value of position may not be greater than 2^14 (16384), 
                       any value greater than this limit will be replaced
                       by 16383.
		13.4.2 Only 256 positional info per lexem.

Notes for programmers

	ts_vector:
		There is one unused byte per lexem in position information 
                (because of alignment)
	ts_query:
		There are one byte and one bit per node

	I don't know, for what purpose this bytes may be used.... 
        Any idea how to use them are welcome !

Tsearch V2 in Brief

Projects

Links