2008-12-16
Stable to_tsquery
result of to_tsquery() can't be used as a cache key, since ts_query() does preserve an order, which isn't good for cacheing. For example,
=# select to_tsquery('cache & query') = to_tsquery('query & cache'); ?column? ---------- f
Little function (kudos to Teodor) helps:
CREATE OR REPLACE FUNCTION stable_ts_query(tsquery) RETURNS tsquery AS $$ SELECT ts_rewrite( $1 , 'dummy_word', 'dummy_word'); $$ LANGUAGE SQL RETURNS NULL ON NULL INPUT IMMUTABLE;
Of course, you should remember about search configuraton to have really good cache key !
2008-11-17
2600 times performance jump !
Very useful but slow ts_stat() function, which gathers words statistics from tsvectors, will be surprisingly fast in 8.4. I tested dataset with geonames, total 5793013 rows with 2404197 unique names, CVS HEAD. I used query
=# select * into ts_stat2 from ts_stat('select fts from spots');
Before it takes 66405972.737 ms, now - only 25506.736 ms ! Very nice !
Enabling using internal data type in aggregate provides machinery for this win.
2008-10-04
Find documents for specific token type
Someone asked us about how to find documents, which contain emails. One possible solution is to write function
CREATE OR REPLACE FUNCTION document_token_types(text) RETURNS _text AS $$ SELECT ARRAY ( SELECT DISTINCT alias FROM ts_token_type('default') AS tt, ts_parse('default', $1) AS tp WHERE tt.tokid = tp.tokid ); $$ LANGUAGE SQL immutable; arxiv=# select document_token_types(title) from papers limit 10; document_token_types --------------------------------------------------------------- {asciihword,asciiword,blank,hword_asciipart} {asciiword,blank} {asciiword,blank} {asciiword,blank} {asciiword,blank} {asciiword,blank,float,host} {asciiword,blank} {asciihword,asciiword,blank,hword_asciipart,int,numword,uint} {asciiword,blank} {asciiword,blank} (10 rows)
Now we can create functional GIN index on document (which can be any combination of text fields) to speedup search.
create index fts_types_idx on papers using gin( document_token_types (comment) );
Find all documents with urls
explain analyze select comment from papers where document_token_types(comment) && '{url}'; Bitmap Heap Scan on papers (cost=40.80..8135.97 rows=2098 width=67) (actual time=6.811..27.679 rows=15483 loops=1) Recheck Cond: (document_token_types(comment) && '{url}'::text[]) -> Bitmap Index Scan on fts_types_idx (cost=0.00..40.28 rows=2098 width=0) (actual time=3.820..3.820 rows=15483 loops=1) Index Cond: (document_token_types(comment) && '{url}'::text[]) Total runtime: 33.124 ms
The list of available token types, supported by parser is available as
arxiv=# select * from ts_token_type('default');
2008-08-26
MapReduce and PostgreSQL
Yesterday I was informed about AsterDB (asterdata.com), also PostgreSQL based (?, rumours from Neil Conway) commercial database, which implemented MapReduce technology inside PostgreSQL. Another one, is the Greenplum database (PostgreSQL + MPP), see http://www.greenplum.com/resources/mapreduce/ for details.
A half of year ago I also thought about MapReduce as a possible way to speedup building of inverted index in the PostgreSQL.
DBMS2 has a post for MapReduce and SQL - http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing/
2008-07-16
Aggregate for tsvector
Teodor today has come with a problem, which can be solved using aggregate for tsvector:
CREATE AGGREGATE tsvector_sum(tsvector) ( SFUNC = tsvector_concat, STYPE = tsvector, INITCOND = '' );
Below is my demonstration:
=# select tsvector_sum( t.fts) from ( select ('1 2 ' || generate_series(3,10,1))::tsvector as fts ) as t; tsvector_sum ------------------------------------------ '1' '2' '3' '4' '5' '6' '7' '8' '9' '10'
2008-07-10
pg_trgm becames faster !
As title says, pg_trgm are now faster for long strings (cache trigrams). We observed speedup about 6 times. Changes commited to CVS HEAD (8.4).
Thanks to jfg:://networks guys.
2008-06-14
Some news
It was very nice to knew that Yahoo China uses PostgreSQL and even more - it widely uses our extensions like ltree and pg_trgm. I got message from their developers about UTF-8 support (we planned to work TODO) they want to implement. We got their first patch for ltree, but Teodor pointed out several problems in their approach, so waiting for the next versions.
Another interesting message is about standalone GiST. One guy wants to wrap it as Perl module, so anybody could write custom indexes using perl. Neat ! We did that once for one company, so we can't share code and only assure guy it's doable. Hope this guy is persistent enough.
I'm diving into Cloud Computing, Cloud Databases specifically. We expect petabyte size database in the next few year from project 'Lira' and I seriously think about storing them in the Cloud. Quick calculations of expenses supports this.
2008-03-18
Capital letters in smartphones
I was so dumb, I didn't know how to input capital letters in smartphones !!! Shame ! Just press '*' and then letter. Pressing '**' works like Caps Lock.
btw, to unlock HP Ipaq 514 ( latest firmware) one need to use http://ceregeditor.mdsoft.pl/ registry editor (regeditSTG2.exe, resco 6.15, total commander will not work !). I just came through unlocking while trying run WM5Storage to export my 4Gb Kingston MicroSDHC flash to linux box.
2008-01-20
MapReduce Investigated by Database expert
Recently, David DeWitt in his Database column discussed Google's MapReduce technology, which stands behind almost all Google's products.
See also comments on HighScalability and Typical Programmer.
2007-12-20
Supernovae in empty space
Just read Cosmic explosion detonates in empty space in NewScientist about GRB 070125. We have a list of candidates to outlier supernovae Outlier Supernovae compiled at 1998 year. I wrote that time "Myth or Reality ?".