2008-12-16

Stable to_tsquery

result of to_tsquery() can't be used as a cache key, since ts_query() does preserve an order, which isn't good for cacheing. For example,

=# select to_tsquery('cache & query') = to_tsquery('query & cache');
 ?column?
----------
 f

Little function (kudos to Teodor) helps:

CREATE OR REPLACE FUNCTION stable_ts_query(tsquery)
RETURNS tsquery AS
$$
   SELECT ts_rewrite( $1 , 'dummy_word', 'dummy_word');
$$
LANGUAGE SQL RETURNS NULL ON NULL INPUT IMMUTABLE;

Of course, you should remember about search configuraton to have really good cache key !

0 Comments on this page

2008-11-17

2600 times performance jump !

Very useful but slow ts_stat() function, which gathers words statistics from tsvectors, will be surprisingly fast in 8.4. I tested dataset with geonames, total 5793013 rows with 2404197 unique names, CVS HEAD. I used query

=# select * into ts_stat2 from ts_stat('select fts from spots');

Before it takes 66405972.737 ms, now - only 25506.736 ms ! Very nice !

Enabling using internal data type in aggregate provides machinery for this win.

0 Comments on this page

2008-10-04

Find documents for specific token type

Someone asked us about how to find documents, which contain emails. One possible solution is to write function

CREATE OR REPLACE FUNCTION document_token_types(text)
RETURNS _text AS
$$

SELECT ARRAY (
    SELECT
        DISTINCT alias
    FROM
        ts_token_type('default') AS tt,
        ts_parse('default', $1) AS tp
    WHERE
        tt.tokid = tp.tokid
    );
$$ LANGUAGE SQL immutable;

arxiv=# select document_token_types(title) from papers limit 10;
                     document_token_types
---------------------------------------------------------------
 {asciihword,asciiword,blank,hword_asciipart}
 {asciiword,blank}
 {asciiword,blank}
 {asciiword,blank}
 {asciiword,blank}
 {asciiword,blank,float,host}
 {asciiword,blank}
 {asciihword,asciiword,blank,hword_asciipart,int,numword,uint}
 {asciiword,blank}
 {asciiword,blank}
(10 rows)

Now we can create functional GIN index on document (which can be any combination of text fields) to speedup search.

create index fts_types_idx on papers using gin( document_token_types (comment) );

Find all documents with urls

explain analyze select comment from papers where document_token_types(comment) && '{url}';


 Bitmap Heap Scan on papers  (cost=40.80..8135.97 rows=2098 width=67) (actual time=6.811..27.679 rows=15483 loops=1)
   Recheck Cond: (document_token_types(comment) && '{url}'::text[])
   ->  Bitmap Index Scan on fts_types_idx  (cost=0.00..40.28 rows=2098 width=0) (actual time=3.820..3.820 rows=15483 loops=1)
         Index Cond: (document_token_types(comment) && '{url}'::text[])
 Total runtime: 33.124 ms

The list of available token types, supported by parser is available as

arxiv=# select * from  ts_token_type('default');

0 Comments on this page

2008-08-26

MapReduce and PostgreSQL

Yesterday I was informed about AsterDB (asterdata.com), also PostgreSQL based (?, rumours from Neil Conway) commercial database, which implemented MapReduce technology inside PostgreSQL. Another one, is the Greenplum database (PostgreSQL + MPP), see http://www.greenplum.com/resources/mapreduce/ for details.

A half of year ago I also thought about MapReduce as a possible way to speedup building of inverted index in the PostgreSQL.

DBMS2 has a post for MapReduce and SQL - http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing/

0 Comments on this page

2008-07-16

Aggregate for tsvector

Teodor today has come with a problem, which can be solved using aggregate for tsvector:

CREATE AGGREGATE tsvector_sum(tsvector) (
    SFUNC = tsvector_concat,
    STYPE = tsvector,
    INITCOND = ''
);

Below is my demonstration:

=# select tsvector_sum( t.fts) 
from ( 
select ('1 2 ' || generate_series(3,10,1))::tsvector as fts 
) as t;
               tsvector_sum
------------------------------------------
 '1' '2' '3' '4' '5' '6' '7' '8' '9' '10'

0 Comments on this page

2008-07-10

pg_trgm becames faster !

As title says, pg_trgm are now faster for long strings (cache trigrams). We observed speedup about 6 times. Changes commited to CVS HEAD (8.4).

Thanks to jfg:://networks guys.

0 Comments on this page

2008-06-14

Some news

It was very nice to knew that Yahoo China uses PostgreSQL and even more - it widely uses our extensions like ltree and pg_trgm. I got message from their developers about UTF-8 support (we planned to work TODO) they want to implement. We got their first patch for ltree, but Teodor pointed out several problems in their approach, so waiting for the next versions.

Another interesting message is about standalone GiST. One guy wants to wrap it as Perl module, so anybody could write custom indexes using perl. Neat ! We did that once for one company, so we can't share code and only assure guy it's doable. Hope this guy is persistent enough.

I'm diving into Cloud Computing, Cloud Databases specifically. We expect petabyte size database in the next few year from project 'Lira' and I seriously think about storing them in the Cloud. Quick calculations of expenses supports this.

0 Comments on this page

2008-03-18

Capital letters in smartphones

I was so dumb, I didn't know how to input capital letters in smartphones !!! Shame ! Just press '*' and then letter. Pressing '**' works like Caps Lock.

btw, to unlock HP Ipaq 514 ( latest firmware) one need to use http://ceregeditor.mdsoft.pl/ registry editor (regeditSTG2.exe, resco 6.15, total commander will not work !). I just came through unlocking while trying run WM5Storage to export my 4Gb Kingston MicroSDHC flash to linux box.

0 Comments on this page

2008-01-20

MapReduce Investigated by Database expert

Recently, David DeWitt in his Database column discussed Google's MapReduce technology, which stands behind almost all Google's products.

See also comments on HighScalability and Typical Programmer.

0 Comments on this page

2007-12-20

Supernovae in empty space

Just read Cosmic explosion detonates in empty space in NewScientist about GRB 070125. We have a list of candidates to outlier supernovae Outlier Supernovae compiled at 1998 year. I wrote that time "Myth or Reality ?".

0 Comments on this page

More...

Journal

Projects

Links

2008-12-16

Stable to_tsquery

2008-11-17

2600 times performance jump !

2008-10-04

Find documents for specific token type

2008-08-26

MapReduce and PostgreSQL

2008-07-16

Aggregate for tsvector

2008-07-10

pg_trgm becames faster !

2008-06-14

Some news

2008-03-18

Capital letters in smartphones

2008-01-20

MapReduce Investigated by Database expert

2007-12-20

Supernovae in empty space