Postgresql GIN索引

您所在的位置：网站首页 › gin索引compare_partial › Postgresql GIN索引

Postgresql GIN索引

#Postgresql GIN索引| 来源: 网络整理| 查看: 265

GIN概念介绍：

GIN是Generalized Inverted Index的缩写。就是所谓的倒排索引。它处理的数据类型的值不是原子的，而是由元素构成。我们称之为复合类型。如(‘hank’, ‘15:3 21:4’)中，表示hank在15:3和21:4这两个位置出现过,下面会从具体的例子更加清晰的认识GIN索引。

全文搜索

GIN的主要应用领域是加速全文搜索，所以，这里我们使用全文搜索的例子介绍一下GIN索引。

如下，建一张表,doc_tsv是文本搜索类型，可以自动排序并消除重复的元素：

postgres=# create table ts(doc text, doc_tsv tsvector); postgres=# insert into ts(doc) values ('Can a sheet slitter slit sheets?'), ('How many sheets could a sheet slitter slit?'), ('I slit a sheet, a sheet I slit.'), ('Upon a slitted sheet I sit.'), ('Whoever slit the sheets is a good sheet slitter.'), ('I am a sheet slitter.'), ('I slit sheets.'), ('I am the sleekest sheet slitter that ever slit sheets.'), ('She slits the sheet she sits on.'); postgres=# update ts set doc_tsv = to_tsvector(doc); postgres=# create index on ts using gin(doc_tsv);

postgresql tsvector 文档链接：http://www.postgres.cn/docs/9.6/datatype-textsearch.html

该GIN索引结构如下，黑色方块是TID编号，白色为单词,注意这里是单向链表，不同于B-tree的双向链表：

posgresql tid ，ctid 参考链接：

https://blog.csdn.net/weixin_34372728/article/details/90591262

https://help.aliyun.com/document_detail/181315.html

GIN索引在物理存储上包含如下内容：

1. Entry：GIN索引中的一个元素，可以认为是一个词位，也可以理解为一个key 2. Entry tree：在Entry上构建的B树 3. posting list：一个Entry出现的物理位置(heap ctid, 堆表行号)的链表 4. posting tree：在一个Entry出现的物理位置链表(heap ctid, 堆表行号)上构建的B树，所以posting tree的KEY是ctid，而entry tree的KEY是被索引的列的值 5. pending list：索引元组的临时存储链表，用于fastupdate模式的插入操作

参考链接：https://www.cnblogs.com/flying-tiger/p/6704931.html

由上可见，sheet,slit,slitter出现在多行之中，所有会有多个TID，这样就会生成一个TID列表，并为之生成一棵单独的B-tree。

以下语句可以找出多少行出现过该单词。

所以执行以下语句，可以走用到GIN索引：

--这里由于数据量较小，所以禁用全表扫描 hank=# set enable_seqscan TO off; SET hank=# explain(costs off) select doc from ts where doc_tsv @@ to_tsquery('many & slitter'); QUERY PLAN --------------------------------------------------------------------- Bitmap Heap Scan on ts Recheck Cond: (doc_tsv @@ to_tsquery('many & slitter'::text)) -> Bitmap Index Scan on ts_doc_tsv_idx Index Cond: (doc_tsv @@ to_tsquery('many & slitter'::text)) (4 rows) hank=# select amop.amopopr::regoperator, amop.amopstrategy from pg_opclass opc, pg_opfamily opf, pg_am am, pg_amop amop where opc.opcname = 'tsvector_ops' and opf.oid = opc.opcfamily and am.oid = opf.opfmethod and amop.amopfamily = opc.opcfamily and am.amname = 'gin' and amop.amoplefttype = opc.opcintype; amopopr | amopstrategy -----------------------+-------------- @@(tsvector,tsquery) | 1 matching search query @@@(tsvector,tsquery) | 2 synonym for @@ (for backward compatibility) (2 rows)

索引扫描方式参考链接： https://blog.csdn.net/qq_35260875/article/details/106084392?utm_medium=distribute.pc_relevant.none-task-blog-baidujs_title-6&spm=1001.2101.3001.4242

下图分别找mani和slittermani — (0,2).slitter — (0,1), (0,2), (1,2), (1,3), (2,2).

最后，看一下找到的相关行，并且条件是and,所以只能返回(0,2)。

| | | consistency | | | function TID | mani | slitter | slit & slitter -------+------+---------+---------------- (0,1) | f | T | f (0,2) | T | T | T (1,2) | f | T | f (1,3) | f | T | f (2,2) | f | T | f postgres=# select doc from ts where doc_tsv @@ to_tsquery('many & slitter'); doc --------------------------------------------- How many sheets could a sheet slitter slit? (1 row)

文本搜索运算符参考文档： http://www.postgres.cn/docs/9.6/functions-textsearch.html

更新缓慢

GIN索引中的数据插入或更新非常慢。因为每行通常包含许多要索引的单词元素。因此，当添加或更新一行时，我们必须大量更新索引树。另一方面，如果同时更新多个行，它们的某些单词元素可能是相同的，所以总的代价小于一行一行单独更新文档时的代价。

GIN索引具有 fastupdate 存储参数，我们可以在创建索引时指定它，并在以后更新：

postgres=# create index on ts using gin(doc_tsv) with (fastupdate = true);

fastupdate 参考链接： https://www.cnblogs.com/flying-tiger/p/6704931.html

启用此参数后，更新将累积在单独的无序列表中。当此列表足够大时或vacuum期间，所有累积的更新将立即对索引操作。这个“足够大”的列表由“ gin_pending_list_limit”配置参数或创建索引时同名的存储参数确定。

部分匹配搜索

查询包含slit打头的doc

hank=# select doc from ts where doc_tsv @@ to_tsquery('slit:*'); doc -------------------------------------------------------- Can a sheet slitter slit sheets? How many sheets could a sheet slitter slit? I slit a sheet, a sheet I slit. Upon a slitted sheet I sit. Whoever slit the sheets is a good sheet slitter. I am a sheet slitter. I slit sheets. I am the sleekest sheet slitter that ever slit sheets. She slits the sheet she sits on. (9 rows)

同样可以使用索引加速：

postgres=# explain (costs off) select doc from ts where doc_tsv @@ to_tsquery('slit:*'); QUERY PLAN ------------------------------------------------------------- Bitmap Heap Scan on ts Recheck Cond: (doc_tsv @@ to_tsquery('slit:*'::text)) -> Bitmap Index Scan on ts_doc_tsv_idx Index Cond: (doc_tsv @@ to_tsquery('slit:*'::text)) (4 rows)

频翻和不频繁

制造一些数据,下载地址： http://oc.postgrespro.ru/index.php/s/fRxTZ0sVfPZzbmd/download

fts=# alter table mail_messages add column tsv tsvector; fts=# update mail_messages set tsv = to_tsvector(body_plain); fts=# create index on mail_messages using gin(tsv); --这里不使用unnest统计单词出现在行的次数，因为数据量比较大，我们使用ts_stat函数来进行计算 fts=# select word, ndoc from ts_stat('select tsv from mail_messages') order by ndoc desc limit 3; word | ndoc -------+-------- re | 322141 wrote | 231174 use | 176917 (3 rows)

例如我们查询邮件信息里很少出现的单词，如“tattoo”：

fts=# select word, ndoc from ts_stat('select tsv from mail_messages') where word = 'tattoo'; word | ndoc --------+------ tattoo | 2 (1 row)

两个单词同一行出现的次数，wrote和tattoo同时出现的行只有一行

fts=# select count(*) from mail_messages where tsv @@ to_tsquery('wrote & tattoo'); count ------- 1 (1 row)

我们来看看是如何执行的，如上所述，如果我们要获得两个词的TID列表，则搜索效率显然很低下：因为将必须遍历20多万个值，而只取一个值。但是通过统计信息，该算法可以了解到“wrote”经常出现，而“ tattoo”则很少出现。因此，将执行不经常使用的词的搜索，然后从检索到的两行中检查是否存在“wrote”。这样就可以快速得出查询结果：

fts=# \timing on fts=# select count(*) from mail_messages where tsv @@ to_tsquery('wrote & tattoo'); count ------- 1 (1 row) Time: 0,959 ms

查询wrote将话费更长的时间

fts=# select count(*) from mail_messages where tsv @@ to_tsquery('wrote'); count -------- 231174 (1 row) Time: 2875,543 ms (00:02,876)

这种优化当然不只是两个单词元素搜索有效，其他更复杂的搜索也有效。

限制查询结果

GIN的一个特点是，结果总是以位图的形式返回：该方法不能按TID返回所需数据的TID。因此，本文中的所有查询计划都使用位图扫描。

因此，使用LIMIT子句限制索引扫描结果的效率不是很高。注意操作的预计成本（“limit”节点的“cost”字段）：

fts=# explain (costs off) select * from mail_messages where tsv @@ to_tsquery('wrote') limit 1; QUERY PLAN ----------------------------------------------------------------------------------------- Limit (cost=1283.61..1285.13 rows=1) -> Bitmap Heap Scan on mail_messages (cost=1283.61..209975.49 rows=137207) Recheck Cond: (tsv @@ to_tsquery('wrote'::text)) -> Bitmap Index Scan on mail_messages_tsv_idx (cost=0.00..1249.30 rows=137207) Index Cond: (tsv @@ to_tsquery('wrote'::text)) (5 rows)

估计成本为1285.13，比构建整个位图1249.30的成本（“Bitmap Index Scan”节点的“cost”字段）稍大。

因此，索引具有限制结果数量的功能。该阈值gin_fuzzy_search_limit配置参数中指定，并且默认情况下等于零（没有限制）。但是我们可以设置阈值：

fts=# set gin_fuzzy_search_limit = 1000; fts=# select count(*) from mail_messages where tsv @@ to_tsquery('wrote'); count ------- 5746 (1 row) fts=# set gin_fuzzy_search_limit = 10000; fts=# select count(*) from mail_messages where tsv @@ to_tsquery('wrote'); count ------- 14726 (1 row)

我们可以看到，查询返回的行数对于不同的参数值是不同的（如果使用索引访问）。限制并不严格：可以返回多于指定行的行，这证明参数名称的“模糊”部分是合理的。

GIN索引比较小，不会占用太多空间。首先，如果在多行中出现相同的单词，则它仅在索引中存储一次。其次，TID以有序的方式存储在索引中，这使我们能够使用一种简单的压缩方式：列表中的下一个TID实际上与上一个TID是不同的；这个数字通常很小，与完整的六字节TID相比，所需的位数要小得多。

为了了解其大小，我们从消息文本构建B树：

GIN建立在不同的数据类型（“ tsvector”而不是“ text”）上，该数据类型较小同时，B树的消息大小必须缩短到大约2 KB。 fts=# create index mail_messages_btree on mail_messages(substring(body_plain for 2048));

创建一个gist索引：

fts=# create index mail_messages_gist on mail_messages using gist(tsv);

分别看一下gin,gist,btree的大小：

fts=# select pg_size_pretty(pg_relation_size('mail_messages_tsv_idx')) as gin, pg_size_pretty(pg_relation_size('mail_messages_gist')) as gist, pg_size_pretty(pg_relation_size('mail_messages_btree')) as btree; gin | gist | btree --------+--------+-------- 179 MB | 125 MB | 546 MB (1 row)

由于GIN索引更节省空间，我们从Oracle迁移到postgresql过程中可以使用GIN索引来代替位图索引。通常，位图索引用于唯一值很少的字段，这对于GIN也是非常有效的。而且，PostgreSQL可以基于任何索引（包括GIN）动态构建位图。

使用GiST还是GIN一般来说，GIN在准确性和搜索速度上均胜过GiST。如果数据更新不频繁并且需要快速搜索，则可以选择GIN。另一方面，如果对数据进行密集更新，则更新GIN的开销成本可能太大。在这种情况下，我们将不得不比较这两种索引，并选择其相关特征更适合的索引。

数组

使用GIN的另一个示例是数组的索引。在这种情况下，数组元素进入索引，这可以加快对数组的许多操作：

postgres=# select amop.amopopr::regoperator, amop.amopstrategy from pg_opclass opc, pg_opfamily opf, pg_am am, pg_amop amop where opc.opcname = 'array_ops' and opf.oid = opc.opcfamily and am.oid = opf.opfmethod and amop.amopfamily = opc.opcfamily and am.amname = 'gin' and amop.amoplefttype = opc.opcintype; amopopr | amopstrategy -----------------------+-------------- &&(anyarray,anyarray) | 1 intersection @>(anyarray,anyarray) | 2 contains array Bitmap Index Scan on routes_t_days_of_week_idx Index Cond: (days_of_week = '{2,4,7}'::integer[]) (4 rows)

可以看到出现六趟航班：

demo=# select flight_no, departure_airport_name, arrival_airport_name, days_of_week from routes_t where days_of_week = ARRAY[2,4,7]; flight_no | departure_airport_name | arrival_airport_name | days_of_week -----------+------------------------+----------------------+-------------- PG0005 | Domodedovo | Pskov | {2,4,7} PG0049 | Vnukovo | Gelendzhik | {2,4,7} PG0113 | Naryan-Mar | Domodedovo | {2,4,7} PG0249 | Domodedovo | Gelendzhik | {2,4,7} PG0449 | Stavropol | Vnukovo | {2,4,7} PG0540 | Barnaul | Vnukovo | {2,4,7} (6 rows)

该查询的执行步骤分析：

首先从数组中取出元素2，4，7、在元素树中，找到提取的键，并为每个键选择TID列表在找到的TID中，从中选择与运算符匹配的TID。对于=运算符，只有那些TID匹配出现在所有三个列表中的TID（换句话说，初始数组必须包含所有元素）。但这还不够：数组还需要不包含任何其他值，并且我们无法使用索引检查此条件。因此，在这种情况下，访问方法要求索引引擎重新检查与表一起返回的所有TID。

但是有些策略（例如，“包含在数组中”）无法检查任何内容，而必须重新检查在表中找到的所有TID。（原博主的这个“例如”我没看懂。。。）

但是，如果我们需要知道周二，周四和周日从莫斯科起飞的航班怎么办？索引不支持附加条件，该条件将进入“filter”。

demo=# explain (costs off) select * from routes_t where days_of_week = ARRAY[2,4,7] and departure_city = 'Moscow'; QUERY PLAN ----------------------------------------------------------- Bitmap Heap Scan on routes_t Recheck Cond: (days_of_week = '{2,4,7}'::integer[]) Filter: (departure_city = 'Moscow'::text) -> Bitmap Index Scan on routes_t_days_of_week_idx Index Cond: (days_of_week = '{2,4,7}'::integer[]) (5 rows)

在这里可以（索引只选择六行），但是在增加了其他条件选择能力的情况下，我们希望也同样支持。但是，我们不能直接创建联合索引：

demo=# create index on routes_t using gin(days_of_week,departure_city); ERROR: data type text has no default operator class for access method "gin" HINT: You must specify an operator class for the index or define a default operator class for the data type.

这个时候可以使用btree_gin来帮助我们,它添加了GIN运算符来模拟常规B树来工作

demo=# create extension btree_gin; demo=# create index on routes_t using gin(days_of_week,departure_city); demo=# explain (costs off) select * from routes_t where days_of_week = ARRAY[2,4,7] and departure_city = 'Moscow'; QUERY PLAN --------------------------------------------------------------------- Bitmap Heap Scan on routes_t Recheck Cond: ((days_of_week = '{2,4,7}'::integer[]) AND (departure_city = 'Moscow'::text)) -> Bitmap Index Scan on routes_t_days_of_week_departure_city_idx Index Cond: ((days_of_week = '{2,4,7}'::integer[]) AND (departure_city = 'Moscow'::text)) (4 rows)

JSONB

内置GIN支持的复合数据类型的另一个示例是JSON。为了使用JSON值，PG定义了许多运算符和函数，其中一些可以使用索引加快访问速度：

postgres=# select opc.opcname, amop.amopopr::regoperator, amop.amopstrategy as str from pg_opclass opc, pg_opfamily opf, pg_am am, pg_amop amop where opc.opcname in ('jsonb_ops','jsonb_path_ops') and opf.oid = opc.opcfamily and am.oid = opf.opfmethod and amop.amopfamily = opc.opcfamily and am.amname = 'gin' and amop.amoplefttype = opc.opcintype; opcname | amopopr | str ----------------+------------------+----- jsonb_ops | ?(jsonb,text) | 9 top-level key exists jsonb_ops | ?|(jsonb,text[]) | 10 some top-level key exists jsonb_ops | ?&(jsonb,text[]) | 11 all top-level keys exist jsonb_ops | @>(jsonb,jsonb) | 7 JSON value is at top level jsonb_path_ops | @>(jsonb,jsonb) | 7 (5 rows)

可见有两类运算符jsonb_ops和jsonb_path_ops。默认情况下，使用第一个运算符jsonb_ops。所有的键，值和数组元素都将作为初始JSON文档的元素到达索引。属性将会添加到每个元素中，指定该元素是否为键（“存在”策略需要此属性，以区分键和值）。

demo=# create table routes_jsonb as select to_jsonb(t) route from ( select departure_airport_name, arrival_airport_name, days_of_week from routes order by flight_no limit 4 ) t; demo=# select ctid, jsonb_pretty(route) from routes_jsonb; ctid | jsonb_pretty -------+------------------------------------------------- (0,1) | { + | "days_of_week": [ + | 1 + | ], + | "arrival_airport_name": "Surgut", + | "departure_airport_name": "Ust-Ilimsk" + | } (0,2) | { + | "days_of_week": [ + | 2 + | ], + | "arrival_airport_name": "Ust-Ilimsk", + | "departure_airport_name": "Surgut" + | } (0,3) | { + | "days_of_week": [ + | 1, + | 4 + | ], + | "arrival_airport_name": "Sochi", + | "departure_airport_name": "Ivanovo-Yuzhnyi"+ | } (0,4) | { + | "days_of_week": [ + | 2, + | 5 + | ], + | "arrival_airport_name": "Ivanovo-Yuzhnyi", + | "departure_airport_name": "Sochi" + | } (4 rows) demo=# create index on routes_jsonb using gin(route);

索引结构如下：

以下示例可以使用索引：

demo=# explain (costs off) select jsonb_pretty(route) from routes_jsonb where route @> '{"days_of_week": [5]}'; QUERY PLAN --------------------------------------------------------------- Bitmap Heap Scan on routes_jsonb Recheck Cond: (route @> '{"days_of_week": [5]}'::jsonb) -> Bitmap Index Scan on routes_jsonb_route_idx Index Cond: (route @> '{"days_of_week": [5]}'::jsonb) (4 rows)

从JSON文档的根位置开始，@> 运算符检查是否发生了指定的路由（“ days_of_week”：[5]）。以下查询将返回一行：

jsonb 运算符文档：http://postgres.cn/docs/9.6/functions-json.html

demo=# select jsonb_pretty(route) from routes_jsonb where route @> '{"days_of_week": [5]}'; jsonb_pretty ------------------------------------------------ { + "days_of_week": [ + 2, + 5 + ], + "arrival_airport_name": "Ivanovo-Yuzhnyi",+ "departure_airport_name": "Sochi" + } (1 row)

这个查询执行过程如下：

搜索查询（“ days_of_week”：[5]）提取元素（搜索关键字）：«days_of_week»和«5»。在元素的树中找到提取的键，并为每个键选择TID列表：对于5对应的TID为（0,4），对于days_of_week对应的TID为（0,1），（0,2 ），（0,3），（0,4）。在已经找到的TID中，一致性函数从查询中选择与运算符匹配的TID。对于@>运算符，肯定不能包含不包含搜索查询中所有元素的文档，因此仅保留（0,4）。但是，我们仍然需要重新检查保留的TID，因为从索引中无法清楚找到的元素在JSON文档中的出现顺序。内部结构

使用 pageinspect 查看内部情况：

fts=# create extension pageinspect;

meta页面显示了常规的统计信息：

fts=# select * from gin_metapage_info(get_raw_page('mail_messages_tsv_idx',0));

页面的结构提供了一个特殊的区域，这个区域存放了访问方法的存储信息。对于普通的程序，如vacuum，则该区域“不透明”。 gin_page_opaque_info 函数可以显示GIN数据。如，我们可以了解索引页面的集合：

fts=# select flags, count(*) from generate_series(1,22967) as g(id), -- n_total_pages gin_page_opaque_info(get_raw_page('mail_messages_tsv_idx',g.id)) group by flags; flags | count ------------------------+------- {meta} | 1 meta page {} | 133 internal page of element B-tree {leaf} | 13618 leaf page of element B-tree {data} | 1497 internal page of TID B-tree {data,leaf,compressed} | 7719 leaf page of TID B-tree (5 rows) fts=# select flags, count(*) from generate_series(1,22967) as g(id), -- n_total_pages gin_page_opaque_info(get_raw_page('mail_messages_tsv_idx',g.id)) group by flags; flags | count ------------------------+------- {meta} | 1 meta page {} | 133 internal page of element B-tree {leaf} | 13618 leaf page of element B-tree {data} | 1497 internal page of TID B-tree {data,leaf,compressed} | 7719 leaf page of TID B-tree (5 rows)

gin_leafpage_items 函数可以展示页面（data,leaf,compressed）上TID的信息：

fts=# select * from gin_leafpage_items(get_raw_page('mail_messages_tsv_idx',2672)); -[ RECORD 1 ]--------------------------------------------------------------------- first_tid | (239,44) nbytes | 248 tids | {"(239,44)","(239,47)","(239,48)","(239,50)","(239,52)","(240,3)",... -[ RECORD 2 ]--------------------------------------------------------------------- first_tid | (247,40) nbytes | 248 tids | {"(247,40)","(247,41)","(247,44)","(247,45)","(247,46)","(248,2)",... ...

GIN的一些属性

GIN的访问方法如下：

这个可以看到，GIN支持创建多列索引。但是，与常规B树不同，多列索引仍将存储单个元素，而不是复合键，并且会为每个元素指示列号。

索引层面的属性：

注意，不支持按TID（索引扫描）返回结果的TID；因为GIN只能进行位图扫描。（这句话没太明白。。。）

以下是列层面的属性：

其他的一些插件也支持类似GIN的功能：

如pg_trgm可以模糊匹配。它支持各种运算符，包括通过LIKE和正则表达式进行比较。我们可以使用此插件和GIN配合使用。 btree_gin上面介绍过，可以支持GIN创建多列复合索引。

参考链接：

https://blog.csdn.net/dazuiba008/article/details/103985791

【本文地址】

公司简介

联系我们