PostgreSQL索引(7) – GIN – 数据库内核研究

您所在的位置：网站首页 › gin索引创建比较慢 › PostgreSQL索引(7) – GIN – 数据库内核研究

PostgreSQL索引(7) – GIN – 数据库内核研究

2023-10-04 22:16| 来源: 网络整理| 查看: 265

本文翻译自Egor Rogov的英文博客，且已征得Egor Rogov的同意。译者做了少量修改。

审校：纪昀红，中国人民大学信息学院在读博士生。

在前面几篇文章中，我们讨论了通用索引引擎、AM的接口和四种AM（Hash索引、B-tree索引、GiST索引、SP-GiST索引）。这篇文章将介绍GIN索引。

GIN是Generalized Inverted Index的缩写，就是所谓的倒排索引。它所操作的数据类型的值都不是原子的，而是由多个元素组成。我们把这些类型称为复合类型。我们不是对原始的复合类型建立索引，而是对复合类型中的单个元素建立索引。每个元素指向复合类型的值。

它像一本书结尾部分的“词汇出现列表”，列表列出每个单词在书中出现的页码。GIN索引必须保证可以快速检索被索引的元素。因此，这些元素被存在类似B-tree的结构中（使用了更简单的结构）。一个有序的集合指向基表的行，行中包含复合元素的值。有序，对数据检索来说并不重要，但是对索引的内部存储方式非常重要。

GIN索引永远不会删除它索引的元素。复合类型的值可能会被删除、更改，但是所有元素的集合可能变化并不大。这个方法大大简化了GIN索引的并发控制。

如果TID列表很小，可以和元素放在一个页面内（被称为posting list）。但是，如果TID列表很大，就需要使用更高效的数据结构 —— B-tree。这颗B-tree存储在单独的页面中，被称为posting tree。

所以，GIN索引包含一个B-tree，它存储了元素。还包含另外一些B-tree或列表，它们用来存储TID列表。

和之前讨论的GiST、SP-GiST一样，GIN也为开发人员提供了一个接口，用来支持符合类型上的各种操作。

1. 全文检索

GIN索引的主要用途就是加速全文检索。

我们在介绍GiST索引时已经对全文检索进行了简要介绍，下面直入正题。这种场景下，复合类型就是文档，元素就是文档中的词素。

考虑GiST索引中的例子：

create table ts(doc text, doc_tsv tsvector); insert into ts(doc) values ('Can a sheet slitter slit sheets?'), ('How many sheets could a sheet slitter slit?'), ('I slit a sheet, a sheet I slit.'), ('Upon a slitted sheet I sit.'), ('Whoever slit the sheets is a good sheet slitter.'), ('I am a sheet slitter.'), ('I slit sheets.'), ('I am the sleekest sheet slitter that ever slit sheets.'), ('She slits the sheet she sits on.'); update ts set doc_tsv = to_tsvector(doc); create index on ts using gin(doc_tsv);

GIN索引树的结构可能如下所示：

和之前的图片不同，TID使用黑色背景的值（页面号，页面内偏移）表示。

在上面的例子中，除了sheet、slit和slitter之外，其它元素所指向的TID可以存放在常规的页面中。它们三个在太多文档中出现过，因此它们所指向的TID列表都存储在单独的B-tree中。

顺便提一句，我们如何知道一个词素被多少文档包含呢？对小表来说，可以直接用下面的查询，但是对大表，需要用其它更高效的方法。

值得注意的是，和常规的B-tree不同，GIN索引中的页面使用的是单向的指针，而不是双向指针。这已经足够了，因为只会对它进行前向遍历。

1.1 查询示例

下面的查询如何执行呢？

explain(costs off) select doc from ts where doc_tsv @@ to_tsquery('many & slitter'); QUERY PLAN --------------------------------------------------------------------- Bitmap Heap Scan on ts Recheck Cond: (doc_tsv @@ to_tsquery('many & slitter'::text)) -> Bitmap Index Scan on ts_doc_tsv_idx Index Cond: (doc_tsv @@ to_tsquery('many & slitter'::text)) (4 rows)

首先从查询中抽取出每个词素（查询key）：mani和slitter。这由特殊的API完成，它会考虑数据类型和操作符的策略：

select amop.amopopr::regoperator, amop.amopstrategy from pg_opclass opc, pg_opfamily opf, pg_am am, pg_amop amop where opc.opcname = 'tsvector_ops' and opf.oid = opc.opcfamily and am.oid = opf.opfmethod and amop.amopfamily = opc.opcfamily and am.amname = 'gin' and amop.amoplefttype = opc.opcintype; amopopr | amopstrategy -----------------------+-------------- @@(tsvector,tsquery) | 1 matching search query @@@(tsvector,tsquery) | 2 synonym for @@ (for backward compatibility) (2 rows)

在包含词素的B-tree中，找出两个key对应的TID列表：

mani：(0,2) slitter：(0,1), (0,2), (1,2), (1,3), (2,2)

最后，对每个找出的TID，依次调用一致性函数。这个一致性函数可以确定TID指向的行是否满足查询条件。因为查询中的词素用and连接，因此返回的行仅仅是(0,2)。

| | | consistency | | | function TID | mani | slitter | slit & slitter -------+------+---------+---------------- (0,1) | f | T | f (0,2) | T | T | T (1,2) | f | T | f (1,3) | f | T | f (2,2) | f | T | f

结果是：

select doc from ts where doc_tsv @@ to_tsquery('many & slitter'); doc --------------------------------------------- How many sheets could a sheet slitter slit? (1 row)

如果把GIN和GiST进行对比，GIN在全文检索方面的优势显而易见。但是，GIN索引也有缺陷。

1.2 更新速度慢的问题

GIN索引的其中一个问题是：插入和更新很慢。一篇文章通常包含很多需要被索引的词素。因此，即使仅仅添加或更新一篇文章，也需要对索引进行大量更新。

在另一方面，如果多个文档被同时更新，它们的某些词素可能相同，对它们进行批量更新的总工作量会比依次更新每个文档工作量的和要小很多。

GIN索引有一个参数 —— fastupdate，可以在创建索引时指定：

create index on ts using gin(doc_tsv) with (fastupdate = true);

打开这个参数时，更新会被延迟，这些更新会先被存储在一个单独的无序列表中（在单独的连续页面中）。当这个列表足够大时，或者做vacuum时，这些更新会被同时更新到索引中。多大是足够大呢？这由gin_pending_list_limit参数控制，这个参数也可以在创建索引时指定。

但是这个方法也有缺点：首先，查询会变慢（除了查找tree，还要查找这个无序列表）；其次，下一次更新可能会耗费大量的时间（触发了批量更新）。

1.3 部分匹配

我们可以在全文检索中使用部分匹配，例如：

select doc from ts where doc_tsv @@ to_tsquery('slit:*'); doc -------------------------------------------------------- Can a sheet slitter slit sheets? How many sheets could a sheet slitter slit? I slit a sheet, a sheet I slit. Upon a slitted sheet I sit. Whoever slit the sheets is a good sheet slitter. I am a sheet slitter. I slit sheets. I am the sleekest sheet slitter that ever slit sheets. She slits the sheet she sits on. (9 rows)

这个查询将找出所有以slit开头的词素，在这个例子中，就是slit和slitter

这个查询，即使没有索引，查询可以正常工作，但是GIN索引可以加速查询：

explain (costs off) select doc from ts where doc_tsv @@ to_tsquery('slit:*'); QUERY PLAN ------------------------------------------------------------- Bitmap Heap Scan on ts Recheck Cond: (doc_tsv @@ to_tsquery('slit:*'::text)) -> Bitmap Index Scan on ts_doc_tsv_idx Index Cond: (doc_tsv @@ to_tsquery('slit:*'::text)) (4 rows)

这个例子中，会首先查找所有以slit为前缀的词素，然后使用OR连接起来。

1.4 词素的频繁程度

我们继续使用pgsql-hacker邮件列表的真实数据集进行研究，这个数据集在GiST索引中介绍过。它包含356125条消息，每条消息包含发送日期、主题、作者和全文。

alter table mail_messages add column tsv tsvector; update mail_messages set tsv = to_tsvector(body_plain); NOTICE: word is too long to be indexed DETAIL: Words longer than 2047 characters are ignored. ... UPDATE 356125 create index on mail_messages using gin(tsv);

首先，找出出现频率比较高的词素。此时不能使用unnest，因为数据量太大，我们需要使用ts_stat函数，它提供了很多信息，如在多少文档中出现过、总的出现次数等。

select word, ndoc from ts_stat('select tsv from mail_messages') order by ndoc desc limit 3; word | ndoc -------+-------- re | 322141 wrote | 231174 use | 176917 (3 rows)

我们使用wrote这个词。然后再选一个出现不频繁的词，tattoo：

select word, ndoc from ts_stat('select tsv from mail_messages') where word = 'tattoo'; word | ndoc --------+------ tattoo | 2 (1 row)

有文档同时包含这两个词素吗？看起来，好像有：

select count(*) from mail_messages where tsv @@ to_tsquery('wrote & tattoo'); count ------- 1 (1 row)

这个查询是如何执行的呢？如果我们首先拿到这两个词素对应的TID列表，这会非常低效，因为需要去处理20万个TID，最终只有一个符合条件。幸运的是，优化器借助统计信息，可以知道wrote出的频率很高，tatoo出现频率较低。因此，Query首先查找不频繁的词素，拿到两个文档，然后再检查文档中是否有wrote。这种方法会高效很多：

\timing on select count(*) from mail_messages where tsv @@ to_tsquery('wrote & tattoo'); count ------- 1 (1 row) Time: 0,959 ms

查找wrote本身耗时就会很长：

select count(*) from mail_messages where tsv @@ to_tsquery('wrote'); count -------- 231174 (1 row) Time: 2875,543 ms (00:02,876)

优化器当然不只能处理两个词素，还能处理更复杂的场景。

1.5 Limit子句

GIN索引的一个特点是：它只能返回bitmap，不能依次返回每个TID。所以，本文中所有的计划都使用了bitmap scan。

因此，使用LIMIT子句时效率不高。注意谓词操作的代价（LIMIT算子的cost字段）：

explain (costs off) select * from mail_messages where tsv @@ to_tsquery('wrote') limit 1; QUERY PLAN ----------------------------------------------------------------------------------------- Limit (cost=1283.61..1285.13 rows=1) -> Bitmap Heap Scan on mail_messages (cost=1283.61..209975.49 rows=137207) Recheck Cond: (tsv @@ to_tsquery('wrote'::text)) -> Bitmap Index Scan on mail_messages_tsv_idx (cost=0.00..1249.30 rows=137207) Index Cond: (tsv @@ to_tsquery('wrote'::text)) (5 rows)

这个查询的总代价为1285.13，只比建立位图的代价（1249.30）稍微大一点。

因此，GIN索引有一个特殊的能力来限制查询结果的数量。这个阈值使用gin_fuzzy_search_limit参数控制，它的默认值为0（无限制）。我们可以手工设置这个值：

set gin_fuzzy_search_limit = 1000; select count(*) from mail_messages where tsv @@ to_tsquery('wrote'); count ------- 5746 (1 row) set gin_fuzzy_search_limit = 10000; select count(*) from mail_messages where tsv @@ to_tsquery('wrote'); count ------- 14726 (1 row)

可见，参数不同，使用GIN时返回的结果数量不同。这个限制不十分精确：返回的结果行数稍多一些。这就是fuzzy的意思。

1.6 压缩方式

GIN的其中一个优势是，它具有压缩的特性。首先，如果一个词素在多个文档中出现（通常是这样的），它只会被存储一次。其次，TID在索引中按顺序存储，这可以使得我们使用一个简单的压缩算法：只存储每个TID与前一个TID的差异；这通常来说是一个比较小的数，占用的空间数少于TID的6字节。

为了直观感受一下体积，我们对上面的邮件列表建立B-tree索引。当然，绝对的公平是不可能的： GIN把索引建立在其它数据类型上（tsvector，而不是text），体积会小一些；另一方面，B-tree会把消息截断到2K以内。

create index mail_messages_btree on mail_messages(substring(body_plain for 2048)); create index mail_messages_gist on mail_messages using gist(tsv); select pg_size_pretty(pg_relation_size('mail_messages_tsv_idx')) as gin, pg_size_pretty(pg_relation_size('mail_messages_gist')) as gist, pg_size_pretty(pg_relation_size('mail_messages_btree')) as btree; gin | gist | btree --------+--------+-------- 179 MB | 125 MB | 546 MB (1 row)

因为GIN索引的压缩特性，当迁移Oracle的位图索引时，可以使用GIN索引（此处不详细展开，可以参见Lewis的博客）。Oracle的位图用在唯一值较少的列上，这也很适合GIN。PG可以为任何索引生成位图，包括GIN。

1.7 使用GiST还是GIN呢？

对很多数据类型来说，操作符类可以同时适用于GiST和GIN，我们应该如何选择呢？

作为一个原则，GIN的精度和速度更高。如果更新不频繁，又对查找速度有要求，GIN是其中一个选择。

另一方面，如果数据更新频繁，更新GIN索引的代价就会很大，这种场景下，我们需要比较二者，选择更适合的应用场景的那个。

2. 数组类型

另一个使用GIN索引的例子是数组。这种场景下，对数组元素建立索引，可以加速数组上的一些操作：

select amop.amopopr::regoperator, amop.amopstrategy from pg_opclass opc, pg_opfamily opf, pg_am am, pg_amop amop where opc.opcname = 'array_ops' and opf.oid = opc.opcfamily and am.oid = opf.opfmethod and amop.amopfamily = opc.opcfamily and am.amname = 'gin' and amop.amoplefttype = opc.opcintype; amopopr | amopstrategy -----------------------+-------------- &&(anyarray,anyarray) | 1 intersection @>(anyarray,anyarray) | 2 contains array Bitmap Index Scan on routes_t_days_of_week_idx Index Cond: (days_of_week = '{2,4,7}'::integer[]) (4 rows)

共有6个航班：

select flight_no, departure_airport_name, arrival_airport_name, days_of_week from routes_t where days_of_week = ARRAY[2,4,7]; flight_no | departure_airport_name | arrival_airport_name | days_of_week -----------+------------------------+----------------------+-------------- PG0005 | Domodedovo | Pskov | {2,4,7} PG0049 | Vnukovo | Gelendzhik | {2,4,7} PG0113 | Naryan-Mar | Domodedovo | {2,4,7} PG0249 | Domodedovo | Gelendzhik | {2,4,7} PG0449 | Stavropol | Vnukovo | {2,4,7} PG0540 | Barnaul | Vnukovo | {2,4,7} (6 rows)

这个查询是如何执行的呢？和上面的描述相同：

首先从条件中的数组中，找出元素，即：2、4、7。

在包含元素的树中，找出每个key，对每个key找出他们的TID。对所有找到的TID，使用一致性函数选出符合条件的元素。对等号来说，只有同时包含这三个key，并且不包含其它key才行。索引扫描无法使用“不等”条件进行扫描，因此GIN索引需要要求通用索引引擎重新检查它返回的TID是否满足条件。

有意思的是，有一些策略（比如，被数组包含）无法检查任何东西，通用搜索引擎需要把它返回的所有TID都进行重新检查。

如果我们要找周二、周四、周日从Moscow起飞的航班呢？GIN索引只能支持一个条件，另外一个条件，就称为过滤条件（Filter字段）。

explain (costs off) select * from routes_t where days_of_week = ARRAY[2,4,7] and departure_city = 'Moscow'; QUERY PLAN ----------------------------------------------------------- Bitmap Heap Scan on routes_t Recheck Cond: (days_of_week = '{2,4,7}'::integer[]) Filter: (departure_city = 'Moscow'::text) -> Bitmap Index Scan on routes_t_days_of_week_idx Index Cond: (days_of_week = '{2,4,7}'::integer[]) (5 rows)

上面的例子中，只有6行结果，使用这种方法可行，如果条件的选择率很高，返回很多结果时，就需要把过滤条件下推到扫描扫描中，成为索引扫描的扫描条件。但是，我们不能创建这样的索引：

create index on routes_t using gin(days_of_week,departure_city); ERROR: data type text has no default operator class for access method "gin" HINT: You must specify an operator class for the index or define a default operator class for the data type.

btree_gin可以提供这种能力，它增加了一些操作符类模拟B-tree的功能。

create extension btree_gin; create index on routes_t using gin(days_of_week,departure_city); explain (costs off) select * from routes_t where days_of_week = ARRAY[2,4,7] and departure_city = 'Moscow'; QUERY PLAN --------------------------------------------------------------------- Bitmap Heap Scan on routes_t Recheck Cond: ((days_of_week = '{2,4,7}'::integer[]) AND (departure_city = 'Moscow'::text)) -> Bitmap Index Scan on routes_t_days_of_week_departure_city_idx Index Cond: ((days_of_week = '{2,4,7}'::integer[]) AND (departure_city = 'Moscow'::text)) (4 rows) 3. JSONB类型

GIN本身支持JSON类型。为了支持JSON值，它实现了一系列操作符和函数，用来加速索引的查询：

select opc.opcname, amop.amopopr::regoperator, amop.amopstrategy as str from pg_opclass opc, pg_opfamily opf, pg_am am, pg_amop amop where opc.opcname in ('jsonb_ops','jsonb_path_ops') and opf.oid = opc.opcfamily and am.oid = opf.opfmethod and amop.amopfamily = opc.opcfamily and am.amname = 'gin' and amop.amoplefttype = opc.opcintype; opcname | amopopr | str ----------------+------------------+----- jsonb_ops | ?(jsonb,text) | 9 top-level key exists jsonb_ops | ?|(jsonb,text[]) | 10 some top-level key exists jsonb_ops | ?&(jsonb,text[]) | 11 all top-level keys exist jsonb_ops | @>(jsonb,jsonb) | 7 JSON value is at top level jsonb_path_ops | @>(jsonb,jsonb) | 7 (5 rows)

可以看到，两个操作符类可以使用: jsonb_ops和jsonb_path_ops。

默认使用第一个操作符类jsonb_ops。所有的key、值、数组元组都会被当成初始文档的元素。每个元素上增加了一个属性，表示这个元素是否为一个key（这在exist策略中有用，它区分key和值）。

例如，我们使用routes中的几行表示成JSON格式：

create table routes_jsonb as select to_jsonb(t) route from ( select departure_airport_name, arrival_airport_name, days_of_week from routes order by flight_no limit 4 ) t; select ctid, jsonb_pretty(route) from routes_jsonb; ctid | jsonb_pretty -------+------------------------------------------------- (0,1) | { + | "days_of_week": [ + | 1 + | ], + | "arrival_airport_name": "Surgut", + | "departure_airport_name": "Ust-Ilimsk" + | } (0,2) | { + | "days_of_week": [ + | 2 + | ], + | "arrival_airport_name": "Ust-Ilimsk", + | "departure_airport_name": "Surgut" + | } (0,3) | { + | "days_of_week": [ + | 1, + | 4 + | ], + | "arrival_airport_name": "Sochi", + | "departure_airport_name": "Ivanovo-Yuzhnyi"+ | } (0,4) | { + | "days_of_week": [ + | 2, + | 5 + | ], + | "arrival_airport_name": "Ivanovo-Yuzhnyi", + | "departure_airport_name": "Sochi" + | } (4 rows) create index on routes_jsonb using gin(route);

索引可能像下面这样：

现在，下面这样的查询，可以使用索引：

explain (costs off) select jsonb_pretty(route) from routes_jsonb where route @> '{"days_of_week": [5]}'; QUERY PLAN --------------------------------------------------------------- Bitmap Heap Scan on routes_jsonb Recheck Cond: (route @> '{"days_of_week": [5]}'::jsonb) -> Bitmap Index Scan on routes_jsonb_route_idx Index Cond: (route @> '{"days_of_week": [5]}'::jsonb) (4 rows)

从JSON文档的根节点开始，@>操作符检查特定的route(“days_of_week”: [5])是否存在。查询会返回一行：

select jsonb_pretty(route) from routes_jsonb where route @> '{"days_of_week": [5]}'; jsonb_pretty ------------------------------------------------ { + "days_of_week": [ + 2, + 5 + ], + "arrival_airport_name": "Ivanovo-Yuzhnyi",+ "departure_airport_name": "Sochi" + } (1 row)

查询按下面的方式执行：

查询中的元素被提取出来：days_of_week和5。从包含元素的树中找出被抽取的key，找出它们所对应的TID列表。对5来说是（0，4），对days_of_week来说是(0,1), (0,2), (0,3), (0,4)。

对所有找到的TID，使用一致性函数找出所有满足查询操作符的行。对@>操作符来说，不包含所有元素的文档肯定不符合条件，因此值留下了（0，4）。但是我们仍然需要重新检查剩下的TID，因为我们还不知道它们在文档中是否按期望的顺序出现。

读者可以阅读文档了解其它操作符。

除了常用的支持JSON的操作符，jsquery插件也已经可用很久了，它定义了一种具有丰富能力的查询语言（当然包含对GIN索引的支持）。另外，在2016年，新的SQL标准被发布，它定义了自己的操作符和查询语言SQL/JSON path。SQL/JSON path在PG12中已经支持，其它部分可能在PG13种实现。

4. 内部实现

我们可以使用pageinspect插件查看GIN索引的内部。

create extension pageinspect;

元页面的统计信息：

页面提供了一个special区域，存储了AM自己的信息。这个区域对普通程序（比如vacuum）来说是opaque。gin_page_opaque_info展示了GIN索引的这些数据。例如，我们可以知道索引的页面：

select flags, count(*) from generate_series(1,22967) as g(id), -- n_total_pages gin_page_opaque_info(get_raw_page('mail_messages_tsv_idx',g.id)) group by flags; flags | count ------------------------+------- {meta} | 1 meta page {} | 133 internal page of element B-tree {leaf} | 13618 leaf page of element B-tree {data} | 1497 internal page of TID B-tree {data,leaf,compressed} | 7719 leaf page of TID B-tree (5 rows)

gin_leafpage_items函数提供了TID中的信息{data,leaf,compressed}:

select * from gin_leafpage_items(get_raw_page('mail_messages_tsv_idx',2672)); -[ RECORD 1 ]--------------------------------------------------------------------- first_tid | (239,44) nbytes | 248 tids | {"(239,44)","(239,47)","(239,48)","(239,50)","(239,52)","(240,3)",... -[ RECORD 2 ]--------------------------------------------------------------------- first_tid | (247,40) nbytes | 248 tids | {"(247,40)","(247,41)","(247,44)","(247,45)","(247,46)","(248,2)",... ...

值得注意的是，TID树的叶子页面中实际存放了很小的压缩列表，而不是单独的指针。

5. 属性

我们看一下GIN AM的属性（查询见本系列第二篇文章）：

有趣的是，GIN支持创建多列索引。然而，和常规的B-tree不同，多列索引仍然存储单个的元素，而不是存储复合的key，每个元素都有一个列编号表示。

索引层面的属性：

值得注意的是，不支持依次返回每个TID，只支持位图扫描。不支持反向扫描：这对索引扫描很有用，但是对位图扫描没用。

下面是列层面的属性：

所有属性的都不支持，

6. 其它数据类型

其它几个插件，可以为其它数据类型提供支持。

hstore实现了K-V存储。不过由于已经有了jsonb，hstore的使用场景并不多。

intarray扩展了int数组类型的功能。它支持GiST索引和GIN索引（gin__int_ops操作符类）。

下面两个插件已经被提过了：

btree_gin为常规的数据类型增加了GIN索引的能力。

jsquery定义了一种新的查询语言用来处理JSON查询。

【本文地址】

公司简介

联系我们

今日新闻

推荐新闻

专题文章