postgresql 索引之 hash的使用詳解

Posted on 2021-02-02 by WalkonNet

os: ubuntu 16.04

postgresql: 9.6.8

ip 規劃

192.168.56.102 node2 postgresql

help create index

postgres=# \h create index
Command:   CREATE INDEX
Description: define a new index
Syntax:
CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON table_name [ USING method ]
  ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
  [ WITH ( storage_parameter = value [, ... ] ) ]
  [ TABLESPACE tablespace_name ]
  [ WHERE predicate ]

[ USING method ]

method

要使用的索引方法的名稱。可以選擇 btree、hash、 gist、spgist、 gin以及brin。默認方法是btree。

hash

hash 隻能處理簡單的等值比較，

postgres=# drop table tmp_t0;
DROP TABLE
postgres=# create table tmp_t0(c0 varchar(100),c1 varchar(100));
CREATE TABLE
postgres=# insert into tmp_t0(c0,c1) select md5(id::varchar),md5((id+id)::varchar) from generate_series(1,100000) as id;
INSERT 0 100000
postgres=# create index idx_tmp_t0_1 on tmp_t0 using hash(c0);
CREATE INDEX
postgres=# \d+ tmp_t0
                     Table "public.tmp_t0"
 Column |     Type     | Collation | Nullable | Default | Storage | Stats target | Description 
--------+------------------------+-----------+----------+---------+----------+--------------+-------------
 c0   | character varying(100) |      |     |     | extended |       | 
 c1   | character varying(100) |      |     |     | extended |       | 
Indexes:
  "idx_tmp_t0_1" hash (c0)

postgres=# explain select * from tmp_t0 where c0 = 'd3d9446802a44259755d38e6d163e820';
                 QUERY PLAN                 
----------------------------------------------------------------------------
 Index Scan using idx_tmp_t0_1 on tmp_t0 (cost=0.00..8.02 rows=1 width=66)
  Index Cond: ((c0)::text = 'd3d9446802a44259755d38e6d163e820'::text)
(2 rows)

註意事項，官網特別強調：

Hash索引操作目前不被WAL記錄，因此存在未寫入修改，在數據庫崩潰後需要用REINDEX命令重建Hash索引。

同樣，在完成初始的基礎備份後，對於Hash索引的改變也不會通過流式或基於文件的復制所復制，所以它們會對其後使用它們的查詢給出錯誤的答案。

正因為這些原因，Hash索引已不再被建議使用。

補充：Postgresql hash索引介紹

hash索引的結構

當數據插入索引時，我們會為這個索引鍵通過哈希函數計算一個值。 PostgreSQL中的哈希函數始終返回“整數”類型，范圍為2^32≈40億。bucket桶的數量最初為2個，然後動態增加以適應數據大小。可以使用位算法從哈希碼計算出桶編號。這個bucket將存放TID。

由於可以將與不同索引鍵匹配的TID放入同一bucket桶中。而且除瞭TID之外，還可以將鍵的源值存儲在bucket桶中，但這會增加索引大小。為瞭節省空間，bucket桶隻存儲索引鍵的哈希碼，而不存儲索引鍵。

當我們通過索引查詢時，我們計算索引鍵的哈希函數並獲取bucket桶的編號。現在，仍然需要遍歷存儲桶的內容，並僅返回所需的哈希碼匹配的TID。由於存儲的“hash code – TID”對是有序的，因此可以高效地完成此操作。

但是，兩個不同的索引鍵可能會發生以下情況，兩個索引鍵都進入一個bucket桶，而且具有相同的四字節的哈希碼。因此，索引訪問方法要求索引引擎重新檢查表行中的情況來驗證每個TID。

映射數據結構到page

Meta page – 0號page,包含索引內部相關信息

Bucket pages – 索引的主要page,存儲 “hash code – TID” 對

Overflow pages – 與bucket page的結構相同，在不足一個page時,作為bucket桶使用

Bitmap pages – 跟蹤當前幹凈的overflow page，並可將其重新用於其他bucket桶

註意，哈希索引不能減小大小。雖然我們刪除瞭一些索引行，但是分配的頁面將不會返回到操作系統，隻會在VACUUMING之後重新用於新數據。減小索引大小的唯一選項是使用REINDEX或VACUUM FULL命令從頭開始重建索引

接下來看下hash索引如何創建

demo=# create index on flights using hash(flight_no);
demo=# explain (costs off) select * from flights where flight_no = 'PG0001';
           QUERY PLAN           
----------------------------------------------------
 Bitmap Heap Scan on flights
  Recheck Cond: (flight_no = 'PG0001'::bpchar)
  -> Bitmap Index Scan on flights_flight_no_idx
     Index Cond: (flight_no = 'PG0001'::bpchar)
(4 rows)

註意：10版本之前hash索引不記錄到wal中，所以hash索引不能做recovery，當然也就不能復制瞭，但是從10版本以後hash所用得到瞭增強，可以記錄到wal中，創建的時候也不會再有警告。

查看hash訪問方法相關的操作函數

demo=# select  opf.opfname as opfamily_name,
     amproc.amproc::regproc AS opfamily_procedure
from   pg_am am,
     pg_opfamily opf,
     pg_amproc amproc
where  opf.opfmethod = am.oid
and   amproc.amprocfamily = opf.oid
and   am.amname = 'hash'
order by opfamily_name,
     opfamily_procedure;
  
   opfamily_name  |  opfamily_procedure  
--------------------+-------------------------
 abstime_ops    | hashint4extended
 abstime_ops    | hashint4
 aclitem_ops    | hash_aclitem
 aclitem_ops    | hash_aclitem_extended
 array_ops     | hash_array
 array_ops     | hash_array_extended
 bool_ops      | hashcharextended
 bool_ops      | hashchar
 bpchar_ops     | hashbpcharextended
 bpchar_ops     | hashbpchar
 bpchar_pattern_ops | hashbpcharextended
 bpchar_pattern_ops | hashbpchar
 bytea_ops     | hashvarlena
 bytea_ops     | hashvarlenaextended
 char_ops      | hashcharextended
 char_ops      | hashchar
 cid_ops      | hashint4extended
 cid_ops      | hashint4
 date_ops      | hashint4extended
 date_ops      | hashint4
 enum_ops      | hashenumextended
 enum_ops      | hashenum
 float_ops     | hashfloat4extended
 float_ops     | hashfloat8extended
 float_ops     | hashfloat4
 float_ops     | hashfloat8
 ...

可以用這些函數計算相關類型的哈希碼

hank=# select hashtext('zhang');
 hashtext  
-------------
 -1172392837
(1 row)
hank=# select hashint4(10);
 hashint4  
-------------
 -1547814713
(1 row)

hash索引相關的屬性

hank=# select a.amname, p.name, pg_indexam_has_property(a.oid,p.name)
hank-# from pg_am a,
hank-#   unnest(array['can_order','can_unique','can_multi_col','can_exclude']) p(name)
hank-# where a.amname = 'hash'
hank-# order by a.amname;
 amname |   name   | pg_indexam_has_property 
--------+---------------+-------------------------
 hash  | can_order   | f
 hash  | can_unique  | f
 hash  | can_multi_col | f
 hash  | can_exclude  | t
(4 rows)
hank=# select p.name, pg_index_has_property('hank.idx_test_name'::regclass,p.name)
hank-# from unnest(array[
hank(#    'clusterable','index_scan','bitmap_scan','backward_scan'
hank(#   ]) p(name);
   name   | pg_index_has_property 
---------------+-----------------------
 clusterable  | f
 index_scan  | t
 bitmap_scan  | t
 backward_scan | t
(4 rows)
hank=# select p.name,
hank-#   pg_index_column_has_property('hank.idx_test_name'::regclass,1,p.name)
hank-# from unnest(array[
hank(#    'asc','desc','nulls_first','nulls_last','orderable','distance_orderable',
hank(#    'returnable','search_array','search_nulls'
hank(#   ]) p(name);
    name    | pg_index_column_has_property 
--------------------+------------------------------
 asc        | f
 desc        | f
 nulls_first    | f
 nulls_last     | f
 orderable     | f
 distance_orderable | f
 returnable     | f
 search_array    | f
 search_nulls    | f
(9 rows)

由於hash函數沒有特定的排序規則，所以一般的hash索引隻支持等值查詢，可以通過下面數據字典看到，所有操作都是“=”，而且hash索引也不會處理null值，所以不會標記null值，還有就是hash索引不存儲索引鍵，隻存儲hash碼，所以不會 index-only掃描，也不支持多列創建hash索引

hank=# select  opf.opfname AS opfamily_name,
hank-#     amop.amopopr::regoperator AS opfamily_operator
hank-# from   pg_am am,
hank-#     pg_opfamily opf,
hank-#     pg_amop amop
hank-# where  opf.opfmethod = am.oid
hank-# and   amop.amopfamily = opf.oid
hank-# and   am.amname = 'hash'
hank-# order by opfamily_name,
hank-#     opfamily_operator;
  opfamily_name  |           opfamily_operator           
--------------------+------------------------------------------------------------
 abstime_ops    | =(abstime,abstime)
 aclitem_ops    | =(aclitem,aclitem)
 array_ops     | =(anyarray,anyarray)
 bool_ops      | =(boolean,boolean)
 bpchar_ops     | =(character,character)
 bpchar_pattern_ops | =(character,character)
 bytea_ops     | =(bytea,bytea)
 char_ops      | =("char","char")
 cid_ops      | =(cid,cid)
 date_ops      | =(date,date)
 enum_ops      | =(anyenum,anyenum)
 float_ops     | =(real,real)
 float_ops     | =(double precision,double precision)
 float_ops     | =(real,double precision)
 float_ops     | =(double precision,real)
 hash_hstore_ops  | =(hstore,hstore)
 integer_ops    | =(integer,bigint)
 integer_ops    | =(smallint,smallint)
 integer_ops    | =(integer,integer)
 integer_ops    | =(bigint,bigint)
 integer_ops    | =(bigint,integer)
 integer_ops    | =(smallint,integer)
 integer_ops    | =(integer,smallint)
 integer_ops    | =(smallint,bigint)
 integer_ops    | =(bigint,smallint)
 interval_ops    | =(interval,interval)
 jsonb_ops     | =(jsonb,jsonb)
 macaddr8_ops    | =(macaddr8,macaddr8)
 macaddr_ops    | =(macaddr,macaddr)
 name_ops      | =(name,name)
 network_ops    | =(inet,inet)
 numeric_ops    | =(numeric,numeric)
 oid_ops      | =(oid,oid)
 oidvector_ops   | =(oidvector,oidvector)
 pg_lsn_ops     | =(pg_lsn,pg_lsn)
 range_ops     | =(anyrange,anyrange)
 reltime_ops    | =(reltime,reltime)
 text_ops      | =(text,text)
 text_pattern_ops  | =(text,text)
 time_ops      | =(time without time zone,time without time zone)
 timestamp_ops   | =(timestamp without time zone,timestamp without time zone)
 timestamptz_ops  | =(timestamp with time zone,timestamp with time zone)
 timetz_ops     | =(time with time zone,time with time zone)
 uuid_ops      | =(uuid,uuid)
 xid_ops      | =(xid,xid)

從10版本開始，可以通過pageinspect插件查看hash索引的內部情況

安裝插件

create extension pageinspect;

查看0號page

hank=# select hash_page_type(get_raw_page('hank.idx_test_name',0));
 hash_page_type 
----------------
 metapage
(1 row)

查看索引中的行數和已用的最大存儲桶數

hank=# select ntuples, maxbucket
hank-# from hash_metapage_info(get_raw_page('hank.idx_test_name',0));  
 ntuples | maxbucket 
---------+-----------
  1000 |     3
(1 row)

可以看到1號page是bucket,查看此bucket page的活動元組和死元組的數量，

也就是膨脹度，以便維護索引

hank=# select hash_page_type(get_raw_page('hank.idx_test_name',1));
 hash_page_type 
----------------
 bucket
(1 row)
hank=# select live_items, dead_items
hank-# from hash_page_stats(get_raw_page('hank.idx_test_name',1));  
 live_items | dead_items 
------------+------------
    407 |     0
(1 row)

以上為個人經驗，希望能給大傢一個參考，也希望大傢多多支持WalkonNet。如有錯誤或未考慮完全的地方，望不吝賜教。

postgresql 索引之 hash的使用詳解

help create index

hash

註意事項，官網特別強調：

hash索引的結構

映射數據結構到page

接下來看下hash索引如何創建

查看hash訪問方法相關的操作函數

可以用這些函數計算相關類型的哈希碼

hash索引相關的屬性

從10版本開始，可以通過pageinspect插件查看hash索引的內部情況

推薦閱讀：

發佈留言取消回覆

近期文章

help create index

hash

註意事項，官網特別強調：

hash索引的結構

映射數據結構到page

接下來看下hash索引如何創建

查看hash訪問方法相關的操作函數

可以用這些函數計算相關類型的哈希碼

hash索引相關的屬性

從10版本開始，可以通過pageinspect插件查看hash索引的內部情況

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

發佈留言取消回覆