解決postgresql insert into select無法使用並行查詢的問題
本文信息基於PG13.1。
從PG9.6開始支持並行查詢。PG11開始支持CREATE TABLE … AS、SELECT INTO以及CREATE MATERIALIZED VIEW的並行查詢。
先說結論:
換用create table as 或者select into或者導入導出。
首先跟蹤如下查詢語句的執行計劃:
select count(*) from test t1,test1 t2 where t1.id = t2.id ; postgres=# explain analyze select count(*) from test t1,test1 t2 where t1.id = t2.id ; QUERY PLAN ------------------------------------------------------------------------------------------- Finalize Aggregate (cost=34244.16..34244.17 rows=1 width=8) (actual time=683.246..715.324 rows=1 loops=1) -> Gather (cost=34243.95..34244.16 rows=2 width=8) (actual time=681.474..715.311 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=33243.95..33243.96 rows=1 width=8) (actual time=674.689..675.285 rows=1 loops=3) -> Parallel Hash Join (cost=15428.00..32202.28 rows=416667 width=0) (actual time=447.799..645.689 rows=333333 loops=3) Hash Cond: (t1.id = t2.id) -> Parallel Seq Scan on test t1 (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.025..74.010 rows=333333 loops=3) -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (actual time=260.052..260.053 rows=333333 loops=3) Buckets: 131072 Batches: 16 Memory Usage: 3520kB -> Parallel Seq Scan on test1 t2 (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.032..104.804 rows=333333 loops=3) Planning Time: 0.420 ms Execution Time: 715.447 ms (13 rows)
可以看到走瞭兩個Workers。
下邊看一下insert into select:
postgres=# explain analyze insert into va select count(*) from test t1,test1 t2 where t1.id = t2.id ; QUERY PLAN ------------------------------------------------------------------------------------------- Insert on va (cost=73228.00..73228.02 rows=1 width=4) (actual time=3744.179..3744.187 rows=0 loops=1) -> Subquery Scan on "*SELECT*" (cost=73228.00..73228.02 rows=1 width=4) (actual time=3743.343..3743.352 rows=1 loops=1) -> Aggregate (cost=73228.00..73228.01 rows=1 width=8) (actual time=3743.247..3743.254 rows=1 loops=1) -> Hash Join (cost=30832.00..70728.00 rows=1000000 width=0) (actual time=1092.295..3511.301 rows=1000000 loops=1) Hash Cond: (t1.id = t2.id) -> Seq Scan on test t1 (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.030..421.537 rows=1000000 loops=1) -> Hash (cost=14425.00..14425.00 rows=1000000 width=4) (actual time=1090.078..1090.081 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3227kB -> Seq Scan on test1 t2 (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.021..422.768 rows=1000000 loops=1) Planning Time: 0.511 ms Execution Time: 3745.633 ms (11 rows)
可以看到並沒有Workers的指示,沒有啟用並行查詢。
即使開啟強制並行,也無法走並行查詢。
postgres=# set force_parallel_mode =on; SET postgres=# explain analyze insert into va select count(*) from test t1,test1 t2 where t1.id = t2.id ; QUERY PLAN ------------------------------------------------------------------------------------------- Insert on va (cost=73228.00..73228.02 rows=1 width=4) (actual time=3825.042..3825.049 rows=0 loops=1) -> Subquery Scan on "*SELECT*" (cost=73228.00..73228.02 rows=1 width=4) (actual time=3824.976..3824.984 rows=1 loops=1) -> Aggregate (cost=73228.00..73228.01 rows=1 width=8) (actual time=3824.972..3824.978 rows=1 loops=1) -> Hash Join (cost=30832.00..70728.00 rows=1000000 width=0) (actual time=1073.587..3599.402 rows=1000000 loops=1) Hash Cond: (t1.id = t2.id) -> Seq Scan on test t1 (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.034..414.965 rows=1000000 loops=1) -> Hash (cost=14425.00..14425.00 rows=1000000 width=4) (actual time=1072.441..1072.443 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3227kB -> Seq Scan on test1 t2 (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.022..400.624 rows=1000000 loops=1) Planning Time: 0.577 ms Execution Time: 3825.923 ms (11 rows)
原因在官方文檔有寫:
The query writes any data or locks any database rows. If a query contains a data-modifying operation either at the top level or within a CTE, no parallel plans for that query will be generated. As an exception, the commands CREATE TABLE … AS, SELECT INTO, and CREATE MATERIALIZED VIEW which create a new table and populate it can use a parallel plan.
解決方案有如下三種:
1.select into
postgres=# explain analyze select count(*) into vaa from test t1,test1 t2 where t1.id = t2.id ; QUERY PLAN ------------------------------------------------------------------------------------------- Finalize Aggregate (cost=34244.16..34244.17 rows=1 width=8) (actual time=742.736..774.923 rows=1 loops=1) -> Gather (cost=34243.95..34244.16 rows=2 width=8) (actual time=740.223..774.907 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=33243.95..33243.96 rows=1 width=8) (actual time=731.408..731.413 rows=1 loops=3) -> Parallel Hash Join (cost=15428.00..32202.28 rows=416667 width=0) (actual time=489.880..700.830 rows=333333 loops=3) Hash Cond: (t1.id = t2.id) -> Parallel Seq Scan on test t1 (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.033..87.479 rows=333333 loops=3) -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (actual time=266.839..266.840 rows=333333 loops=3) Buckets: 131072 Batches: 16 Memory Usage: 3520kB -> Parallel Seq Scan on test1 t2 (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.058..106.874 rows=333333 loops=3) Planning Time: 0.319 ms Execution Time: 783.300 ms (13 rows)
2.create table as
postgres=# explain analyze create table vb as select count(*) from test t1,test1 t2 where t1.id = t2.id ; QUERY PLAN ------------------------------------------------------------------------------------------- Finalize Aggregate (cost=34244.16..34244.17 rows=1 width=8) (actual time=540.120..563.733 rows=1 loops=1) -> Gather (cost=34243.95..34244.16 rows=2 width=8) (actual time=537.982..563.720 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=33243.95..33243.96 rows=1 width=8) (actual time=526.602..527.136 rows=1 loops=3) -> Parallel Hash Join (cost=15428.00..32202.28 rows=416667 width=0) (actual time=334.532..502.793 rows=333333 loops=3) Hash Cond: (t1.id = t2.id) -> Parallel Seq Scan on test t1 (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.018..57.819 rows=333333 loops=3) -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (actual time=189.502..189.503 rows=333333 loops=3) Buckets: 131072 Batches: 16 Memory Usage: 3520kB -> Parallel Seq Scan on test1 t2 (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.023..77.786 rows=333333 loops=3) Planning Time: 0.189 ms Execution Time: 565.448 ms (13 rows)
3.或者通過導入導出的方式,例如:
psql -h localhost -d postgres -U postgres -c "select count(*) from test t1,test1 t2 where t1.id = t2.id " -o result.csv -A -t -F "," psql -h localhost -d postgres -U postgres -c "COPY va FROM 'result.csv' WITH (FORMAT CSV, DELIMITER ',', HEADER FALSE, ENCODING 'windows-1252')"
一些場景下也會比非並行快。
補充:POSTGRESQL: 動態SQL語句中不能使用SELECT INTO?
我的數據庫版本是 PostgreSQL 8.4.7 。 下面是出錯的存儲過程:
CREATE or Replace FUNCTION func_getnextid( tablename varchar(240), idname varchar(20) default 'id') RETURNS integer AS $funcbody$ Declare sqlstring varchar(240); currentId integer; Begin sqlstring:= 'select max("' || idname || '") into currentId from "' || tablename || '";'; EXECUTE sqlstring; if currentId is NULL or currentId = 0 then return 1; else return currentId + 1; end if; End; $funcbody$ LANGUAGE plpgsq
執行後出現這樣的錯誤:
SQL error:
ERROR: EXECUTE of SELECT … INTO is not implemented
CONTEXT: PL/pgSQL function “func_getnextbigid” line 6 at EXECUTE statement
改成這樣的就對瞭:
CREATE or Replace FUNCTION func_getnextid( tablename varchar(240), idname varchar(20) default 'id') RETURNS integer AS $funcbody$ Declare sqlstring varchar(240); currentId integer; Begin sqlstring:= 'select max("' || idname || '") from "' || tablename || '";'; EXECUTE sqlstring into currentId; if currentId is NULL or currentId = 0 then return 1; else return currentId + 1; end if; End; $funcbody$ LANGUAGE plpgsql;
以上為個人經驗,希望能給大傢一個參考,也希望大傢多多支持WalkonNet。如有錯誤或未考慮完全的地方,望不吝賜教。
推薦閱讀:
- postgresql insert into select無法使用並行查詢的解決
- PostgreSQL 對IN,EXISTS,ANY/ALL,JOIN的sql優化方案
- postgresql 刪除重復數據的幾種方法小結
- 淺談pg_hint_plan定制執行計劃
- postgresql 中的 like 查詢優化方案