Go標準庫http與fasthttp服務端性能對比場景分析

1. 背景

Go初學者學習Go時,在編寫瞭經典的“hello, world”程序之後,可能會迫不及待的體驗一下Go強大的標準庫,比如:用幾行代碼寫一個像下面示例這樣擁有完整功能的web server:

// 來自https://tip.golang.org/pkg/net/http/#example_ListenAndServe
package main
import (
    "io"
    "log"
    "net/http"
)
func main() {
    helloHandler := func(w http.ResponseWriter, req *http.Request) {
        io.WriteString(w, "Hello, world!\n")
    }
    http.HandleFunc("/hello", helloHandler)
    log.Fatal(http.ListenAndServe(":8080", nil))
}

go net/http包是一個比較均衡的通用實現,能滿足大多數gopher 90%以上場景的需要,並且具有如下優點:

  • 標準庫包,無需引入任何第三方依賴;
  • 對http規范的滿足度較好;
  • 無需做任何優化,即可獲得相對較高的性能;
  • 支持HTTP代理;
  • 支持HTTPS;
  • 無縫支持HTTP/2。

不過也正是因為http包的“均衡”通用實現,在一些對性能要求嚴格的領域,net/http的性能可能無法勝任,也沒有太多的調優空間。這時我們會將眼光轉移到其他第三方的http服務端框架實現上。

而在第三方http服務端框架中,一個“行如其名”的框架fasthttp被提及和采納的較多,fasthttp官網宣稱其性能是net/http的十倍(基於go test benchmark的測試結果)。

fasthttp采用瞭許多性能優化上的最佳實踐,尤其是在內存對象的重用上,大量使用sync.Pool以降低對Go GC的壓力。

那麼在真實環境中,到底fasthttp能比net/http快多少呢?恰好手裡有兩臺性能還不錯的服務器可用,在本文中我們就在這個真實環境下看看他們的實際性能。

2. 性能測試

我們分別用net/http和fasthttp實現兩個幾乎“零業務”的被測程序:

  • nethttp:
// github.com/bigwhite/experiments/blob/master/http-benchmark/nethttp/main.go
package main
import (
    _ "expvar"
    "log"
    "net/http"
    _ "net/http/pprof"
    "runtime"
    "time"
)
func main() {
    go func() {
        for {
            log.Println("當前routine數量:", runtime.NumGoroutine())
            time.Sleep(time.Second)
        }
    }()

    http.Handle("/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("Hello, Go!"))
    }))

    log.Fatal(http.ListenAndServe(":8080", nil))
}
  • fasthttp:
// github.com/bigwhite/experiments/blob/master/http-benchmark/fasthttp/main.go
package main
import (
    "fmt"
    "log"
    "net/http"
    "runtime"
    "time"
    _ "expvar"
    _ "net/http/pprof"
    "github.com/valyala/fasthttp"
)
type HelloGoHandler struct {
}
func fastHTTPHandler(ctx *fasthttp.RequestCtx) {
    fmt.Fprintln(ctx, "Hello, Go!")
}
func main() {
    go func() {
        http.ListenAndServe(":6060", nil)
    }()
    go func() {
        for {
            log.Println("當前routine數量:", runtime.NumGoroutine())
            time.Sleep(time.Second)
        }
    }()
    s := &fasthttp.Server{
        Handler: fastHTTPHandler,
    }
    s.ListenAndServe(":8081")
}

對被測目標實施壓力測試的客戶端,我們基於hey這個http壓測工具進行,為瞭方便調整壓力水平,我們將hey“包裹”在下面這個shell腳本中(僅適於在linux上運行):

// github.com/bigwhite/experiments/blob/master/http-benchmark/client/http_client_load.sh
# ./http_client_load.sh 3 10000 10 GET http://10.10.195.181:8080
echo "$0 task_num count_per_hey conn_per_hey method url"
task_num=$1
count_per_hey=$2
conn_per_hey=$3
method=$4
url=$5
start=$(date +%s%N)
for((i=1; i<=$task_num; i++)); do {
    tm=$(date +%T.%N)
        echo "$tm: task $i start"
    hey -n $count_per_hey -c $conn_per_hey -m $method $url > hey_$i.log
    tm=$(date +%T.%N)
        echo "$tm: task $i done"
} & done
wait
end=$(date +%s%N)
count=$(( $task_num * $count_per_hey ))
runtime_ns=$(( $end - $start ))
runtime=`echo "scale=2; $runtime_ns / 1000000000" | bc`
echo "runtime: "$runtime
speed=`echo "scale=2; $count / $runtime" | bc`
echo "speed: "$speed

該腳本的執行示例如下:

bash http_client_load.sh 8 1000000 200 GET http://10.10.195.134:8080
http_client_load.sh task_num count_per_hey conn_per_hey method url
16:58:09.146948690: task 1 start
16:58:09.147235080: task 2 start
16:58:09.147290430: task 3 start
16:58:09.147740230: task 4 start
16:58:09.147896010: task 5 start
16:58:09.148314900: task 6 start
16:58:09.148446030: task 7 start
16:58:09.148930840: task 8 start
16:58:45.001080740: task 3 done
16:58:45.241903500: task 8 done
16:58:45.261501940: task 1 done
16:58:50.032383770: task 4 done
16:58:50.985076450: task 7 done
16:58:51.269099430: task 5 done
16:58:52.008164010: task 6 done
16:58:52.166402430: task 2 done
runtime: 43.02
speed: 185960.01

從傳入的參數來看,該腳本並行啟動瞭8個task(一個task啟動一個hey),每個task向http://10.10.195.134:8080建立200個並發連接,並發送100w http GET請求。

我們使用兩臺服務器分別放置被測目標程序和壓力工具腳本:

  • 目標程序所在服務器:10.10.195.181(物理機,Intel x86-64 CPU,40核,128G內存, CentOs 7.6)
$ cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core) 

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
座:                 2
NUMA 節點:         2
廠商 ID:           GenuineIntel
CPU 系列:          6
型號:              85
型號名稱:        Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
步進:              4
CPU MHz:             800.000
CPU max MHz:           2201.0000
CPU min MHz:           800.0000
BogoMIPS:            4400.00
虛擬化:           VT-x
L1d 緩存:          32K
L1i 緩存:          32K
L2 緩存:           1024K
L3 緩存:           14080K
NUMA 節點0 CPU:    0-9,20-29
NUMA 節點1 CPU:    10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d
  • 壓力工具所在服務器:10.10.195.133(物理機,鯤鵬arm64 cpu,96核,80G內存, CentOs 7.9)
# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (AltArch)

# lscpu
Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    1
Core(s) per socket:    48
座:                 2
NUMA 節點:         4
型號:              0
CPU max MHz:           2600.0000
CPU min MHz:           200.0000
BogoMIPS:            200.00
L1d 緩存:          64K
L1i 緩存:          64K
L2 緩存:           512K
L3 緩存:           49152K
NUMA 節點0 CPU:    0-23
NUMA 節點1 CPU:    24-47
NUMA 節點2 CPU:    48-71
NUMA 節點3 CPU:    72-95
Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm

我用dstat監控被測目標所在主機資源占用情況(dstat -tcdngym),尤其是cpu負荷;通過[expvarmon監控memstats],由於沒有業務,內存占用很少;通過go tool pprof查看目標程序中對各類資源消耗情況的排名。

下面是多次測試後制作的一個數據表格:

圖:測試數據

3. 對結果的簡要分析

受特定場景、測試工具及腳本精確性以及壓力測試環境的影響,上面的測試結果有一定局限,但卻真實反映瞭被測目標的性能趨勢。我們看到在給予同樣壓力的情況下,fasthttp並沒有10倍於net http的性能,甚至在這樣一個特定的場景下,兩倍於net/http的性能都沒有達到:我們看到在目標主機cpu資源消耗接近70%的幾個用例中,fasthttp的性能僅比net/http高出30%~70%左右。

那麼為什麼fasthttp的性能未及預期呢?要回答這個問題,那就要看看net/http和fasthttp各自的實現原理瞭!我們先來看看net/http的工作原理示意圖:

圖:nethttp工作原理示意圖

http包作為server端的原理很簡單,那就是accept到一個連接(conn)之後,將這個conn甩給一個worker goroutine去處理,後者一直存在,直到該conn的生命周期結束:即連接關閉。

下面是fasthttp的工作原理示意圖:

圖:fasthttp工作原理示意圖

而fasthttp設計瞭一套機制,目的是盡量復用goroutine,而不是每次都創建新的goroutine。fasthttp的Server accept一個conn之後,會嘗試從workerpool中的ready切片中取出一個channel,該channel與某個worker goroutine一一對應。一旦取出channel,就會將accept到的conn寫到該channel裡,而channel另一端的worker goroutine就會處理該conn上的數據讀寫。當處理完該conn後,該worker goroutine不會退出,而是會將自己對應的那個channel重新放回workerpool中的ready切片中,等待這下一次被取出

fasthttp的goroutine復用策略初衷很好,但在這裡的測試場景下效果不明顯,從測試結果便可看得出來,在相同的客戶端並發和壓力下,net/http使用的goroutine數量與fasthttp相差無幾。這是由測試模型導致的:在我們這個測試中,每個task中的hey都會向被測目標發起固定數量的[長連接(keep-alive)],然後在每條連接上發起“飽和”請求。這樣fasthttp workerpool中的goroutine一旦接收到某個conn就隻能在該conn上的通訊結束後才能重新放回,而該conn直到測試結束才會close,因此這樣的場景相當於讓fasthttp“退化”成瞭net/http的模型,也染上瞭net/http的“缺陷”:goroutine的數量一旦多起來,go runtime自身調度所帶來的消耗便不可忽視甚至超過瞭業務處理所消耗的資源占比。下面分別是fasthttp在200長連接、8000長連接以及16000長連接下的cpu profile的結果:

200長連接:

(pprof) top -cum
Showing nodes accounting for 88.17s, 55.35% of 159.30s total
Dropped 150 nodes (cum <= 0.80s)
Showing top 10 nodes out of 60
      flat  flat%   sum%        cum   cum%
     0.46s  0.29%  0.29%    101.46s 63.69%  github.com/valyala/fasthttp.(*Server).serveConn
         0     0%  0.29%    101.46s 63.69%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
         0     0%  0.29%    101.46s 63.69%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.04s 0.025%  0.31%     89.46s 56.16%  internal/poll.ignoringEINTRIO (inline)
    87.38s 54.85% 55.17%     89.27s 56.04%  syscall.Syscall
     0.12s 0.075% 55.24%     60.39s 37.91%  bufio.(*Writer).Flush
         0     0% 55.24%     60.22s 37.80%  net.(*conn).Write
     0.08s  0.05% 55.29%     60.21s 37.80%  net.(*netFD).Write
     0.09s 0.056% 55.35%     60.12s 37.74%  internal/poll.(*FD).Write
         0     0% 55.35%     59.86s 37.58%  syscall.Write (inline)
(pprof) 

8000長連接:

(pprof) top -cum
Showing nodes accounting for 108.51s, 54.46% of 199.23s total
Dropped 204 nodes (cum <= 1s)
Showing top 10 nodes out of 66
      flat  flat%   sum%        cum   cum%
         0     0%     0%    119.11s 59.79%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
         0     0%     0%    119.11s 59.79%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.69s  0.35%  0.35%    119.05s 59.76%  github.com/valyala/fasthttp.(*Server).serveConn
     0.04s  0.02%  0.37%    104.22s 52.31%  internal/poll.ignoringEINTRIO (inline)
   101.58s 50.99% 51.35%    103.95s 52.18%  syscall.Syscall
     0.10s  0.05% 51.40%     79.95s 40.13%  runtime.mcall
     0.06s  0.03% 51.43%     79.85s 40.08%  runtime.park_m
     0.23s  0.12% 51.55%     79.30s 39.80%  runtime.schedule
     5.67s  2.85% 54.39%     77.47s 38.88%  runtime.findrunnable
     0.14s  0.07% 54.46%     68.96s 34.61%  bufio.(*Writer).Flush

16000長連接:

(pprof) top -cum
Showing nodes accounting for 239.60s, 87.07% of 275.17s total
Dropped 190 nodes (cum <= 1.38s)
Showing top 10 nodes out of 46
      flat  flat%   sum%        cum   cum%
     0.04s 0.015% 0.015%    153.38s 55.74%  runtime.mcall
     0.01s 0.0036% 0.018%    153.34s 55.73%  runtime.park_m
     0.12s 0.044% 0.062%       153s 55.60%  runtime.schedule
     0.66s  0.24%   0.3%    152.66s 55.48%  runtime.findrunnable
     0.15s 0.055%  0.36%    127.53s 46.35%  runtime.netpoll
   127.04s 46.17% 46.52%    127.04s 46.17%  runtime.epollwait
         0     0% 46.52%       121s 43.97%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
         0     0% 46.52%       121s 43.97%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.41s  0.15% 46.67%    120.18s 43.67%  github.com/valyala/fasthttp.(*Server).serveConn
   111.17s 40.40% 87.07%    111.99s 40.70%  syscall.Syscall
(pprof)

通過上述profile的比對,我們發現當長連接數量增多時(即workerpool中goroutine數量增多時),go runtime調度的占比會逐漸提升,在16000連接時,runtime調度的各個函數已經排名前4瞭。

4. 優化途徑

從上面的測試結果,我們看到fasthttp的模型不太適合這種連接連上後進行持續“飽和”請求的場景,更適合短連接或長連接但沒有持續飽和請求,在後面這樣的場景下,它的goroutine復用模型才能更好的得以發揮。

但即便“退化”為瞭net/http模型,fasthttp的性能依然要比net/http略好,這是為什麼呢?這些性能提升主要是fasthttp在內存分配層面的優化trick的結果,比如大量使用sync.Pool,比如避免在[]byte和string互轉等。

那麼,在持續“飽和”請求的場景下,如何讓fasthttp workerpool中goroutine的數量不會因conn的增多而線性增長呢?fasthttp官方沒有給出答案,但一條可以考慮的路徑是使用os的多路復用(linux上的實現為epoll),即go runtime netpoll使用的那套機制。在多路復用的機制下,這樣可以讓每個workerpool中的goroutine處理同時處理多個連接,這樣我們可以根據業務規模選擇workerpool池的大小,而不是像目前這樣幾乎是任意增長goroutine的數量。當然,在用戶層面引入epoll也可能會帶來系統調用占比的增多以及響應延遲增大等問題。至於該路徑是否可行,還是要看具體實現和測試結果。

註:fasthttp.Server中的Concurrency可以用來限制workerpool中並發處理的goroutine的個數,但由於每個goroutine隻處理一個連接,當Concurrency設置過小時,後續的連接可能就會被fasthttp拒絕服務。因此fasthttp的默認Concurrency為:

const DefaultConcurrency = 256 * 1024

到此這篇關於Go標準庫http與fasthttp服務端性能比較的文章就介紹到這瞭,更多相關go http與fasthttp服務端性能內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet!

推薦閱讀: