python3之Splash的具體使用

Posted on 2021-08-09 by WalkonNet

並行處理多個網頁
獲取HTML源代碼或截取屏幕截圖
關閉圖像或使用Adblock Plus規則使渲染更快
在頁面上下文中執行自定義JavaScript
可通過Lua腳本來控制頁面的渲染過程
在Splash-Jupyter 筆記本中開發Splash Lua腳本。
以HAR格式獲取詳細的渲染信息

1、Scrapy-Splash的安裝

Scrapy-Splash的安裝分為兩部分，一個是Splash服務的安裝，具體通過Docker來安裝服務，運行服務會啟動一個Splash服務，通過它的接口來實現JavaScript頁面的加載；另外一個是Scrapy-Splash的Python庫的安裝，安裝後就可在Scrapy中使用Splash服務瞭，下面我們分三部份來安裝：

(1)安裝Docker

#安裝所需要的包：
yum install -y yum-utils device-mapper-persistent-data lvm2
#設置穩定存儲庫：
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
#開始安裝DOCKER CE：
yum install docker-ce
#啟動dockers：
systemctl start docker
#測試安裝是否正確：
docker run hello-world

(2)安裝splash服務

通過Docker安裝Scrapinghub/splash鏡像，然後啟動容器，創建splash服務

docker pull scrapinghub/splash
docker run -d -p 8050:8050 scrapinghub/splash
#通過瀏覽器訪問8050端口驗證安裝是否成功

(3)Python包Scrapy-Splash安裝

pip3 install scrapy-splash

2、Splash Lua腳本

運行splash服務後，通過web頁面訪問服務的8050端口如:http://localhost:8050即可看到其web頁面，如下圖：

上面有個輸入框，默認是http://google.com，我們可以換成想要渲染的網頁如：https://www.baidu.com然後點擊Render me按鈕開始渲染，頁面返回結果包括渲染截圖、HAR加載統計數據、網頁源代碼:

從HAR中可以看到，Splash執行瞭整個頁面的渲染過程，包括CSS、JavaScript的加載等，通過返回結果可以看到它分別對應搜索框下面的腳本文件中return部分的三個返回值，html、png、har：

function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(0.5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

這個腳本是使用Lua語言寫的，它首先使用go()方法加載頁面，wait()方法等待加載時間，然後返回源碼、截圖和HAR信息。

現在我們修改下它的原腳本，訪問www.baidu.com，通過javascript腳本，讓它返回title，然後執行：

function main(splash, args)
assert(splash:go("https://www.baidu.com"))
assert(splash:wait(0.5))
local title = splash:evaljs("document.title")  
return {
title = title
}
end

#返回結果：
Splash Response: Object

title: "百度一下，你就知道"

由此可以確定Splash渲染頁面的過程是通過此入口腳本來實現的，那麼我們可以修改此腳本來滿足我們對抓取頁面的分析和結果返回，但此函數但名稱必須是main()，它返回的結果是一個字典形式也可以返回字符串形式的內容：

function main(splash)
  return {
    hello="world"
  }
end

#返回結果
Splash Response: Object
hello: "world"


function main(splash)
  return "world"
end

#返回結果
Splash Response: "world"

3、Splash對象的屬性與方法

在前面的例子中，main()方法的第一參數是splash，這個對象它類似於selenium中的WebDriver對象，可以調用它的屬性和方法來控制加載規程，下面介紹一些常用的屬性：

splash.args：該屬性可以獲取加載時陪在的參數，如URL，如果為GET請求，它可以獲取GET請求參數，如果為POST請求，它可以獲取表單提交的數據，splash.args可以使用函數的第二個可選參數args來進行訪問

function main(splash,args)
    local url = args.url
end

#上面的第二個參數args就相當於splash.args屬性，如下代碼與上面是等價的

function main(splash)
   local url=splash.args.url
end

splash.js_enabled：啟用或者禁用頁面中嵌入的JavaScript代碼的執行，默認為true，啟用JavaScript執行

splash.resource_timeout：設置網絡請求的默認超時，以秒為單位，如設置為0或nil則表示無超時：splash.resource_timeout=nil

splash.images_enabled：啟用或禁用圖片加載，默認情況下是加載的：splash.images_enabled=true

splash.plugins_enabled：啟用或禁用瀏覽器插件，默認為禁止：splash.plugins_enabled=false

splash.scroll_position：獲取和設置主窗口的當前位置：splash.scroll_position={x=50,y=600}

function main(splash, args)
  assert(splash:go('https://www.toutiao.com'))
  splash.scroll_position={y=400}
  return {
    png = splash:png()
  }
end

#它會向下滾動400像素來獲取圖片

splash.html5_media_enabled：啟用或禁用HTML5媒體,包括HTML5視頻和音頻(例如<video>元素播放)

splash對象的方法：

splash:go() ：該方法用來請求某個鏈接，而且它可以模擬GET和POST請求，同時支持傳入請求頭，表單等數據，用法如下：

ok, reason = splash:go{url, baseurl=nil, headers=nil, http_method="GET", body=nil, formdata=nil}

參數說明：url為請求的URL，baseurl為可選參數表示資源加載相對路徑，headers為可選參數，表示請求頭，http_method表示http請求方法的字符串默認為GET,body為使用POST時發送表單數據，使用的Content-type為application/json，formdata默認為空，POST請求時的表單數據，使用的Content-type為application/x-www-form-urlencoded

該方法返回結果是ok和reason的組合，如果ok為空則代表網頁加載錯誤，reason變量中會包含錯誤信息

function main(splash, args)
  local ok, reason = splash:go{"http://httpbin.org/post", http_method="POST", body="name=Germey"}
  if ok then
        return splash:html()
  end
end

splash.wait() ：控制頁面的等待時間

ok,reason=splash:wait{time,cancel_on_redirect=false,cancel_on_error=true}

tiem為等待的秒數，cancel_on_redirect表示發生重定向就停止等待，並返回重定向結果，默認為false，cancel_on_error默認為false，表示如果發生錯誤就停止等待

返回結果同樣是ok和reason的組合

function main(splash, args)
  splash:go("https://www.toutiao.com")
  local ok reason = splash:wait(1)
  return {
    ok=ok,
    reason=reason
  }
end

#返回true說明返回頁面成功

splash:jsfunc()
lua_func = splash:jsfunc(func)
此方法可以直接調用JavaScript定義的函數，但所調用的函數需要用雙中括號包圍，它相當於實現瞭JavaScript方法到Lua腳本到轉換，全局的JavaScript函數可以直接包裝

function main(splash, args)
  local get_div_count = splash:jsfunc([[
  function () {
    var body = document.body;
    var divs = body.getElementsByTagName('div');
    return divs.length;
  }
  ]])
  splash:go("https://www.baidu.com")
  return ("There are %s DIVs"):format(
    get_div_count())
end

#
Splash Response: "There are 21 DIVs"

splash.evaljs() ：在頁面上下文中執行JavaScript代碼段並返回最後一個語句的結果

local title = splash:evaljs("document.title")

#返回頁面標題

splash:runjs() ：在頁面上下文中運行JavaScript代碼，同evaljs差不多，但它更偏向於執行某些動作或聲明函數

function main(splash, args)
  splash:go("https://www.baidu.com")
  splash:runjs("foo = function() { return 'bar' }")
  local result = splash:evaljs("foo()")
  return result
end

splash:autoload() ：將JavaScript設置為在每個頁面加載時自動加載

ok,reason=splash:autoload{source_or_url,source=nil,url=nil}

參數：

source_or_url – 包含JavaScript源代碼的字符串或用於加載JavaScript代碼的URL;s
ource – 包含JavaScript源代碼的字符串;
url – 從中加載JavaScript源代碼的URL

此方法隻加載JavaScript代碼或庫，不執行操作，如果要執行操作可以調用evaljs()或runjs()方法

function main(splash, args)
  splash:autoload([[
    function get_document_title(){
      return document.title;
    }
  ]])
  splash:go("https://www.baidu.com")
  return splash:evaljs("get_document_title()")
end


#加載JS庫文件
function main(splash, args)
  assert(splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js"))
  assert(splash:go("https://www.taobao.com"))
  local version = splash:evaljs("$.fn.jquery")
  return 'JQuery version: ' .. version
end

splash:call_later ：通過設置定時任務和延遲時間來實現任務延時執行

timer=splash:call_later(callback,delay) ：callback運行的函數，delay延遲時間

function main(splash, args)
  local snapshots = {}
  local timer = splash:call_later(function()
    snapshots["a"] = splash:png()
    splash.scroll_position={y=500}
    splash:wait(1.0)
    snapshots["b"] = splash:png()
  end, 2)
  splash:go("https://www.toutiao.com")
  splash:wait(3.0)
  return snapshots
end

#等待2秒後執行截圖然後再等待3秒後執行截圖

splash:http_get() ：發送HTTP GET請求並返回相應

response=splash:http_get{url,headers=nil,follow_redirects=true} ：url要加載的URL，headers添加HTTP頭，follw_redirects是否啟動自動重定向默認為true

local reply = splash:http_get("http://example.com")

#返回一個響應對象，不會講結果返回到瀏覽器

splash:http_post ：發送POST請求

response = splash:http_post{url, headers=nil, follow_redirects=true, body=nil}

dody指定表單數據

function main(splash, args)
  local treat = require("treat")
  local json = require("json")
  local response = splash:http_post{"http://httpbin.org/post",     
      body=json.encode({name="Germey"}),
      headers={["content-type"]="application/json"}
    }
    return {
    html=treat.as_string(response.body),
    url=response.url,
    status=response.status
    }
end

#
html:{"args":{},"data":"{\"name\": \"Germey\"}","files":{},"form":{},"headers":{"Accept-Encoding":"gzip, deflate","Accept-Language":"en,*","Connection":"close","Content-Length":"18","Content-Type":"application/json","Host":"httpbin.org","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1"},"json":{"name":"Germey"},"origin":"221.218.181.223","url":"http://httpbin.org/post"}
status: 200
url: http://httpbin.org/post

splash:set_content() ：設置當前頁面的內容

ok,reason=splash:set_content{data,mime_type=”text/html;charset=utf-8″,baseurl=””}

function main(splash)
    assert(splash:set_content("<html><body><h1>hello</h1></body></html>"))
    return splash:png()
end

splash:html() ：獲取網頁的源代碼，結果為字符串

function main(splash, args)
  splash:go("https://httpbin.org/get")
  return splash:html()
end

splash:png() ：獲取PNG格式的網頁截圖

splash:jpeg() ：獲取JPEG格式的網頁截圖

splash:har() ：獲取頁面加載過程描述

splash:url() ：獲取當前正在訪問的URL

splash:get_cookies() ：獲取當前頁面的cookies

splash:add_cookie() ：為當前頁面添加cookie

function main(splash)
    splash:add_cookie{"sessionid", "237465ghgfsd", "/", domain="http://example.com"}
    splash:go("http://example.com/")
    return splash:get_cookies()
end

#
Splash Response: Array[1]
0: Object
domain: "http://example.com"
httpOnly: false
name: "sessionid"
path: "/"
secure: false
value: "237465ghgfsd"

splash:clear_cookies() ：清除所有的cookies

splash:delete_cookies{name=nil,url=nil} 刪除指定的cookie

splash:get_viewport_size() ：獲取當前瀏覽器頁面的大小，即寬高

splash:set_viewport_size(width,height) ：設置當前瀏覽器頁面的大小，即寬高

splash:set_viewport_full() ：設置瀏覽器全屏顯示

splash:set_user_agent() ：覆蓋設置請求頭的User-Agent

splash:get_custom_headers(headers) ：設置請求頭

function main(splash)
  splash:set_custom_headers({
     ["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36",
     ["Site"] = "httpbin.org",
  })
  splash:go("http://httpbin.org/get")
  return splash:html()
end

splash:on_request(callback) ：在HTTP請求之前註冊要調用的函數

splash:get_version() ：獲取splash版本信息

splash:mouse_press() ：觸發鼠標按下事件

splash:mouse_release() ：觸發鼠標釋放事件

splash:send_keys() ：發送鍵盤事件到頁面上下文，如發送回車鍵：splash:send_keys(“key_Enter”)

splash:send_text() ：將文本內容發送到頁面上下文

splash:select() ：選中符合條件的第一個節點，如果有多個節點符合條件，則隻會返回一個，其參數是CSS選擇器

function main(splash)
  splash:go("https://www.baidu.com/")
  input = splash:select("#kw")
  input:send_text('Splash')
  splash:wait(3)
  return splash:png()
end

splash:select_all() ：選中所有符合條件的節點，其參數是CSS選擇器

function main(splash)
  local treat = require('treat')
  assert(splash:go("https://www.zhihu.com"))
  assert(splash:wait(1))
  local texts = splash:select_all('.ContentLayout-mainColumn .ContentItem-title')
  local results = {}
  for index, text in ipairs(texts) do
    results[index] = text.node.textContent
  end
  return treat.as_array(results)
end

#返回所有節點下的文本內容

splash:mouse_click() ：出發鼠標單擊事件

function main(splash)
  splash:go("https://www.baidu.com/")
  input = splash:select("#kw")
  input:send_text('Splash')
  submit = splash:select('#su')
  submit:mouse_click()
  splash:wait(3)
  return splash:png()
end

其他splash scripts的屬性與方法請參考官方文檔：http://splash.readthedocs.io/en/latest/scripting-ref.html

4、響應對象

響應對象是由splash方法返回的回調信息，如splash:http_get()或splash:http_post()，會被傳遞給回調splash:on_response和splash:on_response_headers，它們包括的響應信息：

response.url：響應的URL

response.status:響應的HTTP狀態碼

response.ok：成功返回true否則返回false

response.headers：返回HTTP頭信息

response.info：具有HAR響應格式的響應數據表

response.body：返回原始響應主體信息為二進制對象，需要使用treat.as_string轉換為字符串

resonse.request：響應的請求對象

response.abort：終止響應

5、元素對象

元素對象包裝JavaScript DOM節點，創建某個方法返回任何類型的DOM節點，如Node，Element，HTMLElement等，splash:select和splash:select_all將返回元素對象

element:mouse_click() 出發元素上的鼠標單擊事件

element:mouse_hover()在元素上觸發鼠標懸停事件

elemnet:styles() 返回元素的計算樣式

element:bounds() 返回元素的邊界客戶端矩形

element:png()以PNG格式返回元素的屏幕截圖

element:jpeg() 以JPEG格式返回元素的屏幕截圖

element:visible() 檢查元素是否可見

element:focused() 檢查元素是否具有焦點

element:text() 從元素中獲取文本信息

element:info() 獲取元素的詳細信息

element:field_value() 獲取field元素的值,如input,select,textarea,button

element:form_values(values=’auto’/’list’/’first’) 如果元素類型是表單，則返回帶有表單的表，返回類型有三種格式

element:fill(values) 使用提供的值填寫表單

element:send_keys(keys) 將鍵盤事件發送到元素，如發送回車send_keys(‘key_Enter’)，其他鍵請參考：http://doc.qt.io/qt-5/qt.html

element:send_text() 發送字符串到元素

element:submit()提交表單元素

element:exists()檢查DOM中元素是否存在

element屬性：

element.node 它具有所有公開的元素DOM方法和屬性，但不包括splash定義的方法和屬性

element.inner_id 表示元素ID

外部繼承的支持的DOM屬性：（有一些是隻讀的）

從HTMLElement繼承的屬性:

accessKey
accessKeyLabel (read-only)
contentEditable
isContentEditable (read-only)
dataset (read-only)
dir
draggable
hidden
lang
offsetHeight (read-only)
offsetLeft (read-only)
offsetParent (read-only)
offsetTop (read-only)
spellcheck
style – a table with styles which can be modified
tabIndex
title
translate

從Element繼承的屬性:

attributes (read-only) – a table with attributes of the element
classList (read-only) – a table with class names of the element
className
clientHeight (read-only)
clientLeft (read-only)
clientTop (read-only)
clientWidth (read-only)
id
innerHTML
localeName (read-only)
namespaceURI (read-only)
nextElementSibling (read-only)
outerHTML
prefix (read-only)
previousElementSibling (read-only)
scrollHeight (read-only)
scrollLeft
scrollTop
scrollWidth (read-only)
tabStop
tagName (read-only)

從Node繼承的屬性:

baseURI (read-only)
childNodes (read-only)
firstChild (read-only)
lastChild (read-only)
nextSibling (read-only)
nodeName (read-only)
nodeType (read-only)
nodeValue
ownerDocument (read-only)
parentNode (read-only)
parentElement (read-only)
previousSibling (read-only)
rootNode (read-only)
textContent

6、Splash HTTP API調用

Splash通過HTTP API控制來發送GET請求或POST表單數據，它提供瞭這些接口，隻需要在請求時傳遞相應的參數即可獲得不同的內容，下面來介紹下這些接口

(1)render.html 它返回JavaScript渲染頁面的HTML代碼

參數：

url：要渲染的網址，str類型

baseurl：用於呈現頁面的基本URL

timeout：渲染的超時時間默認為30秒

resource_timeout：單個網絡請求的超時時間

wait：加載頁面後等待更新的時間默認為0

proxy：代理配置文件名稱或代理URL，格式為：[protocol://][user:password@]proxyhost[:port])

js：JavaScript配置

js_source：在頁面中執行的JavaScript代碼

filtrs：以逗號分隔的請求過濾器名稱列表

allowed_domains：允許的域名列表

images：為1時下載圖像，為0時不下載圖像，默認為1

headers：設置的HTTP標頭，JSON數組

body：發送POST請求的數據

http_method：HTTP方法，默認為GET

html5_media：是否啟用HTML5媒體，值為1啟用，0為禁用，默認為0

import requests
url='http://172.16.32.136:8050/'
response=requests.get(url+'render.html?url=https://www.baidu.com&wait=3&images=0')
print(response.text)  #返回網頁源代碼

（2）render.png 此接口獲取網頁的截圖PNG格式

import requests
url='http://172.16.32.136:8050/'
#指定圖像寬和高
response=requests.get(url+'render.png?url=https://www.taobao.com&wait=5&width=1000&height=700&render_all=1')
with open('taobao.png','wb') as f:
    f.write(response.content)

（3）render.jpeg 返回JPEG格式截圖

import requests
url='http://172.16.32.136:8050/'

response=requests.get(url+'render.jpeg?url=https://www.taobao.com&wait=5&width=1000&height=700&render_all=1')
with open('taobao.jpeg','wb') as f:
    f.write(response.content)

（4）render.har 此接口用於獲取頁面加載的HAR數據

import requests
url='http://172.16.32.136:8050/'
response=requests.get(url+'render.har?url=https://www.jd.com&wait=5')

print(response.text)

（5）render.json 此接口包含瞭前面接口的所有功能，返回結果是JSON格式

參數：

html：是否在輸出中包含HTML，html=1時包含html內容，為0時不包含，默認為0

png：是否包含PNG截圖，為1包含為0不包含默認為0

jpeg：是否包含JPEG截圖，為1包含為0不包含默認為0

iframes：是否在輸出中包含子幀的信息，默認為0

script：是否輸出包含執行的JavaScript語句的結果

console：是否輸出中包含已執行的JavaScript控制臺消息

history：是否包含網頁主框架的請求與響應的歷史記錄

har：是否輸出中包含HAR信息

import requests
url='http://172.16.32.136:8050/'
response=requests.get(url+'render.json?url=https://httpbin.org&html=1&png=1&history=1&har=1')

print(response.text)

（6）execute 用此接口可以實現與Lua腳本的對接，它可以實現與頁面的交互操作

參數：

lua_source：Lua腳本文件

timeout：設置超時

allowed_domains：指定允許的域名列表

proxy：指定代理

filters：指定篩選條件

import requests
from urllib.parse import quote
lua='''
function main(splash)
    return 'hello'
end
'''
url='http://172.16.32.136:8050/execute?lua_source='+quote(lua)
response=requests.get(url)
print(response.text)

通過Lua腳本獲取頁面的body,url和狀態碼：

import requests
from urllib.parse import quote
lua='''
function main(splash,args)
    local treat=require("treat")
    local response=splash:http_get("http://httpbin.org/get")
    return {
        html=treat.as_string(response.body),
        url=response.url,
        status=response.status
    }
end
'''
url='http://172.16.32.136:8050/execute?lua_source='+quote(lua)
response=requests.get(url)
print(response.text)

#
{"status": 200, "html": "{\"args\":{},\"headers\":{\"Accept-Encoding\":\"gzip, deflate\",\"Accept-Language\":\"en,*\",\"Connection\":\"close\",\"Host\":\"httpbin.org\",\"User-Agent\":\"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1\"},\"origin\":\"221.218.181.223\",\"url\":\"http://httpbin.org/get\"}\n", "url": http://httpbin.org/get}

7、實例

抓取JD python書籍數據：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2018/7/9 13:33
# @Author  : Py.qi
# @File    : JD.py
# @Software: PyCharm
import re

import requests
import pymongo
from pyquery import PyQuery as pq

client=pymongo.MongoClient('localhost',port=27017)
db=client['JD']

def page_parse(html):
    doc=pq(html,parser='html')
    items=doc('#J_goodsList .gl-item').items()
    for item in items:
        if item('.p-img img').attr('src'):
            image=item('.p-img img').attr('src')
        else:
            image=item('.p-img img').attr('data-lazy-img')
        texts={
            'image':'https:'+image,
            'price':item('.p-price').text()[:6],
            'title':re.sub('\n','',item('.p-name').text()),
            'commit':item('.p-commit').text()[:-3],

        }
        yield texts

def save_to_mongo(data):
    if db['jd_collection'].insert(data):
        print('保存到MongoDB成功',data)
    else:
        print('MongoDB存儲錯誤',data)

def main(number):
    url='http://192.168.146.140:8050/render.html?url=https://search.jd.com/Search?keyword=python&page={}&wait=1&images=0'.format(number)
    response=requests.get(url)
    data=page_parse(response.text)
    for i in data:
        save_to_mongo(i)
        #print(i)

if __name__ == '__main__':
    for number in range(1,200,2):
        print('開始抓取第{}頁'.format(number))
        main(number)

更多內容請查看官方文檔：http://splash.readthedocs.io/en/stable/

到此這篇關於python3之Splash的具體使用的文章就介紹到這瞭,更多相關python3 Splash內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet！

python3之Splash的具體使用

目錄

1、Scrapy-Splash的安裝

2、Splash Lua腳本

3、Splash對象的屬性與方法

4、響應對象

5、元素對象

6、Splash HTTP API調用

7、實例

推薦閱讀：

發佈留言取消回覆

近期文章

目錄

1、Scrapy-Splash的安裝

2、Splash Lua腳本

3、Splash對象的屬性與方法

4、響應對象

5、元素對象

6、Splash HTTP API調用

7、實例

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

發佈留言取消回覆