python爬取企查查企業信息之selenium自動模擬登錄企查查

Posted on 2021-04-08 by WalkonNet

最近接瞭個小項目需要批量搜索企查查上的相關企業並把指定信息保存到Excel文件中，由於企查查需要登錄後才能查看所有搜索到的信息所以第一步需要模擬登錄企查查。

python模擬登錄企查查最重要的是自動拖拽驗證插件

先介紹下項目中使用到的工具與庫

Python的selenium庫：

Web應用程序測試的工具，Selenium可以模擬用戶在瀏覽器中的操作，就像真實用戶使用一樣。

官方技術文檔：https://www.selenium.dev/selenium/docs/api/py/index.html

Chrome瀏覽器：

谷歌瀏覽器，不作過多介紹

Chromedriver：

配合Selenium操作Chrome瀏覽器的驅動程序，註意在下載Chromedriver時必須與已安裝的Chrome瀏覽器版本號前3位保持一至

官方下載地址：http://chromedriver.storage.googleapis.com/index.html

獲取完整項目代碼請關註下面的公眾號“python客棧”然後回復“qcc”

第一步：下載配置Chromedriver

假設電腦中已安裝Chrome最新版（如果沒有安裝請自行下載安裝）,下載與電腦系統、Chrome版本相匹配的版本（Chromedriver的版本號必須與安裝的Chrome版本號一至）。

從官網下載的文件是一個壓縮包，解壓出Chromedriver.exe文件，

網上有很多文章說要正常使用Chromedriver.exe，需要配置系統的環境變量，其實這是一種比較麻煩的方法。

為瞭項目的可移動性和操作方便使用另一種方法，就是把Chrome瀏覽器安裝目錄下的整個Application目錄都復制到項目目錄下，這樣就可以隨便移動項目到新開發環境中而不用考慮新環境的系統環境變量瞭。

把解壓出Chromedriver.exe文件復制到項目目錄下的從Chrome瀏覽器安裝目錄中復制過來的Application目錄下，保證Chromedriver.exe文件與chrome.exe文件在同一目錄下。

第二步：安裝selenium庫

pip安裝selenium庫

pip install selenium

Pycharm開發工具安裝selenium庫

在Pycharm菜單欄中找到並點擊【file】->【settings】

在彈出窗口中按下圖所示操作

第三步：自動模擬登錄企查查python代碼編寫

首先引入selenium相關庫

import time
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

初始化webdriver基本配置參數

    options = webdriver.ChromeOptions()
    # options.add_argument('--headless')  # 開啟無界面模式
    options.add_argument('--disable-gpu')  # 禁用gpu，解決一些莫名的問題
    options.add_argument('blink-settings=imagesEnabled=false')  # 不加載圖片, 提升速度
    options.add_argument('--disable-infobars')  # 禁用瀏覽器正在被自動化程序控制的提示
    options.add_argument('--start-maximized')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    d = DesiredCapabilities.CHROME
    d['goog:loggingPrefs'] = {'performance': 'ALL'}# 獲取Headers必須參數
    driver = webdriver.Chrome(options=options, executable_path="Application/chromedriver.exe", desired_capabilities=d)
    driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {#清除驗證插件中windows.navigator.webdriver的值
        "source": """
        Object.defineProperty(navigator, 'webdriver', {
          get: () => undefined
        })
      """
    })

模擬用戶在頁面中的一系列操作

    driver.implicitly_wait(2)#延時
    driver.set_window_size(width=800, height=600)
    driver.get("https://www.QCC.com/")
    driver.find_element_by_xpath('//a[@class="navi-btn"][1]').click()
    locator = (By.ID, "dom_id_two")
    try:
        WebDriverWait(driver, 20, 0.5).until(EC.presence_of_element_located(locator))
    except:
        driver.close()
    # WebDriverWait(driver,20,0.5).until(lambda driver:driver.find_element_by_xpath('//span[@class="nc_iconfont btn_slide"]'))
    # 找到賬號輸入框
    driver.find_element_by_xpath('//input[@id="nameVerify"]').send_keys('手機號')

自動拖動驗證插件滑塊並驗證

驗證插件會檢測瀏覽器是否為webdriver即使用JS檢查windows.navigator.webdriver值

所以需要在頁面加載前手動修改windows.navigator.webdriver值

driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
        "source": """
        Object.defineProperty(navigator, 'webdriver', {
          get: () => undefined
        })
      """
    })

修改完成windows.navigator.webdriver值後再模擬拖動驗證插件滑塊

    # 滑動條定位
    start = driver.find_element_by_xpath('//span[@class="nc_iconfont btn_slide"]')
    action = ActionChains(driver)
    action.click_and_hold(start)
    action.drag_and_drop_by_offset(start, 308, 0).perform()

檢查驗證是否成功

    time.sleep(2)
    style = 'position:absolute;top:0;left:0;width:100%;z-index:999;font-size:40px;line-height:100px;background:rgba(255,217,0,90%);height:100%;text-align:center;color:#000;'
    driver.execute_script(
        'var htm=document.getElementsByClassName("login-sao-panel")[0];htm.innerHTML+="<div style={style}><b id=tt></b><b id=ts></b></div>"'.format(
            style=style))

    ts = driver.find_element_by_id('ts')
    tt = driver.find_element_by_id('tt')

    try:
        driver.find_element_by_xpath('//div[@class="errloading"][1]')
        set_id_att(driver, 'tt', 'innerHTML', '請手工驗證')
    except:
        tr = driver.find_element_by_xpath('//span[@class="nc-lang-cnt"][1]')
        if tr.text != '驗證通過':
            set_id_att(driver, 'tt', 'innerHTML', '請手工驗證')
            # for i in range(1, 6):
            #    if tr.text == '驗證通過':
            #        break
            #    set_id_att(driver, 'ts', 'innerHTML', i)
            #    time.sleep(1)
    try:
        driver.find_element_by_xpath('//a[@class="text-primary vcode-btn get-mobile-code"]').click()
    except:
        pass
    # code=driver.find_element_by_xpath('//input[@id="vcodeNormal"]')
    set_id_att(driver, 'tt', 'innerHTML', '請填入手機驗證碼')
    # rjs='const callback = arguments[arguments.length - 1];callback({v:document.getElementById("vcodeNormal").value})'
    rjs = 'return document.getElementById("vcodeNormal").value'
    locator = (By.CLASS_NAME, "nav-user")
    but = driver.find_element_by_xpath('//form[@id="user_login_verify"]/button')
    for i in range(1, 1):
        # code = driver.execute_async_script(rjs)
        code = driver.execute_script(rjs)
        if len(code) == 6:
            but.click()
            try:
                #WebDriverWait(driver, 5, 0.5).until(EC.presence_of_element_located(locator))
                break
            except:
                pass
                #return 0
        set_id_att(driver, 'ts', 'innerHTML', i)
        time.sleep(1)

上面的代碼中在頁面裡增加瞭一些狀態顯示元素及JS代碼

    style = 'position:absolute;top:0;left:0;width:100%;z-index:999;font-size:40px;line-height:100px;background:rgba(255,217,0,90%);height:100%;text-align:center;color:#000;'
    driver.execute_script(
        'var htm=document.getElementsByClassName("login-sao-panel")[0];htm.innerHTML+="<div style={style}><b id=tt></b><b id=ts></b></div>"'.format(
            style=style))

把selenium訪問頁面元素寫成函數方便以後操作

def set_id_att(bor, id, att, val):
    bor.execute_script('document.getElementById("{a}").{b}="{c}"'.format(a=id, b=att, c=val))

def set_class_att(bor, classs, id, att, val):
    bor.execute_script('document.getElementsByClassName("{a}")[{d}].{b}="{c}"'.format(a=classs, b=att, c=val, d=id))

登錄成功後還需要獲取頁面的headers、Cookie方便後面的requests庫使用

selenium獲取頁面headers頭部信息

def getheader(browser):
    for responseReceived in browser.get_log('performance'):
        try:
            response = json.loads(responseReceived[u'message'])[u'message'][u'params'][u'response']
            if response[u'url'] == browser.current_url:
                return response[u'requestHeaders']
        except:
            pass
    return None

selenium獲取頁面登錄後Cookie

cookie = [item["name"] + "=" + item["value"] for item in driver.get_cookies()]
headers['cookie'] = ';'.join(item for item in cookie)

完整代碼如下

import time
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

def getheader(browser):
    for responseReceived in browser.get_log('performance'):
        try:
            response = json.loads(responseReceived[u'message'])[u'message'][u'params'][u'response']
            if response[u'url'] == browser.current_url:
                return response[u'requestHeaders']
        except:
            pass
    return None

def set_id_att(bor, id, att, val):
    bor.execute_script('document.getElementById("{a}").{b}="{c}"'.format(a=id, b=att, c=val))

def set_class_att(bor, classs, id, att, val):
    bor.execute_script('document.getElementsByClassName("{a}")[{d}].{b}="{c}"'.format(a=classs, b=att, c=val, d=id))

def login():
    options = webdriver.ChromeOptions()
    # options.add_argument('--headless')  # 開啟無界面模式
    options.add_argument('--disable-gpu')  # 禁用gpu，解決一些莫名的問題
    options.add_argument('blink-settings=imagesEnabled=false')  # 不加載圖片, 提升速度
    options.add_argument('--disable-infobars')  # 禁用瀏覽器正在被自動化程序控制的提示
    options.add_argument('--start-maximized')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    d = DesiredCapabilities.CHROME
    d['goog:loggingPrefs'] = {'performance': 'ALL'}
    driver = webdriver.Chrome(options=options, executable_path="Application/chromedriver.exe", desired_capabilities=d)
    driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
        "source": """
        Object.defineProperty(navigator, 'webdriver', {
          get: () => undefined
        })
      """
    })
    driver.implicitly_wait(2)
    driver.set_window_size(width=800, height=600)
    driver.get("https://www.QCC.com/",)
    driver.find_element_by_xpath('//a[@class="navi-btn"][1]').click()
    locator = (By.ID, "dom_id_two")
    try:
        WebDriverWait(driver, 20, 0.5).until(EC.presence_of_element_located(locator))
    except:
        driver.close()
    # WebDriverWait(driver,20,0.5).until(lambda driver:driver.find_element_by_xpath('//span[@class="nc_iconfont btn_slide"]'))
    # 找到賬號輸入框
    driver.find_element_by_xpath('//input[@id="nameVerify"]').send_keys('19942496979')
    # 滑動條定位
    start = driver.find_element_by_xpath('//span[@class="nc_iconfont btn_slide"]')
    action = ActionChains(driver)
    action.click_and_hold(start)
    action.drag_and_drop_by_offset(start, 308, 0).perform()
    time.sleep(2)
    style = 'position:absolute;top:0;left:0;width:100%;z-index:999;font-size:40px;line-height:100px;background:rgba(255,217,0,90%);height:100%;text-align:center;color:#000;'
    driver.execute_script(
        'var htm=document.getElementsByClassName("login-sao-panel")[0];htm.innerHTML+="<div style={style}><b id=tt></b><b id=ts></b></div>"'.format(
            style=style))

    ts = driver.find_element_by_id('ts')
    tt = driver.find_element_by_id('tt')

    try:
        driver.find_element_by_xpath('//div[@class="errloading"][1]')
        set_id_att(driver, 'tt', 'innerHTML', '請手工驗證')
    except:
        tr = driver.find_element_by_xpath('//span[@class="nc-lang-cnt"][1]')
        if tr.text != '驗證通過':
            set_id_att(driver, 'tt', 'innerHTML', '請手工驗證')
            # for i in range(1, 6):
            #    if tr.text == '驗證通過':
            #        break
            #    set_id_att(driver, 'ts', 'innerHTML', i)
            #    time.sleep(1)
    try:
        driver.find_element_by_xpath('//a[@class="text-primary vcode-btn get-mobile-code"]').click()
    except:
        pass
    # code=driver.find_element_by_xpath('//input[@id="vcodeNormal"]')
    set_id_att(driver, 'tt', 'innerHTML', '請填入手機驗證碼')
    # rjs='const callback = arguments[arguments.length - 1];callback({v:document.getElementById("vcodeNormal").value})'
    rjs = 'return document.getElementById("vcodeNormal").value'
    locator = (By.CLASS_NAME, "nav-user")
    but = driver.find_element_by_xpath('//form[@id="user_login_verify"]/button')
    for i in range(1, 1):
        # code = driver.execute_async_script(rjs)
        code = driver.execute_script(rjs)
        if len(code) == 6:
            but.click()
            try:
                #WebDriverWait(driver, 5, 0.5).until(EC.presence_of_element_located(locator))
                break
            except:
                pass
                #return 0
        set_id_att(driver, 'ts', 'innerHTML', i)
        time.sleep(1)

    headers = getheader(driver)#獲取headers
    ip = "202.121.178.244"
    if headers:
        #獲取cookie並存入headers中
        cookie = [item["name"] + "=" + item["value"] for item in driver.get_cookies()]
        headers['cookie'] = ';'.join(item for item in cookie)
        del headers[':authority']
        del headers[':method']
        del headers[':path']
        del headers[':scheme']
        headers['X-Forwarded-For'] = ip
        headers['X-Remote-IP'] = ip
        headers['X-Originating-IP'] = ip
        headers['X-Remote-Addr'] = ip
        headers['X-Client-IP'] = ip
    return headers

headers=login()#自動登錄並獲取登錄後的Headers包括cookies

要獲取完整項目代碼（selenium模擬登錄企查查+requests庫自動搜索獲取指定信息並保存Excel）請關註上面的公眾號“python客棧”然後回復“qcc”

本文主要介紹瞭如何使用python的selenium模擬登錄企查查,主要介紹瞭如何使用selenium保存Cookies與headers、自動驗證及selenium庫對頁面元素的一些操作方法

下一篇將介紹Python使用requests庫自動在企查查上搜索相關企業並獲取指定信息

python爬取企查查企業信息之selenium自動模擬登錄企查查

先介紹下項目中使用到的工具與庫

Python的selenium庫：

Chrome瀏覽器：

Chromedriver：

獲取完整項目代碼請關註下面的公眾號“python客棧”然後回復“qcc”

第一步：下載配置Chromedriver

第二步：安裝selenium庫

pip安裝selenium庫

Pycharm開發工具安裝selenium庫

第三步：自動模擬登錄企查查python代碼編寫

首先引入selenium相關庫

初始化webdriver基本配置參數

模擬用戶在頁面中的一系列操作

自動拖動驗證插件滑塊並驗證

檢查驗證是否成功

把selenium訪問頁面元素寫成函數方便以後操作

selenium獲取頁面headers頭部信息

selenium獲取頁面登錄後Cookie

完整代碼如下

推薦閱讀：

發佈留言取消回覆

近期文章

先介紹下項目中使用到的工具與庫

Python的selenium庫：

Chrome瀏覽器：

Chromedriver：

獲取完整項目代碼請關註下面的公眾號“python客棧”然後回復“qcc”

第一步：下載配置Chromedriver

第二步：安裝selenium庫

pip安裝selenium庫

Pycharm開發工具安裝selenium庫

第三步：自動模擬登錄企查查python代碼編寫

首先引入selenium相關庫

初始化webdriver基本配置參數

模擬用戶在頁面中的一系列操作

自動拖動驗證插件滑塊並驗證

檢查驗證是否成功

把selenium訪問頁面元素寫成函數方便以後操作

selenium獲取頁面headers頭部信息

selenium獲取頁面登錄後Cookie

完整代碼如下

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

發佈留言取消回覆