Python基於百度AI實現抓取表情包

Posted on 2021-06-27 by WalkonNet

本文先抓取網絡上的表情圖像，然後利用百度 AI 識別表情包上的說明文字，並利用表情文字重命名文件，這樣當發表情包時，不需要逐個打開查找，直接根據文件名選擇表情並發送。

一、百度 AI 開放平臺的 Key 申請方法

本例使用瞭百度 AI 的 API 接口實現文字識別。因此需要先申請對應的 API 使用權限，具體步驟如下：

在網頁瀏覽器(比如 Chrome 或者火狐) 的地址欄中輸入 ai.baidu.com，進入到百度雲 AI 的官網，在該頁面中單擊右上角的 控制臺 按鈕。

在這裡插入圖片描述

進入到百度雲 AI 官網的登錄頁面，輸入百度賬號和密碼，如果沒有，可以單擊 立即註冊 超鏈接進行註冊申請。

登錄成功後，進入到百度雲 AI 官網的控制臺頁面，單擊左側導航的 產品服務，展開列表，在列表的最右側下方看到有 人工智能 的分類，然後選擇 圖像識別，或者直接選擇 文字識別，如下圖所示。

在這裡插入圖片描述

進入圖像識別一概覽 頁面，要使用百度雲 AI 的 API，首先需要申請權限，申請權限之前需要先創建自己的應用，因此單擊 創建應用按鈕，如下圖所示。

在這裡插入圖片描述

進入到 創建應用 頁面，該頁面中需要輸入應用的名稱，選擇應用類型，並選擇接口，註意：這裡的接口可以多選擇一些，把後期可能用到的接口全部選擇上，這樣，在開發其他實例時，就可以直接使用瞭；選擇完接口後，選擇文字識別包名，這裡選擇 不需要，輸入應用描述，單擊 立即創建 按鈕，如下圖所示。

在這裡插入圖片描述

創建完成後，單擊 返回應用列表 按鈕，頁面跳轉到應用列表頁面，在該頁面中即可查看創建的應用，以及百度雲自動為您分配的 AppID，API Key，Secret Key，這些值根據應用的不同而不同，因此一定要保存好，以便開發時使用。

在這裡插入圖片描述

二、抓取貼吧表情包

本例在百度貼吧中找到瞭一些自制的表情包：https://tieba.baidu.com/p/5522091060
現在想把圖片都爬下來，具體操作步驟如下：

Network 抓包看下返回的數據是否和 Element 一致，即是否包含想要的數據，而不是通過 JS 黑魔法進行加載的。復制下第一個圖的圖片鏈接，到 Network 選項卡裡的 Response 裡查找一下。

在這裡插入圖片描述

在 Network 抓包中沒有發現 Ajax 動態加載數據的蹤跡。

點擊第二頁，抓包發現瞭 Ajax 加載的痕跡。

在這裡插入圖片描述

以第一個圖的 url 搜下，同樣可以找到。

三個參數猜測 pn 為 page_number，即頁數，postman 或者自己寫代碼模擬請求，記得塞入 Host 和 X-Requested-With，驗證 pn=1 是否為第一頁數據，驗證通過，即所有頁面數據都可以通過這個接口拿到。

先加載拿到末頁是第幾頁，然後走一波循環遍歷即可解析數據獲得圖片 url，寫入文件，使用多個線程進行下載，詳細代碼如下。

# 抓取百度貼吧某個帖子裡的所有圖片
import requests
import time
import threading
import queue
from bs4 import BeautifulSoup
import chardet
import os

tiezi_url = "https://tieba.baidu.com/p/5522091060"
headers = {
    'Host': 'tieba.baidu.com',
    'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KH'
                  'TML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
}
pic_save_dir = 'tiezi_pic/'
if not os.path.exists(pic_save_dir):  # 判斷文件夾是否存在，不存在就創建
    os.makedirs(pic_save_dir)

pic_urls_file = 'tiezi_pic_urls.txt'
download_q = queue.Queue()  # 下載隊列


# 獲得頁數
def get_page_count():
    try:
        resp = requests.get(tiezi_url, headers=headers, timeout=5)
        if resp is not None:
            resp.encoding = chardet.detect(resp.content)['encoding']
            html = resp.text
            soup = BeautifulSoup(html, 'lxml')
            a_s = soup.find("ul", attrs={'class': 'l_posts_num'}).findAll("a")
            for a in a_s:
                if a.get_text() == '尾頁':
                    return a['href'].split('=')[1]
    except Exception as e:
        print(str(e))


# 下載線程
class PicSpider(threading.Thread):
    def __init__(self, t_name, func):
        self.func = func
        threading.Thread.__init__(self, name=t_name)

    def run(self):
        self.func()


# 獲得每頁裡的所有圖片URL
def get_pics(count):
    params = {
        'pn': count,
        'ajax': '1',
        't': int(time.time())
    }
    try:
        resp = requests.get(tiezi_url, headers=headers, timeout=5, params=params)
        if resp is not None:
            resp.encoding = chardet.detect(resp.content)['encoding']
            html = resp.text
            soup = BeautifulSoup(html, 'lxml')
            imgs = soup.findAll('img', attrs={'class': 'BDE_Image'})
            for img in imgs:
                print(img['src'])
                with open(pic_urls_file, 'a') as fout:
                    fout.write(img['src'])
                    fout.write('\n')
            return None
    except Exception:
        pass


# 下載線程調用的方法
def down_pics():
    global download_q
    while not download_q.empty():
        data = download_q.get()
        download_pic(data)
        download_q.task_done()


# 下載調用的方法
def download_pic(img_url):
    try:
        resp = requests.get(img_url, headers=headers, timeout=10)
        if resp.status_code == 200:
            print("下載圖片:" + img_url)
            pic_name = img_url.split("/")[-1][0:-1]
            with open(pic_save_dir + pic_name, "wb+") as f:
                f.write(resp.content)

    except Exception as e:
        print(e)


if __name__ == '__main__':
    print("檢索判斷鏈接文件是否存在：")
    if not os.path.exists(pic_urls_file):
        print("不存在，開始解析帖子...")
        page_count = get_page_count()
        if page_count is not None:
            headers['X-Requested-With'] = 'XMLHttpRequest'
            for page in range(1, int(page_count) + 1):
                get_pics(page)
        print("鏈接已解析完畢！")
        headers.pop('X-Requested-With')
    else:
        print("存在")
    print("開始下載圖片~~~~")
    headers['Host'] = 'imgsa.baidu.com'
    fo = open(pic_urls_file, "r")
    pic_list = fo.readlines()

    threads = []
    for pic in pic_list:
        download_q.put(pic)
    for i in range(0, len(pic_list)):
        t = PicSpider(t_name='線程' + str(i), func=down_pics)
        t.daemon = True
        t.start()
        threads.append(t)
    download_q.join()
    for t in threads:
        t.join()
    print("圖片下載完畢")

運行結果：

在這裡插入圖片描述

下面通過 OCR 文字識別技術，直接把表情裡的文字提出來，然後來命名圖片，這樣就可以直接文件搜索表情關鍵字，可以快速找到需要的表情圖片。使用谷歌的 OCR 文字識別引擎：Tesseract，對於此類大圖片小文字，不太適合，識別率太低，甚至無法識別，這時使用百度雲 OCR 比較合適，它能夠自動定位到圖片中具體位置，並找出圖片中所有的文字。

三、使用 Baidu-aip

申請百度 AI 的應用 key 之後，就可以在本地系統中安裝 Baidu-aip，代碼如下：

pip install baidu-aip

先識別一張圖片，看看效果如何：

from aip import AipOcr

# 新建一個AipOcr對象
config = {
    'appId': '填寫自己的appId',
    'apiKey': '填寫自己的apiKey',
    'secretKey': '填寫自己的secretKey'
}
client = AipOcr(**config)


# 識別圖片裡的文字
def img_to_str(image_path):
    # 讀取圖片
    with open(image_path, 'rb') as fp:
        image = fp.read()

        # 調用通用文字識別, 圖片參數為本地圖片
    result = client.basicGeneral(image)
    # 返回拼接結果
    if 'words_result' in result:
        return '\n'.join([w['words'] for w in result['words_result']])


if __name__ == '__main__':
    print(img_to_str('tiezi_pic/5c0ddb1e4134970aebd593e29ecad1c8a5865dbd.jpg'))

運行程序，結果如下圖所示：

在這裡插入圖片描述

百度 AI 返回的是一個 JSON 格式數據，如下所示。返回一個字典對象，包含 log_id、words_result_num、words_result 三個鍵，其中 words_result_num 表示識別的文本行數，words_result 是一個列表，每個列表項目記錄一條識別的文本，每個項目返回一個字典對象，包含 words 鍵，words 表示識別的文本。

{'words_result': [{'words': 'o。o'}, {'words': '6226-16:59'}, {'words': '絕望jpg'}], 'log_id': 1393611954748129280, 'words_result_num': 3}
o。o
6226-16:59
絕望jpg

由於每個圖片中可能包含很多文字信息，如水印的日期文字，以及個別特殊的文字符號被誤解析，我們需要提出的是漢字或字母信息，同時可能會包含多條漢字信息，本例選擇漢字或字母最長的一條來命名文件。完整的示例代碼如下：

# 識別圖片文字，批量命名圖片文字

import os
from aip import AipOcr
import re
import datetime

# 新建一個AipOcr對象
config = {
    'appId': '填寫自己的appId',
    'apiKey': '填寫自己的apiKey',
    'secretKey': '填寫自己的secretKey'
}
client = AipOcr(**config)

pic_dir = r"tiezi_pic/"


# 讀取圖片
def get_file_content(file_path):
    with open(file_path, 'rb') as fp:
        return fp.read()


# 識別圖片裡的文字
def img_to_str(image_path):
    image = get_file_content(image_path)
    # 調用通用文字識別, 圖片參數為本地圖片
    result = client.basicGeneral(image)
    # 結果拼接返回
    words_list = []
    if 'words_result' in result:
        if len(result['words_result']) > 0:
            for w in result['words_result']:
                words_list.append(w['words'])
            file_name = get_longest_str(words_list)
            print(file_name)
            file_dir_name = pic_dir + str(file_name).replace("/", "") + '.jpg'
            if os.path.exists(file_dir_name):  # 處理文件重名問題
                sec = datetime.datetime.now().microsecond  # 獲取當前毫秒時值
                file_dir_name = pic_dir + str(file_name).replace("/", "") + str(sec) + '.jpg'
            try:
                os.rename(image_path, file_dir_name)
            except Exception:
                print(" 重命名失敗：", image_path, " => ", file_name)


# 獲取字符串列表中最長的字符串
def get_longest_str(str_list):
    pat = re.compile(r'[\u4e00-\u9fa5A-Za-z]+')
    str = max(str_list, key=hanzi_len)
    result = pat.findall(str)
    return ''.join(result)


def hanzi_len(item):
    pat = re.compile(r'[\u4e00-\u9fa5]+')
    sum = 0
    for i in item:
        if pat.search(i):
            sum += 1
    return sum


# 遍歷某個文件夾下所有圖片
def query_picture(dir_path):
    pic_path_list = []
    for filename in os.listdir(dir_path):
        pic_path_list.append(dir_path + filename)
    return pic_path_list


if __name__ == '__main__':
    pic_list = query_picture(pic_dir)
    if len(pic_list) > 0:
        for i in pic_list:
            img_to_str(i)

運行程序，結果如下圖所示：

在這裡插入圖片描述

到此這篇關於Python基於百度AI實現抓取表情包的文章就介紹到這瞭,更多相關Python 抓取表情包內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet！

Python基於百度AI實現抓取表情包

一、百度 AI 開放平臺的 Key 申請方法

二、抓取貼吧表情包

三、使用 Baidu-aip

推薦閱讀：

發佈留言取消回覆

近期文章

一、百度 AI 開放平臺的 Key 申請方法

二、抓取貼吧表情包

三、使用 Baidu-aip

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

發佈留言取消回覆