python 提取html文本的方法

Posted on 2021-05-20 by WalkonNet

假設我們需要從各種網頁中提取全文，並且要剝離所有HTML標記。通常，默認解決方案是使用BeautifulSoup軟件包中的get_text方法，該方法內部使用lxml。這是一個經過充分測試的解決方案，但是在處理成千上萬個HTML文檔時可能會非常慢。
通過用selectolax替換BeautifulSoup，您幾乎可以免費獲得5-30倍的加速！
這是一個簡單的基準測試，可分析commoncrawl(`處理NLP問題時，有時您需要獲得大量的文本集。互聯網是文本的最大來源，但是不幸的是，從任意HTML頁面提取文本是一項艱巨而痛苦的任務。
假設我們需要從各種網頁中提取全文，並且要剝離所有HTML標記。通常，默認解決方案是使用BeautifulSoup軟件包中的get_text方法，該方法內部使用lxml。這是一個經過充分測試的解決方案，但是在處理成千上萬個HTML文檔時可能會非常慢。
通過用selectolax替換BeautifulSoup，您幾乎可以免費獲得5-30倍的加速！這是一個簡單的基準測試，可分析commoncrawl(https://commoncrawl.org/)的10,000個HTML頁面：

# coding: utf-8

from time import time

import warc
from bs4 import BeautifulSoup
from selectolax.parser import HTMLParser


def get_text_bs(html):
    tree = BeautifulSoup(html, 'lxml')

    body = tree.body
    if body is None:
        return None

    for tag in body.select('script'):
        tag.decompose()
    for tag in body.select('style'):
        tag.decompose()

    text = body.get_text(separator='\n')
    return text


def get_text_selectolax(html):
    tree = HTMLParser(html)

    if tree.body is None:
        return None

    for tag in tree.css('script'):
        tag.decompose()
    for tag in tree.css('style'):
        tag.decompose()

    text = tree.body.text(separator='\n')
    return text


def read_doc(record, parser=get_text_selectolax):
    url = record.url
    text = None

    if url:
        payload = record.payload.read()
        header, html = payload.split(b'\r\n\r\n', maxsplit=1)
        html = html.strip()

        if len(html) > 0:
            text = parser(html)

    return url, text


def process_warc(file_name, parser, limit=10000):
    warc_file = warc.open(file_name, 'rb')
    t0 = time()
    n_documents = 0
    for i, record in enumerate(warc_file):
        url, doc = read_doc(record, parser)

        if not doc or not url:
            continue

        n_documents += 1

        if i > limit:
            break

    warc_file.close()
    print('Parser: %s' % parser.__name__)
    print('Parsing took %s seconds and produced %s documents\n' % (time() - t0, n_documents))

>>> ! wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz
>>> file_name = "CC-MAIN-20180116070444-20180116090444-00000.warc.gz"
>>> process_warc(file_name, get_text_selectolax, 10000)
Parser: get_text_selectolax
Parsing took 16.170367002487183 seconds and produced 3317 documents
>>> process_warc(file_name, get_text_bs, 10000)
Parser: get_text_bs
Parsing took 432.6902508735657 seconds and produced 3283 documents

顯然，這並不是對某些事物進行基準測試的最佳方法，但是它提供瞭一個想法，即selectolax有時比lxml快30倍。
selectolax最適合將HTML剝離為純文本。如果我有10,000多個HTML片段，需要將它們作為純文本索引到Elasticsearch中。（Elasticsearch有一個html_strip文本過濾器，但這不是我想要/不需要在此上下文中使用的過濾器）。事實證明，以這種規模將HTML剝離為純文本實際上是非常低效的。那麼，最有效的方法是什麼？

PyQuery

from pyquery import PyQuery as pq

text = pq(html).text()

selectolax

from selectolax.parser import HTMLParser

text = HTMLParser(html).text()

正則表達式

import re

regex = re.compile(r'<.*?>')
text = clean_regex.sub('', html)

結果

我編寫瞭一個腳本來計算時間，該腳本遍歷包含HTML片段的10,000個文件。註意！這些片段不是完整的<html>文檔（帶有<head>和<body>等），隻是HTML的一小部分。平均大小為10,314字節（中位數為5138字節）。結果如下：

pyquery
  SUM:    18.61 seconds
  MEAN:   1.8633 ms
  MEDIAN: 1.0554 ms
selectolax
  SUM:    3.08 seconds
  MEAN:   0.3149 ms
  MEDIAN: 0.1621 ms
regex
  SUM:    1.64 seconds
  MEAN:   0.1613 ms
  MEDIAN: 0.0881 ms

我已經運行瞭很多次，結果非常穩定。重點是：selectolax比PyQuery快7倍。

正則表達式好用？真的嗎？

對於最基本的HTML Blob，它可能工作得很好。實際上，如果HTML是<p> Foo＆amp; Bar </ p>，我希望純文本轉換應該是Foo＆Bar，而不是Foo＆amp; bar。
更重要的一點是，PyQuery和selectolax支持非常特定但對我的用例很重要的內容。在繼續之前，我需要刪除某些標簽（及其內容）。例如：

<h4 class="warning">This should get stripped.</h4>
<p>Please keep.</p>
<div style="display: none">This should also get stripped.</div>

正則表達式永遠無法做到這一點。

2.0 版本

因此，我的要求可能會發生變化，但基本上，我想刪除某些標簽。例如：<div class =“ warning”> 、 <div class =“ hidden”> 和 <div style =“ display：none”>。因此，讓我們實現一下：

PyQuery

from pyquery import PyQuery as pq

_display_none_regex = re.compile(r'display:\s*none')

doc = pq(html)
doc.remove('div.warning, div.hidden')
for div in doc('div[style]').items():
    style_value = div.attr('style')
    if _display_none_regex.search(style_value):
        div.remove()
text = doc.text()

selectolax

from selectolax.parser import HTMLParser

_display_none_regex = re.compile(r'display:\s*none')

tree = HTMLParser(html)
for tag in tree.css('div.warning, div.hidden'):
    tag.decompose()
for tag in tree.css('div[style]'):
    style_value = tag.attributes['style']
    if style_value and _display_none_regex.search(style_value):
        tag.decompose()
text = tree.body.text()

這實際上有效。當我現在為10,000個片段運行相同的基準時，新結果如下：

pyquery
  SUM:    21.70 seconds
  MEAN:   2.1701 ms
  MEDIAN: 1.3989 ms
selectolax
  SUM:    3.59 seconds
  MEAN:   0.3589 ms
  MEDIAN: 0.2184 ms
regex
  Skip

同樣，selectolax擊敗PyQuery約6倍。

結論

正則表達式速度快，但功能弱。selectolax的效率令人印象深刻。

以上就是python 提取html文本的方法的詳細內容，更多關於python 提取html文本的資料請關註WalkonNet其它相關文章！

python 提取html文本的方法

結果

正則表達式好用？真的嗎？

2.0 版本

結論

推薦閱讀：

發佈留言取消回覆

近期文章

結果

正則表達式好用？真的嗎？

2.0 版本

結論

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

發佈留言取消回覆