python中文分詞+詞頻統計的實現步驟

Posted on 2022-06-11 by WalkonNet

前言

本文記錄瞭一下Python在文本處理時的一些過程+代碼

一、文本導入

我準備瞭一個名為abstract.txt的文本文件

接著是在網上下載瞭stopword.txt(用於結巴分詞時的停用詞)

有一些是自己覺得沒有用加上去的

另外建立瞭自己的詞典extraDict.txt

準備工作做好瞭，就來看看怎麼使用吧！

二、使用步驟

1.引入庫

代碼如下：

import jieba
from jieba.analyse import extract_tags
from sklearn.feature_extraction.text import TfidfVectorizer

2.讀入數據

代碼如下：

jieba.load_userdict('extraDict.txt')  # 導入自己建立詞典

3.取出停用詞表

def stopwordlist():
    stopwords = [line.strip() for line in open('chinesestopwords.txt', encoding='UTF-8').readlines()]
    # ---停用詞補充,視具體情況而定---
    i = 0
    for i in range(19):
        stopwords.append(str(10 + i))
    # ----------------------
 
    return stopwords

4.分詞並去停用詞（此時可以直接利用python原有的函數進行詞頻統計）

def seg_word(line):
    # seg=jieba.cut_for_search(line.strip())
    seg = jieba.cut(line.strip())
    temp = ""
    counts = {}
    wordstop = stopwordlist()
    for word in seg:
        if word not in wordstop:
            if word != ' ':
                temp += word
                temp += '\n'
                counts[word] = counts.get(word, 0) + 1#統計每個詞出現的次數
    return  temp #顯示分詞結果
    #return str(sorted(counts.items(), key=lambda x: x[1], reverse=True)[:20])  # 統計出現前二十最多的詞及次數

5. 輸出分詞並去停用詞的有用的詞到txt

def output(inputfilename, outputfilename):
    inputfile = open(inputfilename, encoding='UTF-8', mode='r')
    outputfile = open(outputfilename, encoding='UTF-8', mode='w')
    for line in inputfile.readlines():
        line_seg = seg_word(line)
        outputfile.write(line_seg)
    inputfile.close()
    outputfile.close()
    return outputfile

6.函數調用

if __name__ == '__main__':
    print("__name__", __name__)
    inputfilename = 'abstract.txt'
    outputfilename = 'a1.txt'
    output(inputfilename, outputfilename)

7.結果

附：輸入一段話，統計每個字母出現的次數

先來講一下思路：

例如給出下面這樣一句話

Love is more than a word
it says so much.
When I see these four letters,
I almost feel your touch.
This is only happened since
I fell in love with you.
Why this word does this,
I haven’t got a clue.

那麼想要統計裡面每一個單詞出現的次數，思路很簡單，遍歷一遍這個字符串，再定義一個空字典count_dict，看每一個單詞在這個用於統計的空字典count_dict中的key中存在否，不存在則將這個單詞當做count_dict的鍵加入字典內，然後值就為1，若這個單詞在count_dict裡面已經存在，那就將它對應的鍵的值+1就行

下面來看代碼：

#定義字符串
sentences = """           # 字符串很長時用三個引號
Love is more than a word
it says so much.
When I see these four letters,
I almost feel your touch.
This is only happened since
I fell in love with you.
Why this word does this,
I haven't got a clue.
"""
#具體實現
#  將句子裡面的逗號去掉,去掉多種符號時請用循環，這裡我就這樣吧
sentences=sentences.replace(',','')   
sentences=sentences.replace('.','')   #  將句子裡面的.去掉
sentences = sentences.split()         # 將句子分開為單個的單詞，分開後產生的是一個列表sentences
# print(sentences)
count_dict = {}
for sentence in sentences:
    if sentence not in count_dict:    # 判斷是否不在統計的字典中
        count_dict[sentence] = 1
    else:                              # 判斷是否不在統計的字典中
        count_dict[sentence] += 1
for key,value in count_dict.items():
    print(f"{key}出現瞭{value}次")

輸出結果是這樣：

總結

以上就是今天要講的內容，本文僅僅簡單介紹瞭python的中文分詞及詞頻統計！

到此這篇關於python中文分詞+詞頻統計的實現步驟的文章就介紹到這瞭,更多相關python中文分詞詞頻統計內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet！

python中文分詞+詞頻統計的實現步驟

目錄

前言

一、文本導入

二、使用步驟

1.引入庫

2.讀入數據

3.取出停用詞表

4.分詞並去停用詞（此時可以直接利用python原有的函數進行詞頻統計）

5. 輸出分詞並去停用詞的有用的詞到txt

6.函數調用

7.結果

附：輸入一段話，統計每個字母出現的次數

總結

推薦閱讀：

發佈留言取消回覆

近期文章

目錄

前言

一、文本導入

二、使用步驟

1.引入庫

2.讀入數據

3.取出停用詞表

4.分詞並去停用詞（此時可以直接利用python原有的函數進行詞頻統計）

5. 輸出分詞並去停用詞的有用的詞到txt

6.函數調用

7.結果

附：輸入一段話，統計每個字母出現的次數

總結

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

發佈留言取消回覆