中文文章的抽取式摘要—使用bert-extractive-summarizer¶ 說明¶

bert-extractive-summarizer 是一個使用 Bert 加上 Clustering 進行抽取式摘要的模型,詳細原理、實作可以看作者的 Github 有論文連結。因為範例是英文的,用於中文需要稍作修改,載入中文的模型。

Github : https://github.com/dmmiller612/bert-extractive-summarizer

安裝需要的套件¶ 1 2 3 4 5 6 7 8#安裝需要的套件 pip install bert-extractive-summarizer pip install spacy==2.3.1 pip install transformers pip install neuralcoref #下載中文的spacy model python -m spacy download zh_core_web_lg Python¶ 載入模型¶

這裡的 Pretrained Model 可以輸入在 https://huggingface.co 上有的模型名字,或是自己訓練的模型路徑。這裡以 bert-base-chinese 當作範例,可以換成自己常用的模型。載入模型後就可以直接進行摘要了。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26# spaCy 載入中文模型 import spacy import zh_core_web_lg import neuralcoref nlp = zh_core_web_lg.load() neuralcoref.add_to_pipe(nlp) # summarizer 載入中文模型 from summarizer import Summarizer from summarizer.sentence_handler import SentenceHandler from spacy.lang.zh import Chinese from transformers import * # Load model, model config and tokenizer via Transformers modelName = "bert-base-chinese" # 可以換成自己常用的 custom_config = AutoConfig.from_pretrained(modelName) custom_config.output_hidden_states=True custom_tokenizer = AutoTokenizer.from_pretrained(modelName) custom_model = AutoModel.from_pretrained(modelName, config=custom_config) model = Summarizer( custom_model=custom_model, custom_tokenizer=custom_tokenizer, sentence_handler = SentenceHandler(language=Chinese) ) 使用模型¶


1 2 3 4 5body = "要摘要的文章" result = model(body) full = ''.join(result) print(full) # 摘要出來的句子


1 2 3 4 5 6 7model( body: str # The string body that you want to summarize ratio: float # The ratio of sentences that you want for the final summary min_length: int # Parameter to specify to remove sentences that are less than 40 characters max_length: int # Parameter to specify to remove sentences greater than the max length, num_sentences: Number of sentences to use. Overrides ratio if supplied. )






