英文文本分词处理（NLTK）

2023-08-16 16:33| 来源: 网络整理| 查看: 265

文章目录 1、NLTK的安装2、NLTK分词和分句3、NLTK分词后去除标点符号4、NLTK分词后去除停用词5、NLTK分词后进行词性标注6、NLTK分词后进行词干提取7、NLTK分词后进行词性还原

1、NLTK的安装

首先，打开终端（Anaconda Prompt）安装nltk：

pip install nltk

打开Python终端或是Anaconda 的Spyder并输入以下内容来安装 NLTK 包

import nltk nltk.download()

注意: 详细操作或其他安装方式请查看 Anaconda3安装jieba库和NLTK库。

2、NLTK分词和分句

由于英语的句子基本上就是由标点符号、空格和词构成，那么只要根据空格和标点符号将词语分割成数组即可，所以相对来说简单很多：（1）分词：

from nltk import word_tokenize #以空格形式实现分词 paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!" words = word_tokenize(paragraph) print(words)

运行结果：

['The', 'first', 'time', 'I', 'heard', 'that', 'song', 'was', 'in', 'Hawaii', 'on', 'radio', '.', 'I', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'What', 'a', 'fantastic', 'song', '!']

（2）分句：

from nltk import sent_tokenize #以符号形式实现分句 sentences = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!" sentence = sent_tokenize(sentences ) print(sentence)

运行结果：

['The first time I heard that song was in Hawaii on radio.', 'I was just a kid, and loved it very much!', 'What a fantastic song!']

注意： NLTK分词或者分句以后，都会自动形成列表的形式

3、NLTK分词后去除标点符号 from nltk import word_tokenize paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".lower() cutwords1 = word_tokenize(paragraph) #分词 print('【NLTK分词结果：】') print(cutwords1) interpunctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%'] #定义标点符号列表 cutwords2 = [word for word in cutwords1 if word not in interpunctuations] #去除标点符号 print('\n【NLTK分词后去除符号结果：】') print(cutwords2)

运行结果：

【NLTK分词结果：】 ['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', '.', 'i', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'what', 'a', 'fantastic', 'song', '!'] 【NLTK分词后去除符号结果：】 ['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', 'i', 'was', 'just', 'a', 'kid', 'and', 'loved', 'it', 'very', 'much', 'what', 'a', 'fantastic', 'song'] 4、NLTK分词后去除停用词 from nltk import word_tokenize from nltk.corpus import stopwords paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".lower() cutwords1 = word_tokenize(paragraph) #分词 print('【NLTK分词结果：】') print(cutwords1) interpunctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%'] #定义符号列表 cutwords2 = [word for word in cutwords1 if word not in interpunctuations] #去除标点符号 print('\n【NLTK分词后去除符号结果：】') print(cutwords2) stops = set(stopwords.words("english")) cutwords3 = [word for word in cutwords2 if word not in stops] print('\n【NLTK分词后去除停用词结果：】') print(cutwords3)

运行结果：

【NLTK分词结果：】 ['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', '.', 'i', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'what', 'a', 'fantastic', 'song', '!'] 【NLTK分词后去除符号结果：】 ['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', 'i', 'was', 'just', 'a', 'kid', 'and', 'loved', 'it', 'very', 'much', 'what', 'a', 'fantastic', 'song'] 【NLTK分词后去除停用词结果：】 ['first', 'time', 'heard', 'song', 'hawaii', 'radio', 'kid', 'loved', 'much', 'fantastic', 'song'] 5、NLTK分词后进行词性标注 from nltk import word_tokenize,pos_tag from nltk.corpus import stopwords paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".lower() cutwords1 = word_tokenize(paragraph) #分词 print('【NLTK分词结果：】') print(cutwords1) interpunctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%'] #定义符号列表 cutwords2 = [word for word in cutwords1 if word not in interpunctuations] #去除标点符号 print('\n【NLTK分词后去除符号结果：】') print(cutwords2) stops = set(stopwords.words("english")) cutwords3 = [word for word in cutwords2 if word not in stops] print('\n【NLTK分词后去除停用词结果：】') print(cutwords3) print('\n【NLTK分词去除停用词后进行词性标注：】') print(pos_tag(cutwords3))

运行结果：

说明：列表中每个元组第二个元素显示为该词的词性，具体每个词性注释可运行代码”nltk.help.upenn_tagset()“或参看说明文档：NLTK词性标注说明

6、NLTK分词后进行词干提取

单词词干提取就是从单词中去除词缀并返回词根，搜索引擎在索引页面的时候使用这种技术，所以很多人通过同一个单词的不同形式进行搜索，返回的都是相同的、有关这个词干的页面。词干提取的算法有很多：

# 基于Porter词干提取算法 from nltk.stem.porter import PorterStemmer print(PorterStemmer().stem('leaves')) # 基于Lancaster 词干提取算法 from nltk.stem.lancaster import LancasterStemmer print(LancasterStemmer().stem('leaves')) # 基于Snowball 词干提取算法 from nltk.stem import SnowballStemmer print(SnowballStemmer('english').stem('leaves')) 运行结果： leav leav leav

我们最常用的算法是 Porter 提取算法。NLTK 有一个 PorterStemmer 类，使用的就是 Porter 提取算法：

from nltk import word_tokenize,pos_tag from nltk.corpus import stopwords from nltk.stem import PorterStemmer paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".lower() cutwords1 = word_tokenize(paragraph) #分词 print('【NLTK分词结果：】') print(cutwords1) interpunctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%'] #定义符号列表 cutwords2 = [word for word in cutwords1 if word not in interpunctuations] #去除标点符号 print('\n【NLTK分词后去除符号结果：】') print(cutwords2) stops = set(stopwords.words("english")) cutwords3 = [word for word in cutwords2 if word not in stops] #判断分词在不在停用词列表内 print('\n【NLTK分词后去除停用词结果：】') print(cutwords3) print('\n【NLTK分词去除停用词后进行词性标注：】') print(pos_tag(cutwords3)) #词性标注 print('\n【NLTK分词进行词干提取：】') cutwords4 = [] for cutword in cutwords3: cutwords4.append(PorterStemmer().stem(cutword)) #词干提取 print(cutwords4)

运行结果：

词形还原与词干提取类似，但不同之处在于词干提取经常可能创造出不存在的词汇，词形还原的结果是一个真正的词汇。

from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() print(lemmatizer.lemmatize('playing')) 运行结果： playing

NLTK词形还原时默认还原的结果是名词，如果你想得到动词，可以通过以下的方式指定：

from nltk.stem import WordNetLemmatizer #词性还原 lemmatizer = WordNetLemmatizer() print(lemmatizer.lemmatize('playing', pos="v")) #指定还原词性为动词运行结果： play

注意：面对不同的词语，我们需要其不同的词性才能更好的提起单词的原型，所以建议在词性还原时指定还原的词性。

from nltk import word_tokenize,pos_tag #分词、词性标注 from nltk.corpus import stopwords #停用词 from nltk.stem import PorterStemmer #词干提取 from nltk.stem import WordNetLemmatizer #词性还原 paragraph = "I went to the gymnasium yesterday , when I had finished my homework !".lower() cutwords1 = word_tokenize(paragraph) #分词 print('【NLTK分词结果：】') print(cutwords1) interpunctuations = [',', ' ','.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%'] #定义符号列表 cutwords2 = [word for word in cutwords1 if word not in interpunctuations] #去除标点符号 print('\n【NLTK分词后去除符号结果：】') print(cutwords2) stops = set(stopwords.words("english")) cutwords3 = [word for word in cutwords2 if word not in stops] #判断分词在不在停用词列表内 print('\n【NLTK分词后去除停用词结果：】') print(cutwords3) print('\n【NLTK分词去除停用词后进行词性标注：】') print(pos_tag(cutwords3)) #词性标注 print('\n【NLTK分词进行词干提取：】') cutwords4 = [] for cutword1 in cutwords3: cutwords4.append(PorterStemmer().stem(cutword1)) #词干提取 print(cutwords4) print('\n【NLTK分词进行词形还原：】') cutwords5 = [] for cutword2 in cutwords4: cutwords5.append(WordNetLemmatizer().lemmatize(cutword2,pos='v')) #指定还原词性为名词 print(cutwords5)

运行结果：

【NLTK分词结果：】 ['i', 'went', 'to', 'the', 'gymnasium', 'yesterday', ',', 'when', 'i', 'had', 'finished', 'my', 'homework', '!'] 【NLTK分词后去除符号结果：】 ['i', 'went', 'to', 'the', 'gymnasium', 'yesterday', 'when', 'i', 'had', 'finished', 'my', 'homework'] 【NLTK分词后去除停用词结果：】 ['went', 'gymnasium', 'yesterday', 'finished', 'homework'] 【NLTK分词去除停用词后进行词性标注：】 [('went', 'VBD'), ('gymnasium', 'NN'), ('yesterday', 'NN'), ('finished', 'VBD'), ('homework', 'NN')] 【NLTK分词进行词干提取：】 ['went', 'gymnasium', 'yesterday', 'finish', 'homework'] 【NLTK分词进行词形还原：】 ['go', 'gymnasium', 'yesterday', 'finish', 'homework']

到这里，英文文本处理就基本结束了，后续进行英文关键词的提取和分析过程，谢谢你的阅读！

【本文地址】

公司简介

联系我们