python

2024-07-10 17:41| 来源: 网络整理| 查看: 265

在进行文本预处理时，可以使用正则化去掉文本中的标点符号。

re.sub(pattern, repl, string, count=0, flags=0)

去除掉一般符号代码如下：

r = "[A-Za-z0-9_.!+-=——,$%^，。？、~@#￥%……&*《》「」{}【】()/]" sentence = re.sub(r, ' ', sentence)

但如果要去除 []’"\ 这些符号，则需要使用转义符号

r = "[\\\[\]'\"]" sentence = re.sub(r, ' ', sentence)

注意： [A-Za-z0-9_]表示匹配字母、数字、下划线，等价于[\w+] +表示匹配前面的子表达式一次或多次。 \表示转义符号

合并起来：

r = "[z0-9_.!+-=——,$%^，。？、~@#￥%……&*《》「」{}【】()/\\\[\]'\"]" sentence = "9_.!+-=——,$%^，。？、~@#￥%……&*《》「」{}【】()/\\\[\]'\"]" sentence = re.sub(r, ' ', sentence)

实验结果：全部字符去除

干掉所有麻烦字符的终极武器：非中文、英文、表情的都干掉（中文unicode编码范围：[0x4E00,0x9FA5]）

sen_text = re.compile(u'[\u4E00-\u9FA5|\s\w]').findall(sen) sentece = "".join(sen_text)

贪婪和非贪婪匹配

【本文地址】

公司简介

联系我们