Python爬取豆瓣电影短评

您所在的位置:网站首页 豆瓣短评爬虫 Python爬取豆瓣电影短评

Python爬取豆瓣电影短评

2024-07-09 02:11:32| 来源: 网络整理| 查看: 265

        豆瓣是比较难爬取的网站之一,主要因为豆瓣默认如果不登录账号的话只能爬取10页的评论。所以我就带着cookie去爬取,而且设置了一个用户代理池,尽可能的伪装成浏览器。然而当我爬了三四次,一共几十页评论之后的第二天,我的豆瓣账号就被封了。。。

        还是老规矩,F12审查元素,点击network并刷新页面,然后惊人的发现json文件中没有找到有关评论的信息。豆瓣的影评没有存到json文件里,而是直接放在HTML静态页面上,那我们就直接打开HTML文件的头部信息,拿到URL和cookie,然后直接上代码。

通用信息中的URL:

请求头中的cookie:       

# -*-coding:utf-8-*- import urllib.request from bs4 import BeautifulSoup import random import time def getHtml(url): """获取url页面""" user_agents = [ 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60', 'Opera/8.0 (Windows NT 5.1; U; en)', 'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0', 'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ', "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52" ] headers = { 'Cookie': 'douban-profile-remind=1; trc_cookie_storage=taboola%2520global%253Auser-id%3De1c60947-a390-4686-856c-f46c4a32e761-tuct3ee994a; __utmv=30149280.16715; _vwo_uuid_v2=D86904745E1B8CFBCE3B86C645D26D73E|da7b9bc06f5d01692e863e3ef93a79aa; ll="118106"; bid=ritmPjTqVZ0; __yadk_uid=w5qthAeH1m9Q5sKuM5jcaIJc5jtoQbk9; __gads=ID=4c78ae3f2ae997a9:T=1580449446:S=ALNI_MblBbT9dgASYCycz7yeVP4uMK6hpA; push_doumail_num=0; dbcl2="167155432:fpzNKjDvwrU"; push_noty_num=0; ct=y; ck=tMAk; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1583240130%2C%22https%3A%2F%2Fcn.bing.com%2F%22%5D; _pk_ses.100001.4cf6=*; ap_v=0,6.0; __utma=30149280.1642366477.1541387570.1581926084.1583240131.43; __utmb=30149280.0.10.1583240131; __utmc=30149280; __utmz=30149280.1583240131.43.3.utmcsr=cn.bing.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utma=223695111.778911602.1541387570.1581926084.1583240131.40; __utmb=223695111.0.10.1583240131; __utmc=223695111; __utmz=223695111.1583240131.40.1.utmcsr=cn.bing.com|utmccn=(referral)|utmcmd=referral|utmcct=/; _pk_id.100001.4cf6=dd777e506416d181.1541387571.39.1583240394.1581926693.', 'User-Agent': str(random.choice(user_agents)), 'Referer': 'https: // movie.douban.com / subject / 26752088 / comments?status = P', 'Connection': 'keep-alive' } req = urllib.request.Request(url,headers=headers) req = urllib.request.urlopen(req) content = req.read().decode('utf-8') return content def getComment(url): """解析HTML页面""" html = getHtml(url) soupComment = BeautifulSoup(html, 'html.parser') comments = soupComment.findAll('span', 'short') onePageComments = [] for comment in comments: # print(comment.getText()+'\n') onePageComments.append(comment.getText()+'\n') return onePageComments if __name__ == '__main__': f = open('我不是药神.txt', 'a', encoding='utf-8') j = 0 for page in range(15): # 豆瓣爬取多页评论需要验证。 url = 'https://movie.douban.com/subject/26752088/comments?start=' + str(20*page) + '&limit=20&sort=new_score&status=P' print('第%s页的评论:' % (page)) print(url + '\n') for i in getComment(url): f.write(i) print(j,i) j += 1 time.sleep(10) print('\n')

        这里分析页面使用了bs4库中的html.paeser,这是个解析HTML的工具。在页面中我们不难找到评论信息存储在属性为‘short’的span标签中:



【本文地址】

公司简介

联系我们

今日新闻


点击排行

实验室常用的仪器、试剂和
说到实验室常用到的东西,主要就分为仪器、试剂和耗
不用再找了,全球10大实验
01、赛默飞世尔科技(热电)Thermo Fisher Scientif
三代水柜的量产巅峰T-72坦
作者:寞寒最近,西边闹腾挺大,本来小寞以为忙完这
通风柜跟实验室通风系统有
说到通风柜跟实验室通风,不少人都纠结二者到底是不
集消毒杀菌、烘干收纳为一
厨房是家里细菌较多的地方,潮湿的环境、没有完全密
实验室设备之全钢实验台如
全钢实验台是实验室家具中较为重要的家具之一,很多

推荐新闻


图片新闻

实验室药品柜的特性有哪些
实验室药品柜是实验室家具的重要组成部分之一,主要
小学科学实验中有哪些教学
计算机 计算器 一般 打孔器 打气筒 仪器车 显微镜
实验室各种仪器原理动图讲
1.紫外分光光谱UV分析原理:吸收紫外光能量,引起分
高中化学常见仪器及实验装
1、可加热仪器:2、计量仪器:(1)仪器A的名称:量
微生物操作主要设备和器具
今天盘点一下微生物操作主要设备和器具,别嫌我啰嗦
浅谈通风柜使用基本常识
 众所周知,通风柜功能中最主要的就是排气功能。在

专题文章

    CopyRight 2018-2019 实验室设备网 版权所有 win10的实时保护怎么永久关闭