Python爬取豆瓣看过的电影

2024-07-04 08:29| 来源: 网络整理| 查看: 265

直接附上Python代码：

#coding=utf-8 import requests from requests.exceptions import RequestException import re import json import xlwt import xlrd def get_one_page(url): try: response = requests.get(url)#拿到网页数据 if response.status_code == 200:#返回200表示响应正常 return response.text#返回数据 return None#如果响应不正常，则不返回任何数据 except RequestException as e:#所有异常输出为空 print(e) return None n=1 def parse_one_page(html): pattern = re.compile('.*?(.*?).*?(.*?).*?.*?(.*?).*?(.*?)', re.S) items = re.findall(pattern, html)#以列表形式返回所有能匹配到的字符串 for item in items: global n sheet.write(n,0,str(item［0］)) sheet.write(n,1,str(item［2］)) sheet.write(n,2,str(item［3］)) sheet.write(n,3,str(item［4］)) cut=item［1］.split('/') i=4 for j in cut: sheet.write(n,i,str(j)) i=i+1 n=n+1 print(n) def main(start): n=start+1 url = 'https://movie.douban.com/people/7847299/collect?start='+str(start)+'&sort=time&rating=all&filter=all&mode=grid' html = get_one_page(url) parse_one_page(html) try: book=xlwt.Workbook(encoding='utf-8',style_compression=0) sheet=book.add_sheet('看过的电影',cell_overwrite_ok=True) sheet.write(0,0,'电影名') sheet.write(0,1,'评分') sheet.write(0,2,'看过的时间') sheet.write(0,3,'评价') for b in range(0,60): m=b*15 try: main(m) book.save(r'C:\Users\Administrator\Desktop\movie.xls') except Exception as e: print(e) pass except Exception as e: print(e)

出来是这样子的excel：

因为没有分词包，而上映时间国家导演演员等等全都在一个字段里，这部分并没有能做到很好的区分

另外对于有的标记了已看但是没有做评论的，会导致评论不能很好的和电影匹配上，这样的话会自动匹配下一个，但是只影响单个电影，所以几乎可以忽略

【本文地址】

公司简介

联系我们

Python爬取豆瓣 看过的电影

Python爬取豆瓣看过的电影