python爬虫开发之使用Python爬虫库requests多线程抓取猫眼电影TOP100实例

2023-11-11 04:15| 来源: 网络整理| 查看: 265

这篇文章主要介绍了python爬虫开发之使用Python爬虫库requests多线程抓取猫眼电影TOP100实例,需要的朋友可以参考下使用Python爬虫库requests多线程抓取猫眼电影TOP100思路：

查看网页源代码抓取单页内容正则表达式提取信息猫眼TOP100所有信息写入文件多线程抓取运行平台：windows Python版本：Python 3.7. IDE:Sublime Text 浏览器：Chrome浏览器 1.查看猫眼电影TOP100网页原代码按F12查看网页源代码发现每一个电影的信息都在“

”标签之中。

点开之后，信息如下：在这里插入图片描述

2.抓取单页内容在浏览器中打开猫眼电影网站，点击“榜单”，再点击“TOP100榜”如下图：在这里插入图片描述

接下来通过以下代码获取网页源代码：

#-*-coding:utf-8-*- import requests from requests.exceptions import RequestException #猫眼电影网站有反爬虫措施，设置headers后可以爬取 headers = { 'Content-Type': 'text/plain; charset=UTF-8', 'Origin':'https://maoyan.com', 'Referer':'https://maoyan.com/board/4', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' } #爬取网页源代码 def get_one_page(url,headers): try: response =requests.get(url,headers =headers) if response.status_code == 200: return response.text return None except RequestsException: return None def main(): url = "https://maoyan.com/board/4" html = get_one_page(url,headers) print(html) if __name__ == '__main__': main()

执行结果如下：在这里插入图片描述 3.正则表达式提取信息上图标示信息即为要提取的信息，代码实现如下：

#-*-coding:utf-8-*- import requests import re from requests.exceptions import RequestException #猫眼电影网站有反爬虫措施，设置headers后可以爬取 headers = { 'Content-Type': 'text/plain; charset=UTF-8', 'Origin':'https://maoyan.com', 'Referer':'https://maoyan.com/board/4', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' } #爬取网页源代码 def get_one_page(url,headers): try: response =requests.get(url,headers

【本文地址】

公司简介

联系我们