python爬虫还在用BeautifulSoup？你有更好的选择！

2024-07-15 04:54| 来源: 网络整理| 查看: 265

1.前言 1.1 抓取网页

本文将举例说明抓取网页数据的三种方式：正则表达式、BeautifulSoup、lxml。获取网页内容所用代码详情请参照Python网络爬虫-你的第一个爬虫。利用该代码获取抓取整个网页。

import requests def download(url, num_retries=2, user_agent='wswp', proxies=None): '''下载一个指定的URL并返回网页内容参数： url(str): URL 关键字参数： user_agent(str):用户代理（默认值：wswp） proxies（dict）：代理（字典）: 键：‘http’'https' 值：字符串（‘http(s)://IP’） num_retries(int):如果有5xx错误就重试（默认：2） #5xx服务器错误，表示服务器无法完成明显有效的请求。 #https://zh.wikipedia.org/wiki/HTTP%E7%8A%B6%E6%80%81%E7%A0%81 ''' print('==========================================') print('Downloading:', url) headers = { 'User-Agent': user_agent} #头部设置，默认头部有时候会被网页反扒而出错 try: resp = requests.get(url, headers=headers, proxies=proxies) #简单粗暴，.get(url) html = resp.text #获取网页内容，字符串形式 if resp.status_code >= 400: #异常处理，4xx客户端错误返回None print('Download error:', resp.text) html = None if num_retries and 500

【本文地址】

公司简介

联系我们