Python 网页抓取

2024-07-10 17:13| 来源: 网络整理| 查看: 265

❮ 上一节下一节 ❯ Python 网页抓取 - 数据提取

分析网页意味着了解其结构。现在，问题来了，为什么它对网络抓取很重要? 在本章中，让我们详细了解这一点。

网页分析

网页分析很重要，因为如果不进行分析，我们将无法知道在提取后我们将从该网页（结构化或非结构化）接收哪种形式的数据。我们可以通过以下方式进行网页分析 −

查看页面源

这是一种通过检查网页的源代码来了解网页结构的方法。要实现这一点，我们需要右键单击该页面，然后必须选择查看页面源代码选项。然后，我们将以 HTML 的形式从该网页中获取我们感兴趣的数据。但主要关注的是空格和格式，我们很难格式化。

通过点击 Inspect Element 选项检查页面源代码

这是另一种分析网页的方法。但不同的是，它将解决网页源代码中的格式化和空格问题。您可以通过右键单击然后从菜单中选择 Inspect 或 Inspect element 选项来实现。它将提供有关该网页的特定区域或元素的信息。

从网页中提取数据的不同方法

以下方法主要用于从网页中提取数据 −

正则表达式

它们是嵌入在 Python 中的高度专业化的编程语言。我们可以通过Python的re模块来使用它。它也称为 RE 或正则表达式或正则表达式模式。在正则表达式的帮助下，我们可以为我们想要从数据中匹配的可能字符串集指定一些规则。

示例

在下面的示例中，我们将在使用正则表达式匹配的内容后，从 http://example.webscraping.com 中抓取有关印度的数据。

import re import urllib.request response = urllib.request.urlopen('http://example.webscraping.com/places/default/view/India-102') html = response.read() text = html.decode() re.findall('(.*?)',text) 输出

对应的输出会如下所示 −

[ '', '3,287,590 square kilometres', '1,173,108,018', 'IN', 'India', 'New Delhi', 'AS', '.in', 'INR', 'Rupee', '91', '######', '^(\\d{6})$', 'enIN,hi,bn,te,mr,ta,ur,gu,kn,ml,or,pa,as,bh,sat,ks,ne,sd,kok,doi,mni,sit,sa,fr,lus,inc', ' CN NP MM BT PK BD ' ]

观察，在上面的输出中，您可以使用正则表达式查看有关 India 的详细信息。

BeautifulSoup

假设我们想从一个网页中收集所有的超链接，那么我们可以使用一个名为 BeautifulSoup 的解析器，可以在 https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 上获得更详细的信息。简单来说，BeautifulSoup 是一个用于从 HTML 和 XML 文件中提取数据的 Python 库。它可以与请求一起使用，因为它需要一个输入（文档或 url）来创建汤对象，因为它不能自己获取网页。您可以使用以下 Python 脚本来收集网页标题和超链接。

安装 beautifulsoup

使用 pip 命令，我们可以在我们的虚拟环境或全局安装中安装 beautifulsoup。

(base) D:\ProgramData>pip install bs4 Collecting bs4 Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89 a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz Requirement already satisfied: beautifulsoup4 in d:\programdata\lib\sitepackages (from bs4) (4.6.0) Building wheels for collected packages: bs4 Running setup.py bdist_wheel for bs4 ... done Stored in directory: C:\Users\gaurav\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b87abedbc0bf2d 52235414c3467d8889be38dd472 Successfully built bs4 Installing collected packages: bs4 Successfully installed bs4-0.0.1 示例

请注意，在这个例子中，我们扩展了上面用 requests python 模块实现的例子。我们正在使用 r.text 创建一个 soup 对象，该对象将进一步用于获取网页标题等详细信息。

首先，我们需要导入必要的Python模块 −

import requests from bs4 import BeautifulSoup

在下面这行代码中，我们使用 requests 为 url 发出 GET HTTP 请求: https://authoraditiagarwal.com/ by making a GET request.

r = requests.get('https://authoraditiagarwal.com/')

现在我们需要创建一个Soup对象如下 −

soup = BeautifulSoup(r.text, 'lxml') print (soup.title) print (soup.title.text) 输出

对应的输出会如下所示 −

Learn and Grow with Aditi Agarwal Learn and Grow with Aditi Agarwal Lxml

我们要讨论的另一个用于网络抓取的 Python 库是 lxml。它是一个高性能的 HTML 和 XML 解析库。它相对快速和直接。您可以在 https://lxml.de/ 上阅读更多信息。

安装 lxml

使用 pip 命令，我们可以在我们的虚拟环境或全局安装中安装 lxml。

(base) D:\ProgramData>pip install lxml Collecting lxml Downloading https://files.pythonhosted.org/packages/b9/55/bcc78c70e8ba30f51b5495eb0e 3e949aa06e4a2de55b3de53dc9fa9653fa/lxml-4.2.5-cp36-cp36m-win_amd64.whl (3. 6MB) 100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 3.6MB 64kB/s Installing collected packages: lxml Successfully installed lxml-4.2.5 示例:使用 lxml 和请求进行数据提取

在下面的示例中，我们使用 lxml 和 requests 从 authoraditiagarwal.com 抓取网页的特定元素 −

首先，我们需要从lxml库中导入requests和html，如下 −

import requests from lxml import html

现在我们需要提供要抓取的网页的 url

url = 'https://authoraditiagarwal.com/leadershipmanagement/'

现在我们需要为该网页的特定元素提供路径 (Xpath) −

path = '//*[@id="panel-836-0-0-1"]/div/div/p[1]' response = requests.get(url) byte_string = response.content source_code = html.fromstring(byte_string) tree = source_code.xpath(path) print(tree[0].text_content()) 输出

对应的输出会如下所示 −

The Sprint Burndown or the Iteration Burndown chart is a powerful tool to communicate daily progress to the stakeholders. It tracks the completion of work for a given sprint or an iteration. The horizontal axis represents the days within a Sprint. The vertical axis represents the hours remaining to complete the committed work. ❮ 上一节下一节 ❯

【本文地址】

公司简介

联系我们