Python网站搜索内容定向爬虫（新手向超详细）

您所在的位置：网站首页 › 在网站内搜索怎么搜索 › Python网站搜索内容定向爬虫（新手向超详细）

Python网站搜索内容定向爬虫（新手向超详细）

2024-07-13 20:42:05| 来源: 网络整理| 查看: 265

目录功能前期准备各个模块功能代码部分代码解析getHtmlparsePageprintlist 运行效果总结

功能目标网站：https://www.hellohuanxuan.top/定向爬虫，只能爬取给定URL，不进行扩展爬取爬虫向搜索框提交搜索信息，爬取搜索之后的结果所需库：requests，bs4 前期准备首先查看网页搜索框，随便搜索数据看看在这里插入图片描述

我们注意到，此时url为： file

可推断出执行搜索的参数为 “?s=”之后打开F12查看源代码，看到整个数据部分是在一个main标签里，如发表时间、标题，链接等等在这里插入图片描述

开始构造代码吧各个模块功能整个爬虫分为三大模块，每个模块一个函数getHtml(url, header)函数：发起请求，获得返回源代码parsePage(ulist, html)函数：负责解析源代码，获取到有用的信息，并存入列表中（整个代码的关键部分）printlist(ulist)函数：将列表格式化打印出来代码部分 import requests from bs4 import BeautifulSoup import bs4 def getHtml(url, header): try: r = requests.get(url, headers=header) r.raise_for_status() print(r.request.headers) # r.encoding = r.apparent_encoding # 根据情况是否填写 return r.text except: print("爬取失败！") return " " def parsePage(ulist, html): soup = BeautifulSoup(html, "html.parser") for i in soup.find('main', {'class': 'site-main'}).children: try: if isinstance(i, bs4.element.Tag): psrc = i('div', {'class': 'p-time'}) title = i('h1', {'class': 'entry-title'}) # print(psrc[0].text) # print(title[0].string) # print(title[0].a.attrs['href']) ulist.append([psrc[0].text, title[0].string, title[0].a.attrs['href']]) # ulist.append([1, 1, 1]) except: print("数据丢失！") def printlist(ulist): print("{:10}\t{:10}\t{:8}".format("发布日期", "标题", "链接")) for i in ulist: print("{:10}\t{:10}\t{:8}".format(i[0], i[1], i[2])) def main(): header = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36", } worlds = '1' ulist = [] url = "https://www.hellohuanxuan.top/?s=" + worlds html = getHtml(url, header) parsePage(ulist, html) printlist(ulist) if __name__ == "__main__": main() 代码解析 getHtml try: # 通过requests的get方法获得源代码 r = requests.get(url, headers=header) # 判断返回状态码是否为200，不为200直接进入异常 r.raise_for_status() # 打印头部信息看看，可注释掉 print(r.request.headers) # r.encoding = r.apparent_encoding # 根据情况是否填写，爬我的网站要注释，否则显示中文为乱码 return r.text except: print("爬取失败！") return " " parsePage # 利用BeautifulSoup解析html soup = BeautifulSoup(html, "html.parser") # for循环查找class为'site-main'的main标签的字标签 for i in soup.find('main', {'class': 'site-main'}).children: # try except捕捉异常 try: # isinstance函数在这里判断i是否是bs4库里规定的标签类型 if isinstance(i, bs4.element.Tag): # 获取class为'p-time'的div标签 psrc = i('div', {'class': 'p-time'}) # 获取class为'entry-title'的h1标签 title = i('h1', {'class': 'entry-title'}) # print(psrc[0].text) # print(title[0].string) # print(title[0].a.attrs['href']) # 将值写进列表 ulist.append([psrc[0].text, title[0].string, title[0].a.attrs['href']]) # ulist.append([1, 1, 1]) except: print("数据丢失！") printlist # 格式化输出列表 print("{:10}\t{:10}\t{:8}".format("发布日期", "标题", "链接")) for i in ulist: print("{:10}\t{:10}\t{:8}".format(i[0], i[1], i[2])) 运行效果

在这里插入图片描述

总结

大家千万别全拿我的网站爬啊，学生服务器经不起太多折腾。（无奈）最后推荐一个慕课的视频，北京理工大学嵩天老师的python爬虫课程，讲的很清晰也很透彻。 Bilibili链接：python网络爬虫与信息提取 python爬虫学习中，如果大佬们看出有什么可以优化的地方欢迎指正转自自己的小网站：我的博客

【本文地址】

公司简介

联系我们

今日新闻

点击排行

实验室常用的仪器、试剂和: 说到实验室常用到的东西，主要就分为仪器、试剂和耗

不用再找了，全球10大实验: 01、赛默飞世尔科技（热电）Thermo Fisher Scientif

三代水柜的量产巅峰T-72坦: 作者：寞寒最近，西边闹腾挺大，本来小寞以为忙完这

通风柜跟实验室通风系统有: 说到通风柜跟实验室通风，不少人都纠结二者到底是不

集消毒杀菌、烘干收纳为一: 厨房是家里细菌较多的地方，潮湿的环境、没有完全密

实验室设备之全钢实验台如: 全钢实验台是实验室家具中较为重要的家具之一，很多

图片新闻

实验室药品柜的特性有哪些: 实验室药品柜是实验室家具的重要组成部分之一，主要

小学科学实验中有哪些教学: 计算机计算器一般打孔器打气筒仪器车显微镜

实验室各种仪器原理动图讲: 1.紫外分光光谱UV分析原理：吸收紫外光能量，引起分

高中化学常见仪器及实验装: 1、可加热仪器：2、计量仪器：（1）仪器A的名称：量

微生物操作主要设备和器具: 今天盘点一下微生物操作主要设备和器具，别嫌我啰嗦

浅谈通风柜使用基本常识: 　众所周知，通风柜功能中最主要的就是排气功能。在

Python网站搜索内容定向爬虫（新手向超详细）

Python网站搜索内容定向爬虫（新手向超详细）

今日新闻

点击排行

推荐新闻

图片新闻

专题文章