Python爬取知乎盐选专栏热榜实例

2024-05-09 05:40| 来源: 网络整理| 查看: 265

背景

大家好。我，在被一个抖机灵答案吊足胃口、迫不得其充值了知乎盐选会员之后，越想越觉得抓心挠肝的亏。

我并不是说我不愿意为知识付费。事实上，我为7点jj氪金不少，但在知乎花9块钱看一个故事结局，总让我有一种背叛无产阶级的感觉。

于是，我，一个爬虫经验仅限于爬7点爽文和jj虐恋的无产阶级同志，决定尝试挑战自我，爬一下知乎盐选热榜。

榜单目录爬取

首先，考虑将热榜的榜单爬取下来。也就是取得一个榜单的目录。

知乎盐选榜单网址： https://www.zhihu.com/xen/market/ranking-list/salt

但点进去就会发现，这是一个动态加载的页面，没有办法一次性爬取全部目录。

随后，发现在往下拉时，会返回一个包。里面的Request URL形式如下:

连接中的20/20是变化的。 https://api.zhihu.com/market/rank_list?type=hottest&sku_type=salt_all&limit=20&offset=20

具体爬取代码如下：

1234567891011121314151617181920212223242526272829303132333435363738import requests import re import json import jsonpath import os import csv import pandas as pd from retrying import retry from bs4 import BeautifulSoup from w3lib.html import remove_tags from collections import defaultdict cookie = '/你的cookie/' headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36', 'Cookie':cookie} url = 'https://api.zhihu.com/market/rank_list?type=hottest&sku_type=salt_all&limit=1000&offset=0' r = requests.get(url, headers = headers) html = r.content.decode() bsobj = BeautifulSoup(html, 'html.parser') a = json.loads(html) title = jsonpath.jsonpath(a,"$..title") author = jsonpath.jsonpath(a,"$..author") media_type = jsonpath.jsonpath(a,"$..media_type") button_text = jsonpath.jsonpath(a,"$..button_text") data0 = {'名称':title, '作者':author, '类型':media_type, '价格':button_text} data1 = pd.DataFrame(data = data0) # 查看爬取得到的目录 print(data1) # 输出目录文件 outputpath='./zhihu_concent.xlsx' data1.to_excel(outputpath,index=False, header=True)

随后得到一份目录.xlsx文件。

获得作品链接

得到目录显然是不够的，我们还需要爬取具体的作品内容。首先尝试从榜单页面得到跳转链接。

但是这就必须要解决刚刚遇到的问题——如何面对动态加载的界面？

考虑使用seleniu。

1pip3 install seleniu

在Scripts目录下安装Chrome驱动器之后，开始继续爬取。

P.S.值得注意的是，下面只保留“盐选专栏”的链接，其他电子书和音频的链接直接删除。

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152# 正儿八经开始爬取盐选专栏链接 from selenium import webdriver from lxml import etree import time import requests import re import json import jsonpath import os import csv from bs4 import BeautifulSoup cookie = '/你的cookie/' headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36', 'Cookie':cookie} urls = 'https://www.zhihu.com/xen/market/ranking-list/salt' # 使用selenium模拟人为访问页面,获取数据 def spider_jd(url): # ChromeOptions() 函数中有谷歌浏览器的一些配置 options = webdriver.ChromeOptions() # 告诉谷歌这里用的是无头模式 options.add_argument('headless') # 创建谷歌浏览器对象 driver = webdriver.Chrome() # 打开谷歌浏览器,进入指定网址的页面 driver.get(url) # 模拟10次下拉页面动作,是动态页面加载 for i in range(0,10): driver.execute_script("window.scrollTo(0,document.body.scrollHeight);") # 停顿2秒等待页面加载完毕（必须留有页面加载的时间，否则部分浏览器获得的源代码会不完整。） time.sleep(2) # 相当于 request.get(url, headers=header) source = driver.page_source bsobj = BeautifulSoup(source, 'html.parser') a = bsobj.find('div', attrs={'class': 'App-wrap-rM9XT'}) links= [] for k in a.find_all('a', attrs = {'class':'ProductCell-root-3LLcu RankingListItem-productCell-o4KL2'}): link = k['href'] # 只保存专栏（删除音频、电子书形式的盐选） if link[:51] == 'https://www.zhihu.com/xen/market/remix/paid_column/': links.append(link) driver.close() # 爬取完毕关闭浏览器 return links 12# 调用上述函数，得到专栏盐选的URL article_link = spider_jd(urls) 盐选专栏爬取

我在这一步首先尝试了利用selenium进行帐号登录，然而失败了。

12345678910111213141516171819202122232425# 失败实例 url = article_link[0] driver = webdriver.Chrome() driver.get(url) login_button = driver.find_element_by_class_name('ShelfTopNav-login-p5mr5') login_button.click() time.sleep(1) login_tag = driver.find_element_by_class_name('LoginActions-actions-aKPz7') login_tag.click() time.sleep(1) username = driver.find_element_by_name('username') username.send_keys('/你的帐号/') time.sleep(1) password = driver.find_element_by_name('password') password.send_keys('/你的密码/') time.sleep(1) # 具体报错的就是这一步 login_submit = driver.find_element_by_class_name('Button SignFlow-submitButton Button--primary Button--blue') login_submit.click()

具体的报错是：

123error: {code: 10001, message: "10001:请求参数异常，请升级客户端后重试"} code: 10001 message: "10001:请求参数异常，请升级客户端后重试"

查看具体的请求信息得到：

1Failed to load resource: the server responded with a status of 403 ()

所以是服务器拒绝我访问了。我其实没太搞懂为什么会这样，但是通过查阅资料以及思维推理，我觉得应该是服务器反Selenium爬虫机制。

我到处找应该怎么办，然后找到了一个看起来有些靠谱的答案：

https://zhidao.baidu.com/question/751432606829473612.html

于是我打开百度，搜索“selenium接管chrome”。

于是我找到了这篇文章：

https://www.cnblogs.com/pu369/p/12407996.html

就上述教程我进行的具体的操作简单总结如下：

将chrome.exe的目录添加到环境变量打开cmd，在命令行中输入命令：chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\selenum\AutomationProfile" 此时会打开一个浏览器页面，我们把它当成一个已存在的浏览器，于是我们就要接管上面的浏览器用python运行下面的代码： 12345678from selenium import webdriver from selenium.webdriver.chrome.options import Options chrome_options = Options() chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222") chrome_driver = "chromedriver.exe" driver = webdriver.Chrome(chrome_driver, chrome_options=chrome_options) print(driver.title) driver.get("https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html") 此时你就已经接管了刚才那个浏览器，就比如说你刚才大概了百度首页，那么运行上述代码，就会print：百度一下，你就知道~

喵了个咪的然后我试着按照之前那种点点点的方式，再在接管了的浏览器里面试图登录，又失败了。不仅失败了，我还在F12里面看到了知乎前端的招募广告。我能说什么？无产阶级永不认输呗。我不能技术性地登录我还不能扫码登进去吗我，呵呵。

于是我就扫码登陆了。

之后的过程枯燥乏味，无非就是仗着自己盐选会员的身份胡作为非。主要内容就是将每个专栏的章节合并放进一个txt。

代码如下

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081from selenium import webdriver from selenium.webdriver.chrome.options import Options import os import time from w3lib.html import remove_tags chrome_options = Options() chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222") browser = webdriver.Chrome(chrome_driver, chrome_options=chrome_options) for i in range(len(article_link)): url = article_link[i] browser.get(url) time.sleep(3) source = browser.page_source bsobj = BeautifulSoup(source, 'html.parser') # 获得书名 a = bsobj.find('div', attrs={'class': 'AlbumColumnMagazineWebPage-title-wN4vV'}) book_title = remove_tags(str(a)) print(book_title) # 获得状态 a = bsobj.find('div', attrs={'class': 'SectionCount-root-8G3SV AlbumColumnMagazineWebPage-sectionCount-a2s6F'}) book_status = remove_tags(str(a)) print(book_status) # 写出文件(空文件) srcFile = '知乎/title.txt' dstFile = '知乎/'+ '【' + book_status + '】' + book_title + '.txt' with open('知乎/title.txt',"w",encoding='utf-8') as f: f.write('【' + book_status + '】' + '《' + book_title + '》' + '\n') f.write('\n') os.rename(srcFile,dstFile) # 获得目录 article_id = article_link[i].replace('https://www.zhihu.com/xen/market/remix/paid_column/', '') content_js = 'https://api.zhihu.com/remix/well/' + article_id + '/catalog?' r = requests.get(content_js, headers = headers) html = r.content.decode() a = json.loads(html) chapter_link = jsonpath.jsonpath(a, "$..url") # 获得章节的id chapter_id = [] for i in range(len(chapter_link)): chapter_id_temp = str(chapter_link[i]).replace('https://www.zhihu.com/market/manuscript?business_id=','').replace(article_id,'').replace('&track_id=', '').replace('&sku_type=paid_column','') chapter_id.append(chapter_id_temp) for j in range(len(chapter_link)): # 生成对应章节的url url = 'https://www.zhihu.com/market/paid_column/' + article_id + '/section/' + chapter_id[j] print(url) browser.get(url) time.sleep(2) source = browser.page_source bsobj = BeautifulSoup(source, 'html.parser') # 获得章节题目 title = bsobj.find('h1', attrs={'class': 'ManuscriptTitle-root-vhZzG'}) title = remove_tags(str(title)) # 获得正文 a = bsobj.find('div', attrs={'class': 'ManuscriptIntro-root-ighpP'}) b = remove_tags(str(a)) with open(dstFile,"a",encoding='utf-8') as f: f.write('第' + str(j+1) + '章·' + title + '\n') f.write(' ' + '\n') f.write(b + '\n') f.close()

至此，知乎盐选热榜爬取完毕。

因为我爬虫技术并不是很熟练，代码写得也不是很漂亮。如果看到这篇文章的老师同学们有什么意见或者建议，欢迎联系我。谢谢啦！

【本文地址】

公司简介

联系我们