scrapy爬取晋江免费小说（章节）+ cookie爬vip章节

2024-06-30 16:17| 来源: 网络整理| 查看: 265

思路：先打开晋江任意一篇小说的第一章，然后爬取该章节的名字、内容，以及该小说的名字，下一章节的链接；利用下一章节的链接实现重复的爬取，其中章节的名字、内容、小说名字存储在item字典中；最后将爬取到的内容进行整理写入txt文件。其实也可以在目录页提取各个章节的链接进行爬取，实现的是前一种方法。

1.创建项目

创建Scrapy项目，在shell中使用scrapy startproject命令：

scrapy startproject jjnovel cd jjnovel scrapy genspider book "www.jjwxc.net/onebook.php?novelid=3921595&chapterid=1"

在这里插入图片描述

2.分析页面 1.1

查看网页源码，分别找到小说名字、章节名字、下一章节链接所在的标签，可直接右键复制xpath节点地址在这里插入图片描述以页面的url地址为参数运行 scrapy shell命令：

scrapy shell "http://www.jjwxc.net/onebook.php?novelid=3921595&chapterid=1"

在shell中利用复制的xpath节点提取相应内容：

novel = response.xpath('/html/body/table[1]/tr[1]/td[1]/h1/a/span/text()').extract_first()[38:58]#小说名，[38:58]是根据网站实际情款写的 chapter = response.xpath('/html/body/table[1]/tr[2]/td[1]/div/div[2]/h2/text()').extract_first()#章节 content = response.xpath('/html/body/table[1]/tr[2]/td[1]/div/text()').extract()#内容 3.代码实现

1.settings.py中启用Item Pipeline（将注释去掉即可），将ROBOTSTXT_OBEY改为False，有时会报错：Ignoring response HTTP status code is not handled or not allowed，则可以伪装成常规浏览器，将USER_AGENT注释去掉。

USER_AGENT = 'Mozilla/5.0 (X11;Linux x86_64) Chrome/42.0.2311.90 Safari/537.36'

(参考：https://www.cnblogs.com/QW-lzm/p/9461375.html)

2.spiders目录下的book.py：

# -*- coding: utf-8 -*- import scrapy class BookSpider(scrapy.Spider): name = 'book' # allowed_domains = ['http://www.jjwxc.net/onebook.php?novelid=3921595&chapterid=1'] start_urls = ['http://www.jjwxc.net/onebook.php?novelid=3921595&chapterid=1/'] def parse(self, response): novel = response.xpath('/html/body/table[1]/tr[1]/td[1]/h1/a/span/text()').extract_first()[38:58] # 小说名 chapter = response.xpath('/html/body/table[1]/tr[2]/td[1]/div/div[2]/h2/text()').extract_first() # 章节 content = response.xpath('/html/body/table[1]/tr[2]/td[1]/div/text()').extract() # 内容 # i.replace(u'\u3000',u' ') yield { 'novel': novel, 'chapter': chapter, 'content': content, } next_url0 = response.xpath('/html/body/table[1]/tr[5]/td[1]/span/a/@href').extract_first() # 第一章只有一个标签 next_url = response.xpath('/html/body/table[1]/tr[5]/td[1]/span/a[2]/@href').extract_first() # 后面有上、下章节 if next_url: # 如果不是第一章 next_url = response.urljoin(next_url) # next_url的绝对地址 yield scrapy.Request(next_url, callback=self.parse) else: # 如果是第一章 if next_url0: # 如果找到下一页的URL,得到绝对路径，构造新的Request对象 next_url = response.urljoin(next_url0) # next_url的绝对地址 yield scrapy.Request(next_url, callback=self.parse)

3.格式化文本并写入txt文件，pipelines.py

# -*- coding: utf-8 -*- class JjnovelPipeline(object): def process_item(self, item, spider): novelname = '' for i in item['novel']: if i != ' ': # 爬取时截了20的字长，现在去掉空格 novelname += i self.filename = novelname + '.txt' self.file = open(self.filename, 'a', encoding='utf-8') self.file.write('\n' + item['chapter']) for i in range(len(item['content']) - 3): # 前三行是空串 kk = item['content'][i + 3] if i == 0: # 第四行比较特殊'\r\n \u3000\u3000...' self.file.write(kk + '\n') else: # 之后去掉行'\u3000\u3000'以及空行 if (len(kk) > 2) & (kk[2:3] != ' '): self.file.write(kk + '\n') self.file.close() return item

4.items.py

import scrapy class JjnovelItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field()

5.爬虫运行文件，在工程目录下新建crawl.py:

# -*- coding: utf-8 -*- from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings #运行爬虫的方法 if __name__=='__main__': # setting=get_project_settings() process = CrawlerProcess(settings=get_project_settings()) # 'credit'替换成你自己的爬虫名 process.crawl('book') process.start() # the script will block here until the crawling is finished

运行结果：在这里插入图片描述

只爬取28章节，29章需要vip。在这里插入图片描述爬虫注意事项： 1.xpath一直返回空列表。原因：浏览器会对html文本进行一定的规范化，所以会自动在路径中加入tbody，导致读取失败，直接在路径中去除tbody即可。直接复制xpath地址时含tbody，需要去掉，否则返回空，可在scrapy shell中测试。参考：https://www.cnblogs.com/hailong88/p/10565762.html

补充根据cookie爬取vip章节。晋江小说的普通章节和vip章节的url是不一样的，所以爬完一整部小说需要先爬完普通章节，再换url爬vip章节（已经购买的）。修改book.py

# start_urls = ['http://www.jjwxc.net/onebook.php?novelid=3921595&chapterid=1'] headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36', 'cookie':'此处为登录账号后的网站cookie' } def start_requests(self): url = '更改为vip章节的url' yield scrapy.Request(url, headers=self.headers) def parse(self, response): ... if next_url: #如果不是第一章 next_url = response.urljoin(next_url) # next_url的绝对地址 yield scrapy.Request(next_url, callback=self.parse,headers=self.headers) else:#如果是第一章 if next_url0: # 如果找到下一页的URL,得到绝对路径，构造新的Request对象 next_url = response.urljoin(next_url0) # next_url的绝对地址 yield scrapy.Request(next_url,callback=self.parse,headers=self.headers)

`settings.py’

COOKIES_ENABLED = False

【本文地址】

公司简介

联系我们