使用Scrapy框架,爬取b站番剧信息。 您所在的位置:网站首页 b站视频框架 使用Scrapy框架,爬取b站番剧信息。

使用Scrapy框架,爬取b站番剧信息。

2024-06-05 17:33| 来源: 网络整理| 查看: 265

使用Scrapy框架,爬取b站番剧信息。

感觉好久没写爬虫的,今天看了在b站浏览了一会儿,发现b站有很多东西可以爬取的,比如首页的排行榜,番剧感觉很容易找到数据来源的,所以就拿主页的番剧来练练手的。

爬取的网址: https://www.bilibili.com/anime/index/#season_version=-1&area=-1&is_finish=-1©right=-1&season_status=-1&season_month=-1&year=-1&style_id=-1&order=3&st=1&sort=0&page=1* 在这里插入图片描述 在这里插入图片描述 通过观察url的规律,去除一些不影响请求网站的url中的数据,得到url https://api.bilibili.com/pgc/season/index//resultpage=1&season_type=1&pagesize=20&type=1,然后发现只需每次改变page=的值就可以得到想要的信息,page最大值为153,感觉这次爬取的信息作用不大,不过还是把代码写出来了

运行scrapy的main方法,无需每次scrapy crawl name

# -*- coding: utf-8 -*- #@Project filename:PythonDemo dramaMain.py #@IDE :IntelliJ IDEA #@Author :ganxiang #@Date :2020/03/02 0002 19:16 from scrapy.cmdline import execute import os import sys sys.path.append(os.path.dirname(os.path.abspath(__file__))) execute(['scrapy','crawl','drama'])

编写的dramaSeries.py

# -*- coding: utf-8 -*- import scrapy import json from ..items import DramaseriesItem class DramaSpider(scrapy.Spider): name = 'drama' allowed_domains = ['https://api.bilibili.com/'] i =1 start_urls = ['https://api.bilibili.com/pgc/season/index//result?page=%s&season_type=1&pagesize=20&type=1'% s for s in range(1,101)] def parse(self, response): item =DramaseriesItem() drama =json.loads(response.text) data =drama['data'] data_list =data['list'] # print(data_list) for filed in data_list: item['number']=self.i item['badge']=filed['badge'] item['cover_img']=filed['cover'] item['index_show']=filed['index_show'] item['link']=filed['link'] item['media_id']=filed['media_id'] item['order_type']=filed['order_type'] item['season_id']=filed['season_id'] item['title']=filed['title'] print(self.i,item) self.i+=1 yield item self.i+=20

items.py

# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class DramaseriesItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() number =scrapy.Field() badge =scrapy.Field() cover_img =scrapy.Field() index_show =scrapy.Field() link =scrapy.Field() media_id =scrapy.Field() order_type =scrapy.Field() season_id =scrapy.Field() title =scrapy.Field() pass

pipelines.py

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from openpyxl import Workbook from scrapy.utils.project import get_project_settings settings = get_project_settings() class DramaseriesPipeline(object): excelBook =Workbook() activeSheet =excelBook.active file =['number','title','link','media_id','season_id','index_show','cover_img','badge'] activeSheet.append(file) def process_item(self, item, spider): files =[item['number'],item['title'],item['link'],item['media_id'],item['season_id'],item['index_show'],item['cover_img'],item['badge']] self. activeSheet.append(files) self.excelBook.save('./drama.xlsx') return item

settings.py 打开

# Obey robots.txt rules ROBOTSTXT_OBEY = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } ITEM_PIPELINES = { 'dramaSeries.pipelines.DramaseriesPipeline': 300, }

运行结果爬取了两千多字段,其实还可以爬很多的。 在这里插入图片描述 在这里插入图片描述 在这里插入图片描述



【本文地址】

公司简介

联系我们

今日新闻

    推荐新闻

    专题文章
      CopyRight 2018-2019 实验室设备网 版权所有