python爬虫：用scrapy框架爬取链家网房价信息并存入mongodb

#python爬虫：用scrapy框架爬取链家网房价信息并存入mongodb| 来源: 网络整理| 查看: 265

1.目标界面：https://dg.lianjia.com/ershoufang/ 2.爬取的信息：①标题 ②总价 ③小区名 ④所在地区名 ⑤详细信息 ⑥详细信息里的面积 3. 存入：MongoDB 上面链接是东莞的二手房信息，如果需要爬取别的信息更改url即可,因为网页结构没变： https://bj.lianjia.com/ershoufang/ 北京二手房信息 https://gz.lianjia.com/ershoufang/ 广州二手房信息 https://gz.lianjia.com/ershoufang/tianhe 广州天河区二手房信息 … 下面就是具体的代码了：在这里插入图片描述 ershoufang_spider.py:

import scrapy from lianjia_dongguan.items import LianjiaDongguanItem #这是item.py定义的class class lianjiadongguanSpider(scrapy.Spider): name = "ershoufang" # 爬虫的名字，后面运行要用 global start_page start_page=1 start_urls=["https://gz.lianjia.com/ershoufang/haizhu/pg"+str(start_page)] def parse(self, response): for item in response.xpath('//div[@class="info clear"]'): yield { "title": item.xpath('.//div[@class="title"]/a/text()').extract_first().strip(), "Community": item.xpath('.//div[@class="positionInfo"]/a[1]/text()').extract_first(), "district": item.xpath('.//div[@class="positionInfo"]/a[2]/text()').extract_first(), "price": item.xpath('.//div[@class="totalPrice"]/span/text()').extract_first().strip(), "area": item.xpath('.//div[@class="houseInfo"]/text()').re("\d室\d厅 \| (.+)平米")[0], "info": item.xpath('.//div[@class="houseInfo"]/text()').extract_first().replace("平米", "㎡").strip() } i=1 while i

【本文地址】

公司简介

联系我们