Java爬虫+springboot+微信小程序实践

2024-04-04 04:42| 来源: 网络整理| 查看: 265

做了一个小项目，用微信小程序来显示一个网站的数据。后端采用Java语言，使用Springboot+WebMagic一站式解决，即前端每次刷新，后端就开启爬虫线程并立即把数据返回前端，不设持久层。

WebMagic为开源的Java爬虫框架，官方文档：http://webmagic.io/docs/zh/

一、爬虫部分 1.创建springboot工程，pom里导入WebMagic相关依赖：

us.codecraft webmagic-core 0.7.3 us.codecraft webmagic-extension 0.7.3 org.slf4j slf4j-log4j12

2.根据需要爬取的网页编写正则和xpath。其中正则主要用于解析url，xpath用来获取html标签信息。正则表达式快速入门：http://deerchao.net/tutorials/regex/regex.htm XPath入门：https://www.w3school.com.cn/xpath/index.asp

3.编写Java代码进行网页爬取。网页链接：http://www.pm25.com/city/xian.html

package com.example.reptile; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.pipeline.ConsolePipeline; import us.codecraft.webmagic.processor.PageProcessor; import us.codecraft.webmagic.selector.Selectable; public class MyReptileDemo implements PageProcessor { private Site site = Site.me().setRetryTimes(3).setSleepTime(100); public Site getSite() { return site; } public void process(Page page) { page.putField("西安_AQI",page.getHtml().xpath("//a[@class='cbol_aqi_num' and @href='/news/385.html']/text()").toString()); page.putField("西安_pollutant",page.getHtml().xpath("//a[@class=\"cbol_wuranwu_num \" and @href=\"/news/387.html\"]/text()").toString()); page.putField("西安_pm25",page.getHtml().xpath("//a[@class=\"cbol_nongdu_num \" and @href=\"/news/386.html\"]/span/text()").toString()); page.putField("西安_level",page.getHtml().xpath("//div[@class=\"cbor_gauge\"]/span/text()").toString()); String[] positions = {"高压开关厂","兴城小区","纺织城","市人民体育场","高新西区","经开区","长安区","阎良区","临潼区","草滩","曲江文化产业集团","广运潭"}; for(int i=1;i Spider.create(new MyReptileDemo()).addUrl("http://www.pm25.com/city/xian.html") .addPipeline(new ConsolePipeline()).run(); } }

最后爬取的数据会直接打印到控制台上。在这里插入图片描述但是这并不是我们要的，因此我们接下来需要定制Pipeline接口实现MVC功能。

二、Web部分关于定制Pipeline接口：http://webmagic.io/docs/zh/posts/ch6-custom-componenet/pipeline.html 简单来说，我们的方法就是重写process函数并将结果保存到静态变量中，然后在Controller的方法里启动爬虫线程、并将该变量转化成json字符串返回至页面。代码：

package com.example.reptile; import ... public class MyReptileDemo implements PageProcessor { ...//同上，省略 } @Controller @RequestMapping("test") class ReptilePipeline implements Pipeline { public ReptilePipeline(){} private static Map mapResults;//static关键字能确保爬取的数据能保存下来，不被JVM回收 @Override public void process(ResultItems resultItems, Task task) { mapResults = resultItems.getAll(); } @RequestMapping("/reptile") @ResponseBody//此注解能将POJO转化成JSON串返回到Web中 public Map getReptile() { Spider.create(new MyReptileDemo()).addUrl("http://www.pm25.com/city/xian.html").addPipeline(new ReptilePipeline()).run();//Web中启动爬虫线程 return mapResults; } }

实现结果：在这里插入图片描述

三、前端部分使用微信开发者工具开发。前端就是wxml+js。响应事件写在js文件中：

onLoad: function (e) { var th = this; wx.request({ url: 'http://localhost:8080/test/reptile', method: 'GET', header: { 'content-type': 'application/json' }, success(result) { console.log(result.data) th.setData({ map: result.data }) } }) }

返回的json数据以键值对方式储存在map中。前端wxml文件里只需要类似{{map.xian_api}}这样写即可取出数据。然后其他页面的部分也不是我做的。最后做出来的结果如下：在这里插入图片描述

【本文地址】

公司简介

联系我们