关于爬虫百度贴吧 | 您所在的位置:网站首页 › 浏览器贴吧评论 › 关于爬虫百度贴吧 |
接下来需要发送请求获取相应
def get_data(self, url):
response = requests.get(url, headers=self.headers)
with open("temp.html","wb") as f:
f.write(response.content)
return response.content
然后创建一个element对象 代码如下 def parse_data(self, data): data = data.decode().replace("","") html = etree.HTML(data) el_list = html.xpath('//*[@id="thread_list"]/li/div/div[2]/div[1]/div[1]/a')其中,第二行代码,是因为所用的浏览器代理比较高级,例如百度,谷歌什么的,所以会把数据放到注释里,然后高级浏览器自己提取,这行代码就是为了提取数据。 最后一行代码,是帖子的xpath地址,获取方法如下: 检查帖子,得到图右,然后复制xpath地址,需要验证xpath是否足够准确(验证:结果第一行就是帖子 )得到全部帖子的xpath,找到节点并删除,如下图,删除所选中的即可(找不到的话,就一个一个的删,总能试出来) 结果如下图所示 创建字典以及获取下一页的url 先看代码: data_list = [] for el in el_list: temp={} temp['title'] = el.xpath("./text()")[0] temp['link'] = 'https://tieba.baidu.com' + el.xpath("./@href")[0] data_list.append(temp) try: next_url = 'https:' + html.xpath('//a[contains(text(),"下一页>")]/@href')[0] # 获取下一页,引号内为下一页得xpath语句 except: next_url = None return data_list, next_url 此时要注意,每一页里的下一页的xpath语句都不一样,所以用上图的方法,找下一页的xpath(//a[contains(text(),"下一页>")]/@href) 保存数据,代码如图: def save_data(self, data_list): for data in data_list: print(data)代码中的注释写了代码的作用 def run(self): # headers next_url = self.url while True: # 发送请求,获取相应 data = self.get_data(next_url) # 从相应中提取数据(数据和翻页用的uel) data_list, next_url = self.parse_data(data) self.save_data(data_list) print(next_url) # 判断是否终结 if next_url == None: break 最后运行 if __name__ == '__main__': tieba = Tieba("塞纳河") tieba.run() 部分运行结果 完整代码如下,按照注释填入相关内容即可运行 import requests from lxml import etree class Tieba(object): def __init__(self, name): self.url ="#网址".format(name) self.headers = { "#User-Agent" } def get_data(self, url): response = requests.get(url, headers=self.headers) with open("temp.html","wb") as f: f.write(response.content) return response.content def parse_data(self, data): data = data.decode().replace("","") html = etree.HTML(data) el_list = html.xpath('#所爬内容的xpath') data_list = [] for el in el_list: temp={} temp['title'] = el.xpath("./text()")[0] temp['link'] = 'https://tieba.baidu.com' + el.xpath("./@href")[0] data_list.append(temp) try: next_url = 'https:' + html.xpath('//a[contains(text(),"下一页>")]/@href')[0] except: next_url = None return data_list, next_url def save_data(self, data_list): for data in data_list: print(data) def run(self): next_url = self.url while True: data = self.get_data(next_url) data_list, next_url = self.parse_data(data) self.save_data(data_list) print(next_url) if next_url == None: break if __name__ == '__main__': tieba = Tieba("#吧名") # 输入名字在引号中 tieba.run() |
CopyRight 2018-2019 实验室设备网 版权所有 |