Python爬虫实战,QQ音乐爬取全部歌曲 | 您所在的位置:网站首页 › 酷我音乐最新单曲是什么 › Python爬虫实战,QQ音乐爬取全部歌曲 |
前景介绍
最近小伙伴们听歌的兴趣大涨,网抑云综合症已经遍布各地。 咱们再来抬高一波QQ音乐的热度吧。很多人学习python,不知道从何学起。 很多人学习python,掌握了基本语法过后,不知道在哪里寻找案例上手。 很多已经做案例的人,却不知道如何去学习更加高深的知识。 那么针对这三类人,我给大家提供一个好的学习平台,免费领取视频教程,电子书籍,以及课程的源代码! QQ群:961562169 爬它!目标:歌手列表 任务:将A到Z的歌手以及全部页数的歌存到本地和数据库 观察网页url结构当我们进入网页时发现此时是一个无参数的html网页加载。 寻找我们想要拿到的位置寻找变化,但我们点击A开头的网页跳转时,发现 url 改变了,index 参数应该是首字母,page 参数应该是页数变化。 还是习惯的点开检查按键,找到首字母的作者提供的XML都需要什么参数,随便点点A-Z发现 XML 有一个请求蹦出来,里面返回了是个 json 数据集,都点开看看发现找到了每个作者的参数了。成功了一小半! 既然拿到了XML的网站,POST请求是一定的啦,接下来就该分析分析网站所需要的参数都是什么了,大致猜测一下,这么多数据中sign和data参数有点诡异,不像是正常的参数,加密参数也找到了。 search 找一下sign都在哪里。因为sign应该是个变量,所以说在他后面加个=会查找的更精确一些。找到一个sign参数的位置,这应该是个JavaScript代码,那就应该是这里面了。点进去! 这是多次调用代码的结果,发现data传入成功了。 个数其实一开始我们已经拿到了,只不过那时候没介绍,仔细的童鞋们应该是看到了总数到底为多少个。我们点开刚才的返回json结果就能看到total已经给出来当前的个数了。 大致我们能分析出来。 字母的变化在 index 处,也就是A到Z以及后面的# 应该是一共27个在里面,也就是index从1到27我们需要传给他。页数的变化在 sin 这里,第一页是0,第二页就是80,第三页是160,冷静分析一下应该是从0开始以80为公差的等差数列。这个八十应该是代表每一页都含有八十个歌手。cur_page应该就是当前页数的意思。那咱们跟着sin一起改变。那在这我们拿到了总数,加上每一页总共能展示多少,因为多出来的个数需要占一页才可以,我们使用向下取整。 获取作者名字以及id号我们根据上述写出来爬虫代码后,就可以成功获取 json 的返回值了,在里面我们能看到一个歌手的参数一共有五个,其中 singer_mid 和 singer_name 是我们所需要的。拿到这两个值后可以进入网站下载当前歌手的歌曲。 我们随意点进去一个歌手,进去后寻找XML的网站,我在这里找好了是 getSingerSong 变量。 在这里能获取歌手的每首歌的所能拿到的结果。 我们点入播放中。寻找里面的m4a链接看看都包含什么参数,发现存在七个链接都是。但我们仔细一看歌曲的大小我们就会发现,前几个都是有问题的发包,一首歌怎么可能只有几kb呢。毫不犹豫点进去最后一个。 我不知道刚才有没有仔细看这个位置,发现这个也是个很长的字符串,但是他很特殊,特殊到它和m4a的url是一样的。 那我们先看看vkey到底需要什么参数给进去。其他参数还是都那些,还是差了一个data需要给进去的。咱们分析一下data都需要给啥吧。 大致分析了一下 guid是个无用参数。songmid 是歌曲的 mid,我们刚才已经获取了uin 需要加入一个qq号才可以获取,如果未登陆默认为0其他都是定死的参数m4a文件是一个二进制文件。所以说我们写代码一定要写入二进制文件才可以。 代码优化 因为数据量过大,日常存入数据库因为数据下载量大,使用多进程爬取。将A-Z及#各开一个进程防止存入数据库在多线程阶段同时占用,上锁 全部代码crawl.py #Python3.7 #encoding = utf-8 import execjs,requests,math,os,threading from urllib import parse from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor from db import SQLsession,Song lock = threading.Lock() headers = { 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36', 'Referer':'https://y.qq.com/portal/singer_list.html', } session = SQLsession() def get_sign(data): with open('./QQ音乐/get_sign.js','r',encoding='utf-8') as f: text = f.read() js_data = execjs.compile(text) sign = js_data.call('get_sign',data) return sign def myProcess(): #把歌手按照首字母分为27类 with ProcessPoolExecutor(max_workers = 2) as p:#创建27个进程 for i in range(1,28): p.submit(get_singer_mid,i) def get_singer_mid(index): #index = 1-----27 #打开歌手列表页面,找出singerList,找出所有歌手的数目,除于80,构造后续页面获取page歌手 #找出mid, 用于歌手详情页 data = '{"comm":{"ct":24,"cv":0},"singerList":'\ '{"module":"Music.SingerListServer","method":"get_singer_list","param":'\ '{"area":-100,"sex":-100,"genre":-100,"index":%s,"sin":0,"cur_page":1}}}'%(str(index)) sign = get_sign(data) url = 'https://u.y.qq.com/cgi-bin/musics.fcg?-=getUCGI6720748185279282&g_tk=5381'\ '&sign={}'\ '&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8'\ '¬ice=0&platform=yqq.json&needNewCode=0'\ '&data={}'.format(sign,parse.quote(data)) html = requests.get(url,headers = headers).json() total = html['singerList']['data']['total']#多少个歌手 pages = int(math.floor(int(total)/80))#向下取整 thread_number = pages Thread = ThreadPoolExecutor(max_workers = thread_number) sin = 0 #分页迭代每一个字母下的所有页面歌手 for page in range(1,pages+2): data = '{"comm":{"ct":24,"cv":0},"singerList":{"module":"Music.SingerListServer","method":"get_singer_list","param":{"area":-100,"sex":-100,"genre":-100,"index":%s,"sin":%s,"cur_page":%s}}}'%(str(index),str(sin),str(page)) sign = get_sign(data) url = 'https://u.y.qq.com/cgi-bin/musics.fcg?-=getUCGI6720748185279282&g_tk=5381'\ '&sign={}'\ '&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8'\ '¬ice=0&platform=yqq.json&needNewCode=0'\ '&data={}'.format(sign,parse.quote(data)) html = requests.get(url,headers = headers).json() sings = html['singerList']['data']['singerlist'] for sing in sings: singer_name = sing['singer_name'] #获取歌手名字 mid = sing['singer_mid'] #获取歌手mid Thread.submit(get_singer_data,mid = mid, singer_name = singer_name,) sin+=80 #获取歌手信息 def get_singer_data(mid,singer_name): #获取歌手mid,进入歌手详情页,也就是每一个歌手歌曲所在页面 #找出歌手的歌曲信息页 data = '{"comm":{"ct":24,"cv":0},"singerSongList":{"method":"GetSingerSongList","param":'\ '{"order":1,"singerMid":"%s","begin":0,"num":10}'\ ',"module":"musichall.song_list_server"}}'%(str(mid)) sign = get_sign(data) url = 'https://u.y.qq.com/cgi-bin/musics.fcg?-=getSingerSong4707786209273719'\ '&g_tk=5381&sign={}&loginUin=0'\ '&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0'\ '&data={}'.format(sign,parse.quote(data)) html = requests.get(url,headers = headers).json() songs_num = html['singerSongList']['data']['totalNum'] for number in range(0,songs_num,100): data = '{"comm":{"ct":24,"cv":0},"singerSongList":{"method":"GetSingerSongList","param":'\ '{"order":1,"singerMid":"%s","begin":%s,"num":%s}'\ ',"module":"musichall.song_list_server"}}'%(str(mid),str(number),str(songs_num)) sign = get_sign(data) url = 'https://u.y.qq.com/cgi-bin/musics.fcg?-=getSingerSong4707786209273719'\ '&g_tk=5381&sign={}&loginUin=0'\ '&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0'\ '&data={}'.format(sign,parse.quote(data)) html = requests.get(url,headers = headers).json() datas = html['singerSongList']['data']['songList'] for d in datas: sing_name = d['songInfo']['title'] song_mid = d['songInfo']['mid'] try: lock.acquire() session.add(Song(song_name = sing_name, song_singer = singer_name, song_mid = song_mid)) session.commit() lock.release() print('commit') except: session.rollback() print('rollbeak') print('歌手名字:{}\t歌曲名字:{}\t歌曲ID:{}'.format(singer_name,sing_name,song_mid)) download(song_mid,sing_name,singer_name) def download(song_mid,sing_name,singer_name): qq_number = '请在这里写你的qq号' try:qq_number = str(int(qq_number)) except:raise 'qq号未填写' data = '{"req":{"module":"CDN.SrfCdnDispatchServer","method":"GetCdnDispatch"'\ ',"param":{"guid":"4803422090","calltype":0,"userip":""}},'\ '"req_0":{"module":"vkey.GetVkeyServer","method":"CgiGetVkey",'\ '"param":{"guid":"4803422090","songmid":["%s"],"songtype":[0],'\ '"uin":"%s","loginflag":1,"platform":"20"}},"comm":{"uin":%s,"format":"json","ct":24,"cv":0}}'%(str(song_mid),str(qq_number),str(qq_number)) sign = get_sign(data) url = 'https://u.y.qq.com/cgi-bin/musics.fcg?-=getplaysongvkey27494207511290925'\ '&g_tk=1291538537&sign={}&loginUin={}'\ '&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0'\ '&platform=yqq.json&needNewCode=0&data={}'.format(sign,qq_number,parse.quote(data)) html = requests.get(url,headers = headers).json() try: purl = html['req_0']['data']['midurlinfo'][0]['purl'] url = 'http://119.147.228.27/amobile.music.tc.qq.com/{}'.format(purl) html = requests.get(url,headers = headers) html.encoding = 'utf-8' sing_file_name = '{} -- {}'.format(sing_name,singer_name) filename = './QQ音乐/歌曲' if not os.path.exists(filename): os.makedirs(filename) with open('./QQ音乐/歌曲/{}.m4a'.format(sing_file_name),'wb') as f: print('\n正在下载{}歌曲.....\n'.format(sing_file_name)) f.write(html.content) except: print('查询权限失败,或没有查到对应的歌曲') if __name__ == "__main__": myProcess()db.py from sqlalchemy import Column,Integer,String,create_engine from sqlalchemy.orm import sessionmaker,scoped_session from sqlalchemy.ext.declarative import declarative_base #此处没有使用pymysql的驱动 #请安装pip install mysql-connector-python #engine中的 mysqlconnector 为 mysql官网驱动 engine = create_engine('mysql+mysqlconnector://root:root@localhost:3306/test?charset=utf8', max_overflow = 500,#超过连接池大小外最多可以创建的链接 pool_size = 100,#连接池大小 echo = False,#调试信息展示 ) Base = declarative_base() class Song(Base): __tablename__ = 'song' song_id = Column(Integer,primary_key = True,autoincrement = True) song_name = Column(String(64)) song_ablum = Column(String(64)) song_mid = Column(String(50)) song_singer = Column(String(50)) Base.metadata.create_all(engine) DBsession = sessionmaker(bind = engine) SQLsession = scoped_session(DBsession)get_sign.js this.window = this; var sign = null; !function(n, t) { "object" == typeof exports && "undefined" != typeof module ? module.exports = t() : "function" == typeof define && define.amd ? define(t) : (n = n || self).getSecuritySign = t() } (this, function() { "use strict"; var n = function() { if ("undefined" != typeof self) return self; if ("undefined" != typeof window) return window; if ("undefined" != typeof global) return global; throw new Error("unable to locate global object") } (); n.__sign_hash_20200305 = function(n) { function l(n, t) { var o = (65535 & n) + (65535 & t); return (n >> 16) + (t >> 16) + (o >> 16) > 32 - r, o); var i, r } function g(n, t, o, e, u, p, i) { return r(t & o | ~t & e, n, t, u, p, i) } function a(n, t, o, e, u, p, i) { return r(t & e | o & ~e, n, t, u, p, i) } function s(n, t, o, e, u, p, i) { return r(t ^ o ^ e, n, t, u, p, i) } function v(n, t, o, e, u, p, i) { return r(o ^ (t | ~e), n, t, u, p, i) } function t(n) { return function(n) { var t, o = ""; for (t = 0; t < 32 * n.length; t += 8) o += String.fromCharCode(n[t >> 5] >>> t % 32 & 255); return o } (function(n, t) { n[t >> 5] |= 128 >> 9 > 2) - 1] = void 0, t = 0; t < o.length; t += 1) o[t] = 0; for (t = 0; t < 8 * n.length; t += 8) o[t >> 5] |= (255 & n.charCodeAt(t / 8)) >> 4 & 15) + e.charAt(15 & t); return u } (o(n)) }, function r(f, h, c, l, g) { g = g || [[this], [{}]]; for (var t = [], o = null, n = [function() { return ! 0 }, function() {}, function() { g.length = c[h++] }, function() { g.push(c[h++]) }, function() { g.pop() }, function() { var n = c[h++], t = g[g.length - 2 - n]; g[g.length - 2 - n] = g.pop(), g.push(t) }, function() { g.push(g[g.length - 1]) }, function() { g.push([g.pop(), g.pop()].reverse()) }, function() { g.push([l, g.pop()]) }, function() { g.push([g.pop()]) }, function() { var n = g.pop(); g.push(n[0][n[1]]) }, function() { g.push(g[g.pop()[0]][0]) }, function() { var n = g[g.length - 2]; n[0][n[1]] = g[g.length - 1] }, function() { g[g[g.length - 2][0]][0] = g[g.length - 1] }, function() { var n = g.pop(), t = g.pop(); g.push([t[0][t[1]], n]) }, function() { var n = g.pop(); g.push([g[g.pop()][0], n]) }, function() { var n = g.pop(); g.push(delete n[0][n[1]]) }, function() { var n = []; for (var t in g.pop()) n.push(t); g.push(n) }, function() { g[g.length - 1].length ? g.push(g[g.length - 1].shift(), !0) : g.push(void 0, !1) }, function() { var n = g[g.length - 2], t = Object.getOwnPropertyDescriptor(n[0], n[1]) || { configurable: !0, enumerable: !0 }; t.get = g[g.length - 1], Object.defineProperty(n[0], n[1], t) }, function() { var n = g[g.length - 2], t = Object.getOwnPropertyDescriptor(n[0], n[1]) || { configurable: !0, enumerable: !0 }; t.set = g[g.length - 1], Object.defineProperty(n[0], n[1], t) }, function() { h = c[h++] }, function() { var n = c[h++]; g[g.length - 1] && (h = n) }, function() { throw g[g.length - 1] }, function() { var n = c[h++], t = n ? g.slice( - n) : []; g.length -= n, g.push(g.pop().apply(l, t)) }, function() { var n = c[h++], t = n ? g.slice( - n) : []; g.length -= n; var o = g.pop(); g.push(o[0][o[1]].apply(o[0], t)) }, function() { var n = c[h++], t = n ? g.slice( - n) : []; g.length -= n, t.unshift(null), g.push(new(Function.prototype.bind.apply(g.pop(), t))) }, function() { var n = c[h++], t = n ? g.slice( - n) : []; g.length -= n, t.unshift(null); var o = g.pop(); g.push(new(Function.prototype.bind.apply(o[0][o[1]], t))) }, function() { g.push(!g.pop()) }, function() { g.push(~g.pop()) }, function() { g.push(typeof g.pop()) }, function() { g[g.length - 2] = g[g.length - 2] == g.pop() }, function() { g[g.length - 2] = g[g.length - 2] === g.pop() }, function() { g[g.length - 2] = g[g.length - 2] > g.pop() }, function() { g[g.length - 2] = g[g.length - 2] >= g.pop() }, function() { g[g.length - 2] = g[g.length - 2] > g.pop() }, function() { g[g.length - 2] = g[g.length - 2] >>> g.pop() }, function() { g[g.length - 2] = g[g.length - 2] + g.pop() }, function() { g[g.length - 2] = g[g.length - 2] - g.pop() }, function() { g[g.length - 2] = g[g.length - 2] * g.pop() }, function() { g[g.length - 2] = g[g.length - 2] / g.pop() }, function() { g[g.length - 2] = g[g.length - 2] % g.pop() }, function() { g[g.length - 2] = g[g.length - 2] | g.pop() }, function() { g[g.length - 2] = g[g.length - 2] & g.pop() }, function() { g[g.length - 2] = g[g.length - 2] ^ g.pop() }, function() { g[g.length - 2] = g[g.length - 2] in g.pop() }, function() { g[g.length - 2] = g[g.length - 2] instanceof g.pop() }, function() { g[g[g.length - 1][0]] = void 0 === g[g[g.length - 1][0]] ? [] : g[g[g.length - 1][0]] }, function() { for (var e = c[h++], u = [], n = c[h++], t = c[h++], p = [], o = 0; o < n; o++) u[c[h++]] = g[c[h++]]; for (var i = 0; i < t; i++) p[i] = c[h++]; g.push(function n() { var t = u.slice(0); t[0] = [this], t[1] = [arguments], t[2] = [n]; for (var o = 0; o < p.length && o < arguments.length; o++) 0 < p[o] && (t[p[o]] = [arguments[o]]); return r(f, e, c, l, t) }) }, function() { t.push([c[h++], g.length, c[h++]]) }, function() { t.pop() }, function() { return !! o }, function() { o = null }, function() { g[g.length - 1] += String.fromCharCode(c[h++]) }, function() { g.push("") }, function() { g.push(void 0) }, function() { g.push(null) }, function() { g.push(!0) }, function() { g.push(!1) }, function() { g.length -= c[h++] }, function() { g[g.length - 1] = c[h++] }, function() { var n = g.pop(), t = g[g.length - 1]; t[0][t[1]] = g[n[0]][0] }, function() { var n = g.pop(), t = g[g.length - 1]; t[0][t[1]] = n[0][n[1]] }, function() { var n = g.pop(), t = g[g.length - 1]; g[t[0]][0] = g[n[0]][0] }, function() { var n = g.pop(), t = g[g.length - 1]; g[t[0]][0] = n[0][n[1]] }, function() { g[g.length - 2] = g[g.length - 2] < g.pop() }, function() { g[g.length - 2] = g[g.length - 2] |
CopyRight 2018-2019 实验室设备网 版权所有 |