主流视频网站弹幕下载

2023-10-31 04:23| 来源: 网络整理| 查看: 265

如今主流的视频网站（如 bilibili，腾讯，爱奇艺，优酷，芒果 TV 等）都支持了弹幕，本文介绍了如何下载视频弹幕（.xml）文件并转化为字幕（.ass）文件，支持本地播放。

XML 格式弹幕

B 站是最早的一批弹幕网站之一，且比较成熟，弹幕可以直接以 XML 格式下载，非常方便，所以本文下载的弹幕均以 B 站的 XML 弹幕格式的简化为标准格式。

12345 这是一条弹幕 ...

每一条弹幕的属性 p 的格式为：

弹幕发送时间，相对于视频开始时间，以秒为单位弹幕类型，1-3 为滚动弹幕、4 为底部、5 为顶端、6 为逆向、7 为精确、8 为高级字体大小，25 为中，18 为小，Bilibili 只有这 2 个字号，本地 20 字号比较合适（电脑分辨率是 1920*1080）弹幕颜色，RGB 颜色转为十进制后的值，16777215 为白色弹幕发送时间，Unix 时间戳格式弹幕池，0 为普通，1 为字幕，2 为特殊发送人的 id 弹幕 id

一般只需要使用前 4 项即可。

Python 中利用 request 库来爬取网页结果：

1234567import urllib.requestdef get_response(url): req = urllib.request.Request(url) req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36') response = urllib.request.urlopen(req).read().decode('utf-8') return response

生成 XML 弹幕文件时需要检查是否有非法 XML 字符，并可以设置弹幕黑名单：

1234567891011121314151617filename = 'XML/' + title + '.xml'contents = []with open(filename, 'w', encoding='utf-8') as fout: fout.write('\n') fout.write('\n') illegal = False #标志是否有非法XML字符 for char in ['', '&', '\u0000', '\b']: if char in j['content']: illegal = True break if illegal: continue black_list = [''] #列出弹幕黑名单 if content not in contents and all(word not in content for word in black_list): contents.append(content) fout.write('' + content + '\n') fout.write('')

网上很多相关工具（如弹幕 ASS 转换工具等）可以将 XML 弹幕文件转换成 ASS 字幕文件。基于弹幕 ASS 转换工具个性化设置：

123456789101112131415161718192021// 设置项，适合视频2倍速播放var config = { 'playResX': 1440, // 屏幕分辨率宽（像素） 'playResY': 810, // 屏幕分辨率高（像素） 'fontlist': [ // 字形（会自动选择最前面一个可用的） '黑体', 'Microsoft YaHei UI', 'Microsoft YaHei', '文泉驿正黑', 'STHeitiSC', ], 'font_size': 1.2, // 字号（比例） 'r2ltime': 20, // 右到左弹幕持续时间（秒） 'fixtime': 5, // 固定弹幕持续时间（秒） 'opacity': 0.8, // 不透明度（比例） 'space': 0, // 弹幕间隔的最小水平距离（像素） 'max_delay': 6, // 最多允许延迟几秒出现弹幕 'bottom': 0, // 底端给字幕保留的空间（像素） 'use_canvas': true, // 是否使用canvas计算文本宽度（布尔值，Linux下的火狐默认否，其他默认是，Firefox bug #561361） 'debug': false, // 打印调试信息};

腾讯视频弹幕下载

打开一个腾讯视频 PC 网页端，其源码中的 VIDEO_INFO 字段：

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778var VIDEO_INFO = { "publish_date": "", "leading_actor_id": [""], "duration": , "guests": , "race_teams_id": , "type_name": , "tag": [ ], "singer_id": , "episode": , "race_stars_id": , "srcsite_name": , "type": , "title": , "leading_actor": [""], "show_type": , "singer_name": , "danmu_status": , "second_title": , "positive_trailer": , "athlete": , "mv_stars": , "trytime_second": , "c_full": , "update_flag": , "first_recommand": , "desc": , "pioneer_tag": , "begin_time": , "upload_qq": , "category_map": [, ""], "is_trailer": , "stars_name": , "pic_640_360": , "c_title_segment": , "guests_id": , "presenter_id": , "upload_src": , "athlete_id": , "sec_recommand": , "costar_id": , "relative_stars_id": , "relative_stars": , "drm": , "modify_time": , "tail_time": , "valid_tag_id": , "vid": , "pic_url": , "costar": , "race_teams_name": , "c_title_output": , "director_id": [""], "title_en": , "stars": , "danmu": , "mv_stars_id": , "playright": [""], "presenter": , "race_stars": , "view_all_count": , "c_tags_flag": , "c_has_adv_danmu": , "head_time": , "state": , "copyright_id": , "pic160x90": , "director": [""], "famous_id": , "pioneer_tag_ids": , "trytime": , "famous_actor": , "video_checkup_time": , "": , "isFull": }; 其中所需的字段是duration、title、vid。接下来通过vid找到targetid：http://bullet.video.qq.com/fcgi-bin/target/regist?otype=json&vid=(%vid%)，打开此链接得到： 1234567891011QZOutputJson = { "danmukey":"bubble_flag=&targetid=&vid=&type=", "display":, "is_has_adv":, "is_has_bubble":, "open":, "returncode":, "returnmsg":, "targetid":, "userstatus":} 然后就可以通过targetid得到弹幕：http://mfm.video.qq.com/danmu?timestamp=(%timestamp%)&target_id=(%targetid%)，其中timestamp从0开始并且以30为增量，打开此链接得到（只截取了第一条弹幕）： 123456789101112131415161718192021222324252627282930{ "err_code":, "err_msg":, "peroid":, "target_id":, "count":, "tol_up":, "single_max_count":, "session_key":, "comments":[ { "commentid":, "content":, "upcount":, "isfriend":, "isop":, "isself":, "timepoint":, "headurl":, "opername":, "bb_bcolor":, "bb_head":, "bb_level":, "bb_id":, "rich_type":, "uservip_degree":, "content_style": "{\"color\":\"\",\"position\":}" } ]} 其中timepoint、content_style中的color、content字段可以组成xml弹幕格式。全部python代码为： 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960import requestsimport jsondef getHTMLText(url): try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding='utf-8' return r.text except Exception as e: print(e) return ''def get_tencent_danmu(url): video_info = json.loads(str([s for s in getHTMLText(url).split('\n') if 'VIDEO_INFO' in str(s)]).strip('[\'var VIDEO_INFO = ').strip('\']')) duration = video_info['duration'] title = video_info['title'] vid = video_info['vid'] targetid = json.loads(getHTMLText('http://bullet.video.qq.com/fcgi-bin/target/regist?otype=json&vid=' + vid).strip('QZOutputJson=').strip(';'))['targetid'] filename = 'XML/' + title + '.xml' contents = [] print('\n' + title + ': ', end='') with open(filename, 'w', encoding='utf-8') as fout: fout.write('\n') fout.write('\n') for i in range(int(duration) // 30 + 1): timestamp = i*30 print(i/2, end='min, ') response = getHTMLText('http://mfm.video.qq.com/danmu?timestamp=' + str(timestamp) + '&target_id=' + targetid) if response == '': continue try: danmu = json.loads(response, strict=False) for j in danmu['comments']: illegal = False #标志是否有非法XML字符 for char in ['', '&', '\u0000', '\b']: if char in j['content']: illegal = True break if illegal: continue timepoint = j['timepoint'] #弹幕发送时间 ct = 1 #弹幕样式 size = 20 #字体大小 # 获取颜色 if 'color' in j['content_style']: content_style = json.loads(j['content_style']) color = int(content_style['color'], 16) else: color = 16777215 content = j['content'] #弹幕内容 black_list = ['word'] if ':' in content: content = content.split(':')[1].strip(' ').strip(' ') if content not in contents and all(word not in content for word in black_list): contents.append(content.strip(' ').strip(' ')) fout.write('' + content + '\n') except Exception as e: continue fout.write('')

爱奇艺视频弹幕下载

打开一个爱奇艺视频 PC 网页端，其源码中的 page-info 字段：

12345678910111213141516171819202122232425262728293031323334353637{ "albumId":, "albumName":, "imageUrl":, "tvId":, "vid":, "cid":, "isSource":, "contentType":, "vType":, "pType":, "pageNo":, "pageType":, "userId":, "pageUrl":, "tvName":, "isfeizhengpian":, "categoryName":, "categories":, "downloadAllowed":, "publicLevel":, "payMark":, "payMarkUrl":, "vipType":[ ], "qiyiProduced":, "exclusive":, "tvYear":, "duration":"::", "wallId":, "rewardAllowed":, "commentAllowed":, "heatShowTypes":, "videoTemplate":, "issueTime":} 其中所需的字段是duration、tvName、albumId、tvId、cid。 duration由‘时：分：秒’格式转为秒： 12345duration_str = page_info['duration'].split(':')duration = 0for i in range(len(duration_str)-1): duration = (duration + int(duration_str[i])) * 60duration = duration + int(duration_str[-1]) 然后就可以通过albumId、tvId、cid得到弹幕：http://cmts.iqiyi.com/bullet/(%tvId[-4:-2]%)/(%tvId[-2:]%)/(%tvId%)_300_(%page%).z?rn=0.(%16位随机数%)&business=danmu&is_iqiyi=true&is_video_page=true&tvid=(%tvid%)&albumid=(%albumid%)&categoryid=(%cid%)&qypid=01010021010000000000，其中tvId需要分割出倒数4-3位和倒数2-1位，page从1开始并且以1为增量，打开此链接得到(%tvId%)_300_(%page%).z的文件，这个文件是压缩的字节流需要解压。 Python中利用zlib库，dec = zlib.decompressobj(32 + zlib.MAX_WBITS) 和 b = dec.decompress('z文件').decode("utf-8") 得到XML格式的弹幕（只截取了第一条弹幕）： 1234567891011121314151617181920212223242526272829303132333435363738 1 其中showTime、color、content字段可以组成xml弹幕格式（color需要从16进制转换成10进制）。全部python代码为： 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960import requestsimport reimport jsonfrom random import randintimport zlibimport xml.etree.ElementTree as ETdef getHTMLText(url, encode): try: r = requests.get(url, timeout=30) r.raise_for_status() if encode == 'utf-8': r.encoding='utf-8' return r.text elif encode == 'byte': return r.content except Exception as e: print(e) return ''def get_iqiyi_danmu(url): page_info = json.loads(re.search(r'page-info=\'(.*)\'( *):video-info', getHTMLText(url, 'utf-8')).group(1)) duration_str = page_info['duration'].split(':') duration = 0 for i in range(len(duration_str)-1): duration = (duration + int(duration_str[i])) * 60 duration = duration + int(duration_str[-1]) title = page_info['tvName'] albumid = page_info['albumId'] tvid = page_info['tvId'] categoryid = page_info['cid'] page = duration // (60 * 5) + 1 filename = 'XML/' + title + '.xml' contents = [] with open(filename, 'w', encoding='utf-8') as fout: fout.write('\n') fout.write('\n') for i in range(duration // (60 * 5) + 1): dec = zlib.decompressobj(32 + zlib.MAX_WBITS) try: b = dec.decompress(getHTMLText('http://cmts.iqiyi.com/bullet/' + str(tvid)[-4:-2] + '/' + str(tvid)[-2:] + '/' + str(tvid) + '_300_' + str(i+1) + '.z?rn=0.' + ''.join(['%s' % randint(0, 9) for num in range(0, 16)]) + '&business=danmu&is_iqiyi=true&is_video_page=true&tvid=' + str(tvid) + '&albumid=' + str(albumid) + '&categoryid=' + str(categoryid) + '&qypid=01010021010000000000', 'byte')) print('page: ' + str(i)) except: print(print('page not found: ' + str(i))) try: root = ET.fromstring(b.decode('utf-8')) except Exception as e: print(e) continue for bulletInfo in root.iter('bulletInfo'): timepoint = bulletInfo[3].text #弹幕发送时间 ct = 1 #弹幕样式 size = 20 #字体大小 color = int(bulletInfo[5].text, 16) #颜色 content = bulletInfo[1].text #弹幕内容 black_list = ['word'] if content not in contents and all(word not in content for word in black_list): contents.append(content) fout.write('' + content + '\n') fout.write('')

优酷视频弹幕下载

打开一个优酷视频 PC 网页端，其源码中的 window.PageConfig 字段：

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152window.PageConfig = { transfer_mode: , isDRM: , videoCategoryId: , isSimple: , videoId: , newVersion: , isDebug: , pid: , homeHost: , youku_homeurl: , catId: , playmode: , videoOwner: , videoOwner_en: , videoId2: , currentEncodeVid: , catName: , seconds: , bullet: , transfer: , panorama: , folderId: , fpos: , forder: , ftotalpos: , showid_en: , showid: , cp: , paid: , showtype: , tabs: , singerId: , loadinglogo: , lottery_open_sidetool: , lottery_id_sidetool: , lottery_sidetool: , page: { type: , isdatetype: , year: , firstMon: , lastMon: , currMon: , episodeLast: , parentvideoid: , compeleted: }, copytoclip: , playerUrl: };var str = "&ct=c&cs=&td=&s=&v=&u=&paid=&tt="; 其中所需的字段是seconds、tt、videoId。然后就可以通过videoId得到弹幕：https://service.danmu.youku.com/list?mat=(%mat%)&ct=1001&iid=(%videoId%)，其中mat从0开始并且以1为增量，打开此链接得到（只截取了第一条弹幕）： 123456789101112131415161718192021222324252627{ "count": , "filtered": , "result": [{ "aid": , "content": "", "createtime": , "ct": , "extFields": { "voteUp": }, "id": , "iid": , "ipaddr": , "level": , "lid": , "mat": , "ouid": , "playat": , "propertis": "{\"pos\":,\"size\":,\"effect\":,\"color\":,\"dmfid\":}", "status": , "type": , "uid": , "ver": }], "scm": "0"} 其中playat、propertis中的color、content字段可以组成xml弹幕格式。全部python代码为： 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647import urllib.requestimport reimport jsondef get_response(url): req = urllib.request.Request(url) req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36') response = urllib.request.urlopen(req).read().decode('utf-8') return responsedef get_youku_danmu(url): res = get_response(url) title = re.search(r'(.*)', res).group(1).split('—')[0] iid = re.search(r'videoId: \'(\d*)\'', res).group(1) duration = float(re.search(r'seconds: \'(.*)\',', res).group(1)) filename = 'XML/' + title.split('集 ')[0] + '.xml' contents = [] with open(filename, 'w', encoding='utf-8') as fout: fout.write('\n') fout.write('\n') for mat in range(int(duration) // 60 + 1): response = get_response('https://service.danmu.youku.com/list?mat=' + str(mat) + '&ct=1001&iid=' + iid) danmu = json.loads(response) print(str(mat) + '\tresult:' + str(len(danmu['result']))) for i in range(len(danmu['result'])): illegal = False #标志是否有非法XML字符 for char in ['', '&', '\u0000', '\b']: if char in danmu['result'][i]['content']: illegal = True break if illegal: continue playat = danmu['result'][i]['playat']/1000 #弹幕发送时间 ct = 1 #弹幕样式 size = 20 #字体大小 # 获取颜色 if 'color' in danmu['result'][i]['propertis']: propertis = json.loads(danmu['result'][i]['propertis']) color = propertis['color'] else: color = 16777215 content = danmu['result'][i]['content'] #弹幕内容 black_list = ['word'] if content not in contents and all(word not in content for word in black_list): contents.append(content) fout.write('' + content + '\n') fout.write('')

芒果视频弹幕下载

打开一个芒果视频 PC 网页端，其网址（以 https://www.mgtv.com/b/9015/4828668.html 为例）中以 / 分割，倒数第二位是 cid，倒数第一位是 vid。从源码中霸王别姬 - 视频在线观看 - 霸王别姬 - 芒果TV 可获得 title。然后就可以通过 cid 和 vid 得到弹幕：https://galaxy.bz.mgtv.com/rdbarrage?vid=(%vid%)&cid=(%cid%)&time=(%time%)，其中 time 从 0 开始并且下一个 time 的值可从弹幕中得到，打开此链接得到（只截取了第一条弹幕）：

123456789101112131415161718{ "status":, "msg":"操作成功", "seq":"", "data":{ "next":, "interval":, "items":[ { "id":, "type":, "uid":, "content":, "time": } ] }} 其中time、content字段可以组成xml弹幕格式。全部python代码为： 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748import urllib.requestimport jsonimport sysimport iosys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')def get_response(url): req = urllib.request.Request(url) req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36') response = urllib.request.urlopen(req).read().decode('utf-8') return responsedef get_mangguo_danmu(url): cid = url.split('/')[4] vid = url.split('/')[5].strip('.html') video_info = json.loads(get_response('https://pcweb.api.mgtv.com/video/info?vid=8244411&cid=335811')) title = video_info['data']['info']['videoName'] filename = 'XML/' + title + '.xml' contents = [] with open(filename, 'w', encoding='utf-8') as fout: fout.write('\n') fout.write('\n') time = 0 while True: print('https://galaxy.bz.mgtv.com/rdbarrage?version=2.0.0&vid=' + vid + '&cid=' + cid + '&time=' + str(time)) danmu = json.loads(get_response('https://galaxy.bz.mgtv.com/rdbarrage?version=2.0.0&vid=' + vid + '&cid=' + cid + '&time=' + str(time))) print(str(time)) if danmu['data']['items'] == None: break for j in danmu['data']['items']: illegal = False #标志是否有非法XML字符 for char in ['', '&', '\u0000', '\b']: if char in j['content']: illegal = True break if illegal: continue timepoint = j['time']/1000 #弹幕发送时间 ct = 1 #弹幕样式 size = 20 #字体大小 color = 16777215 #弹幕颜色 content = j['content'] #弹幕内容 black_list = ['word'] if content not in contents and all(word not in content for word in black_list): contents.append(content) fout.write('' + content + '\n') time = danmu['data']['next'] fout.write('')

视频下载

You-Get 是一个命令行程序，提供便利的方式来下载网络上的媒体信息。 you-get 的功用: 1. 下载流行网站的音频、视频 (查看完整支持列表) 2. 在媒体播放器中观看在线视频，脱离浏览器与广告 3. 下载喜欢的网页上的图片 4. 下载任何非 HTML 内容，例如二进制文件

you-get 主要在 linux 等开源平台上运行，由于家用电脑大多为 windows 系统，安装方法如下：

下载相关安装包

以下是必要依赖，需要单独安装，除非于 Windows 下使用预包装包: Python 3 FFmpeg 或者 [Libav] https://libav.org/

通过 pip 安装 you-get 的官方版本通过 PyPI 分发，可从 PyPI 镜像中通过 pip 包管理器安装。务必使用版本 3 的 pip: $ pip3 install you-get

Git clone $ git clone git://github.com/soimort/you-get.git 将源码解压到任意目录即可

升级考虑到 you-get 安装方法的差异，请使用: $ pip3 install --upgrade you-get 或下载最新更新: $ you-get https://github.com/soimort/you-get/archive/master.zip

使用 you-get

进入解压文件夹 you-get-develop 下，在该目录下打开 Windows Powershell。输入 python you-get 视频网址即可使用下载功能（视频保存在 you-get-develop 目录下）。

腾讯视频下载

打开腾讯视频播放页，打开控制台（F12），Network 选项下搜索 "ts.m3u8" 字段，找到类似下面的网址： https://apd-(32位字符串).v.smtcdns.com/moviets.tc.qq.com/(44位字符串)/uwMROfz0r5xgoaQXGdGnC2df64hwtZlCglRDKOjEZ_qQW-eC/(160位字符串)/(vid).(数字).ts.m3u8?ver=4

此 m3u8 文件存有 ts 索引相对地址：

1234567#EXTM3U#EXT-X-VERSION:#EXT-X-MEDIA-SEQUENCE:#EXT-X-TARGETDURATION:#EXT-X-PLAYLIST-TYPE:#EXTINF:(时长),0(#)_(vid).(数字).(#).ts?index=(数字)&start=(数字)&end=(数字)&brs=(数字)&bre=(数字)&ver=4 可以利用如下代码下载并且合并ts文件： 1234567891011121314151617import urllib.requestdef get_response(url): req = urllib.request.Request(url) req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36') response = urllib.request.urlopen(req).read() return responseurl_m3u8 = '(m3u8地址)'EXTM3U = get_response(url_m3u8).decode('utf-8').split('\n')ts = [i for i in EXTM3U if 'ts' in i]url_header = url_m3u8[::-1].split('/', 1)[1][::-1]for i in ts: url_ts = url_header + '/' + i with open('(文件名).ts', 'ab') as f: f.write(get_response(url_ts))

批量进行弹幕 ASS 转换

安装 selenium pip install selenium

如果用 chrome 查看 chrome 的版本号 (Chromium 72.0.3626.121) https://chromedriver.storage.googleapis.com/LATEST_RELEASE_72.0.3626 https://chromedriver.storage.googleapis.com/index.html?path=72.0.3626.69/ 下载相应 win32 版本解压放入 python 根目录修改 common.js startDownload('\ufeff' + ass, name.replace(/\.[^.]*$/, '') + '.ass'); 改为 return ass;

【本文地址】

公司简介

联系我们