想要爬取b站过年期间各个分区的热门视频信息,作为学习爬虫的小练习

pSvUHgg.png

查看robots.txt:https://www.bilibili.com/robots.txt

1
2
3
4
5
6
7
8
9
10
11
User-agent: Yisouspider Allow: /
User-agent: Applebot Allow: /
User-agent: bingbot Allow: /
User-agent: Sogou inst spider Allow: /
User-agent: Sogou web spider Allow: /
User-agent: 360Spider Allow: /
User-agent: Googlebot Allow: /
User-agent: Baiduspider Allow: /
User-agent: Bytespider Allow: /
User-agent: PetalBot Allow: /
User-agent: * Disallow: /

这意味着b站允许Yisouspider、Applebot、bingbot、Sogou inst spider、Sogou web spider、360Spider、Googlebot、Badiduspider、Bytespider和PetalBot访问它的所有网址,不允许其他爬虫访问它的任何内容

所以出于尊重robots.txt协议的目的,只爬取少量数据(?)

经过一番翻找,找到了存放相关数据的文件

观察链接:

https://s.search.bilibili.com/cate/search?main_ver=v3&search_type=video&view_type=hot_rank&copy_right=-1&new_web_tag=1&order=click&cate_id=24&page=1&pagesize=30&time_from=20230215&time_to=20230222

https://www.bejson.com 观察json视图:

pSvULuj.png

通过观察和一些小的测试,得出以下结论:

  1. 链接中cate_id代表的是某根分区下的某个子分区
  2. 链接中的每个视频的信息是按照播放量(play字段)降序排序的
  3. 链接中page代表搜索的页数
  4. pagesize代表每页最大视频数量,经过测试最大为100
  5. time_from和time_to代表搜索时间的范围
  6. 每条视频的信息包含:上传时间(pubdate),封面(pic),标签(tag),作者(author),封面(pic),评论数(review),简介(description),弹幕数(video_review),收藏数(favorites),视频链接(arcurl),BV号(bvid),标题(title)等信息

根据以上信息便可以编写python爬取视频信息了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import requests
import json
import csv
#创建csv文件
f = open("E:\\bili_csv\\bili.csv", mode='a', encoding="utf-8-sig", newline="")
#设置字段
csv_writer = csv.DictWriter(f, fieldnames = ["titile", "author", "play", "partition", "tag", "pic","pubdate", "arcurl", "review", "video_review", "favorites"])
csv_writer.writeheader()

#某分区下某子分区的编号
cate_id = [24, 25, 47, 210, 86, 253, 27, 22, 26, 126, 216, 127, 20, 198, 199, 200, 154, 156, 71, 241, 242, 137, 95, 230, 231, 232, 233, 76, 212, 213, 214, 215, 17, 171, 172, 65, 173, 121, 136, 19, 28, 31, 59, 30, 29, 193, 243, 244, 130, 182, 183, 85, 184, 201, 124, 228, 207, 208, 209, 229, 122, 203, 204, 205, 206, 138, 254, 250 ,251, 239, 161, 162, 21, 218, 219, 222, 221, 220, 75, 245, 246, 247, 248, 240, 227, 176, 157, 252, 158, 159, 235, 249, 164, 236, 237, 238]
#每个编号对应的分区
partitions = ["动画", "动画", "动画", "动画", "动画", "动画", "动画", "鬼畜", "鬼畜", "鬼畜", "鬼畜", "鬼畜", "舞蹈", "舞蹈", "舞蹈", "舞蹈", "舞蹈", "舞蹈", "娱乐", "娱乐", "娱乐", "娱乐", "科技", "科技", "科技", "科技", "科技", "美食", "美食", "美食", "美食", "美食", "游戏", "游戏", "游戏", "游戏", "游戏", "游戏", "游戏", "游戏", "音乐", "音乐", "音乐", "音乐", "音乐", "音乐", "音乐", "音乐", "音乐", "影视", "影视", "影视", "影视", "知识", "知识", "知识", "知识", "知识", "知识", "知识", "知识", "资讯", "资讯", "资讯", "资讯", "生活", "生活", "生活", "生活", "生活", "生活", "生活", "生活", "动物圈", "动物圈", "动物圈", "动物圈", "动物圈", "动物圈", "汽车", "汽车", "汽车", "汽车", "汽车", "汽车", "汽车", "时尚", "时尚", "时尚", "时尚", "运动", "运动", "运动", "运动", "运动", "运动"]
#每个编号对应的子分区
tags = ["MAD/AMV", "MMD/3D", "短片/手书/配音", "手办/模玩", "特摄", "动漫杂谈", "动画综合", "鬼畜调教", "音MAD", "人力VOCALOID", "鬼畜剧场", "教程演示", "宅舞", "街舞", "明星舞蹈", "中国舞", "舞蹈综合", "舞蹈教程", "综艺", "娱乐杂谈", "粉丝创作", "明星综合", "数码", "软件应用", "计算机技术", "科工机械", "极客DIY", "美食制作", "美食侦探", "美食测评", "田园美食", "美食记录", "单机游戏", "电子竞技", "手机游戏", "网络游戏", "桌游棋牌", "GMV", "音游", "Mugen", "原创音乐", "翻唱", "演奏", "VOCALOID", "音乐现场", "MV", "乐评盘点", "音乐教学", "音乐综合", "影视杂谈", "影视剪辑", "小剧场", "预告/资讯", "科学科普", "社科/法律心理", "人文历史", "财经商业", "校园学习", "职业职场", "设计/创意", "野生技能协会", "热点", "环球", "社会", "资讯综合", "搞笑", "亲子", "出行", "三农", "家居房产", "手工", "绘画", "日常", "喵星人", "汪星人", "小宠异宠", "野生动物", "动物二创", "动物综合", "赛车", "改装玩车", "新能源车", "房车", "摩托车", "购车攻略", "汽车生活", "美妆护肤", "仿妆cos", "穿搭", "时尚潮流", "篮球", "足球", "健身", "竞技体育", "运动文化", "运动综合"]

for idx in range(0, 96):#一共96个子分区
partition = partitions[idx]
id = cate_id[idx]
tag = tags[idx]
for page in range(1, 2)#每个子分区爬1页数据(100条)
#设置id,page,pagesize,time_from & time_to
url = "https://s.search.bilibili.com/cate/search?main_ver=v3&search_type=video&view_type=hot_rank&copy_right=-1&new_web_tag=1&order=click&cate_id=" + str(id) + "&page=" + str(page) + "&pagesize=100&time_from=20230120&time_to=20230127"
response = requests.get(url)
d0 = json.loads(response.content)
for i in range(0, 100):
try:
print(tag + " " + str(page) + " " + str(i))
play = d0["result"][i]["play"]#播放量
author = d0["result"][i]["author"]#作者
title = d0["result"][i]["title"]#标题
review = d0["result"][i]["review"]#评论数
pubdate = d0["result"][i]["pubdate"]#发布日期
favorites = d0["result"][i]["favorites"]#收藏量
arcurl = d0["result"][i]["arcurl"]#链接
video_review = d0["result"][i]["video_review"]#弹幕数
pic = d0["result"][i]["pic"]#封面
data = {"titile":title,
"author":author,
"play":play,
"partition":partition,
"tag":tag,
"pic":pic,
"pubdate":pubdate,
"arcurl":arcurl,
"review":review,
"video_review":video_review,
"favorites":favorites}
csv_writer.writerow(data)
except:
print("out of limit")
continue



数据:

pSvULuj.png

参考:

哔哩哔哩分区视频详细信息爬取(三连、播放量、标签)等

课程作业 | B站2022年度“每周必看”视频数据爬取_哔哩哔哩_bilibili