从米游社爬取原神wiki数据

最近上头了一个尘歌壶套装,但是前期统计材料需求、收集材料是很痛苦的过程,如果计算出错就需要来回捯饬,很是费时间。于是乎就想要收集一个全素材的材料表,找到找去还是从米游社官方去爬数据最简单。大概思路是登陆之后,在浏览器控制台找到对数据的请求链接,然后用Python请求到数据,用BeautifulSoup解析html,用json模块解析数据,用pandas整理数据并输出。下面是一些抓数据的代码:

1
2
3
4
import pandas as pd
import requests
import json
import re
1
2
3
4
url = 'https://api-static.mihoyo.com/common/blackboard/ys_obc/v1/home/content/list?app_sn=ys_obc&channel_id=189'
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
response = requests.get(url,headers=headers) 
rep = json.loads(response.text)

可以从该json中解析出所需要的数据

1
2
datadb = rep['data']['list'][0]['children']
fur = datadb[13]

不同家具的情况不一样,有些位置会空缺数据,需要用try/catch识别,并且出错后报告

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
res = []
n=0
for i in fur['list']:
    title = i['title']
    summary = i['summary']
    ext = json.loads(i['ext'])
    try:
        region = re.compile(r'\"区域\/\S+?\"').findall(ext['c_130']['filter']['text'])[0][4:-1]
    except:
        region = "Error"
        print(n, title, summary, region, type, quality, blueprintAccess, furnitureAccess)

    try:
        type = re.compile(r'\"类型\/\S+?\"').findall(ext['c_130']['filter']['text'])[0][4:-1]
    except:
        type = "Error"
        print(n, title, summary, region, type, quality, blueprintAccess, furnitureAccess)

    try:
        quality = re.compile(r'\"品质\/\S+?\"').findall(ext['c_130']['filter']['text'])[0][4:-1]
    except:
        quality = "Error"
        print(n, title, summary, region, type, quality, blueprintAccess, furnitureAccess)

    try:
        blueprintAccess = re.compile(r'\"图纸获取方式\/\S+?\"').findall(ext['c_130']['filter']['text'])[0][8:-1]
    except:
        blueprintAccess = "Error"
        print(n, title, summary, region, type, quality, blueprintAccess, furnitureAccess)

    try:
        furnitureAccess = re.compile(r'\"摆设获取方式\/\S+?\"').findall(ext['c_130']['filter']['text'])[0][8:-1]
    except:
        furnitureAccess = "Error"
        print(n, title, summary, region, type, quality, blueprintAccess, furnitureAccess)

    n=n+1
    res.append([title, summary, region, type, quality, blueprintAccess, furnitureAccess])
1
2
result = pd.DataFrame(res, columns=['名称','注释','区域','分类','品质','图纸获取方式','摆设获取方式'])
result.to_csv('家具表.csv')