从米游社爬取原神wiki数据
最近上头了一个尘歌壶套装,但是前期统计材料需求、收集材料是很痛苦的过程,如果计算出错就需要来回捯饬,很是费时间。于是乎就想要收集一个全素材的材料表,找到找去还是从米游社官方去爬数据最简单。大概思路是登陆之后,在浏览器控制台找到对数据的请求链接,然后用Python请求到数据,用BeautifulSoup
解析html,用json
模块解析数据,用pandas
整理数据并输出。下面是一些抓数据的代码:
引包
import pandas as pd
import requests
import json
import re
请求数据
url = 'https://api-static.mihoyo.com/common/blackboard/ys_obc/v1/home/content/list?app_sn=ys_obc&channel_id=189'
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
response = requests.get(url,headers=headers)
rep = json.loads(response.text)
数据整理
可以从该json中解析出所需要的数据
datadb = rep['data']['list'][0]['children']
fur = datadb[13]
不同家具的情况不一样,有些位置会空缺数据,需要用try/catch识别,并且出错后报告
res = []
n=0
for i in fur['list']:
title = i['title']
summary = i['summary']
ext = json.loads(i['ext'])
try:
region = re.compile(r'\"区域\/\S+?\"').findall(ext['c_130']['filter']['text'])[0][4:-1]
except:
region = "Error"
print(n, title, summary, region, type, quality, blueprintAccess, furnitureAccess)
try:
type = re.compile(r'\"类型\/\S+?\"').findall(ext['c_130']['filter']['text'])[0][4:-1]
except:
type = "Error"
print(n, title, summary, region, type, quality, blueprintAccess, furnitureAccess)
try:
quality = re.compile(r'\"品质\/\S+?\"').findall(ext['c_130']['filter']['text'])[0][4:-1]
except:
quality = "Error"
print(n, title, summary, region, type, quality, blueprintAccess, furnitureAccess)
try:
blueprintAccess = re.compile(r'\"图纸获取方式\/\S+?\"').findall(ext['c_130']['filter']['text'])[0][8:-1]
except:
blueprintAccess = "Error"
print(n, title, summary, region, type, quality, blueprintAccess, furnitureAccess)
try:
furnitureAccess = re.compile(r'\"摆设获取方式\/\S+?\"').findall(ext['c_130']['filter']['text'])[0][8:-1]
except:
furnitureAccess = "Error"
print(n, title, summary, region, type, quality, blueprintAccess, furnitureAccess)
n=n+1
res.append([title, summary, region, type, quality, blueprintAccess, furnitureAccess])
数据存储
result = pd.DataFrame(res, columns=['名称','注释','区域','分类','品质','图纸获取方式','摆设获取方式'])
result.to_csv('家具表.csv')