虽说不用框架自己手撸也行,但达不到学习了解scrapy的目的。
先定一个小目标,再在实际使用中学习吧。
前言
-
想要实现的
每天记录一次指定Github Repo(此处以BilibiliDown为例)的Star人员账号,
并上传到指定路径(GithubStargazers/BilibiliDown) -
实现步骤
- 获取Star情况
- Github文件上传
- Github workfow周期性调用
获取Star情况
- 自定义items.py
就记录序号和用户名,比较简单class GithubstarerItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() #序号 serial_number = scrapy.Field() #用户名 user_name = scrapy.Field()
- 自定义爬虫逻辑
已经有现成的接口,每页最多返回30个,只需要改一下分页的参数就行了
?page=%d
,当返回为空的时候停止。class BilibilidownSpider(scrapy.Spider): name = 'BilibiliDown' allowed_domains = ['api.github.com'] start_urls = ['https://api.github.com/repos/nICEnnnnnnnLee/BilibiliDown/stargazers'] page = 1 def parse(self, response): print(response.text) result = json.loads(response.text) if len(result) == 0: return i = self.page*30 - 30 for i_user in result: starer = GithubstarerItem() starer['serial_number'] = i starer['user_name'] = i_user['login'] i += 1 print(starer) yield starer # 解析下一页 self.page += 1 next_link = 'https://api.github.com/repos/nICEnnnnnnnLee/BilibiliDown/stargazers?page=%d'%self.page yield scrapy.Request(next_link, callback=self.parse)
- 自定义pipelines.py
将采集到的item保存为txt文本
class GithubstarerPipeline(object): def process_item(self, item, spider): # 获取当前工作目录 base_dir = os.getcwd() fiename = base_dir + '/starers.txt' # 从内存以追加的方式打开文件,并写入对应的数据 with open(fiename, 'a') as f: f.write(str(item['serial_number']) + '\t') f.write(item['user_name'] + ' \r\n') return item
相应配置:
ITEM_PIPELINES = { 'GithubStarer.pipelines.GithubstarerPipeline': 300, }
- 其它琐碎
难度比较低,不是很讲究http headers等等,对settings.py
、middlewares.py
几乎没有修改。
Github文件上传
在前面已有现成的实现,参考FileUploader4Github。
以下为参考的脚本调用:
# 格式化日期
cur_date=$(date "+%Y-%m-%d")
# 上传路径
upload_path="https://api.github.com/repos/nICEnnnnnnnLee/GithubStargazers/contents/BilibiliDown/$cur_date.txt"
# 调用已有的实现上传starers.txt
java -jar tool/FileUploadTool.jar $upload_path starers.txt ${ { secrets.AUTH_TOKEN }}
Github workfow周期性调用
其实没啥难度,要注意的是需要在settings
->secrets
里面设置好AUTH_TOKEN
,用于指定repo的文件上传的授权。
name: CI
on:
schedule:
- cron: '1 0 * * *' # 每天0点1分调用
jobs:
build:
runs-on: ubuntu-latest
steps:
# 检出工程
- uses: actions/checkout@v2
# 安装java环境
- name: Set up JDK 1.8
uses: actions/setup-java@v1
with:
java-version: 1.8
# 安装python环境
- name: Set up Python 3.8
uses: actions/setup-python@v1
with:
python-version: 3.8
# 安装scrapy
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install scrapy
# 获取Star用户信息,保存到starers.txt
- name: Get GithubStargazers
run: |
rm -rf starers.txt
scrapy crawl BilibiliDown
# 上传starers.txt到指定Repo
- name: Upload GithubStargazers
run: |
cur_date=$(date "+%Y-%m-%d")
upload_path="https://api.github.com/repos/nICEnnnnnnnLee/GithubStargazers/contents/BilibiliDown/$cur_date.txt"
echo upload_path
java -jar tool/FileUploadTool.jar $upload_path starers.txt $