爬取今日头条Ajax请求-侯体宗的博客

爬取今日头条Ajax请求
前端 / 管理员发布于 6年前 415

网址：https://www.toutiao.com/

搜索头条

可以得到这个网址：

https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D

开发者工具查看：

我们在搜索中并没有发现上面的文字，那么我们可以初步判定，这个由Ajax加载，然后渲染出来的。此时切换到xhr过滤，可以看到确实是ajax请求。

观察请求的特点，发现只有offset是改变的，而且一次加20,。

我们可以用它来控制数据分页，然后把图片下载下来。代码如下：

import requestsimport osfrom urllib.parse import urlencodefrom hashlib import md5from multiprocessing.pool import Poolfrom requests import codesdef get_page(offset):  params = {    "offset":offset,    "format":"json",    "keyword":"街拍",    "autoload":"true",    "count":"20",    "cur_tab":"1",    "from":"search_tab"  }  url = 'https://www.toutiao.com/search_content/?'+urlencode(params)  try:    response = requests.get(url)    if response.status_code == 200:      # print(url)      return response.json()  except requests.ConnectionError:    return None# get_page(0)def get_images(json):  if json.get('data'):    for item in json.get('data'):      if item.get('cell_type') is not None:        continue      title = item.get('title')      images = item.get('image_list')      for image in images:        yield {          'title':title,          'image':'https:' + image.get('url'),        }def save_image(item):  #os.path.sep  路径分隔符‘//'  img_path = 'img' + os.path.sep + item.get('title')  if not os.path.exists(img_path):    os.makedirs(img_path)  try:    resp = requests.get(item.get('image'))    # print(type(resp))    if codes.ok == resp.status_code:      file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(        file_name=md5(resp.content).hexdigest(),#md5是一种加密算法获取图片的二进制数据，以二进制形式写入文件        file_suffix='jpg')      if not os.path.exists(file_path):        with open(file_path,'wb')as f:          f.write(resp.content)          print('Downladed image path is %s' % file_path)      else:        print('Already Downloaded',file_path)  except requests.ConnectionError:    print('Failed to Save Image,item %s' % item)def main(offset):  json = get_page(offset)  for item in get_images(json):    print(item)    save_image(item)GROUP = 0GROUP_END = 2if __name__ == '__main__':  pool = Pool()  groups = ([x*20 for x in range(GROUP,GROUP_END)])  pool.map(main,groups)  #将groups一个个调出来传给main函数  pool.close()  pool.join()   #保证子进程结束后再向下执行 pool.join(1) 等待一秒

总结

以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作具有一定的参考学习价值，谢谢大家的支持。如果你想了解更多相关内容请查看下面相关链接

上一条：
常用的JQuery数字类型验证正则表达式整理
下一条：
js 正则表达式之test函数讲解

0条评论 (评论内容有缓存机制,请悉知!)

最新最热

相关文章
使用 Alpine.js 排序插件对元素进行排序(0个评论)
在js中使用jszip + file-saver实现批量下载OSS文件功能示例(0个评论)
在vue中实现父页面按钮显示子组件中的el-dialog效果(0个评论)
使用mock-server实现模拟接口对接流程步骤(0个评论)
vue项目打包程序实现把项目打包成一个exe可执行程序(0个评论)

近期评论
test1 在
opencode + Oh-my-openagent,我的第一个免费的ai编程智能体管家:Sisyphus中评论 test..
122 在
学历：一种延缓就业设计，生活需求下的权衡之选中评论工作几年后，报名考研了，到现在还没认真学习备考，迷茫中。作为一名北漂互联网打工人..
Zita 在
Google AI Studio升级全栈 vibe coding体验，可直接构建带登录和数据库的应用中评论 111222..
123 在
Clash for Windows作者删库跑路了，github已404中评论按理说只要你在国内，所有的流量进出都在监控范围内，不管你怎么隐藏也没用，想搞你分..
原梓番博客在
在Laravel框架中使用模型Model分表最简单的方法中评论好久好久都没看友情链接申请了，今天刚看，已经添加。..

Top