Python爬虫 scrapy框架爬取某招聘网存入mongodb解析-侯体宗的博客

Python爬虫 scrapy框架爬取某招聘网存入mongodb解析
Python / 管理员发布于 8年前 185

创建项目

scrapy startproject zhaoping

创建爬虫

cd zhaopingscrapy genspider hr zhaopingwang.com

目录结构

items.py

  title = scrapy.Field()  position = scrapy.Field()  publish_date = scrapy.Field()

pipelines.py

from pymongo import MongoClientmongoclient = MongoClient(host='192.168.226.150',port=27017)collection = mongoclient['zhaoping']['hr']class TencentPipeline(object):  def process_item(self, item, spider):    print(item)    # 需要转换为 dict    collection.insert(dict(item))    return item

spiders/hr.py

def parse(self, response):    # 不要第一个 和最后一个    tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]    for tr in tr_list:      item = TencentItem()      # xpath 从1 开始数起      item["title"] = tr.xpath("./td[1]/a/text()").extract_first()      item["position"] = tr.xpath("./td[2]/text()").extract_first()      item["publish_date"] = tr.xpath("./td[5]/text()").extract_first()      yield item    next_url = response.xpath("//a[@id='next']/@href").extract_first()    # 构造url    if next_url != "javascript:;":      print(next_url)      next_url = "https://hr.tencent.com/" + next_url      yield scrapy.Request(url=next_url,callback=self.parse,)

就是这么简单，就获取到数据

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持。

上一条：
Python对接六大主流数据库(只需三步)
下一条：
python爬虫模拟登录人人网过程解析

0条评论 (评论内容有缓存机制,请悉知!)

最新最热

近期评论
test1 在
opencode + Oh-my-openagent,我的第一个免费的ai编程智能体管家:Sisyphus中评论 test..
122 在
学历：一种延缓就业设计，生活需求下的权衡之选中评论工作几年后，报名考研了，到现在还没认真学习备考，迷茫中。作为一名北漂互联网打工人..
Zita 在
Google AI Studio升级全栈 vibe coding体验，可直接构建带登录和数据库的应用中评论 111222..
123 在
Clash for Windows作者删库跑路了，github已404中评论按理说只要你在国内，所有的流量进出都在监控范围内，不管你怎么隐藏也没用，想搞你分..
原梓番博客在
在Laravel框架中使用模型Model分表最简单的方法中评论好久好久都没看友情链接申请了，今天刚看，已经添加。..

Top