python支持多线程的爬虫实例-侯体宗的博客

python支持多线程的爬虫实例
Python / 管理员发布于 7年前 194

python是支持多线程的, 主要是通过thread和threading这两个模块来实现的，本文主要给大家分享python实现多线程网页爬虫

一般来说，使用线程有两种模式, 一种是创建线程要执行的函数, 把这个函数传递进Thread对象里，让它来执行. 另一种是直接从Thread继承，创建一个新的class，把线程执行的代码放到这个新的class里。

实现多线程网页爬虫，采用了多线程和锁机制，实现了广度优先算法的网页爬虫。

先给大家简单介绍下我的实现思路：

对于一个网络爬虫，如果要按广度遍历的方式下载，它是这样的：

1.从给定的入口网址把第一个网页下载下来

2.从第一个网页中提取出所有新的网页地址，放入下载列表中

3.按下载列表中的地址，下载所有新的网页

4.从所有新的网页中找出没有下载过的网页地址，更新下载列表

5.重复3、4两步，直到更新后的下载列表为空表时停止

python代码如下：

#!/usr/bin/env python#coding=utf-8import threadingimport urllibimport reimport timeg_mutex=threading.Condition()g_pages=[] #从中解析所有url链接g_queueURL=[] #等待爬取的url链接列表g_existURL=[] #已经爬取过的url链接列表g_failedURL=[] #下载失败的url链接列表g_totalcount=0 #下载过的页面数class Crawler: def __init__(self,crawlername,url,threadnum):  self.crawlername=crawlername  self.url=url  self.threadnum=threadnum  self.threadpool=[]  self.logfile=file("log.txt",'w') def craw(self):  global g_queueURL  g_queueURL.append(url)   depth=0  print self.crawlername+" 启动..."  while(len(g_queueURL)!=0):   depth+=1   print 'Searching depth ',depth,'...\n\n'   self.logfile.write("URL:"+g_queueURL[0]+"........")   self.downloadAll()   self.updateQueueURL()   content='\n>>>Depth '+str(depth)+':\n'   self.logfile.write(content)   i=0   while i<len(g_queueURL):    content=str(g_totalcount+i)+'->'+g_queueURL[i]+'\n'    self.logfile.write(content)    i+=1 def downloadAll(self):  global g_queueURL  global g_totalcount  i=0  while i<len(g_queueURL):   j=0   while j<self.threadnum and i+j < len(g_queueURL):    g_totalcount+=1    threadresult=self.download(g_queueURL[i+j],str(g_totalcount)+'.html',j)    if threadresult!=None:     print 'Thread started:',i+j,'--File number =',g_totalcount    j+=1   i+=j   for thread in self.threadpool:    thread.join(30)   threadpool=[]  g_queueURL=[] def download(self,url,filename,tid):  crawthread=CrawlerThread(url,filename,tid)  self.threadpool.append(crawthread)  crawthread.start() def updateQueueURL(self):  global g_queueURL  global g_existURL  newUrlList=[]  for content in g_pages:   newUrlList+=self.getUrl(content)  g_queueURL=list(set(newUrlList)-set(g_existURL))  def getUrl(self,content):  reg=r'"(http://.+?)"'  regob=re.compile(reg,re.DOTALL)  urllist=regob.findall(content)  return urllistclass CrawlerThread(threading.Thread): def __init__(self,url,filename,tid):  threading.Thread.__init__(self)  self.url=url  self.filename=filename  self.tid=tid def run(self):  global g_mutex  global g_failedURL  global g_queueURL  try:   page=urllib.urlopen(self.url)   html=page.read()   fout=file(self.filename,'w')   fout.write(html)   fout.close()  except Exception,e:   g_mutex.acquire()   g_existURL.append(self.url)   g_failedURL.append(self.url)   g_mutex.release()   print 'Failed downloading and saving',self.url   print e   return None  g_mutex.acquire()  g_pages.append(html)  g_existURL.append(self.url)  g_mutex.release()if __name__=="__main__": url=raw_input("请输入url入口:\n") threadnum=int(raw_input("设置线程数:")) crawlername="小小爬虫" crawler=Crawler(crawlername,url,threadnum) crawler.craw()

以上这篇python支持多线程的爬虫实例就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持。

上一条：
Python selenium的基本使用方法分析
下一条：
Python 实现try重新执行

0条评论 (评论内容有缓存机制,请悉知!)

最新最热

近期文章
在windows10中升级go版本至1.24后LiteIDE的Ctrl+左击无法跳转问题解决方案(0个评论)
智能合约Solidity学习CryptoZombie第四课:僵尸作战系统(0个评论)
智能合约Solidity学习CryptoZombie第三课:组建僵尸军队(高级Solidity理论)(0个评论)
智能合约Solidity学习CryptoZombie第二课:让你的僵尸猎食(0个评论)
智能合约Solidity学习CryptoZombie第一课:生成一只你的僵尸(0个评论)
在go中实现一个常用的先进先出的缓存淘汰算法示例代码(0个评论)
在go+gin中使用"github.com/skip2/go-qrcode"实现url转二维码功能(0个评论)
在go语言中使用api.geonames.org接口实现根据国际邮政编码获取地址信息功能(1个评论)
在go语言中使用github.com/signintech/gopdf实现生成pdf分页文件功能(0个评论)
gmail发邮件报错:534 5.7.9 Application-specific password required...解决方案(0个评论)

近期评论
122 在
学历：一种延缓就业设计，生活需求下的权衡之选中评论工作几年后，报名考研了，到现在还没认真学习备考，迷茫中。作为一名北漂互联网打工人..
123 在
Clash for Windows作者删库跑路了，github已404中评论按理说只要你在国内，所有的流量进出都在监控范围内，不管你怎么隐藏也没用，想搞你分..
原梓番博客在
在Laravel框架中使用模型Model分表最简单的方法中评论好久好久都没看友情链接申请了，今天刚看，已经添加。..
博主在
佛跳墙vpn软件不会用?上不了网?佛跳墙vpn常见问题以及解决办法中评论 @1111老铁这个不行了，可以看看近期评论的其他文章..
1111 在
佛跳墙vpn软件不会用?上不了网?佛跳墙vpn常见问题以及解决办法中评论网站不能打开，博主百忙中能否发个APP下载链接，佛跳墙或极光..

Top