Python爬虫包BeautifulSoup实例（三）-侯体宗的博客

Python爬虫包BeautifulSoup实例（三）
Python / 管理员发布于 7年前 162

一步一步构建一个爬虫实例，抓取糗事百科的段子

先不用beautifulsoup包来进行解析

第一步，访问网址并抓取源码

# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date:  2016-12-22 16:16:08# @Last Modified by:  HaonanWu# @Last Modified time: 2016-12-22 20:17:13import urllibimport urllib2import reimport osif __name__ == '__main__':  # 访问网址并抓取源码  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'  headers = {'User-Agent':user_agent}  try:    request = urllib2.Request(url = url, headers = headers)    response = urllib2.urlopen(request)    content = response.read()  except urllib2.HTTPError as e:    print e    exit()  except urllib2.URLError as e:    print e    exit()  print content.decode('utf-8')

第二步，利用正则表达式提取信息

首先先观察源码中，你需要的内容的位置以及如何识别
然后用正则表达式去识别读取
注意正则表达式中的 . 是不能匹配\n的，所以需要设置一下匹配模式。

# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date:  2016-12-22 16:16:08# @Last Modified by:  HaonanWu# @Last Modified time: 2016-12-22 20:17:13import urllibimport urllib2import reimport osif __name__ == '__main__':  # 访问网址并抓取源码  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'  headers = {'User-Agent':user_agent}  try:    request = urllib2.Request(url = url, headers = headers)    response = urllib2.urlopen(request)    content = response.read()  except urllib2.HTTPError as e:    print e    exit()  except urllib2.URLError as e:    print e    exit()  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)  items = re.findall(regex, content)  # 提取数据  # 注意换行符，设置 . 能够匹配换行符  for item in items:    print item

第三步，修正数据并保存到文件中

# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date:  2016-12-22 16:16:08# @Last Modified by:  HaonanWu# @Last Modified time: 2016-12-22 21:41:32import urllibimport urllib2import reimport osif __name__ == '__main__':  # 访问网址并抓取源码  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'  headers = {'User-Agent':user_agent}  try:    request = urllib2.Request(url = url, headers = headers)    response = urllib2.urlopen(request)    content = response.read()  except urllib2.HTTPError as e:    print e    exit()  except urllib2.URLError as e:    print e    exit()  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)  items = re.findall(regex, content)  # 提取数据  # 注意换行符，设置 . 能够匹配换行符  path = './qiubai'  if not os.path.exists(path):    os.makedirs(path)  count = 1  for item in items:    #整理数据，去掉\n,将<br/>换成\n    item = item.replace('\n', '').replace('<br/>', '\n')    filepath = path + '/' + str(count) + '.txt'    f = open(filepath, 'w')    f.write(item)    f.close()    count += 1

第四步，将多个页面下的内容都抓取下来

# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date:  2016-12-22 16:16:08# @Last Modified by:  HaonanWu# @Last Modified time: 2016-12-22 20:17:13import urllibimport urllib2import reimport osif __name__ == '__main__':  # 访问网址并抓取源码  path = './qiubai'  if not os.path.exists(path):    os.makedirs(path)  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'  headers = {'User-Agent':user_agent}  regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S)  count = 1  for cnt in range(1, 35):    print '第' + str(cnt) + '轮'    url = 'http://www.qiushibaike.com/textnew/page/' + str(cnt) + '/?s=4941357'    try:      request = urllib2.Request(url = url, headers = headers)      response = urllib2.urlopen(request)      content = response.read()    except urllib2.HTTPError as e:      print e      exit()    except urllib2.URLError as e:      print e      exit()    # print content    # 提取数据    # 注意换行符，设置 . 能够匹配换行符    items = re.findall(regex, content)    # 保存信息    for item in items:      #  print item      #整理数据，去掉\n,将<br/>换成\n      item = item.replace('\n', '').replace('<br/>', '\n')      filepath = path + '/' + str(count) + '.txt'      f = open(filepath, 'w')      f.write(item)      f.close()      count += 1  print '完成'

使用BeautifulSoup对源码进行解析

# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date:  2016-12-22 16:16:08# @Last Modified by:  HaonanWu# @Last Modified time: 2016-12-22 21:34:02import urllibimport urllib2import reimport osfrom bs4 import BeautifulSoupif __name__ == '__main__':  url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'  headers = {'User-Agent':user_agent}  request = urllib2.Request(url = url, headers = headers)  response = urllib2.urlopen(request)  # print response.read()  soup_packetpage = BeautifulSoup(response, 'lxml')  items = soup_packetpage.find_all("div", class_="content")  for item in items:    try:      content = item.span.string    except AttributeError as e:      print e      exit()    if content:      print content + "\n"

这是用BeautifulSoup去抓取书本以及其价格的代码
可以通过对比得出到bs4对标签的读取以及标签内容的读取
（因为我自己也没有学到这一部分，目前只能依葫芦画瓢地写）

# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date:  2016-12-22 20:37:38# @Last Modified by:  HaonanWu# @Last Modified time: 2016-12-22 21:27:30import urllib2import urllibimport re from bs4 import BeautifulSoup url = "https://www.packtpub.com/all"try:  html = urllib2.urlopen(url) except urllib2.HTTPError as e:  print e  exit()soup_packtpage = BeautifulSoup(html, 'lxml') all_book_title = soup_packtpage.find_all("div", class_="book-block-title") price_regexp = re.compile(u"\s+\$\s\d+\.\d+") for book_title in all_book_title:   try:    print "Book's name is " + book_title.string.strip()  except AttributeError as e:    print e    exit()  book_price = book_title.find_next(text=price_regexp)   try:    print "Book's price is "+ book_price.strip()  except AttributeError as e:    print e    exit()  print ""

以上全部为本篇文章的全部内容，希望对大家的学习有所帮助，也希望大家多多支持。

上一条：
Python爬虫包BeautifulSoup学习实例（五）
下一条：
Python爬虫包BeautifulSoup异常处理（二）

0条评论 (评论内容有缓存机制,请悉知!)

最新最热

近期文章
在windows10中升级go版本至1.24后LiteIDE的Ctrl+左击无法跳转问题解决方案(0个评论)
智能合约Solidity学习CryptoZombie第四课:僵尸作战系统(0个评论)
智能合约Solidity学习CryptoZombie第三课:组建僵尸军队(高级Solidity理论)(0个评论)
智能合约Solidity学习CryptoZombie第二课:让你的僵尸猎食(0个评论)
智能合约Solidity学习CryptoZombie第一课:生成一只你的僵尸(0个评论)
在go中实现一个常用的先进先出的缓存淘汰算法示例代码(0个评论)
在go+gin中使用"github.com/skip2/go-qrcode"实现url转二维码功能(0个评论)
在go语言中使用api.geonames.org接口实现根据国际邮政编码获取地址信息功能(1个评论)
在go语言中使用github.com/signintech/gopdf实现生成pdf分页文件功能(95个评论)
gmail发邮件报错:534 5.7.9 Application-specific password required...解决方案(0个评论)

近期评论
122 在
学历：一种延缓就业设计，生活需求下的权衡之选中评论工作几年后，报名考研了，到现在还没认真学习备考，迷茫中。作为一名北漂互联网打工人..
123 在
Clash for Windows作者删库跑路了，github已404中评论按理说只要你在国内，所有的流量进出都在监控范围内，不管你怎么隐藏也没用，想搞你分..
原梓番博客在
在Laravel框架中使用模型Model分表最简单的方法中评论好久好久都没看友情链接申请了，今天刚看，已经添加。..
博主在
佛跳墙vpn软件不会用?上不了网?佛跳墙vpn常见问题以及解决办法中评论 @1111老铁这个不行了，可以看看近期评论的其他文章..
1111 在
佛跳墙vpn软件不会用?上不了网?佛跳墙vpn常见问题以及解决办法中评论网站不能打开，博主百忙中能否发个APP下载链接，佛跳墙或极光..

Top