侯体宗的博客
  • 首页
  • Hyperf版
  • beego仿版
  • 人生(杂谈)
  • 技术
  • 关于我
  • 更多分类
    • 文件下载
    • 文字修仙
    • 中国象棋ai
    • 群聊
    • 九宫格抽奖
    • 拼图
    • 消消乐
    • 相册

对Python3 解析html的几种操作方式小结

Python  /  管理员 发布于 7年前   136

解析html是爬虫后的重要的一个处理数据的环节。一下记录解析html的几种方式。

先介绍基础的辅助函数,主要用于获取html并输入解析后的结束

#把传递解析函数,便于下面的修改def get_html(url, paraser=bs4_paraser): headers = {  'Accept': '*/*',  'Accept-Encoding': 'gzip, deflate, sdch',  'Accept-Language': 'zh-CN,zh;q=0.8',  'Host': 'www.360kan.com',  'Proxy-Connection': 'keep-alive',  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' } request = urllib2.Request(url, headers=headers) response = urllib2.urlopen(request) response.encoding = 'utf-8' if response.code == 200:  data = StringIO.StringIO(response.read())  gzipper = gzip.GzipFile(fileobj=data)  data = gzipper.read()  value = paraser(data) # open('E:/h5/haPkY0osd0r5UB.html').read()  return value else:  pass  value = get_html('http://www.360kan.com/m/haPkY0osd0r5UB.html', paraser=lxml_parser)for row in value: print row

1,lxml.html的方式进行解析,

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.6 to 3.5. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ. [官网](http://lxml.de/)

def lxml_parser(page): data = [] doc = etree.HTML(page) all_div = doc.xpath('//div[@class="yingping-list-wrap"]') for row in all_div:  # 获取每一个影评,即影评的item  all_div_item = row.xpath('.//div[@class="item"]') # find_all('div', attrs={'class': 'item'})  for r in all_div_item:   value = {}   # 获取影评的标题部分   title = r.xpath('.//div[@class="g-clear title-wrap"][1]')   value['title'] = title[0].xpath('./a/text()')[0]   value['title_href'] = title[0].xpath('./a/@href')[0]   score_text = title[0].xpath('./div/span/span/@style')[0]   score_text = re.search(r'\d+', score_text).group()   value['score'] = int(score_text) / 20   # 时间   value['time'] = title[0].xpath('./div/span[@class="time"]/text()')[0]   # 多少人喜欢   value['people'] = int(     re.search(r'\d+', title[0].xpath('./div[@class="num"]/span/text()')[0]).group())   data.append(value) return data

2,使用BeautifulSoup,不多说了,大家网上找资料看看

def bs4_paraser(html): all_value = [] value = {} soup = BeautifulSoup(html, 'html.parser') # 获取影评的部分 all_div = soup.find_all('div', attrs={'class': 'yingping-list-wrap'}, limit=1) for row in all_div:  # 获取每一个影评,即影评的item  all_div_item = row.find_all('div', attrs={'class': 'item'})  for r in all_div_item:   # 获取影评的标题部分   title = r.find_all('div', attrs={'class': 'g-clear title-wrap'}, limit=1)   if title is not None and len(title) > 0:    value['title'] = title[0].a.string    value['title_href'] = title[0].a['href']    score_text = title[0].div.span.span['style']    score_text = re.search(r'\d+', score_text).group()    value['score'] = int(score_text) / 20    # 时间    value['time'] = title[0].div.find_all('span', attrs={'class': 'time'})[0].string    # 多少人喜欢    value['people'] = int(      re.search(r'\d+', title[0].find_all('div', attrs={'class': 'num'})[0].span.string).group())   # print r   all_value.append(value)   value = {} return all_value

3,使用SGMLParser,主要是通过start、end tag的方式进行了,解析工程比较明朗,但是有点麻烦,而且该案例的场景不太适合该方法,(哈哈)

class CommentParaser(SGMLParser): def __init__(self):  SGMLParser.__init__(self)  self.__start_div_yingping = False  self.__start_div_item = False  self.__start_div_gclear = False  self.__start_div_ratingwrap = False  self.__start_div_num = False  # a  self.__start_a = False  # span 3中状态  self.__span_state = 0  # 数据  self.__value = {}  self.data = []  def start_div(self, attrs):  for k, v in attrs:   if k == 'class' and v == 'yingping-list-wrap':    self.__start_div_yingping = True   elif k == 'class' and v == 'item':    self.__start_div_item = True   elif k == 'class' and v == 'g-clear title-wrap':    self.__start_div_gclear = True   elif k == 'class' and v == 'rating-wrap g-clear':    self.__start_div_ratingwrap = True   elif k == 'class' and v == 'num':    self.__start_div_num = True  def end_div(self):  if self.__start_div_yingping:   if self.__start_div_item:    if self.__start_div_gclear:     if self.__start_div_num or self.__start_div_ratingwrap:      if self.__start_div_num:       self.__start_div_num = False      if self.__start_div_ratingwrap:       self.__start_div_ratingwrap = False     else:      self.__start_div_gclear = False    else:     self.data.append(self.__value)     self.__value = {}     self.__start_div_item = False   else:    self.__start_div_yingping = False  def start_a(self, attrs):  if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:   self.__start_a = True   for k, v in attrs:    if k == 'href':     self.__value['href'] = v  def end_a(self):  if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear and self.__start_a:   self.__start_a = False  def start_span(self, attrs):  if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:   if self.__start_div_ratingwrap:    if self.__span_state != 1:     for k, v in attrs:      if k == 'class' and v == 'rating':       self.__span_state = 1      elif k == 'class' and v == 'time':       self.__span_state = 2    else:     for k, v in attrs:      if k == 'style':       score_text = re.search(r'\d+', v).group()     self.__value['score'] = int(score_text) / 20     self.__span_state = 3   elif self.__start_div_num:    self.__span_state = 4  def end_span(self):  self.__span_state = 0  def handle_data(self, data):  if self.__start_a:   self.__value['title'] = data  elif self.__span_state == 2:   self.__value['time'] = data  elif self.__span_state == 4:   score_text = re.search(r'\d+', data).group()   self.__value['people'] = int(score_text)  passdef sgl_parser(html): parser = CommentParaser() parser.feed(html) return parser.data

4,HTMLParaer,与3原理相识,就是调用的方法不太一样,基本上可以公用,

class CommentHTMLParser(HTMLParser.HTMLParser): def __init__(self):  HTMLParser.HTMLParser.__init__(self)  self.__start_div_yingping = False  self.__start_div_item = False  self.__start_div_gclear = False  self.__start_div_ratingwrap = False  self.__start_div_num = False  # a  self.__start_a = False  # span 3中状态  self.__span_state = 0  # 数据  self.__value = {}  self.data = []  def handle_starttag(self, tag, attrs):  if tag == 'div':   for k, v in attrs:    if k == 'class' and v == 'yingping-list-wrap':     self.__start_div_yingping = True    elif k == 'class' and v == 'item':     self.__start_div_item = True    elif k == 'class' and v == 'g-clear title-wrap':     self.__start_div_gclear = True    elif k == 'class' and v == 'rating-wrap g-clear':     self.__start_div_ratingwrap = True    elif k == 'class' and v == 'num':     self.__start_div_num = True  elif tag == 'a':   if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:    self.__start_a = True    for k, v in attrs:     if k == 'href':      self.__value['href'] = v  elif tag == 'span':   if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:    if self.__start_div_ratingwrap:     if self.__span_state != 1:      for k, v in attrs:       if k == 'class' and v == 'rating':        self.__span_state = 1       elif k == 'class' and v == 'time':        self.__span_state = 2     else:      for k, v in attrs:       if k == 'style':        score_text = re.search(r'\d+', v).group()      self.__value['score'] = int(score_text) / 20      self.__span_state = 3    elif self.__start_div_num:     self.__span_state = 4  def handle_endtag(self, tag):  if tag == 'div':   if self.__start_div_yingping:    if self.__start_div_item:     if self.__start_div_gclear:      if self.__start_div_num or self.__start_div_ratingwrap:       if self.__start_div_num:        self.__start_div_num = False       if self.__start_div_ratingwrap:        self.__start_div_ratingwrap = False      else:       self.__start_div_gclear = False     else:      self.data.append(self.__value)      self.__value = {}      self.__start_div_item = False    else:     self.__start_div_yingping = False  elif tag == 'a':   if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear and self.__start_a:    self.__start_a = False  elif tag == 'span':   self.__span_state = 0  def handle_data(self, data):  if self.__start_a:   self.__value['title'] = data  elif self.__span_state == 2:   self.__value['time'] = data  elif self.__span_state == 4:   score_text = re.search(r'\d+', data).group()   self.__value['people'] = int(score_text)  passdef html_parser(html): parser = CommentHTMLParser() parser.feed(html) return parser.data

3,4对于该案例来说确实是不太适合,趁现在有空记录下来,功学习使用!

以上这篇对Python3 解析html的几种操作方式小结就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持。


  • 上一条:
    Python3 pip3 list 出现 DEPRECATION 警告的解决方法
    下一条:
    Python实现爬取马云的微博功能示例
  • 昵称:

    邮箱:

    0条评论 (评论内容有缓存机制,请悉知!)
    最新最热
    • 分类目录
    • 人生(杂谈)
    • 技术
    • linux
    • Java
    • php
    • 框架(架构)
    • 前端
    • ThinkPHP
    • 数据库
    • 微信(小程序)
    • Laravel
    • Redis
    • Docker
    • Go
    • swoole
    • Windows
    • Python
    • 苹果(mac/ios)
    • 相关文章
    • 在python语言中Flask框架的学习及简单功能示例(0个评论)
    • 在Python语言中实现GUI全屏倒计时代码示例(0个评论)
    • Python + zipfile库实现zip文件解压自动化脚本示例(0个评论)
    • python爬虫BeautifulSoup快速抓取网站图片(1个评论)
    • vscode 配置 python3开发环境的方法(0个评论)
    • 近期文章
    • 在windows10中升级go版本至1.24后LiteIDE的Ctrl+左击无法跳转问题解决方案(0个评论)
    • 智能合约Solidity学习CryptoZombie第四课:僵尸作战系统(0个评论)
    • 智能合约Solidity学习CryptoZombie第三课:组建僵尸军队(高级Solidity理论)(0个评论)
    • 智能合约Solidity学习CryptoZombie第二课:让你的僵尸猎食(0个评论)
    • 智能合约Solidity学习CryptoZombie第一课:生成一只你的僵尸(0个评论)
    • 在go中实现一个常用的先进先出的缓存淘汰算法示例代码(0个评论)
    • 在go+gin中使用"github.com/skip2/go-qrcode"实现url转二维码功能(0个评论)
    • 在go语言中使用api.geonames.org接口实现根据国际邮政编码获取地址信息功能(1个评论)
    • 在go语言中使用github.com/signintech/gopdf实现生成pdf分页文件功能(0个评论)
    • gmail发邮件报错:534 5.7.9 Application-specific password required...解决方案(0个评论)
    • 近期评论
    • 122 在

      学历:一种延缓就业设计,生活需求下的权衡之选中评论 工作几年后,报名考研了,到现在还没认真学习备考,迷茫中。作为一名北漂互联网打工人..
    • 123 在

      Clash for Windows作者删库跑路了,github已404中评论 按理说只要你在国内,所有的流量进出都在监控范围内,不管你怎么隐藏也没用,想搞你分..
    • 原梓番博客 在

      在Laravel框架中使用模型Model分表最简单的方法中评论 好久好久都没看友情链接申请了,今天刚看,已经添加。..
    • 博主 在

      佛跳墙vpn软件不会用?上不了网?佛跳墙vpn常见问题以及解决办法中评论 @1111老铁这个不行了,可以看看近期评论的其他文章..
    • 1111 在

      佛跳墙vpn软件不会用?上不了网?佛跳墙vpn常见问题以及解决办法中评论 网站不能打开,博主百忙中能否发个APP下载链接,佛跳墙或极光..
    • 2016-10
    • 2016-11
    • 2018-04
    • 2020-03
    • 2020-04
    • 2020-05
    • 2020-06
    • 2022-01
    • 2023-07
    • 2023-10
    Top

    Copyright·© 2019 侯体宗版权所有· 粤ICP备20027696号 PHP交流群

    侯体宗的博客