python模块之HTMLParser


 


对于python我只是个初学者,由于实践的需要,发现python这个东西对网页的处理,网络编程,http协议测试都非常方便,还有就是web应用的开发框架dijango。刚刚学到HTMLParser这个模块,对于解析html标签非常好用,这里做个小总结吧,共学习参考。

1.基础api介绍
        HTMLParser是python用来解析html的模块。它可以分析出html里面的标签、数据等等,是一种处理html的简便途径。 HTMLParser采用的是一种事件驱动的模式,当TMLParser找到一个特定的标记时,它会去调用一个用户定义的函数,以此来通知程序处理。它 主要的用户回调函数的命名都是以handler_开头的,都是HTMLParser的成员函数。当我们使用时,就从HTMLParser派生出新的类,然 后重新定义这几个以handler_开头的函数即可。
handle_startendtag 处理开始标签和结束标签
handle_starttag     处理开始标签,比如<xx>
handle_endtag       处理结束标签,比如</xx>
handle_charref      处理特殊字符串,就是以&#开头的,一般是内码表示的字符
handle_entityref    处理一些特殊字符,以&开头的,比如 &nbsp;
handle_data         处理数据,就是<xx>data</xx>中间的那些数据
handle_comment      处理注释
handle_decl         处理<!开头的,比如<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
handle_pi           处理形如<?instruction>的东西

 

[python] view plaincopyprint?
>>> help(HTMLParser.HTMLParser.handle_endtag) 
Help on method handle_endtag in module HTMLParser: 
 
handle_endtag(self, tag) unbound HTMLParser.HTMLParser method 
    # Overridable -- handle end tag  
 
>>> help(HTMLParser.HTMLParser.handle_data) 
Help on method handle_data in module HTMLParser: 
 
handle_data(self, data) unbound HTMLParser.HTMLParser method 
    # Overridable -- handle data  
 
>>> help(HTMLParser.HTMLParser.handle_charref) 
Help on method handle_charref in module HTMLParser: 
 
handle_charref(self, name) unbound HTMLParser.HTMLParser method 
    # Overridable -- handle character reference  
 
>>> help(HTMLParser.HTMLParser.handle_decl) 
Help on method handle_decl in module HTMLParser: 
 
handle_decl(self, decl) unbound HTMLParser.HTMLParser method 
    # Overridable -- handle declaration  
 
>>> help(HTMLParser.HTMLParser.handle_startendtag) 
Help on method handle_startendtag in module HTMLParser: 
 
handle_startendtag(self, tag, attrs) unbound HTMLParser.HTMLParser method 
    # Overridable -- finish processing of start+end tag: <tag.../> 

>>> help(HTMLParser.HTMLParser.handle_endtag)
Help on method handle_endtag in module HTMLParser:

handle_endtag(self, tag) unbound HTMLParser.HTMLParser method
    # Overridable -- handle end tag

>>> help(HTMLParser.HTMLParser.handle_data)
Help on method handle_data in module HTMLParser:

handle_data(self, data) unbound HTMLParser.HTMLParser method
    # Overridable -- handle data

>>> help(HTMLParser.HTMLParser.handle_charref)
Help on method handle_charref in module HTMLParser:

handle_charref(self, name) unbound HTMLParser.HTMLParser method
    # Overridable -- handle character reference

>>> help(HTMLParser.HTMLParser.handle_decl)
Help on method handle_decl in module HTMLParser:

handle_decl(self, decl) unbound HTMLParser.HTMLParser method
    # Overridable -- handle declaration

>>> help(HTMLParser.HTMLParser.handle_startendtag)
Help on method handle_startendtag in module HTMLParser:

handle_startendtag(self, tag, attrs) unbound HTMLParser.HTMLParser method
    # Overridable -- finish processing of start+end tag: <tag.../>

 

2.简单代码入门

先给个简单的示例进入一下状态吧


[python] view plaincopyprint?
#!/usr/bin/env python   
#coding=utf-8   
#python version:2.7.4  
#system:windows xp  
 
import HTMLParser 
 
class MyParser(HTMLParser.HTMLParser): 
    def __init__(self): 
        HTMLParser.HTMLParser.__init__(self)         
         
    def handle_starttag(self, tag, attrs): 
        # 这里重新定义了处理开始标签的函数  
        if tag == 'a': 
            # 判断标签<a>的属性  
            for name,value in attrs: 
                if name == 'href': 
                    print value 
         
if __name__ == '__main__': 
    a = '<html><head><title>test</title><body><a href="http://www.163.com">链接到163</a></body></html>' 
     
    my = MyParser() 
    # 传入要分析的数据,是html的。  
    my.feed(a) 

#!/usr/bin/env python
#coding=utf-8
#python version:2.7.4
#system:windows xp

import HTMLParser

class MyParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)       
       
    def handle_starttag(self, tag, attrs):
        # 这里重新定义了处理开始标签的函数
        if tag == 'a':
            # 判断标签<a>的属性
            for name,value in attrs:
                if name == 'href':
                    print value
       
if __name__ == '__main__':
    a = '<html><head><title>test</title><body><a href="http://www.163.com">链接到163</a></body></html>'
   
    my = MyParser()
    # 传入要分析的数据,是html的。
    my.feed(a)

我相信是搞计算机的调试一下就应该OK了吧,里面还有部分注释,应该没问题

3.oschina实际示例
再来看看我写的一个关于网页分析的示例,完整的,我拿http://www.oschina.net/blog/more?p=1这个网站做为示例,当然在oschina上有写过类似的代码,但我觉得不怎么好,我是用栈来实现 ,自已感觉比较清晰,分享一下哈

代码任务:
抓取最新博客的相关信息(标题、作者、概要信息)网页源码如下图所示:

 

 

 

示例代码:

[python] view plaincopyprint?
#!/usr/bin/env python   
#coding=utf-8   
#python version:2.7.4  
#system:windows xp  
 
import sys 
import httplib2 
import HTMLParser 
def getPageContent(url): 
    '''''
    使用httplib2用编程的方式根据url获取网页内容
    将bytes形式的内容转换成utf-8的字符串
    ''' 
    #使用ie9的user-agent,如果不设置user-agent将会得到403禁止访问  
    headers={'user-agent':'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)', 
            'cache-control':'no-cache'} 
    if url: 
         response,content = httplib2.Http().request(url,headers=headers) 
          
         if response.status == 200 : 
            return content 
 
class stack: 
    def __init__(self,size=100,list=None): 
        self.contain=[] 
        self.msize=size; 
        self.top = 0; 
    def getTop(self): 
        if(self.top>0): 
            return self.contain[self.top-1] 
        else: 
            return None 
    def getLength(self): 
        return len(self.contain); 
    def push(self,data): 
        if(self.top==self.msize): 
            return -1 
        self.contain.append(data) 
        self.top=self.top +1 
    def pop(self): 
        try: 
            res=self.contain.pop() 
            if(self.top>0): 
                self.top=self.top-1 
            return res; 
        except IndexError: 
            return None 
         
 
class ParserOschinaNew(HTMLParser.HTMLParser): 
    def __init__(self): 
        HTMLParser.HTMLParser.__init__(self) 
        self.st=stack(size=1000) 
        self.st.push('over') 
         
    def handle_starttag(self,tag,attrs): 
        stack_size = self.st.getLength() 
        if stack_size==1 and tag=='ul': 
            for name,value in attrs: 
                if (name=='class' and value=='BlogList'): 
                    self.st.push('ul') 
                    break; 
        if (stack_size==2 and tag=='li' ): 
            self.st.push('li') 
        if (stack_size==3 and tag=='h3' ): 
            self.st.push('h3') 
            text = '博客标题:'.decode('utf-8').encode('gb2312','ignore') 
            print '%s'%text 
        if (stack_size==3 and tag=='p' ): 
            self.st.push('p') 
            text = '正文部分:'.decode('utf-8').encode('gb2312','ignore') 
            print '%s'%text 
 
        if (stack_size==3 and tag=='div' ): 
            for name,value in attrs: 
                if (name=='class' and value=='date'): 
                    self.st.push('div') 
                    text = '作者:'.decode('utf-8').encode('gb2312','ignore') 
                    print '%s'%text 
                     
             
    def handle_data(self ,data): 
        stack_size = self.st.getLength() 
        if (stack_size==4): 
            print data.decode('utf-8').encode('gb2312','ignore')   
 
    def handle_endtag(self,tag): 
        stack_size = self.st.getLength() 
        stack_tag = self.st.getTop() 
        if ('h3'==tag and 'h3'==stack_tag): 
            self.st.pop() 
        if ('p'==tag and 'p'==stack_tag): 
            self.st.pop() 
        if ('div'==tag and 'div'==stack_tag): 
            self.st.pop() 
         
        if ('li'==tag and 'li'==stack_tag): 
            self.st.pop() 
        if ('ul'==tag and 'ul'==stack_tag): 
            self.st.pop() 
            if('over'==self.st.getTop()):           
                print "this is end!" 
         
 
 
if __name__ == '__main__': 
    pageC = getPageContent('http://www.oschina.net/blog/more?p=1') 
 #   pageC = pageC.decode('utf-8').encode('gb2312','ignore')     
    my = ParserOschinaNew() 
    my.feed(pageC) 

#!/usr/bin/env python
#coding=utf-8
#python version:2.7.4
#system:windows xp

import sys
import httplib2
import HTMLParser
def getPageContent(url):
    '''
    使用httplib2用编程的方式根据url获取网页内容
    将bytes形式的内容转换成utf-8的字符串
    '''
    #使用ie9的user-agent,如果不设置user-agent将会得到403禁止访问
    headers={'user-agent':'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
            'cache-control':'no-cache'}
    if url:
         response,content = httplib2.Http().request(url,headers=headers)
        
         if response.status == 200 :
            return content

class stack:
    def __init__(self,size=100,list=None):
        self.contain=[]
        self.msize=size;
        self.top = 0;
    def getTop(self):
        if(self.top>0):
            return self.contain[self.top-1]
        else:
            return None
    def getLength(self):
        return len(self.contain);
    def push(self,data):
        if(self.top==self.msize):
            return -1
        self.contain.append(data)
        self.top=self.top +1
    def pop(self):
        try:
            res=self.contain.pop()
            if(self.top>0):
                self.top=self.top-1
            return res;
        except IndexError:
            return None
       

class ParserOschinaNew(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.st=stack(size=1000)
        self.st.push('over')
       
    def handle_starttag(self,tag,attrs):
        stack_size = self.st.getLength()
        if stack_size==1 and tag=='ul':
            for name,value in attrs:
                if (name=='class' and value=='BlogList'):
                    self.st.push('ul')
                    break;
        if (stack_size==2 and tag=='li' ):
            self.st.push('li')
        if (stack_size==3 and tag=='h3' ):
            self.st.push('h3')
            text = '博客标题:'.decode('utf-8').encode('gb2312','ignore')
            print '%s'%text
        if (stack_size==3 and tag=='p' ):
            self.st.push('p')
            text = '正文部分:'.decode('utf-8').encode('gb2312','ignore')
            print '%s'%text

        if (stack_size==3 and tag=='div' ):
            for name,value in attrs:
                if (name=='class' and value=='date'):
                    self.st.push('div')
                    text = '作者:'.decode('utf-8').encode('gb2312','ignore')
                    print '%s'%text
                   
           
    def handle_data(self ,data):
        stack_size = self.st.getLength()
        if (stack_size==4):
            print data.decode('utf-8').encode('gb2312','ignore') 

    def handle_endtag(self,tag):
        stack_size = self.st.getLength()
        stack_tag = self.st.getTop()
        if ('h3'==tag and 'h3'==stack_tag):
            self.st.pop()
        if ('p'==tag and 'p'==stack_tag):
            self.st.pop()
        if ('div'==tag and 'div'==stack_tag):
            self.st.pop()
       
        if ('li'==tag and 'li'==stack_tag):
            self.st.pop()
        if ('ul'==tag and 'ul'==stack_tag):
            self.st.pop()
            if('over'==self.st.getTop()):         
                print "this is end!"
       


if __name__ == '__main__':
    pageC = getPageContent('http://www.oschina.net/blog/more?p=1')
 #   pageC = pageC.decode('utf-8').encode('gb2312','ignore')  
    my = ParserOschinaNew()
    my.feed(pageC)

 

相关内容

    暂无相关文章

评论关闭