Python Show-Me-the-Code 第 0009 题提取网页中的超链接

文章由Byrx.net分享于2019-03-22 02:03:48评论（268）

Python Show-Me-the-Code 第 0009 题提取网页中的超链接

第 0009 题：一个HTML文件，找出里面的链接。

思路：对于提取网页中的超链接，先把网页内容读取出来，然后用beautifulsoup来解析是比较方便的。但是我发现一个问题，如果直接提取a标签的href，就会包含javascript：xxx和#xxx之类的，所以要对这些进行特殊处理。

0009.提取网页中的超链接.py

#!/usr/bin/env python
#coding: utf-8
from bs4 import BeautifulSoup
import urllib
import urllib2
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

# 要分析的网页url
url = 'http://www.ruanyifeng.com/blog/2015/05/co.html'

def findAllLink(url):
    '''
    提取网页中的超链接
    '''
    # 获取协议，域名
    proto, rest = urllib.splittype(url)
    domain = urllib.splithost(rest)[0]
    # 读取网页内容
    html = urllib2.urlopen(url).read()
    # 提取超链接
    a = BeautifulSoup(html).findAll('a')
    # 过滤
    alist = [i.attrs['href'] for i in a if i.attrs['href'][0] != 'j']
    # 将形如#comment-text的锚点补全成http://www.ruanyifeng.com/blog/2015/05/co.html,将形如/feed.html补全为http://www.ruanyifeng.com/feed.html
    alist = map(lambda i: proto + '://' + domain + i if i[0] == '/' else url + i if i[0] == '#' else i, alist)
    return alist

if __name__ == '__main__':
    for i in findAllLink(url):
        print i

拿阮一峰博客上一篇文章测试，效果如下：
这里写图片描述

热门文章：

Python Show-Me-the-Code 第 0009 题提取网页中的超链接

Python Show-Me-the-Code 第 0009 题提取网页中的超链接

相关内容

最新python教程

python~HOT

Python Show-Me-the-Code 第 0009 题 提取网页中的超链接

Python Show-Me-the-Code 第 0009 题 提取网页中的超链接

相关内容

最新python教程

python~HOT

Python Show-Me-the-Code 第 0009 题提取网页中的超链接

Python Show-Me-the-Code 第 0009 题提取网页中的超链接