正则获取网页中的链接地址并判断是否不是HTTP地址，网页链接地址,Python 2.7.3

文章由Byrx.net分享于2019-03-23 08:03:09评论（539）

正则获取网页中的链接地址并判断是否不是HTTP地址，网页链接地址,Python 2.7.3

Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on win32Type "copyright", "credits" or "license()" for more information.&gt;&gt;&gt; ================================ RESTART ================================&gt;&gt;&gt; http://www.baidu.com/gaoji/preferences.htmlhttps://passport.baidu.com/v2/?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2Fhttps://passport.baidu.com/v2/?reg&amp;regType=1&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2Fhttp://news.baidu.comhttp://tieba.baidu.comhttp://zhidao.baidu.comhttp://mp3.baidu.comhttp://image.baidu.comhttp://video.baidu.comhttp://map.baidu.com# 没有找到非法HTTP地址# 没有找到非法HTTP地址# 没有找到非法HTTP地址http://baike.baidu.comhttp://wenku.baidu.comhttp://www.hao123.comhttp://www.baidu.com/more// 没有找到非法HTTP地址http://www.baidu.com/cache/sethelp/index.htmlhttp://www.baidu.com/search/baidukuijie_mp.htmlhttp://e.baidu.com/?refer=888http://top.baidu.comhttp://home.baidu.comhttp://ir.baidu.com/duty/ 没有找到非法HTTP地址http://www.miibeian.gov.cn&gt;&gt;&gt;

正则获取网页中的链接地址.py

###################################################qq:316118740#BLOG:http://hi.baidu.com/alalmn# ����  ��ȡ��ҳ�е����ӵ�ַ  ���ж��Ƿ���HTTP��ַ#  ��ѧд�Ĳ������Ҽ���##################################################def URL_STR(data):#�ж��Ƿ���HTTP�ַ�           sStr2 = 'http://'    sStr3 = 'https://'     #print sStr1.find(sStr2)     if data.find(sStr2) and data.find(sStr3):            return 1 #print "û���ҵ�"    else:            return 0 #print "���ҵ���"##################################################import urllib2, redef URL_DZ(URL):  #����ҳ��ĵ�ַ    s = urllib2.urlopen(URL)   #s = urllib2.urlopen(r"http://www.163.com")    ss = s.read()    p = re.compile( r'&lt;a.+?href=.+?&gt;.+?&lt;/a&gt;' )    pname = re.compile( r'(?&lt;=&gt;).*?(?=&lt;/a&gt;)' )    phref = re.compile( r'(?&lt;=href\=\").*?(?=\")')    #���켰����������ʽ    sarr = p.findall(ss)    #�ҳ�һ��һ����&lt;a&gt;&lt;/a&gt;��ǩ    i=0    for every in sarr:        if i&gt;1000:            print "����1000��URL��ַ������������\n"            break        else:            i+=1        sname = pname.findall( every )        if sname:            sname = sname[0]            shref = phref.findall( every )        if shref:            shref = shref[0]            #print sname.decode( 'gbk' ), "\n" #��ȡ��������            #print shref #��ȡURL            if URL_STR(shref):                print shref,"û���ҵ��Ƿ�HTTP��ַ"            else:                print shref     #"���ҵ�����ȷURL��ַ"        # �����ǽ�ÿ��&lt;a&gt;&lt;/a&gt;��������ݺ͵�ַ��ƥ�����##################################################URL_DZ("http://www.baidu.com")

热门文章：

正则获取网页中的链接地址并判断是否不是HTTP地址，网页链接地址,Python 2.7.3

正则获取网页中的链接地址并判断是否不是HTTP地址，网页链接地址,Python 2.7.3

相关内容

最新python源码实例

python~HOT

正则 获取网页中的链接地址 并判断是否不是HTTP地址，网页链接地址,Python 2.7.3

正则 获取网页中的链接地址 并判断是否不是HTTP地址，网页链接地址,Python 2.7.3

相关内容

最新python源码实例

python~HOT

正则获取网页中的链接地址并判断是否不是HTTP地址，网页链接地址,Python 2.7.3

正则获取网页中的链接地址并判断是否不是HTTP地址，网页链接地址,Python 2.7.3