python简单爬虫,python爬虫,import reimp
文章由Byrx.net分享于2019-03-23 05:03:43
python简单爬虫,python爬虫,import reimp
import reimport urllibimport urllib.requestfrom collections import dequequeue = deque()#存放待爬取的网址visited = set()#存放爬取过的网址。判断是否爬取过url = "http://news.dbanotes.net"#入口网站queue.append(url)count = 1while queue: url = queue.popleft()#删除已经爬取过的队首的网址url visited |= {url}#把已经爬取过的页面放入set中,方便下面的判断 urlop = urllib.request.urlopen(url) if 'html' not in urlop.getheader('Content-Type'): continue#如果是html再继续爬取 try: data = urlop.read().decode('utf-8') except: continue value = re.findall(r'href="(.+?)"',data) for x in value: if 'http' in x and x not in visited: print("加入队列:" + x)
相关内容
- ElasticSearch 数据导入导出工具,,None
- 简单的sqlite3数据库操作实例,sqlite3数据库实例,import
- google hosts不解释,googlehosts解释,#!/usr/bin/p
- python生成字母验证图片,,fromPILimpor
- python实现发送邮件,python发送邮件,None
- python 学习,python,#!/usr/bin/e
- 列表的顺序访问与随机访问,列表顺序访问,class Tobj(
- python3调用百度翻译api的命令行翻译工具,python3api,#!
- python-pcap模块解析mac地址,python-pcap解析mac,import pcap
- python多线程ping和arpping扫描工具,pythonarpping,#/usr/bin/e
评论关闭