Python抓取关键词代码片段,,用python来做数据抓
Python抓取关键词代码片段,,用python来做数据抓
用python来做数据抓取挖掘是很不错的,下边就是用python方法来抓取百度关键词的实现方法代码片段。
编橙之家之前的python 爬虫系列视频教程中也有讲到python数据抓取相关的问题,有兴趣的朋友可以关注一下。
#-*- coding: UTF-8 -*- #Python UTF-8#key.txt是抓取文件配置import cgi,urllib #URL读取import re #正则匹配import MySQLdb #MySQLimport datetime #时间#import time,thread #多线程"""MySQL表结构CREATE TABLE `baidu` ( `id` int(10) unsigned NOT NULL auto_increment, `url` varchar(200) NOT NULL, `title` varchar(600) NOT NULL, `keys` varchar(100) NOT NULL, `bdurl` varchar(200) NOT NULL, `date` date NOT NULL, PRIMARY KEY (`id`)) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;"""def Yang_Config ():fp = open('key.txt','r')for line in fp.read().split('@'):word = line.split(',') #word 是字典#for item in word :#print item.encode("UTF-8")#print '------'if len(word) > 1:yang_u = word[0]yang_k = word[1]Yang_Spider(yang_u,yang_k)#抓取页面开始def Yang_Spider(yang_u,yang_k):url = 'http://www.baidu.com/s?wd=%s+site:%s&&rn=100'% (yang_k,yang_u)print urlfp = urllib.urlopen(url).read()#print fp re.searchm = re.findall(r"<table cellpadding=\"0\" cellspacing=\"0\" class=\"result\" id=\"(\d+)\"\s*?><tr><td class=f><h3 class=\"t\">(<font.*?<\/font>)?<a.*?href=\"(.*?)\"\s*?target=\"_blank\">(.*?)<\/a>\s*?<\/h3><font size=\-1>.*?<span class=\"g\">.*? ((\d{4}\-\d{1,2}\-\d{1,2})|(\d+小时前)|(\d+分钟前)) .*?<\/span>.*?<br><\/font><\/td><\/tr><\/table>",fp)if m: #print m #for s in m:#数组抓取过来是gbk 转码成utf8.encode("UTF-8") 是汉字decode('gbk') ASNII转UTF8 入数据库操作print str(s[3]) #print '~~~'.join(s) #切割数组Yang_MySQL (yang_k,yang_u,s)#入库#for i, s in enumerate(m.group(3)):#print i,selse: print 'not search'def Yang_MySQL (k,u,s):global cursor,dcursor.execute("set names utf8")key_unicode = s[3].decode('gb2312') #gb2312key_utf8 = key_unicode.encode('utf-8')SQL = " INSERT INTO `baidukey`.`baidu` (`url` ,`title` ,`keys` ,`bdurl` ,`date`) VALUES ('%s', '%s', '%s','%s','%s'); " % (s[2],key_utf8,k,u,d)insert = cursor.execute(SQL)#print SQL#www.iplaypy.com#运行抓取函数conn = MySQLdb.connect(host="localhost",user="phper",passwd="123456",db="baidukey")cursor = conn.cursor()t = datetime.datetime.now()d = t.strftime('%Y-%m-%d')#%H:%M:%SDel = " DELETE FROM `baidukey`.`baidu` WHERE date = '%s'" % (d)cursor.execute(Del)Yang_Config()
编橙之家文章,
相关内容
- python正则过滤文件指定邮箱地址的方法,python邮箱地址
- python实现whois查询功能的方法,python实现whois查询,今天
- Base64码转换的python实现源码,base64python,为了方便测试时
- 用scp备份openstack的instance镜像方法,scpopenstack,以下pyt
- Python xlrd方法实现excel数据查找提取保存操作,pythonxl
- web.py能条件判断的页面执行计时方法,web.py计时,编橙之
- Blowfish加密解密的Python实现方法,blowfishpython,Blowfish加密
- 用VBS脚本读英语的Python代码分享,vbs脚本英语python,这是
- Python print输出彩色字符的方法,pythonprint,Python print
- 依赖Tkinter完成的简单记事本,依赖tkinter记事本,这是我
评论关闭