python爬虫——爬取古诗词,,实现目标 1.古诗词


实现目标


1.古诗词网站爬取唐诗宋词

2.落地到本地数据库

页面分析


通过firedebug进行页面定位:



源码定位:

技术分享图片

根据lxml etree定位div标签:

#通过lxml进行页面分析response=etree.HTML(data)#div层定位forrowinresponse.xpath('//div[@class="left"]/div[@class="sons"]'):#标题定位title=row.xpath('div[@class="cont"]/p/a/b/text()')[0]ifrow.xpath('div[@class="cont"]/p/a/b/text()')else''#朝代定位dynasty=row.xpath('div[@class="cont"]/p[@class="source"]//text()')[0]ifrow.xpath('div[@class="cont"]/p[@class="source"]//text()')else''#诗人定位author=row.xpath('div[@class="cont"]/p[@class="source"]//text()')[-1]ifrow.xpath('div[@class="cont"]/p[@class="source"]//text()')else''#内容定位content=''.join(row.xpath('div[@class="cont"]/div[@class="contson"]//text()')).replace('  ','').replace('\n','')ifrow.xpath('div[@class="cont"]/div[@class="contson"]//text()')else''#标签定位tag=','.join(row.xpath('div[@class="tag"]/a/text()'))ifrow.xpath('div[@class="tag"]/a/text()')else''

脚本源码

#!/usr/bin/envpython#-*-coding:utf-8-*-'''@Date:2017/12/2111:12@Author:kaiqing.huang(kaiqing.huang@ubtrobot.com)@Contact:kaiqing.huang@ubtrobot.com@File:shigeSpider.py'''fromutilsimportMySpider,MongoBasefromdatetimeimportdatefromlxmlimportetreeimportsysclassshigeSpider():def__init__(self):self.db=MongoBase()self.spider=MySpider()defdownload(self,url):self.domain=url.split('/')[2]data=self.spider.get(url)ifdata:self.parse(data)defparse(self,data):response=etree.HTML(data)forrowinresponse.xpath('//div[@class="left"]/div[@class="sons"]'):title=row.xpath('div[@class="cont"]/p/a/b/text()')[0]ifrow.xpath('div[@class="cont"]/p/a/b/text()')else''dynasty=row.xpath('div[@class="cont"]/p[@class="source"]//text()')[0]ifrow.xpath('div[@class="cont"]/p[@class="source"]//text()')else''author=row.xpath('div[@class="cont"]/p[@class="source"]//text()')[-1]ifrow.xpath('div[@class="cont"]/p[@class="source"]//text()')else''content=''.join(row.xpath('div[@class="cont"]/div[@class="contson"]//text()')).replace('  ','').replace('\n','')ifrow.xpath('div[@class="cont"]/div[@class="contson"]//text()')else''tag=','.join(row.xpath('div[@class="tag"]/a/text()'))ifrow.xpath('div[@class="tag"]/a/text()')else''self.db.add_new_row('shigeSpider',{'title':title,'dynasty':dynasty,'author':author,'content':content,'tag':tag,'createTime':str(date.today())})print'Title:{}'.format(title)ifresponse.xpath('//div[@class="pages"]/a/@href'):self.download('http://'+self.domain+response.xpath('//div[@class="pages"]/a/@href')[-1])if__name__=='__main__':sys.setrecursionlimit(100000)url='http://so.gushiwen.org/type.aspx?p=501'do=shigeSpider()do.download(url)

执行效果:

技术分享图片

python爬虫——爬取古诗词

评论关闭