Python 爬取 北京市政府首都之窗信件列表-[后续补充],,日期:2020.01
Python 爬取 北京市政府首都之窗信件列表-[后续补充],,日期:2020.01
日期:2020.01.23
博客期:131
星期四
如果想要看到前面的铺垫,请看第130期博客!
【本博客的代码如若要使用,请在下方评论区留言,之后再用(就是跟我说一声)】
我试着改写了一下爬虫脚本,试着运行了一下,第一次卡在了第27页,因为第27页有那个“投诉”类型是我没有料到的!出于对这个问题的解决,我重新写了代码,新的类和上一个总体类的区别有以下几点:
1、因为要使用js调取下一页的数据,所以就去网站上下载了FireFox的驱动
安装参考博客:https://www.cnblogs.com/nuomin/p/8486963.html
这个是Selenium+WebDriver的活用,这样它能够模拟人对浏览器进行操作(当然可以执行js代码!)
# 引包from selenium import webdriver# 使用驱动自动打开浏览器并选择页面profile = webdriver.Firefox()profile.get(‘http://www.beijing.gov.cn/hudong/hdjl/com.web.search.mailList.flow‘)# 浏览器执行js代码# ----[pageTurning(3)在网页上是下一页的跳转js(不用导js文件或方法)]profile.execute_script("pageTurning(3);")# 关闭浏览器profile.__close__()
2、因为要做页面跳转,所以每一次爬完页面地址以后就必须要去执行js代码!
3、因为咨询类的问题可能没有人给予回答,所以对应项应以空格代替(错误抛出)
4、第一个页面读取到文件里需要以"w+"的模式执行写文件,之后则以 “追加”的形式“a+”写文件
先说修改的Bean类(新增内容:我需要记录它每一个信件是“投诉”、“建议”还是“咨询”,所以添加了一个以这个字符串为类型的参数,进行写文件【在开头】)
1 # [ 保存的数据格式 ] 2 class Bean: 3 4 # 构造方法 5 def __init__(self,asker,responser,askTime,responseTime,title,questionSupport,responseSupport,responseUnsupport,questionText,responseText): 6 self.asker = asker 7 self.responser = responser 8 self.askTime = askTime 9 self.responseTime = responseTime10 self.title = title11 self.questionSupport = questionSupport12 self.responseSupport = responseSupport13 self.responseUnsupport = responseUnsupport14 self.questionText = questionText15 self.responseText = responseText16 17 # 在控制台输出结果(测试用)18 def display(self):19 print("提问方:"+self.asker)20 print("回答方:"+self.responser)21 print("提问时间:" + self.askTime)22 print("回答时间:" + self.responseTime)23 print("问题标题:" + self.title)24 print("问题支持量:" + self.questionSupport)25 print("回答点赞数:" + self.responseSupport)26 print("回答被踩数:" + self.responseUnsupport)27 print("提问具体内容:" + self.questionText)28 print("回答具体内容:" + self.responseText)29 30 def toString(self):31 strs = ""32 strs = strs + self.asker;33 strs = strs + "\t"34 strs = strs + self.responser;35 strs = strs + "\t"36 strs = strs + self.askTime;37 strs = strs + "\t"38 strs = strs + self.responseTime;39 strs = strs + "\t"40 strs = strs + self.title;41 strs = strs + "\t"42 strs = strs + self.questionSupport;43 strs = strs + "\t"44 strs = strs + self.responseSupport;45 strs = strs + "\t"46 strs = strs + self.responseUnsupport;47 strs = strs + "\t"48 strs = strs + self.questionText;49 strs = strs + "\t"50 strs = strs + self.responseText;51 return strs52 53 # 将信息附加到文件里54 def addToFile(self,fpath, model):55 f = codecs.open(fpath, model, ‘utf-8‘)56 f.write(self.toString()+"\n")57 f.close()58 59 # 将信息附加到文件里60 def addToFile_s(self, fpath, model,kind):61 f = codecs.open(fpath, model, ‘utf-8‘)62 f.write(kind+"\t"+self.toString() + "\n")63 f.close()64 65 # --------------------[基础数据]66 # 提问方67 asker = ""68 # 回答方69 responser = ""70 # 提问时间71 askTime = ""72 # 回答时间73 responseTime = ""74 # 问题标题75 title = ""76 # 问题支持量77 questionSupport = ""78 # 回答点赞数79 responseSupport = ""80 # 回答被踩数81 responseUnsupport = ""82 # 问题具体内容83 questionText = ""84 # 回答具体内容85 responseText = ""Bean.py
之后是总体的处理类
1 # [ 连续网页爬取的对象 ] 2 class WebPerConnector: 3 profile = "" 4 5 isAccess = True 6 7 # ---[定义构造方法] 8 def __init__(self): 9 self.profile = webdriver.Firefox() 10 self.profile.get(‘http://www.beijing.gov.cn/hudong/hdjl/com.web.search.mailList.flow‘) 11 12 # ---[定义释放方法] 13 def __close__(self): 14 self.profile.quit() 15 16 # 获取 url 的内部 HTML 代码 17 def getHTMLText(self): 18 a = self.profile.page_source 19 return a 20 21 # 获取页面内的基本链接 22 def getFirstChanel(self): 23 index_html = self.getHTMLText() 24 index_sel = parsel.Selector(index_html) 25 links = index_sel.css(‘div #mailul‘).css("a[onclick]").extract() 26 inNum = links.__len__() 27 for seat in range(0, inNum): 28 # 获取相应的<a>标签 29 pe = links[seat] 30 # 找到第一个 < 字符的位置 31 seat_turol = str(pe).find(‘>‘) 32 # 找到第一个 " 字符的位置 33 seat_stnvs = str(pe).find(‘"‘) 34 # 去掉空格 35 pe = str(pe)[seat_stnvs:seat_turol].replace(" ","") 36 # 获取资源 37 pe = pe[14:pe.__len__()-2] 38 pe = pe.replace("‘","") 39 # 整理成 需要关联数据的样式 40 mor = pe.split(",") 41 # ---[ 构造网址 ] 42 url_get_item = ""; 43 # 对第一个数据的判断 44 if (mor[0] == "咨询"): 45 url_get_item = "http://www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId=" 46 else: 47 if (mor[0] == "建议"): 48 url_get_item = "http://www.beijing.gov.cn/hudong/hdjl/com.web.suggest.suggesDetail.flow?originalId=" 49 else: 50 if (mor[0] == "投诉"): 51 url_get_item = "http://www.beijing.gov.cn/hudong/hdjl/com.web.complain.complainDetail.flow?originalId=" 52 url_get_item = url_get_item + mor[1] 53 54 model = "a+" 55 56 if(seat==0): 57 model = "w+" 58 59 dc = DetailConnector(url_get_item) 60 dc.getBean().addToFile_s("../testFile/emails.txt",model,mor[0]) 61 self.getChannel() 62 63 # 获取页面内的基本链接 64 def getNoFirstChannel(self): 65 index_html = self.getHTMLText() 66 index_sel = parsel.Selector(index_html) 67 links = index_sel.css(‘div #mailul‘).css("a[onclick]").extract() 68 inNum = links.__len__() 69 for seat in range(0, inNum): 70 # 获取相应的<a>标签 71 pe = links[seat] 72 # 找到第一个 < 字符的位置 73 seat_turol = str(pe).find(‘>‘) 74 # 找到第一个 " 字符的位置 75 seat_stnvs = str(pe).find(‘"‘) 76 # 去掉空格 77 pe = str(pe)[seat_stnvs:seat_turol].replace(" ", "") 78 # 获取资源 79 pe = pe[14:pe.__len__() - 2] 80 pe = pe.replace("‘", "") 81 # 整理成 需要关联数据的样式 82 mor = pe.split(",") 83 # ---[ 构造网址 ] 84 url_get_item = ""; 85 # 对第一个数据的判断 86 if (mor[0] == "咨询"): 87 url_get_item = "http://www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId=" 88 else: 89 if (mor[0] == "建议"): 90 url_get_item = "http://www.beijing.gov.cn/hudong/hdjl/com.web.suggest.suggesDetail.flow?originalId=" 91 else: 92 if (mor[0] == "投诉"): 93 url_get_item = "http://www.beijing.gov.cn/hudong/hdjl/com.web.complain.complainDetail.flow?originalId=" 94 url_get_item = url_get_item + mor[1] 95 96 dc = DetailConnector(url_get_item) 97 dc.getBean().addToFile_s("../testFile/emails.txt", "a+",mor[0]) 98 self.getChannel() 99 100 def getChannel(self):101 try:102 self.profile.execute_script("pageTurning(3);")103 time.sleep(1)104 except:105 print("-# END #-")106 isAccess = False107 108 if(self.isAccess):109 self.getNoFirstChannel()110 else :111 self.__close__()WebPerConnector.py
对应执行代码(这次不算测试)
wpc = WebPerConnector()wpc.getFirstChanel()
亲测可以执行到第486页,然后页面崩掉,看报错信息应该是溢出了... ...数据量的话是2910条,文件大小 3,553 KB
我将继续对代码进行改造... ...
改造完毕:(完整能够实现爬虫的python集合)
1 import parsel 2 from urllib import request 3 import codecs 4 from selenium import webdriver 5 import time 6 7 8 # [ 对字符串的特殊处理方法-集合 ] 9 class StrSpecialDealer: 10 @staticmethod 11 def getReaction(stri): 12 strs = str(stri).replace(" ","") 13 strs = strs[strs.find(‘>‘)+1:strs.rfind(‘<‘)] 14 strs = strs.replace("\t","") 15 strs = strs.replace("\r","") 16 strs = strs.replace("\n","") 17 return strs 18 19 20 # [ 保存的数据格式 ] 21 class Bean: 22 23 # 构造方法 24 def __init__(self,asker,responser,askTime,responseTime,title,questionSupport,responseSupport,responseUnsupport,questionText,responseText): 25 self.asker = asker 26 self.responser = responser 27 self.askTime = askTime 28 self.responseTime = responseTime 29 self.title = title 30 self.questionSupport = questionSupport 31 self.responseSupport = responseSupport 32 self.responseUnsupport = responseUnsupport 33 self.questionText = questionText 34 self.responseText = responseText 35 36 # 在控制台输出结果(测试用) 37 def display(self): 38 print("提问方:"+self.asker) 39 print("回答方:"+self.responser) 40 print("提问时间:" + self.askTime) 41 print("回答时间:" + self.responseTime) 42 print("问题标题:" + self.title) 43 print("问题支持量:" + self.questionSupport) 44 print("回答点赞数:" + self.responseSupport) 45 print("回答被踩数:" + self.responseUnsupport) 46 print("提问具体内容:" + self.questionText) 47 print("回答具体内容:" + self.responseText) 48 49 def toString(self): 50 strs = "" 51 strs = strs + self.asker; 52 strs = strs + "\t" 53 strs = strs + self.responser; 54 strs = strs + "\t" 55 strs = strs + self.askTime; 56 strs = strs + "\t" 57 strs = strs + self.responseTime; 58 strs = strs + "\t" 59 strs = strs + self.title; 60 strs = strs + "\t" 61 strs = strs + self.questionSupport; 62 strs = strs + "\t" 63 strs = strs + self.responseSupport; 64 strs = strs + "\t" 65 strs = strs + self.responseUnsupport; 66 strs = strs + "\t" 67 strs = strs + self.questionText; 68 strs = strs + "\t" 69 strs = strs + self.responseText; 70 return strs 71 72 # 将信息附加到文件里 73 def addToFile(self,fpath, model): 74 f = codecs.open(fpath, model, ‘utf-8‘) 75 f.write(self.toString()+"\n") 76 f.close() 77 78 # 将信息附加到文件里 79 def addToFile_s(self, fpath, model,kind): 80 f = codecs.open(fpath, model, ‘utf-8‘) 81 f.write(kind+"\t"+self.toString() + "\n") 82 f.close() 83 84 # --------------------[基础数据] 85 # 提问方 86 asker = "" 87 # 回答方 88 responser = "" 89 # 提问时间 90 askTime = "" 91 # 回答时间 92 responseTime = "" 93 # 问题标题 94 title = "" 95 # 问题支持量 96 questionSupport = "" 97 # 回答点赞数 98 responseSupport = "" 99 # 回答被踩数100 responseUnsupport = ""101 # 问题具体内容102 questionText = ""103 # 回答具体内容104 responseText = ""105 106 107 # [ 连续网页爬取的对象 ]108 class WebPerConnector:109 profile = ""110 111 isAccess = True112 113 # ---[定义构造方法]114 def __init__(self):115 self.profile = webdriver.Firefox()116 self.profile.get(‘http://www.beijing.gov.cn/hudong/hdjl/com.web.search.mailList.flow‘)117 118 # ---[定义释放方法]119 def __close__(self):120 self.profile.quit()121 122 # 获取 url 的内部 HTML 代码123 def getHTMLText(self):124 a = self.profile.page_source125 return a126 127 # 获取页面内的基本链接128 def getFirstChanel(self):129 index_html = self.getHTMLText()130 index_sel = parsel.Selector(index_html)131 links = index_sel.css(‘div #mailul‘).css("a[onclick]").extract()132 inNum = links.__len__()133 for seat in range(0, inNum):134 # 获取相应的<a>标签135 pe = links[seat]136 # 找到第一个 < 字符的位置137 seat_turol = str(pe).find(‘>‘)138 # 找到第一个 " 字符的位置139 seat_stnvs = str(pe).find(‘"‘)140 # 去掉空格141 pe = str(pe)[seat_stnvs:seat_turol].replace(" ","")142 # 获取资源143 pe = pe[14:pe.__len__()-2]144 pe = pe.replace("‘","")145 # 整理成 需要关联数据的样式146 mor = pe.split(",")147 # ---[ 构造网址 ]148 url_get_item = "";149 # 对第一个数据的判断150 if (mor[0] == "咨询"):151 url_get_item = "http://www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId="152 else:153 if (mor[0] == "建议"):154 url_get_item = "http://www.beijing.gov.cn/hudong/hdjl/com.web.suggest.suggesDetail.flow?originalId="155 else:156 if (mor[0] == "投诉"):157 url_get_item = "http://www.beijing.gov.cn/hudong/hdjl/com.web.complain.complainDetail.flow?originalId="158 url_get_item = url_get_item + mor[1]159 160 model = "a+"161 162 if(seat==0):163 model = "w+"164 165 dc = DetailConnector(url_get_item)166 dc.getBean().addToFile_s("../testFile/emails.txt",model,mor[0])167 168 # 获取页面内的基本链接169 def getNoFirstChannel(self):170 index_html = self.getHTMLText()171 index_sel = parsel.Selector(index_html)172 links = index_sel.css(‘div #mailul‘).css("a[onclick]").extract()173 inNum = links.__len__()174 for seat in range(0, inNum):175 # 获取相应的<a>标签176 pe = links[seat]177 # 找到第一个 < 字符的位置178 seat_turol = str(pe).find(‘>‘)179 # 找到第一个 " 字符的位置180 seat_stnvs = str(pe).find(‘"‘)181 # 去掉空格182 pe = str(pe)[seat_stnvs:seat_turol].replace(" ", "")183 # 获取资源184 pe = pe[14:pe.__len__() - 2]185 pe = pe.replace("‘", "")186 # 整理成 需要关联数据的样式187 mor = pe.split(",")188 # ---[ 构造网址 ]189 url_get_item = "";190 # 对第一个数据的判断191 if (mor[0] == "咨询"):192 url_get_item = "http://www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId="193 else:194 if (mor[0] == "建议"):195 url_get_item = "http://www.beijing.gov.cn/hudong/hdjl/com.web.suggest.suggesDetail.flow?originalId="196 else:197 if (mor[0] == "投诉"):198 url_get_item = "http://www.beijing.gov.cn/hudong/hdjl/com.web.complain.complainDetail.flow?originalId="199 url_get_item = url_get_item + mor[1]200 201 dc = DetailConnector(url_get_item)202 bea = ""203 try:204 bea = dc.getBean()205 bea.addToFile_s("../testFile/emails.txt", "a+", mor[0])206 except:207 pass208 209 # 转移页面 (2-5624)210 def turn(self,seat):211 seat = seat - 1212 self.profile.execute_script("beforeTurning("+str(seat)+");")213 time.sleep(1)214 215 # [ 信息爬取结点 ]216 class DetailConnector:217 headers = {218 ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36‘}219 basicURL = ""220 221 # ---[定义构造方法]222 def __init__(self, url):223 self.basicURL = url224 225 # 获取 url 的内部 HTML 代码226 def getHTMLText(self):227 req = request.Request(url=self.basicURL, headers=self.headers)228 r = request.urlopen(req).read().decode()229 return r230 231 # 获取基本数据232 def getBean(self):233 index_html = self.getHTMLText()234 index_sel = parsel.Selector(index_html)235 container_div = index_sel.css(‘div .container‘)[0]236 container_strong = index_sel.css(‘div strong‘)[0]237 container_retire = index_sel.css(‘div div div div‘)[5]238 239 #基础数据配置240 title = " "241 num_supp = " "242 question_toBuilder = " "243 question_time = " "244 support_quert = " "245 quText = " "246 answer_name = " "247 answer_time = " "248 answer_text = " "249 num_supp = " "250 num_unsupp = " "251 252 #------------------------------------------------------------------------------------------提问内容253 # 获取提问标题254 title = str(container_strong.extract())255 title = title.replace("<strong>", "")256 title = title.replace("</strong>", "")257 258 # 获取来信人259 container_builder = container_retire.css("div div")260 question_toBuilder = str(container_builder.extract()[0])261 question_toBuilder = StrSpecialDealer.getReaction(question_toBuilder)262 if (question_toBuilder.__contains__("来信人:")):263 question_toBuilder = question_toBuilder.replace("来信人:", "")264 265 # 获取提问时间266 question_time = str(container_builder.extract()[1])267 question_time = StrSpecialDealer.getReaction(question_time)268 if (question_time.__contains__("时间:")):269 question_time = question_time.replace("时间:", "")270 271 # 获取网友支持量272 support_quert = str(container_builder.extract()[2])273 support_quert = support_quert[support_quert.find(‘>‘) + 1:support_quert.rfind(‘<‘)]274 support_quert = StrSpecialDealer.getReaction(support_quert)275 276 # 获取问题具体内容277 quText = str(index_sel.css(‘div div div div‘).extract()[9])278 if(quText.__contains__("input")):279 quText = str(index_sel.css(‘div div div div‘).extract()[10])280 quText = quText.replace("<p>", "")281 quText = quText.replace("</p>", "")282 quText = StrSpecialDealer.getReaction(quText)283 284 # ------------------------------------------------------------------------------------------回答内容285 try:286 # 回答点赞数287 num_supp = str(index_sel.css(‘div a span‘).extract()[0])288 num_supp = StrSpecialDealer.getReaction(num_supp)289 # 回答不支持数290 num_unsupp = str(index_sel.css(‘div a span‘).extract()[1])291 num_unsupp = StrSpecialDealer.getReaction(num_unsupp)292 # 获取回答方293 answer_name = str(container_div.css("div div div div div div div").extract()[1])294 answer_name = answer_name.replace("<strong>", "")295 answer_name = answer_name.replace("</strong>", "")296 answer_name = answer_name.replace("</div>", "")297 answer_name = answer_name.replace(" ", "")298 answer_name = answer_name.replace("\t", "")299 answer_name = answer_name.replace("\r", "")300 answer_name = answer_name.replace("\n", "")301 answer_name = answer_name[answer_name.find(‘>‘) + 1:answer_name.__len__()]302 # ---------------------不想带着这个符号就拿开303 if (answer_name.__contains__("[官方回答]:")):304 answer_name = answer_name.replace("[官方回答]:", "")305 if (answer_name.__contains__("<span>")):306 answer_name = answer_name.replace("<span>", "")307 if (answer_name.__contains__("</span>")):308 answer_name = answer_name.replace("</span>", "")309 310 # 答复时间311 answer_time = str(index_sel.css(‘div div div div div div div div‘)[2].extract())312 answer_time = StrSpecialDealer.getReaction(answer_time)313 if (answer_time.__contains__("答复时间:")):314 answer_time = answer_time.replace("答复时间:", "")315 # 答复具体内容316 answer_text = str(index_sel.css(‘div div div div div div div‘)[4].extract())317 answer_text = StrSpecialDealer.getReaction(answer_text)318 answer_text = answer_text.replace("<p>", "")319 answer_text = answer_text.replace("</p>", "")320 except:321 pass322 323 bean = Bean(question_toBuilder, answer_name, question_time, answer_time, title, support_quert, num_supp,324 num_unsupp, quText, answer_text)325 326 return bean327 328 wpc = WebPerConnector()329 wpc.getFirstChanel()330 for i in range(2,5625):331 wpc.turn(i)332 wpc.getNoFirstChannel()333 wpc.__close__()WebConnector.py
执行结果文件
大小 40,673 KB
数据量 33,746
这爬取过程蛮长的啊!从凌晨2:30开始,一直到7:30多,这是5个小时!
期间爬取任务停了4次:
1、第2565页——进程进行到这里时浏览器卡住了,不动了,这我也不知道就改造了一下代码,继续爬了
2、第3657页——原因同上
3、第3761页——系统说是在文件读写方面,对文件追加的写入已经达到了极限,所以我将当时的数据文件移出,建立了新的空白文件用以储存3761页及以后的数据
4、第4449页——这一页有这样一项数据,名称是“关于亦庄轻轨的小红门站和旧宫东站”(咨询类),然后如果你点开那个链接,你将会收到的是500的网页状态,当时爬虫爬到这里就是因为报错而跳出了程序,如下图:
Python 爬取 北京市政府首都之窗信件列表-[后续补充]
相关内容
- 局部加权之逻辑回归(1) - Python实现,,算法特征:利用s
- Python 根据关键字取出json的值,, def ge
- python面试题二:Python 基础题,python基础面试题及答案,
- Neovim中提示Error: Required vim compiled with +python,,Neovim在编
- Python安装,,1,先去Python
- 新浪微博注册(elenium Python 自动化),,from selen
- python--enum,,# enum用于枚举
- Python之面向对象,,(一)包和模块的管理
- Python for循环与__getitem__的关系记录,,一个类里面如果由
- (selenium+python)_UI自动化05_定位select下拉列表,,前言web网
评论关闭