如何在python scrapy中带cookie进行访问?,pythonscrapy,简单的通过scrapy访
如何在python scrapy中带cookie进行访问?,pythonscrapy,简单的通过scrapy访
简单的通过scrapy访问雪球都报错,我知道要先访问一次雪球,需要cookie信息才能真正打开连接。scrapy据说可以不用在意cookie,会自动获取cookie。我按照这个连接在middleware里已经启用cookie,http://stackoverflow.com/ques...,但为什么还是会返回404错误?搜索了几天都没找到答案。郁闷啊,求帮忙给个简单代码如何访问,谢谢了
class XueqiuSpider(scrapy.Spider): name = "xueqiu" start_urls = "https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1" headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "zh-CN,zh;q=0.8", "Connection": "keep-alive", "Host": "www.zhihu.com", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36" } def __init__(self, url = None): self.user_url = url def start_requests(self): yield scrapy.Request( url = self.start_urls, headers = self.headers, meta = { 'cookiejar': 1 }, callback = self.request_captcha ) def request_captcha(self,response): print response
错误日志。
2017-03-04 12:42:02 [scrapy.core.engine] INFO: Spider opened2017-03-04 12:42:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2017-03-04 12:42:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023********Current UserAgent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6************2017-03-04 12:42:12 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://xueqiu.com/robots.txt>Set-Cookie: aliyungf_tc=AQAAAGFYbBEUVAQAPSHDc8pHhpYZKUem; Path=/; HttpOnly2017-03-04 12:42:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://xueqiu.com/robots.txt> (referer: None)********Current UserAgent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6************2017-03-04 12:42:12 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <404 https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1>Set-Cookie: aliyungf_tc=AQAAAPTfyyJNdQUAPSHDc8KmCkY5slST; Path=/; HttpOnly2017-03-04 12:42:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1> (referer: None)2017-03-04 12:42:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1>: HTTP status code is not handled or not allowed2017-03-04 12:42:12 [scrapy.core.engine] INFO: Closing spider (finished)2017-03-04 12:42:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
又试了一下.. 确实不需要登录哦.. 是我想多了... 直接先请求一下xueqiu.com,拿到cookie后再请求一下API的地址就可以了.. 原来如此..
==============羞耻的分割线=============
经我验证,你需要登录...
import scrapyimport hashlibfrom scrapy.http import FormRequest, Requestclass XueqiuScrapeSpider(scrapy.Spider): name = "xueqiu_scrape" allowed_domains = ["xueqiu.com"] def start_requests(self): m = hashlib.md5() m.update(b"your password") # 在这里填入你的密码 password = m.hexdigest().upper() form_data={ "telephone": "your account", # 在这里填入你的用户名 "password": password, "remember_me": str(), "areacode": "86", } print(form_data) return [FormRequest( url="https://xueqiu.com/snowman/login", formdata=form_data, meta={"cookiejar": 1}, callback=self.loged_in )] def loged_in(self, response): # print(response.url) return [Request( url="https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1", meta={"cookiejar": response.meta["cookiejar"]}, callback=self.get_result, )] def get_result(self, response): print(response.body)
另外,网站确实对User-Agent进行了验证,可以在settings.py中进行设置,当然自己写在爬虫文件里也可以。密码是MD5加密后的字符串。
哦对,补充一点,因为我是用手机注册的,所以form_data是这些字段,如果你是其他方式,只需要用Chrome工具看一下POST请求有哪些参数,自己修改一下form_data的内容就行了。
哈哈,谢谢咯,解决了几天的困惑。之前也通过request来做不需要登录,贴下代码,
session = requests.Session()session.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}session.get('https://xueqiu.com')for page in range(1,100): url = 'https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=%s&size=1' % page print url r = session.get(url)#print r.json().list a = r.text
编橙之家文章,
相关内容
- python urllib2.HTTPError: HTTP Error 400: Bad Request 出错,,新手这
- 两个py文章互相引用时报Exception'module' object has no attri
- 需要一些不是固定的IP VPN账号去哪找,vpn账号,请问哪里
- 为什么Python打包整数得到二进制字符串,python二进制
- Python最轻量代码实现WIKI内链接方法,pythonwiki,不管用什
- 请python高手帮我看段正则匹配的问题,python段正,re_qb
- Python 命令调用密码传参问题求教,python,目前我遇到一
- Python统计中英文字数函数源码请帮助修改,,需求:统计
- Python多线程变量溢出问题,python多线程溢出,代码如下
- 要用python获取内存中数据该怎么写,python怎么写,from
评论关闭