Linux 安装python爬虫框架 scrapy,pythonscrapy,Linux 安装py


Linux 安装python爬虫框架 scrapy

http://scrapy.org/

Scrapy是python最好用的一个爬虫框架.要求: python2.7.x.

1. Ubuntu14.04

1.1 测试是否已经安装pip

    # pip --version

如果没有pip,安装:

    # sudo apt-get install python-pip

1.2 然后安装scrapy

Import the GPG key used to sign Scrapy packages into APT keyring:

    $ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7

Create /etc/apt/sources.list.d/scrapy.list file using the following command:

    $ echo ‘deb http://archive.scrapy.org/ubuntu scrapy main‘ | sudo tee /etc/apt/sources.list.d/scrapy.list

Update package lists and install the scrapy package:

    $ sudo apt-get update && sudo apt-get install scrapy    $ pip install service_identity --timeout 10000

Install pyasn1-0.1.8:

   $ wget https://pypi.python.org/packages/source/p/pyasn1/pyasn1-0.1.8.tar.gz#md5=7f6526f968986a789b1e5e372f0b7065   $ tar -zxvf pyasn1-0.1.8.tar.gz   $ cd pyasn1-0.1.8   $ sudo python setup.py install


2. RHEL6.4

2.1 安装pip

# wget "https://pypi.python.org/packages/source/p/pip/pip-1.5.4.tar.gz#md5=834b2904f92d46aaa333267fb1c922bb" --no-check-certificate# tar -xzvf pip-1.5.4.tar.gz# cd pip-1.5.4# python2.7 setup.py install

2.2 然后安装scrapy

# pip install scrapy --timeout 10000

TODO: 下载太慢啦。等下载完毕再完善这里


3. 实验例子

3.1 创建一个爬虫程序stackoverflow.py

#!/usr/bin/python2.7#-*- coding: UTF-8 -*-# stackoverflow.py#import scrapyclass StackOverflowSpider(scrapy.Spider):    name = ‘stackoverflow‘    start_urls = [‘http://stackoverflow.com/questions?sort=votes‘]        def parse(self, response):        for href in response.css(‘.question-summary h3 a::attr(href)‘):            full_url = response.urljoin(href.extract())            yield scrapy.Request(full_url, callback=self.parse_question)    def parse_question(self, response):        yield {            ‘title‘: response.css(‘h1 a::text‘).extract()[0],            ‘votes‘: response.css(‘.question .vote-count-post::text‘).extract()[0],            ‘body‘: response.css(‘.question .post-text‘).extract()[0],            ‘tags‘: response.css(‘.question .post-tag::text‘).extract(),            ‘link‘: response.url,        }

3.2 运行爬虫程序

    $ scrapy runspider stackoverflow.py -o top-ques.json

3.3 把 top-ques.json 文件的内容放到

http://www.json.cn/

看看爬虫得到了什么!

enjoy it !


版权声明:本文为博主原创文章,未经博主允许不得转载。

Linux 安装python爬虫框架 scrapy

相关内容

    暂无相关文章

评论关闭