如何使用 urllib 包获取网络资源，urllib网络资源,未经许可，禁止转载！英文

文章由Byrx.net分享于2019-03-24 12:03:59评论（141）

如何使用 urllib 包获取网络资源，urllib网络资源,未经许可，禁止转载！英文

本文由编橙之家 - 鸭梨山大翻译，艾凌风校稿。未经许可，禁止转载！
英文出处：docs.python.org。欢迎加入翻译组。

简介

你还可以在这篇文章中找到对使用Python获取网络资源有帮助的信息:

基础认证带有Python示例的基础认证教程

urllib.request是一个用于获取URL(Uniform Resource Locators)的Python模块。它提供的接口（以urlopen函数的形式）非常简单。它可以用不同的协议去获取URL。同时它还提供了稍微复杂些的接口，让我们能在一些常见的场景下使用，如基础认证，Cookies，代理等等。这些接口是通过handler和opener对象来提供的。

urllib.request通过相关的网络协议（例如FTP，Http）支持多种“URL模式”下（以URL中的冒号之前的字符串识别，例如“ftp”是“ftp://python.org/”的 URL模式）的URL获取。本篇教程重点放在最常见的场景中，即Http。

urlopen在简单的场景中极易使用。然而当你在打开Http URL的时候遇到错误或是不正常的情况时，你将会需要一些超文本传输协议的知识。最全面且最权威的Http参考文是RFC 2616。但这不是一个通俗易懂的技术文档。本篇HOWTO意在讲述urllib的使用方法，辅以足够的Http细节去帮助你理解。本文并不是 urllib.request 文档的替代, 而是一个补充。

URL的获取

urllib.request最简单的用法如下：

Python

import urllib.request
response = urllib.request.urlopen(&#039;http://python.org/&#039;)
html = response.read()

如果你想通过URL获取一个资源并存储在某个临时的空间，你可以通过urlretrieve() 函数去实现:

Python

import urllib.request
local_filename, headers = urllib.request.urlretrieve(&#039;http://python.org/&#039;)
html = open(local_filename)

urllib的许多用法就是这么简单（注意，除了“http”，我们还以使用以“ftp”，“file”等开头的URL）。无论如何，本教程的目的在于讲解更复杂的案例，且重点放在Http的案例上。

Http基于请求和响应——客户端作出请求而服务器发送响应。urllib.request通过Request对象映射了你正在做的Http请求。创建最简单的Request对象实例，你只需要指定目标URL。调用urlopen并传入所创建的Request实例，将会返回该URL请求的response对象。该response对象类似于file，这意味着你可以在它上面调用.read()：

Python

import urllib.request

req = urllib.request.Request(&#039;http://www.voidspace.org.uk&#039;)
response = urllib.request.urlopen(req)
the_page = response.read()

应该注意到urllib.request使用了同样的Request的接口去处理所有的URL模式。比如，你可以像这样做一个FTP请求：

Python

req = urllib.request.Request(&#039;ftp://example.com/&#039;)

在Http的案例中，Request对象可以做两样额外的事情。首先，你可以传入要发给服务器的数据。其次，你可以传入额外的关于数据或关于该请求本身的信息（“元数据”）给服务器端——这些信息会作为Http的“headers”传输。接下来让我们依次来了解他们。

Data

有时你会想要向一个URL传输数据（通常这里的URL指的是一个CGI（Common Gateway Interface公共网关接口）脚本或是其他的网络应用）。在Http里，这常常是通过POST请求所完成的。这也是当你填好一个页面中的HTML表单并提交时，你的浏览器所做之事。但并不是所有的POST请求都是来自表单：你可以在你自己的网络应用里用POST请求去传送任意数据。通常在HTML表单中，数据需要以标准方式编码然后通过data参数传给Request对象。一般会使用 urllib.parse 库来进行编码。

Python

import urllib.parse
import urllib.request

url = &#039;http://www.someserver.com/cgi-bin/register.cgi&#039; 
values = {&#039;name&#039; : &#039;Michael Foord&#039;, &#039;location&#039; : &#039;Northampton&#039;, &#039;language&#039; : &#039;Python&#039; }

data = urllib.parse.urlencode(values) 
data = data.encode(&#039;utf-8&#039;) # data should be bytes 数据应为字节码
req = urllib.request.Request(url, data) 
response = urllib.request.urlopen(req) 
the_page = response.read()

注意，有时候我们也会需要到其他类型的编码（比如，通过HTML表单上传文件 —— 点击 HTML规范, 表单的提交了解更多细节).

如果你不给data参数传值，urllib将使用GET请求。GET请求和POST请求之间的一个不同点在于，POST请求通常有“副作用”：它们在某种意义上改变了系统的状态（例如给网站下一个订单，要求送100斤的猪肉罐头到你家门口）。虽然Http标准声称POST请求很有可能造成副作用，同时GET请求从不造成副作用，但是GET请求仍可能产生副作用，POST请求也不一定就会造成副作用。在Http GET请求里，数据也可以被编码进URL。

实现方法：

Python

&gt;&gt;&gt; import urllib.request
&gt;&gt;&gt; import urllib.parse
&gt;&gt;&gt; data = {}
&gt;&gt;&gt; data[&#039;name&#039;] = &#039;Somebody Here&#039;
&gt;&gt;&gt; data[&#039;location&#039;] = &#039;Northampton&#039;
&gt;&gt;&gt; data[&#039;language&#039;] = &#039;Python&#039;
&gt;&gt;&gt; url_values = urllib.parse.urlencode(data)
&gt;&gt;&gt; print(url_values)  # The order may differ from below. 顺序可能与下面不同。
name=Somebody+Here&amp;language=Python&amp;location=Northampton
&gt;&gt;&gt; url = &#039;http://www.example.com/example.cgi&#039;
&gt;&gt;&gt; full_url = url + &#039;?&#039; + url_values
&gt;&gt;&gt; data = urllib.request.urlopen(full_url)

注意：完整的URL是在原URL后面加上 ?以及编码的结果而生成的。

Headers

我们下面将会讨论一个具体的Http header, 向大家展示怎么向你的Http请求添加header。

一些网站 [1] 不喜欢被程序访问，也不喜欢匹配多种浏览器 [2]。默认情况下，urllib将自己设定为Python-urllib/x.y (这里 x 和 y 分别是Python的主要和次要版本号, 如Python-urllib/2.5), 这会把网站搞糊涂，或者干脆就不能正常运行。浏览器通过 User-Agent header [3]来定位自己。创建一个Request 对象时你可以传入一个header的dictionary。下面的例子创建的请求跟之前的一样，唯一不同的地方是该请求将自己标为IE浏览器的某个版本 [4]。

Python

import urllib.parse
import urllib.request

url = &#039;http://www.someserver.com/cgi-bin/register.cgi&#039;
user_agent = &#039;Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)&#039;
values = {&#039;name&#039; : &#039;Michael Foord&#039;,
          &#039;location&#039; : &#039;Northampton&#039;,
          &#039;language&#039; : &#039;Python&#039; }
headers = { &#039;User-Agent&#039; : user_agent }

data  = urllib.parse.urlencode(values)
data = data.encode(&#039;utf-8&#039;)
req = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(req)
the_page = response.read()

响应也有两个有用的方法。在我们待会了解出现了问题时的情形后，可以看这一节info and geturl 去学习这两个方法。

处理异常

urlopen 会在它无法处理一个响应时抛出 URLError (Python的API也常常这样抛出内建的异常，如 ValueError, TypeError等等)。 HttpError 是URLError的子类，在Http URL的一些情况中会被抛出。这些异常类来自urllib.error 模块。

e.g. 通常URL错误的产生原因是没有网络连接（即没有到达指定服务器的路由），或是指定的服务器不存在。这时，被抛出的异常将会有“reason”属性。该属性是一个包含错误码及错误文本的元组（tuple）。如：

Python

&gt;&gt;&gt;req = urllib.request.Request(&#039;http://www.pretend&lt;em&gt;server.org&#039;)
&gt;&gt;&gt;try: urllib.request.urlopen(req)
... except urllib.error.URLError as e:
...    print(e.reason)
...
(4, &#039;getaddrinfo failed&#039;)

Http错误

每个来自服务器的Http响应都有一个数字的“状态码”。状态码有时会表示服务器无法实现请求。默认的handler将会帮你处理这样的一些响应（例如，如果响应是一个要求客户端从另外的URL获取文档，即“重定向（redirection）”,urllib会帮你处理它）。而无法被处理的响应，HttpError将会被urlopen抛出。常见的错误包括“404”(page not found无法找到页面), “403”(request forbidden请求被禁止), 和“401” (authentication required需要验证)。

欲了解Http错误码，请阅读RFC2616的第十节。

被抛出的HttpError实例会有一个整数的“code”属性，对应服务器发送的错误。

Error Codes

由于默认的handler会处理重定向（300范围内的代码），并且100-299范围的代码表示成功，所以你一般只会看到400-599范围内的错误码。 http.server.BaseHttpRequestHandler.responses是一个很有用的响应代码字典，它包含了RFC2616里所用到的全部响应码。下面是该字典的重现：

Python

# Table mapping response codes to messages; entries have the

# form {code: (shortmessage, longmessage)}.

responses = { 100: (&#039;Continue&#039;, &#039;Request received, please continue&#039;), 
101: (&#039;Switching Protocols&#039;, &#039;Switching to new protocol; obey Upgrade header&#039;),
200: (&#039;OK&#039;, &#039;Request fulfilled, document follows&#039;),
201: (&#039;Created&#039;, &#039;Document created, URL follows&#039;),
202: (&#039;Accepted&#039;,
      &#039;Request accepted, processing continues off-line&#039;),
203: (&#039;Non-Authoritative Information&#039;, &#039;Request fulfilled from cache&#039;),
204: (&#039;No Content&#039;, &#039;Request fulfilled, nothing follows&#039;),
205: (&#039;Reset Content&#039;, &#039;Clear input form for further input.&#039;),
206: (&#039;Partial Content&#039;, &#039;Partial content follows.&#039;),

300: (&#039;Multiple Choices&#039;,
      &#039;Object has several resources -- see URI list&#039;),
301: (&#039;Moved Permanently&#039;, &#039;Object moved permanently -- see URI list&#039;),
302: (&#039;Found&#039;, &#039;Object moved temporarily -- see URI list&#039;),
303: (&#039;See Other&#039;, &#039;Object moved -- see Method and URL list&#039;),
304: (&#039;Not Modified&#039;,
      &#039;Document has not changed since given time&#039;),
305: (&#039;Use Proxy&#039;,
      &#039;You must use proxy specified in Location to access this &#039;
      &#039;resource.&#039;),
307: (&#039;Temporary Redirect&#039;,
      &#039;Object moved temporarily -- see URI list&#039;),

400: (&#039;Bad Request&#039;,
      &#039;Bad request syntax or unsupported method&#039;),
401: (&#039;Unauthorized&#039;,
      &#039;No permission -- see authorization schemes&#039;),
402: (&#039;Payment Required&#039;,
      &#039;No payment -- see charging schemes&#039;),
403: (&#039;Forbidden&#039;,
      &#039;Request forbidden -- authorization will not help&#039;),
404: (&#039;Not Found&#039;, &#039;Nothing matches the given URI&#039;),
405: (&#039;Method Not Allowed&#039;,
      &#039;Specified method is invalid for this server.&#039;),
406: (&#039;Not Acceptable&#039;, &#039;URI not available in preferred format.&#039;),
407: (&#039;Proxy Authentication Required&#039;, &#039;You must authenticate with &#039;
      &#039;this proxy before proceeding.&#039;),
408: (&#039;Request Timeout&#039;, &#039;Request timed out; try again later.&#039;),
409: (&#039;Conflict&#039;, &#039;Request conflict.&#039;),
410: (&#039;Gone&#039;,
      &#039;URI no longer exists and has been permanently removed.&#039;),
411: (&#039;Length Required&#039;, &#039;Client must specify Content-Length.&#039;),
412: (&#039;Precondition Failed&#039;, &#039;Precondition in headers is false.&#039;),
413: (&#039;Request Entity Too Large&#039;, &#039;Entity is too large.&#039;),
414: (&#039;Request-URI Too Long&#039;, &#039;URI is too long.&#039;),
415: (&#039;Unsupported Media Type&#039;, &#039;Entity body in unsupported format.&#039;),
416: (&#039;Requested Range Not Satisfiable&#039;,
      &#039;Cannot satisfy request range.&#039;),
417: (&#039;Expectation Failed&#039;,
      &#039;Expect condition could not be satisfied.&#039;),

500: (&#039;Internal Server Error&#039;, &#039;Server got itself in trouble&#039;),
501: (&#039;Not Implemented&#039;,
      &#039;Server does not support this operation&#039;),
502: (&#039;Bad Gateway&#039;, &#039;Invalid responses from another server/proxy.&#039;),
503: (&#039;Service Unavailable&#039;,
      &#039;The server cannot process the request due to a high load&#039;),
504: (&#039;Gateway Timeout&#039;,
      &#039;The gateway server did not receive a timely response&#039;),
505: (&#039;Http Version Not Supported&#039;, &#039;Cannot fulfill request.&#039;),
}

当一个错误被服务器响应抛出时，返回一个Http 错误码和一个错误页面。你可以使用HttpError实例作为一个返回页面中的响应。这意味着同code属性，它还有read, geturl, 和info方法，如urllib.response模块返回的一样：

Python

&gt;&gt;&gt; req = urllib.request.Request(&#039;http://www.python.org/fish.html&#039;)
&gt;&gt;&gt; try:
...     urllib.request.urlopen(req)
... except urllib.error.HttpError as e:
...     print(e.code)
...     print(e.read())
...
404
b&#039;&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot;
  &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;nnn&lt;html
  ...
  &lt;title&gt;Page Not Found&lt;/title&gt;n
  ...

小结

如果你想为HttpError 或 URLError作好准备，这里有两个方法。我个人推荐第二个。

第一个

Python

from urllib.request import Request, urlopen
from urllib.error import URLError, HttpError
req = Request(someurl)
try:
    response = urlopen(req)
except HttpError as e:
    print(&#039;The server couldn&#039;t fulfill the request.&#039;)
    print(&#039;Error code: &#039;, e.code)
except URLError as e:
    print(&#039;We failed to reach a server.&#039;)
    print(&#039;Reason: &#039;, e.reason)
else:
    # everything is fine

注意： except HttpError 一定要放在前面, 不然except URLError 也会获取HttpError。

第二个

Python

from urllib.request import Request, urlopen
from urllib.error import  URLError
req = Request(someurl)
try:
    response = urlopen(req)
except URLError as e:
    if hasacoder(e, &#039;reason&#039;):
        print(&#039;We failed to reach a server.&#039;)
        print(&#039;Reason: &#039;, e.reason)
    elif hasacoder(e, &#039;code&#039;):
        print(&#039;The server couldn&#039;t fulfill the request.&#039;)
        print(&#039;Error code: &#039;, e.code)
else:
    # everything is fine

info and geturl

urlopen (或HttpError 实例)返回的响应有两个有用的方法， info() 和geturl() ，定义在urllib.response模块中.

geturl – 它返回获取的页面的真实URL。在urlopen (或使用的opener对象) 可能带有一个重定向时，它很有帮助。获取的页面的URL不一定跟请求的URL相同。

info – 它返回一个字典-就像一个对象，用于描述获取的页面，特别是服务器发送的header。它是一个http.client.HttpMessage 实例。常见的header有‘Content-length’, ‘Content-type’等等。点击Quick Reference to Http Headers 查看Http header列表，内含各个header的简单介绍和用法。

Openers and Handlers

当你需要获取一个URL时，你需要一个opener(一个看起来不太容易理解的对象urllib.request.OpenerDirector的实例)。一般情况下我们都会通过urlopen来使用默认的opener，但是你可以自己创建不同的opener。Opener会使用handler。所有的“重活”由handler去承担。每个handler知道如何去以某个特定的URL模式（http,ftp等等）打开URL，或是如何处理URL启动的某个方面，比如Http重定向或Http cookie。

如果你想用某个已建立的handler去获取URL，你需要创建opener，例如一个处理cookie的opener，或是一个不处理重定向的opener。

要创建一个opener，你需要初始化一个OpenerDirector，然后重复调用.add_handler(some_handler_instance)。或者，你可以使用build_opener，一个便利的创建opener的函数，只需调用一次该函数便可创建opener。build_opener默认添加了一些handler，但是提供了便捷的途径去添加和/或重写默认的handler。

如果你想知道的话，还有其他种类的handler可以适用于代理，验证以及其他常见但又有些特殊的情形。

install_opener 可以用来创建一个opener对象，即（全局）默认opener。这意味着urlopen将会使用你创建的opener。

Opener对象有一个open方法，可以直接用来像urlopen一样去获取URL：除非更方便，否则没必要调用install_opener。

基本验证

为了演示一个handler的创建和设置，我们将用到HttpBasicAuthHandler。想了解更多关于这方面的细节——包括基本验证是如何运行的——请看Basic Authentication Tutorial。当需要验证时，服务器会发送一个header（同时还有401错误码）来请求验证。这将会指定验证方案以及一个“realm”。Header看起来是这样的。

Python

WWW-Authenticate: Basic realm=&quot;cPanel Users&quot;

客户端接着应该用正确的用户名和密码进行重新请求（请求的header中包含realm）。这就是“基本验证”。为了简化这个过程，我们可以创建一个HttpBasicAuthHandler实例和一个使用该handler的opener。

HttpBasicAuthHandler使用一个叫password manager的对象去处理URL和realm，密码和用户名之间的映射。如果你知道realm是什么(根据服务器发来的验证header)，那么你就可以使用HttpPasswordMgr。通常人们不会关心realm是什么。在这种情况下，使用HttpPasswordMgrWithDefaultRealm会很方便。它允许你指定一个URL默认的用户名和密码。它会在你没有给某个realm提供用户名和密码的时候起到作用。实现这种情况，我们需要将add_password 方法的realm参数设置为None。

最上层的URL是第一个要求验证的URL。只要是比你传给.add_password()的URL“更深”的URL都可以匹配上。

Python

# create a password manager创建一个password manager
password_mgr = urllib.request.HttpPasswordMgrWithDefaultRealm()

# Add the username and password. 添加用户名和密码
# If we knew the realm, we could use it instead of None.如果知道realm的值 ，可以替换掉None
top_level_url = &quot;http://example.com/foo/&quot;
password_mgr.add_password(None, top_level_url, username, password)

handler = urllib.request.HttpBasicAuthHandler(password_mgr)

# create &quot;opener&quot; (OpenerDirector instance)
opener = urllib.request.build_opener(handler)

# use the opener to fetch a URL
opener.open(a_url)

# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)

注意在上面的例子中，我们只提供HttpBasicAuthHandler给build_opener。默认的情况下，opener有处理普通情况的handler —— ProxyHandler（如果代理的环境变量如http_proxy已经设置的话），UnknownHandler, HttpHandler，HttpDefaultErrorHandler，HttpRedirectHandler，FTPHandler，FileHandler， DataHandler，HttpErrorProcessor。

toplevel_url实际上要么是一个完整的URL（包括“Http：”模式部分以及主机名和可选的端口号）比如“http://example.com/” ，要么是一个“主体”（即主机名和可选的端口号）例如“example.com”或“example.com:8080”（后者包含了端口号）。该主体如果出现的话，不能包含“用户信息”部分——如“joe@password:example.com”就是不对的。

代理

urllib 将会自动检测你的代理设置并使用它们。这是通过ProxyHandler来实现的，当代理设置被检测到时，它是普通handler链的一部分。通常来说这是好事，但是它不一定会带来帮助[5]。一个不用定义代理的实现方式是创建我们自己的ProxyHandler。这个实现方法类似于创建一个基本认证handler：

Python

&gt;&gt;&gt; proxy_support = urllib.request.ProxyHandler({}) 
&gt;&gt;&gt; opener = urllib.request.build_opener(proxy_support) 
&gt;&gt;&gt; urllib.request.install_opener(opener)

请注意目前urllib.request不支持通过代理获取https位置。然而，这可以通过扩展urllib.request来实现，见[6]。

Sockets and Layers

Python支持从多层级网页中获取资源。urllib使用http.client库，而该库使用了socket库。在Python2.3中，你可以指定一个socket等待响应的时间。这对于需要获取网页的一些应用来说很有用。默认情况下，scoket模块没有超时时间的设定而是一直挂着。目前socket超时在http.client或urllib.request层是隐藏的。然后你可以将所有socket的默认超时设置成全局的，方法是：

Python

import socket
import urllib.request

# timeout in seconds
#超时时间，以秒为单位

timeout = 10 
socket.setdefaulcodeimeout(timeout)

# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
# 这里调用urllib.request.urlopen使用了我们在socket模块中设置的默认超时时间

req = urllib.request.Request(&#039;http://www.voidspace.org.uk&#039;) 
response = urllib.request.urlopen(req)

注脚

本篇文章由John Lee审阅和修改。

[1]	Like Google for example. The proper way to use google from a program is to use PyGoogle of course.

[2]	Browser sniffing is a very bad practise for website design – building sites using web standards is much more sensible. Unfortunately a lot of sites still send different versions to different browsers.

[3]	The user agent for MSIE 6 is ‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)’

[4]	For details of more Http request headers, see Quick Reference to Http Headers.

[5]	In my case I have to use a proxy to access the internet at work. If you acodeempt to fetch _localhost URLs through this proxy it blocks them. IE is set to use the proxy, which urllib picks up on. In order to test scripts with a localhost server, I have to prevent urllib from using the proxy.

[6]	urllib opener for SSL proxy (CONNECT method): ASPN Cookbook Recipe.

热门文章：

如何使用 urllib 包获取网络资源，urllib网络资源,未经许可，禁止转载！英文