Python 爬取网站图片,我们打开
http://desk.zol.com.cn/bizhi/6429_79089_2.html
发现该网页有很多图片,并且我们点击下一页后会跳转到下一页,那么我们用 python 要如何爬去改页面的图片资源呢?
python requests 库
requests 库的官网http://docs.python-requests.org/en/master/
安装
pip(3) install requests
使用
使用 requests 库,需要对 http 协议有一定的了解,比如状态码,请求头,响应头,方法、字段、参数等概念有一定的了解。
我们导入库后,发送一个 get 请求
>>> import requests
>>> r = requests.get("http://desk.zol.com.cn/bizhi/6429_79089_2.html")
>>> r.status_code #返回状态码为成功
200
上面的 requests.get(url,params=None,**kwargs) 中:
url: 是指获取页面的 url 连接
params:是 url 的额外参数,比如一些请求头,例如有些网站做了防盗链,需要特殊的 referer 或 user-agent 才可以访问,否则拒绝访问,那么我们可以通过定制一些请求头去进行访问。
**kewrgs 12个控制访问的参数
通过 r.headers 可以获取响应头信息
>>> type(r)
<class 'requests.models.Response'>
>>> r.headers
{'Content-Length': '27571', 'Via': 'http/1.1 zats-other1 (zcache-other1 [cRs f ])', 'Age': '3983', 'Expires': 'Mon, 10 Jul 2017 05:59:40 GMT', 'Vary': 'Accept-Encoding', 'Server': 'ngx_openresty', 'Last-Modified': 'Mon, 10 Jul 2017 03:59:40 GMT', 'Connection': 'keep-alive', 'Cache-Control': 'max-age=7200', 'Date': 'Mon, 10 Jul 2017 05:06:03 GMT', 'nnCoection': 'close', 'Content-Type': 'text/html; charset=GBK'}
response 对象的属性
属性 | 说明 |
---|---|
r.status_code | HTTP请求的返回状态,200表示成功,其他都表示有问题 |
r.text | HTTP 响应内容的字符串形式,url 对应的页面内容 |
r.encoding | HTTP Header 中猜测的响应头状态码 |
r.apparent_encoding | 从内容中分析出的响应内容编码方式 |
r.content | 响应内容的二进制形式 |
>>> r.encoding
'GBK'
>>> r.content
'<!DOCTYPE HTML>\r\n<html>\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=gb2312">\r\n<meta name="applicable-device" content="pc">\r\n<title>\xba\xab\xb9\xfa\xd0\xa1\xc7\xe5\xd0\xc2\xc4\xcf\xb9\xe7\xc0\xf6\xbf\xed\xc6\xc1\xb1\xda\xd6\xbd-ZOL\xd7\xc0\xc3\xe6\xb1\xda\xd6\xbd</title>\r\n <meta name="keywords" content="" />\r\n <meta name="description" content=""/><meta property="og:type" content="image"/>\n<meta property="og:image" content="http://desk.fd.zol-img.com.cn/t_s120x90c5/g5/M00/0B/05/ChMkJlcgdH2IVmv2AAYP2zcB7GQAAQr3gJjQtUABg_z016.jpg!awen)"/>\n\r\n<link href="http://s.zol-img.com.cn/d/Desk/Desk_bizhi_detail.css?v=1028" rel="stylesheet" type="text/css" />\r\n\r\n<script src="http://p.zol-img.com.cn/desk/detail.js" type="text/javascript"></script>\r\n<script src="http://icon.zol-img.com.cn/public/js/swfobject.js" type="text/javascript"></script>\r\n<script src="http://icon.zol-img.com.cn/getcook.js?1312" type="text/javascript"></script>\r\n<script>\r\n\tdocument.domain = "zol.com.cn";\r\n\tvar userid = get_cookie(\'zol_userid\');\r\n\t\tvar deskPicArr \t\t= {"list":[{"picId":"79089","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJlcgdH2IVmv2AAYP2zcB7GQAAQr3gJjQtUABg_z016.jpg!awen)"},{"picId":"79087","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJ1cgdHmIaBQ5AAxkE0uQNWQAAQr3QO8WIIADGQr480.jpg!awen)"},{"picId":"79088","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJ1cgdHuIZqqBAA-e_5SjTQMAAQr3gD3HtsAD58X749.jpg!awen)"},{"picId":"79090","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJlcgdH6IJHoZAAxppAZiw2UAAQr3gMT8GkADGm8468.jpg!awen)"},{"picId":"79091","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJlcgdICIFA7wAAwW8vbAcEYAAQr3wBYmyYADBcK439.jpg!awen)"},{"picId":"79092","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJ1cgdIKIHmjLAAW0KFEpkQUAAQr3wGS5wkABbRA556.jpg!awen)"},{"picId":"79093","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJ1cgdISId2OQAA3bomjVkl0AAQr3wI0BO4ADdu6814.jpg!awen)"}]};\r\n\t/***********\xc8\xab\xbe\xd6\xb1\xe4\xc1\xbf\xb5\xc4\xc9\xf9\xc3\xf7*************/\r\n\tvar $deskGlobalConfig = {\r\n\t\t\tgroupId\t\t\t: 6429,\t\t\t//\xd7\xe9\xcd\xbcID\r\n\t\t\tp
如果乱码,则
>>> r.apparent_encoding
'GB2312'
>>> r.encoding ='utf-8'
如果 header 中不存在 charset,则认为编码为 ISO-8859-1
请求异常处理
属性 | 说明 |
---|---|
requests.ConnectionError | 网络连接错误异常,如 DNS 查询失败, 拒绝连接 |
requests.HTTPError | HTTP 错误异常 |
requests. URLRequired | URL 缺失异常 |
requests.TooManyRedirects | 连接远程服务器超时 |
requests.Timeout | 请求 URL 超时,产生的异常 |
#!/usr/bin/python
#!-*-conding:utf-8 -*-
import requests
def getHTMLIMG(url):
try:
r = requests.get(url,timeout=30)
r.raise_for_status()# 如果状态不是200 引发 HTTPError 异常
r.encoding = r.apparent_encoding
return r.text
except:
return "请求异常"
if __name__ == "__main__":
url = "http://desk.zol.com.cn/bizhi/6429_79089_2.html"
print(getHTMLIMG(url))
requests 的 http 方法
方法 | 说明 |
---|---|
requests.requets() | 构造一个请求,支持以下各自方法的基础 |
requests.get() | get请求,获取实体内容 |
requests.head() | 获取头信息 |
requests.post() | 提交 POST 请求 |
requests.put() | 提交 PUT 请求 |
requests.patch() | 提交局部的修改请求 |
requests.delete() | 提交删除请求 |
我们使用爬虫,大部分都是使用get方法比较多。
Beautiful Soup
安装
pip install beautifulsoup4
官网
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
使用
➜ ~ python
Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 12:39:47)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
u'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
导入bs4
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
>>>
显示title
>>> soup.title
<title>This is a python demo page</title>
打印a标签
>>> tag = soup.a
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
上面的只能返回第一个标签
##获取名字
>>> soup.a.name
u'a'
>>> soup.a.parent.name
u'p'
>>> soup.a.parent.parent.name
u'body'
获得数据内容
>>> tag = soup.a
>>> tag.attrs
{u'href': u'http://www.icourse163.org/course/BIT-268001', u'class': [u'py1'], u'id': u'link1'}
>>> tag.attrs['class']
[u'py1']
>>> tag.attrs['href']
u'http://www.icourse163.org/course/BIT-268001'
查看标签类型
>>> type(tag.attrs)
<type 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>
获得标签中的内容
>>> soup.a.string
u'Basic Python'
>>> soup.p.string
u'The demo python introduces several python courses.'
上面我们看到p标签的内容中其实是包含一个b的
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
说明该方法是可以跨域多个层的
beautiful soup元素
遍历
遍历分上下平行遍历
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
[u'\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, u'\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, u'\n']
>>> >>> len(soup.body.contents)
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>
方法
- .contents
- .children .desendants 需要配合for语句使用
上行遍历
>>> soup.body.parent
<html><head><title>This is a python demo page</title></head>\n<body>\n<p class="title"><b>The demo python introduces several python courses.</b></p>\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>\n</body></html>
标签书上行遍历
方法
- .parent
- .parents
平行遍历
获取下一个标签
>>> soup.a.next_sibling
u' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
获取前一个阶段
>>> soup.a.previous_sibling
u'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
方法
- .next_sibling
- .previous_sibling
- .next_siblings
- .previous_siblings
更友好的显示内容
>>> soup.prettify()
u'<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>\n <body>\n <p class="title">\n <b>\n The demo python introduces several python courses.\n </b>\n </p>\n <p class="course">\n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n Basic Python\n </a>\n and\n <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n Advanced Python\n </a>\n .\n </p>\n </body>\n</html>'
>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>**
</body>
</html>
>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
#bs4 库的基本元素
- tag 标签
- name 名字
- -attributes 标签属性
- navigablestring 标签之间的字符串
- comment 注释
本文内容来自网易云课堂