python爬虫

Python 爬取网站图片,我们打开

http://desk.zol.com.cn/bizhi/6429_79089_2.html

发现该网页有很多图片,并且我们点击下一页后会跳转到下一页,那么我们用 python 要如何爬去改页面的图片资源呢?

python requests 库

requests 库的官网http://docs.python-requests.org/en/master/

安装

pip(3) install requests

使用

使用 requests 库,需要对 http 协议有一定的了解,比如状态码,请求头,响应头,方法、字段、参数等概念有一定的了解。

我们导入库后,发送一个 get 请求

>>> import requests
>>> r = requests.get("http://desk.zol.com.cn/bizhi/6429_79089_2.html")
>>> r.status_code #返回状态码为成功
200

上面的 requests.get(url,params=None,**kwargs) 中:

  • url: 是指获取页面的 url 连接

  • params:是 url 的额外参数,比如一些请求头,例如有些网站做了防盗链,需要特殊的 referer 或 user-agent 才可以访问,否则拒绝访问,那么我们可以通过定制一些请求头去进行访问。

  • **kewrgs 12个控制访问的参数

通过 r.headers 可以获取响应头信息

>>> type(r)
<class 'requests.models.Response'>
>>> r.headers
{'Content-Length': '27571', 'Via': 'http/1.1 zats-other1 (zcache-other1 [cRs f ])', 'Age': '3983', 'Expires': 'Mon, 10 Jul 2017 05:59:40 GMT', 'Vary': 'Accept-Encoding', 'Server': 'ngx_openresty', 'Last-Modified': 'Mon, 10 Jul 2017 03:59:40 GMT', 'Connection': 'keep-alive', 'Cache-Control': 'max-age=7200', 'Date': 'Mon, 10 Jul 2017 05:06:03 GMT', 'nnCoection': 'close', 'Content-Type': 'text/html; charset=GBK'}

response 对象的属性

属性 说明
r.status_code HTTP请求的返回状态,200表示成功,其他都表示有问题
r.text HTTP 响应内容的字符串形式,url 对应的页面内容
r.encoding HTTP Header 中猜测的响应头状态码
r.apparent_encoding 从内容中分析出的响应内容编码方式
r.content 响应内容的二进制形式
>>> r.encoding
'GBK'

>>> r.content
'<!DOCTYPE HTML>\r\n<html>\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=gb2312">\r\n<meta name="applicable-device" content="pc">\r\n<title>\xba\xab\xb9\xfa\xd0\xa1\xc7\xe5\xd0\xc2\xc4\xcf\xb9\xe7\xc0\xf6\xbf\xed\xc6\xc1\xb1\xda\xd6\xbd-ZOL\xd7\xc0\xc3\xe6\xb1\xda\xd6\xbd</title>\r\n              <meta name="keywords" content="" />\r\n              <meta name="description" content=""/><meta property="og:type" content="image"/>\n<meta property="og:image" content="http://desk.fd.zol-img.com.cn/t_s120x90c5/g5/M00/0B/05/ChMkJlcgdH2IVmv2AAYP2zcB7GQAAQr3gJjQtUABg_z016.jpg!awen)"/>\n\r\n<link href="http://s.zol-img.com.cn/d/Desk/Desk_bizhi_detail.css?v=1028" rel="stylesheet" type="text/css" />\r\n\r\n<script src="http://p.zol-img.com.cn/desk/detail.js" type="text/javascript"></script>\r\n<script src="http://icon.zol-img.com.cn/public/js/swfobject.js" type="text/javascript"></script>\r\n<script src="http://icon.zol-img.com.cn/getcook.js?1312" type="text/javascript"></script>\r\n<script>\r\n\tdocument.domain = "zol.com.cn";\r\n\tvar userid = get_cookie(\'zol_userid\');\r\n\t\tvar deskPicArr \t\t= {"list":[{"picId":"79089","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJlcgdH2IVmv2AAYP2zcB7GQAAQr3gJjQtUABg_z016.jpg!awen)"},{"picId":"79087","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJ1cgdHmIaBQ5AAxkE0uQNWQAAQr3QO8WIIADGQr480.jpg!awen)"},{"picId":"79088","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJ1cgdHuIZqqBAA-e_5SjTQMAAQr3gD3HtsAD58X749.jpg!awen)"},{"picId":"79090","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJlcgdH6IJHoZAAxppAZiw2UAAQr3gMT8GkADGm8468.jpg!awen)"},{"picId":"79091","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJlcgdICIFA7wAAwW8vbAcEYAAQr3wBYmyYADBcK439.jpg!awen)"},{"picId":"79092","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJ1cgdIKIHmjLAAW0KFEpkQUAAQr3wGS5wkABbRA556.jpg!awen)"},{"picId":"79093","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJ1cgdISId2OQAA3bomjVkl0AAQr3wI0BO4ADdu6814.jpg!awen)"}]};\r\n\t/***********\xc8\xab\xbe\xd6\xb1\xe4\xc1\xbf\xb5\xc4\xc9\xf9\xc3\xf7*************/\r\n\tvar $deskGlobalConfig = {\r\n\t\t\tgroupId\t\t\t: 6429,\t\t\t//\xd7\xe9\xcd\xbcID\r\n\t\t\tp

如果乱码,则

 >>> r.apparent_encoding
'GB2312'
>>> r.encoding ='utf-8'

如果 header 中不存在 charset,则认为编码为 ISO-8859-1

请求异常处理

属性 说明
requests.ConnectionError 网络连接错误异常,如 DNS 查询失败, 拒绝连接
requests.HTTPError HTTP 错误异常
requests. URLRequired URL 缺失异常
requests.TooManyRedirects 连接远程服务器超时
requests.Timeout 请求 URL 超时,产生的异常
#!/usr/bin/python
#!-*-conding:utf-8 -*-

import requests


def getHTMLIMG(url):

    try:

        r = requests.get(url,timeout=30)
        r.raise_for_status()#  如果状态不是200 引发 HTTPError 异常
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "请求异常"


if __name__ == "__main__":
    url = "http://desk.zol.com.cn/bizhi/6429_79089_2.html"
    print(getHTMLIMG(url))

requests 的 http 方法

方法 说明
requests.requets() 构造一个请求,支持以下各自方法的基础
requests.get() get请求,获取实体内容
requests.head() 获取头信息
requests.post() 提交 POST 请求
requests.put() 提交 PUT 请求
requests.patch() 提交局部的修改请求
requests.delete() 提交删除请求

我们使用爬虫,大部分都是使用get方法比较多。

Beautiful Soup

安装

pip install beautifulsoup4

官网

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

使用

➜  ~ python
Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 12:39:47)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
u'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

导入bs4

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>
>>>

显示title

>>> soup.title
<title>This is a python demo page</title>

打印a标签

>>> tag = soup.a
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

上面的只能返回第一个标签

##获取名字

    >>> soup.a.name
u'a'
>>> soup.a.parent.name
u'p'
>>> soup.a.parent.parent.name
u'body'

获得数据内容

>>> tag = soup.a
>>> tag.attrs
{u'href': u'http://www.icourse163.org/course/BIT-268001', u'class': [u'py1'], u'id': u'link1'}
>>> tag.attrs['class']
[u'py1']
>>> tag.attrs['href']
u'http://www.icourse163.org/course/BIT-268001'

查看标签类型

>>> type(tag.attrs)
<type 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>

获得标签中的内容

>>> soup.a.string
u'Basic Python'
>>> soup.p.string
u'The demo python introduces several python courses.'

上面我们看到p标签的内容中其实是包含一个b的

<p class="title">
     <b>
      The demo python introduces several python courses.
     </b>
    </p>

说明该方法是可以跨域多个层的

beautiful soup元素

遍历

遍历分上下平行遍历

>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
[u'\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, u'\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, u'\n']
>>> >>> len(soup.body.contents)
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>

方法

  • .contents
  • .children .desendants 需要配合for语句使用

上行遍历

>>> soup.body.parent
<html><head><title>This is a python demo page</title></head>\n<body>\n<p class="title"><b>The demo python introduces several python courses.</b></p>\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>\n</body></html>

标签书上行遍历

方法

  • .parent
  • .parents

平行遍历

获取下一个标签

>>> soup.a.next_sibling
u' and '

>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

获取前一个阶段

>>> soup.a.previous_sibling
u'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'

方法

  • .next_sibling
  • .previous_sibling
  • .next_siblings
  • .previous_siblings

更友好的显示内容

>>> soup.prettify()
u'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>**
 </body>
</html>
>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>

#bs4 库的基本元素

  • tag 标签
  • name 名字
  • -attributes 标签属性
  • navigablestring 标签之间的字符串
  • comment 注释

本文内容来自网易云课堂