!文章内容如有错误或排版问题,请提交反馈,非常感谢!
用户代理 User-Agent
客户端向服务器请求一张页面时,可以额外附上一些自己的信息(如使用什么操作系统、什么浏览器),以便让服务器提供更好的服务(如根据不同设备返回不同的页面)。额外附上的信息叫请求头(HTTP Header, request header),有这么多内容:
request-header = Accept; Section 14.1 | Accept-Charset; Section 14.2 | Accept-Encoding; Section 14.3 | Accept-Language; Section 14.4 | Authorization; Section 14.8 | Expect; Section 14.20 | From; Section 14.22 | Host; Section 14.23 | If-Match; Section 14.24 | If-Modified-Since; Section 14.25 | If-None-Match; Section 14.26 | If-Range; Section 14.27 | If-Unmodified-Since; Section 14.28 | Max-Forwards; Section 14.31 | Proxy-Authorization; Section 14.34 | Range; Section 14.35 | Referer; Section 14.36 | TE; Section 14.39 | User-Agent; Section 14.43
最后一项 User-Agent 用户代理包含一个特有字符串(characteristic string),指明自己应用类型、操作系统、软件版本。用 requests 库爬取,会发送如下请求头:
Host: httpbin.org User-Agent: python-requests/2.24.0 Accept-Encoding: gzip, deflate Accept: */* Connection: keep-alive
这里用户代理是 python-requests(若用 urllib 标准库,则用户代理为 python-urllib),典型的网页爬虫,容易被服务器识别并拒绝提供服务。
使用 Nginx 屏蔽异常的 User-Agent
Nginx 提供了一种基于 User-Agent 的访问限制方式,可以通过配置文件屏蔽抓取或爬虫的 User-Agent。具体操作如下:
在 Nginx 配置文件中添加以下内容:
# 拒绝抓取或爬虫的 User-Agent 列表 set $blocked_useragents "Curl*|HttpClient*|python-requests*|BOT/* (BOT for JCE)|ApacheBench*|Python-urllib*|ZmEu*|WinHttp*|^jaunty|FeedDemon*|Jullo*|JikeSpider*|Indy Library*|Alexa Toolbar*|AskTbFXTV*|CoolpadWebkit*|Feedly*|UniversalFeedParser*|Microsoft URL Control*|Swiftbot*|oBot*|YandexBot*|EasouSpider*|heritrix*|LinkpadBot*|Ezooms*|^$"; if ($http_user_agent ~* ($blocked_useragents)) { return 403; } return 403; }
上述代码会设置一个名为 blocked_useragents 的变量,其中包含多个需要屏蔽的 User-Agent。在 if 语句中,使用正则表达式进行匹配,如果客户端的 User-Agent 头部信息匹配到了 blocked_useragents 中的任意一个值,则返回 403 错误(即拒绝访问)。
修改完 Nginx 配置文件之后,需要重新加载使其生效。可以使用以下命令来实现:
sudo nginx -t && sudo systemctl reload nginx
在 Python Flask 中实现 User-Agent 反爬
- 在 Flask 程序中使用 @app.before_request 装饰器,指定一个函数来处理所有的请求,在该函数中检查请求头中的 User-Agent 字段。
- 获取 User-Agent 字段的值,在 app 目录中建立 antispider 文件夹,然后新增 html 文件,内容可以自定义。
- 在 check_user_agent() 函数中,对 user_agent 进行判断,如果是某种常见浏览器的名称,则允许请求通过,否则返回一个错误响应。例如:
from flask import Blueprint, jsonify, request from flask import render_template antispider = Blueprint('antispider', __name__, url_prefix='/as') @antispider.before_request def check_user_agent(): user_agent = request.headers.get("User-Agent") print(user_agent) if "Mozilla" in user_agent: # 允许请求通过 return None else: return "错误请求", 403
使用随机User-Agent 绕过反爬虫
import random USER_AGENTS = [ "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0", "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11", "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TheWorld)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)", "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5", "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5", "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+", "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0", "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124", "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)", "UCWEB7.0.2.37/28/999", "NOKIA5700/ UCWEB7.0.2.37/28/999", "Openwave/ UCWEB7.0.2.37/28/999", "Mozilla/4.0 (compatible; MSIE 6.0;) Opera/UCWEB7.0.2.37/28/999", "Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25", ] # headers = {'User-Agent': random.choice(USER_AGENTS)} # 随机获取一个请求头 def get_user_agent(): return random.choice(USER_AGENTS)
上面的代码如果嫌麻烦也可以使用fake_useragent或anti-useragent包,使用起来也非常简单:
代码示例:
from fake_useragent import UserAgent ua = UserAgent() # Get a random browser user-agent string print(ua.random) # Or get user-agent string from a specific browser print(ua.chrome) # Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 print(ua.google) # Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1290.1 Safari/537.13 print(ua['google chrome']) # Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 print(ua.firefox) # Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0 print(ua.ff) # Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0 print(ua.safari) # Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.2 Safari/605.1.15
各大搜索引擎User-Agent
#百度 Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html) Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html) Baiduspider-image (+http://www.baidu.com/search/spider.htm) #360 Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360spider (http://webscan.360.cn) Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36; 360Spider Mozilla/5.0 (Linux; U; Android 4.0.2; en-us; Galaxy Nexus Build/ICL53F) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30; 360Spider Mozilla/5.0 (Linux; U; Android 4.0.2; en-us; Galaxy Nexus Build/ICL53F) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30; HaosouSpider #Google Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Mozilla/5.0 (compatible; Googlebot-Image/1.0; +http://www.google.com/bot.html) AdsBot-Google-Mobile (+http://www.google.com/mobile/adsbot.html) Mozilla (iPhone; U; CPU iPhone OS 30 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile Safari #Bing Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b Mozilla/5.0 (Linux; Android 8.0.0; MHA-AL00 Build/HUAWEIMHA-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/68.0.3440.91 Mobile Safari/537.36 BingWeb/6.9.6 Mozilla/5.0 (Linux; Android 8.0.0; MI6 Build/OPR1.170623.027; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/70.0.3538.110 Mobile Safari/537.36 BingWeb/6.9.6 Mozilla/5.0 (Linux; Android 8.0.0; ONEPLUS A3010 Build/OPR1.170623.032; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/67.0.3396.87 Mobile Safari/537.36 BingWeb/6.9.0 Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 BingPreview/1.0b #腾讯搜搜 Sosospider (+http://help.soso.com/webspider.htm) Sosoimagespider (+http://help.soso.com/soso-image-spider.htm) #雅虎 Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html) #Sogou Sogou Pic Spider/3.0 (+http://www.sogou.com/docs/help/webmasters.htm#07) Sogou web spider/4.0 (+http://www.sogou.com/docs/help/webmasters.htm#07) Sogou inst spider/4.0 (+http://www.sogou.com/docs/help/webmasters.htm#07) Sogou spider (+http://www.sogou.com/docs/help/webmasters.htm#07) Sogou wap spider (+http://www.sogou.com/docs/help/webmasters.htm#07) Sogou News Spider/4.0 (+http://www.sogou.com/docs/help/webmasters.htm#07) Sogou Pic Spider/3.0 (+http://www.sogou.com/docs/help/webmasters.htm#07) Sogou Video Spider/3.0 (+http://www.sogou.com/docs/help/webmasters.htm#07) Sogou Push Spider/3.0 (+http://www.sogou.com/docs/help/webmasters.htm#07) #网易有道 Mozilla/5.0 (compatible; YoudaoBot/1.0; http://www.youdao.com/help/webmaster/spider/;) #字节跳动 Mozilla/5.0 (compatible; Bytespider; [https://zhanzhang.toutiao.com/] AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.0.0 Safari/537.36 Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; [https://zhanzhang.toutiao.com/] Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Version/7.0 Mobile Safari/537.36 (compatible; Bytespider; [https://zhanzhang.toutiao.com/] #Applebot Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1) #神马搜索 Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36Mozilla/5.0 (iPhone; CPU iPhoneOS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e YisouSpider/5.0 Safari/602.1
参考链接:
- fake-useragent/fake-useragent: Up-to-date simple useragent faker with realworld database (github.com)
- ihandmine/anti-useragent: fake pc or app browser useragent, anti useragent, and other awesome tools (github.com)
- alecxe/scrapy-fake-useragent: Random User-Agent middleware based on fake-useragent (github.com)