记一次爬取微信公众号的经历

前言

最近有个需求，要获取一些权威安全公众号的及时消息推送，于是就做了一下微信公众号的爬虫调研。最常见的肯定是抓微信请求的包，然后进行分析等等。这里通过搜索发现可以调搜狗微信搜索的查询接口，就用了这个方法，弊端是只显示最近十篇文章。另外还有一种方法是利用微信个人订阅号，在新建素材时，通过查找文章可以查到全部文章，这里没有这个需求，暂未采用。

接口

这里以长亭安全课堂为例，成功拿到了最近十篇文章的内容

效果图

代码

#-*- coding:utf-8 -*-
__author__ = "Poochi"
__Date__ = "2018/12/6"

import requests
from bs4 import BeautifulSoup as bs
import js2xml
from lxml import etree
import time
import datetime
from email.mime.text import MIMEText
from email.header import Header
import smtplib


def send_email(content):
	sender = '*****' 
	password = '****' 
	receiver = ['***@qq.com'] 
	message = MIMEText(content, 'html', 'utf-8')
	message['From'] = sender
	message['To'] = ",".join(receiver)
	subject = 'The lasted vulns info from Wechat'
	message['Subject'] = Header(subject, 'utf-8')
	try:
	    smtpObj = smtplib.SMTP('smtp.mxhichina.com')
	    smtpObj.login(sender, password)
	    smtpObj.sendmail(sender, receiver, message.as_string())
	    print('send success')
	except smtplib.SMTPException:
	    print('send failed')

def get_info(url):
	r = requests.get(url=url)
	#print(r.text)
	soup_1 = bs(r.text,'html.parser')
	href = soup_1.find('div',{'class':'img-box'}).find('a')['href']
	#print(href)
	res = requests.get(url=href)
	#print(res.text)
	soup = bs(res.text,'html.parser')
	try:
		script = soup.select('body script')[7].string
	except Exception as e:
		print(e)
	#print(script)
	src_text = js2xml.parse(script,debug=False)
	src_tree = js2xml.pretty_print(src_text)
	#print(src_tree)
	selector = etree.HTML(src_tree)
	#print(selector)
	content = selector.xpath("//property[@name = 'title']/string/text()")[0]
	media = selector.xpath("//var[@name = 'name']/binaryoperation/right/string/text()")[0]
	send_time = selector.xpath("//property[@name = 'datetime']/number/@value")[0]
	s_time = int(send_time)
	print(media)
	print(content)
	print(send_time)
	now_time = int(time.time())
	print(now_time)
	ava_time = now_time - s_time
	print(ava_time)
	print_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
	msg = content + ' from ' + media
	with open('log.txt','a+') as f:
		f.write(print_time+'\n')
		if ava_time < 1800:
			send_email(msg)
	

def main():
	table = ['长亭安全课堂','360CERT','猎户攻防实验室']
	for i in table:
		url = 'https://weixin.sogou.com/weixin?type=1&s_from=input&query=' + i + '&ie=utf8&_sug_=n&_sug_type_='
		get_info(url)

if __name__ == '__main__':
	#get_info()
	main()

说几点

搜狗这个网站也是有请求频率限制的，应该是针对ip，同一时间段请求次数过多，需要验证码的输入，这里暂未处理这个问题。邮件推送设置每30分钟爬虫运行一次

代码中利用js2xml库来处理返回包的script文本数据，然后利用xpath定位元素，如有更方便的方法，更好。

反爬

经过测试发现，腾讯设置反爬的策略，脚本不能长时间定时推送，切换ip可以重新请求，或者输入验证码进行验证。

使用代理池的话，如果用网上免费代理池进行轮询也不算稳定，也就没有尝试。于是想尝试对验证码进行识别进行自动验证。通过对验证码的识别测试，发现识别率可以达到100%。这里首先用Python的pytesseract发现效果不好，当然可以通过训练字库的方式，然后再修改默认的字库加载文件，从而识别。另外就是用万能英数软件接口识别，非常准。

可是在提交请求POST包时，存在的参数cert值，对应验证码的时间戳却无法从链接或者返回包等等找到，于是就算识别方便，也暂时搁浅了。

最后又想到自己的需求，其实只需要抓最新一篇文章就行，于是发现在前一步的请求也可以做到，而且并未触发反爬验证码

def get_info(url,table,a):
	proxies = {'http':'http://127.0.0.1:1080','https':'http://127.0.0.1:1080'}
	r = requests.get(url=url,proxies=proxies)
	#print(url)
	#print(r.text)
	soup = bs(r.text,'html.parser')
	content = soup.find('a',{'uigs':'account_article_0'}).string
	send_time = soup.find_all('span')[-1].find('script').string
	re_time = re.findall(r"timeConvert\('(.*?)'\)",send_time)[0]
	print(re_time)
	#print(send_time)
	print(content)
	now_time = int(time.time())
	print_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
	ava_time = now_time - int(re_time)
	print(ava_time)
	msg = content + ' from ' + table[a]
	with open('log.txt','a+') as f:
		f.write(print_time+'\n')
		if ava_time < 1800:
			send_email(msg)
		else:
			pass

只对此函数做了稍微修改，其他绕过反爬的还在测试。看到有之前的文章前面请求通过webdriver绕过，尝试不行，于是咱是用这种。

修改

上面的文字是之前写的，长时间去抓取搜狗微信的内容，肯定受到反爬策略的影响，就算只抓一篇文章也会最终被封IP,导致脚本不能正常运行。首先尝试的是代理池，找了一些免费代理池，先爬取代理池，然后Test测试能否使用，免费代理中有一部分可以使用，但是几乎没有IP可以抓取搜狗内容，返回如下：

查看反爬的提交验证码解封的ajax请求，有如下js

$(function() {
	function c() {
		$("#seccodeImage").attr("src", "util/seccode.php?tc=" + (new Date).getTime())
	}
	function l(a) {
		var b = a.code,
			d = $("#error-tips"),
			e = new Date,
			f = location.hostname,
			g = "sogou.com"; - 1 < f.indexOf("sogo.com") ? g = "sogo.com" : "snapshot.sogoucdn.com" === f && (g = "snapshot.sogoucdn.com");
		if (0 === b || 1 === b) getCookie("SUV") || setCookie("SUV", 1E3 * e.getTime() + Math.round(1E3 * Math.random()), "Sun, 29 July 2046 00:00:00 UTC", g, "/"), e.setTime(e.getTime() + 31536E6), setCookie("SNUID", a.id, e.toGMTString(), g, "/");

在提交验证码后，会解封IP，生成新的SNUID，从而继续利用新Cookie请求。

代码如下：

def cookie_init():
	retries = 1
	while retries < 3:
		cookie = {}
		headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/65.0.3325.181 Safari/537.36'}
		chrome_options = webdriver.ChromeOptions()
		chrome_options.add_argument('--headless')
		chrome_options.add_argument('--disable-gpu')
		chrome_options.add_argument('--no-sandbox')
		chrome_options.add_argument('--disable-dev-shm-usage')
		client = webdriver.Chrome(options=chrome_options)
		client.get("https://weixin.sogou.com/antispider/?from=%2fweixin%3Ftype%3d2%26query%3d360CERT")
		path = '/Users/victor/Documents/Python Scripts/proxyPool/1.png'
		imgpath = '/Users/victor/Documents/Python Scripts/proxyPool/yzm.png'
		client.get_screenshot_as_file(path)
		im = Image.open(path)
		box = (705, 598, 900, 680)  # 设置要裁剪的区
		region = im.crop(box)
		region.save(imgpath)
		capt = client.find_element_by_xpath('//*[@id="seccodeInput"]')
		test = FateadmApi('','',',')   #调用验证码打码平台识别 
		code = test.PredictFromFile('30600','/Users/victor/Documents/Python Scripts/proxyPool/yzm.png')
		#code = '123456'
		print(code)
		capt.send_keys(code)
		time.sleep(1)
		client.find_element_by_xpath('//*[@id="submit"]').click()
		time.sleep(2)
		#print(new_html)
		for item in client.get_cookies():
		    cookie[item["name"]] = item["value"]
		try:
			print(cookie['SNUID'])
		except Exception:
			print ("解锁失败。重试次数:{0:d}".format(3-retries))
			retries += 1
			continue
		time.sleep(5)
		return cookie['SNUID']

最后

又到周末了，是不是很开心呢，不知道你们怎么过，反正我已经被冻成狗了，我要碎觉，碎觉，碎觉。