记一次爬取微信公众号的经历

前言

最近有个需求,要获取一些权威安全公众号的及时消息推送,于是就做了一下微信公众号的爬虫调研。最常见的肯定是抓微信请求的包,然后进行分析等等。这里通过搜索发现可以调搜狗微信搜索的查询接口,就用了这个方法,弊端是只显示最近十篇文章。另外还有一种方法是利用微信个人订阅号,在新建素材时,通过查找文章可以查到全部文章,这里没有这个需求,暂未采用。

接口

这里以长亭安全课堂为例,成功拿到了最近十篇文章的内容

效果图

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
#-*- coding:utf-8 -*-
__author__ = "Poochi"
__Date__ = "2018/12/6"

import requests
from bs4 import BeautifulSoup as bs
import js2xml
from lxml import etree
import time
import datetime
from email.mime.text import MIMEText
from email.header import Header
import smtplib


def send_email(content):
sender = '*****'
password = '****'
receiver = ['***@qq.com']
message = MIMEText(content, 'html', 'utf-8')
message['From'] = sender
message['To'] = ",".join(receiver)
subject = 'The lasted vulns info from Wechat'
message['Subject'] = Header(subject, 'utf-8')
try:
smtpObj = smtplib.SMTP('smtp.mxhichina.com')
smtpObj.login(sender, password)
smtpObj.sendmail(sender, receiver, message.as_string())
print('send success')
except smtplib.SMTPException:
print('send failed')

def get_info(url):
r = requests.get(url=url)
#print(r.text)
soup_1 = bs(r.text,'html.parser')
href = soup_1.find('div',{'class':'img-box'}).find('a')['href']
#print(href)
res = requests.get(url=href)
#print(res.text)
soup = bs(res.text,'html.parser')
try:
script = soup.select('body script')[7].string
except Exception as e:
print(e)
#print(script)
src_text = js2xml.parse(script,debug=False)
src_tree = js2xml.pretty_print(src_text)
#print(src_tree)
selector = etree.HTML(src_tree)
#print(selector)
content = selector.xpath("//property[@name = 'title']/string/text()")[0]
media = selector.xpath("//var[@name = 'name']/binaryoperation/right/string/text()")[0]
send_time = selector.xpath("//property[@name = 'datetime']/number/@value")[0]
s_time = int(send_time)
print(media)
print(content)
print(send_time)
now_time = int(time.time())
print(now_time)
ava_time = now_time - s_time
print(ava_time)
print_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
msg = content + ' from ' + media
with open('log.txt','a+') as f:
f.write(print_time+'\n')
if ava_time < 1800:
send_email(msg)


def main():
table = ['长亭安全课堂','360CERT','猎户攻防实验室']
for i in table:
url = 'https://weixin.sogou.com/weixin?type=1&s_from=input&query=' + i + '&ie=utf8&_sug_=n&_sug_type_='
get_info(url)

if __name__ == '__main__':
#get_info()
main()

说几点

搜狗这个网站也是有请求频率限制的,应该是针对ip,同一时间段请求次数过多,需要验证码的输入,这里暂未处理这个问题。邮件推送设置每30分钟爬虫运行一次

代码中利用js2xml库来处理返回包的script文本数据,然后利用xpath定位元素,如有更方便的方法,更好。

反爬

经过测试发现,腾讯设置反爬的策略,脚本不能长时间定时推送,切换ip可以重新请求,或者输入验证码进行验证。

使用代理池的话,如果用网上免费代理池进行轮询也不算稳定,也就没有尝试。于是想尝试对验证码进行识别进行自动验证。通过对验证码的识别测试,发现识别率可以达到100%。这里首先用Python的pytesseract发现效果不好,当然可以通过训练字库的方式,然后再修改默认的字库加载文件,从而识别。另外就是用万能英数软件接口识别,非常准。

可是在提交请求POST包时,存在的参数cert值,对应验证码的时间戳却无法从链接或者返回包等等找到,于是就算识别方便,也暂时搁浅了。

最后又想到自己的需求,其实只需要抓最新一篇文章就行,于是发现在前一步的请求也可以做到,而且并未触发反爬验证码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def get_info(url,table,a):
proxies = {'http':'http://127.0.0.1:1080','https':'http://127.0.0.1:1080'}
r = requests.get(url=url,proxies=proxies)
#print(url)
#print(r.text)
soup = bs(r.text,'html.parser')
content = soup.find('a',{'uigs':'account_article_0'}).string
send_time = soup.find_all('span')[-1].find('script').string
re_time = re.findall(r"timeConvert\('(.*?)'\)",send_time)[0]
print(re_time)
#print(send_time)
print(content)
now_time = int(time.time())
print_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
ava_time = now_time - int(re_time)
print(ava_time)
msg = content + ' from ' + table[a]
with open('log.txt','a+') as f:
f.write(print_time+'\n')
if ava_time < 1800:
send_email(msg)
else:
pass

只对此函数做了稍微修改,其他绕过反爬的还在测试。看到有之前的文章前面请求通过webdriver绕过,尝试不行,于是咱是用这种。

修改

上面的文字是之前写的,长时间去抓取搜狗微信的内容,肯定受到反爬策略的影响,就算只抓一篇文章也会最终被封IP,导致脚本不能正常运行。首先尝试的是代理池,找了一些免费代理池,先爬取代理池,然后Test测试能否使用,免费代理中有一部分可以使用,但是几乎没有IP可以抓取搜狗内容,返回如下:

查看反爬的提交验证码解封的ajax请求,有如下js

1
2
3
4
5
6
7
8
9
10
11
$(function() {
function c() {
$("#seccodeImage").attr("src", "util/seccode.php?tc=" + (new Date).getTime())
}
function l(a) {
var b = a.code,
d = $("#error-tips"),
e = new Date,
f = location.hostname,
g = "sogou.com"; - 1 < f.indexOf("sogo.com") ? g = "sogo.com" : "snapshot.sogoucdn.com" === f && (g = "snapshot.sogoucdn.com");
if (0 === b || 1 === b) getCookie("SUV") || setCookie("SUV", 1E3 * e.getTime() + Math.round(1E3 * Math.random()), "Sun, 29 July 2046 00:00:00 UTC", g, "/"), e.setTime(e.getTime() + 31536E6), setCookie("SNUID", a.id, e.toGMTString(), g, "/");

在提交验证码后,会解封IP,生成新的SNUID,从而继续利用新Cookie请求。

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def cookie_init():
retries = 1
while retries < 3:
cookie = {}
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/65.0.3325.181 Safari/537.36'}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
client = webdriver.Chrome(options=chrome_options)
client.get("https://weixin.sogou.com/antispider/?from=%2fweixin%3Ftype%3d2%26query%3d360CERT")
path = '/Users/victor/Documents/Python Scripts/proxyPool/1.png'
imgpath = '/Users/victor/Documents/Python Scripts/proxyPool/yzm.png'
client.get_screenshot_as_file(path)
im = Image.open(path)
box = (705, 598, 900, 680) # 设置要裁剪的区
region = im.crop(box)
region.save(imgpath)
capt = client.find_element_by_xpath('//*[@id="seccodeInput"]')
test = FateadmApi('','',',') #调用验证码打码平台识别
code = test.PredictFromFile('30600','/Users/victor/Documents/Python Scripts/proxyPool/yzm.png')
#code = '123456'
print(code)
capt.send_keys(code)
time.sleep(1)
client.find_element_by_xpath('//*[@id="submit"]').click()
time.sleep(2)
#print(new_html)
for item in client.get_cookies():
cookie[item["name"]] = item["value"]
try:
print(cookie['SNUID'])
except Exception:
print ("解锁失败。重试次数:{0:d}".format(3-retries))
retries += 1
continue
time.sleep(5)
return cookie['SNUID']

最后

又到周末了,是不是很开心呢,不知道你们怎么过,反正我已经被冻成狗了,我要碎觉,碎觉,碎觉。

-------------本文结束感谢您的阅读-------------
坚持原创技术分享,您的支持将鼓励我继续创作!