爬虫技巧
抓包
- F12查看网络请求,复制cURL(Bash)
- 点击爬虫代码生成器,生成多种变成语言爬虫代码
示例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
headers = {
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.42',
}
params = {
'tn': 'resultjson_com',
'word': query,
'pn': pageNum
}
response = requests.get('https://image.baidu.com/search/acjson', params=params, headers=headers).json()
info = response['data']
parsed_info = parse_info(info)
for url in parsed_info:
save_image(url['url'])
|
Bs解析
- 直接requests.get({url}),加上headers伪装
- 用bs4.BeautifulSoup解析,提取各个标签及其文本
问题
- 无法获取到具体数据
- 可能需要登录,尝试js逆向。
- 可采用selenium模拟浏览器点击请求,极大概率可行。
示例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
def parse_single_html(html):
soup = BeautifulSoup(html,'html.parser')
article_items = (soup.find("div", class_="article").find("ol",class_="grid_view").find_all("div", class_="item"))
data_list = []
for article_item in article_items:
rank = article_item.find("div",class_="pic").find("em").get_text()
info = article_item.find("div",class_="info")
title = info.find("div",class_="hd").find("span",class_="title").get_text()
stars = (info.find("div",class_="bd").find("div",class_="star").find_all("span"))
rating_star = stars[0]["class"][0]
rating_num = stars[1].get_text()
comments = stars[3].get_text()
data_list.append({
"rank": rank,
"title": title,
"rating_star": rating_star.replace("rating", "").replace("-t", ""),
"rating_num": rating_num,
"comments": comments.replace("人评价", "")
})
return data_list
|
js逆向
js逆向法主要用于需要模拟登录的场景,登录接口做了加密。
方法步骤
- F12抓包,获取登录请求
- 若有加密,查看网站源代码
- 用execjs库运行或python重写js代码,模拟效果
- 成功登录
示例
1
2
3
4
|
with open('../security.js') as f:
js_data = f.read()
ctx = execjs.compile(js_data)
encryped_pwd = ctx.call('getEncrypedPwd', public_exponent, modulus, password)
|
技巧
- 使用requests.session代替request发送请求,自动更新cookies,避免手动更新
- 适当修改js,明确入口函数便于py执行
破解滑块验证码
- 要想破解滑块验证码,需要首先获取缺口和原图的url,爬取后利用opencv进行模版匹配。
- 要求模板匹配算法进行优化,具有较高鲁棒性。
- 用selenium模拟手动拖动滑块过程
示例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
|
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox()
driver.get('xxxx.html') # 用selenium登录网址
slider = driver.find_element('class name', 'zfdun_slider_bar_btn')
# 找到滑块背景图片,以获取滑块的位置信息
slider_bg = driver.find_element('class name', 'zfdun_bgimg_jigsaw')
large_bg = driver.find_element('class name', 'zfdun_bgimg_img')
template_img_url = large_bg.get_attribute('src')
img_url = slider_bg.get_attribute('src')
# 获取当前页面的所有 Cookie
cookies = driver.get_cookies()
cookies = {cookie['name']: cookie['value'] for cookie in cookies}
template_res = requests.get(template_img_url, cookies=cookies)
img_res = requests.get(img_url, cookies=cookies)
# 将二进制数据转化为 numpy 数组
template_img_array = np.frombuffer(template_res.content, np.uint8)
# 使用 OpenCV 解码图片
template = cv2.imdecode(template_img_array, cv2.IMREAD_COLOR)
# 将二进制数据转化为 numpy 数组
img_array = np.frombuffer(img_res.content, np.uint8)
# 使用 OpenCV 解码图片
img = cv2.imdecode(img_array, cv2.IMREAD_COLOR)
# 计算左边距
left_margin = calculate_left_margin(template, img)
print("left_margin", left_margin)
# 输入用户名
username_input = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'yhm')))
username_input.send_keys(username)
# 输入密码
password_input = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'mm')))
password_input.send_keys(password)
# 获取滑块的初始位置和大小
slider_width = slider.size['width']
slider_start = slider.location['x']
print(slider_width)
print(large_bg.size['width'])
# 模拟拖动滑块的操作
ActionChains(driver).click_and_hold(slider).perform()
ActionChains(driver).move_by_offset(left_margin, 0).perform()
time.sleep(0.5) # 可以根据实际情况调整等待时间
ActionChains(driver).release().perform()
# 登录按钮点击
login_button = driver.find_element('id', 'dl')
login_button.click()
cookies = driver.get_cookies()
cookies = {cookie['name']: cookie['value'] for cookie in cookies}
print("cookies:", cookies)
with open('cookies.json', 'w') as f:
json.dump(cookies, f)
driver.quit()
|