1. 知乎评论爬取的技术挑战
知乎的评论数据通常采用动态加载(Ajax),这意味着直接使用**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**
+**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">BeautifulSoup</font>**
无法获取完整数据。此外,知乎还设置了反爬机制,包括:
- 请求头(Headers)验证(如
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">User-Agent</font>**
、**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Referer</font>**
) - Cookie/Session 校验(未登录用户只能获取部分数据)
- 频率限制(频繁请求可能导致IP被封)
因此,我们需要:
- 模拟浏览器请求(携带Headers和Cookies)
- 解析动态API接口(而非静态HTML)
- 优化爬取速度(多线程/异步)
2. 获取知乎评论API分析
(1)查找评论API
打开知乎任意一个问题(如 **<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">https://www.zhihu.com/question/xxxxxx</font>**
),按**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">F12</font>**
进入开发者工具,切换到**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Network</font>**
选项卡,筛选**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">XHR</font>**
请求
(2)解析评论数据结构
评论通常嵌套在**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">data</font>**
字段中,结构如下:
{"data": [{"content": "评论内容","author": { "name": "用户名" },"created_time": 1620000000}],"paging": { "is_end": false, "next": "下一页URL" }
}
我们需要递归翻页(**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">paging.next</font>**
)爬取所有评论。
3. Python爬取知乎评论的三种方式
(1)单线程爬虫(基准测试)
使用**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**
库直接请求API,逐页爬取:
import requests
import timedef fetch_comments(question_id, max_pages=5):base_url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers"headers = {"User-Agent": "Mozilla/5.0","Cookie": "你的Cookie" # 登录后获取}comments = []for page in range(max_pages):url = f"{base_url}?offset={page * 10}&limit=10"resp = requests.get(url, headers=headers).json()for answer in resp["data"]:comments.append(answer["content"])time.sleep(1) # 避免请求过快return commentsstart_time = time.time()
comments = fetch_comments("12345678") # 替换为知乎问题ID
print(f"单线程爬取完成,耗时:{time.time() - start_time:.2f}秒")
缺点:逐页请求,速度慢(假设每页1秒,10页需10秒)。
(2)多线程爬虫(ThreadPoolExecutor)
使用**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">concurrent.futures</font>**
实现多线程并发请求:
from concurrent.futures import ThreadPoolExecutordef fetch_page(page, question_id):url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers?offset={page * 10}&limit=10"headers = {"User-Agent": "Mozilla/5.0"}resp = requests.get(url, headers=headers).json()return [answer["content"] for answer in resp["data"]]def fetch_comments_multi(question_id, max_pages=5, threads=4):with ThreadPoolExecutor(max_workers=threads) as executor:futures = [executor.submit(fetch_page, page, question_id) for page in range(max_pages)]comments = []for future in futures:comments.extend(future.result())return commentsstart_time = time.time()
comments = fetch_comments_multi("12345678", threads=4)
print(f"多线程爬取完成,耗时:{time.time() - start_time:.2f}秒")
优化点:
- 线程池控制并发数(避免被封)
- 比单线程快约3-4倍(4线程爬10页仅需2-3秒)
(3)异步爬虫(Asyncio + aiohttp)
使用**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">aiohttp</font>**
实现异步HTTP请求,进一步提高效率:
import aiohttp
import asyncio
import time# 代理配置
proxyHost = "www.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"async def fetch_page_async(session, page, question_id):url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers?offset={page * 10}&limit=10"headers = {"User-Agent": "Mozilla/5.0"}async with session.get(url, headers=headers) as resp:data = await resp.json()return [answer["content"] for answer in data["data"]]async def fetch_comments_async(question_id, max_pages=5):# 设置代理连接器proxy_auth = aiohttp.BasicAuth(proxyUser, proxyPass)connector = aiohttp.TCPConnector(limit=20, # 并发连接数限制force_close=True,enable_cleanup_closed=True,proxy=f"http://{proxyHost}:{proxyPort}",proxy_auth=proxy_auth)async with aiohttp.ClientSession(connector=connector) as session:tasks = [fetch_page_async(session, page, question_id) for page in range(max_pages)]comments = await asyncio.gather(*tasks)return [item for sublist in comments for item in sublist]if __name__ == "__main__":start_time = time.time()comments = asyncio.run(fetch_comments_async("12345678")) # 替换为知乎问题IDprint(f"异步爬取完成,耗时:{time.time() - start_time:.2f}秒")print(f"共获取 {len(comments)} 条评论")
优势:
- 无GIL限制,比多线程更高效
- 适合高并发IO密集型任务(如爬虫)
4. 性能对比与优化建议
爬取方式 | 10页耗时(秒) | 适用场景 |
---|---|---|
单线程 | ~10 | 少量数据,简单爬取 |
多线程(4线程) | ~2.5 | 中等规模,需控制并发 |
异步(Asyncio) | ~1.8 | 大规模爬取,高并发需求 |
优化建议
- 控制并发数:避免触发反爬(建议10-20并发)。
- 随机延迟:
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">time.sleep(random.uniform(0.5, 2))</font>**
模拟人类操作。 - 代理IP池:防止IP被封(如使用
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**
+**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">ProxyPool</font>**
)。 - 数据存储优化:异步写入数据库(如
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">MongoDB</font>**
或**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">MySQL</font>**
)。