Python 爬虫初学者教程

一、爬虫基础概念

什么是爬虫？

爬虫是模拟浏览器行为，自动获取网页数据的程序，常用于数据采集、信息监控等场景。

爬虫的基本流程：

1. 发送请求获取网页内容

2. 解析内容提取数据

3. 存储数据

二、环境准备

1. 安装 Python：推荐 Python 3.8+，官网下载后按提示安装，记得勾选“Add to PATH”。

2. 安装必要库：

- requests ：发送 HTTP 请求（ pip install requests ）

- BeautifulSoup ：解析 HTML/XML 数据（ pip install beautifulsoup4 ）

- lxml ：高效解析库（ pip install lxml ，BeautifulSoup 可配合此库使用）

三、第一个爬虫：获取网页标题

以获取豆瓣电影首页标题为例，代码如下：

import requests

from bs4 import BeautifulSoup

# 1. 发送请求

url = "https://movie.douban.com/"

response = requests.get(url)

# 2. 处理编码（避免中文乱码）

response.encoding = response.apparent_encoding

# 3. 解析网页

soup = BeautifulSoup(response.text, 'lxml')

# 4. 提取数据：获取所有电影标题

movie_titles = soup.find_all('span', class_='title')

# 5. 输出结果

print("豆瓣电影首页部分标题：")

for title in movie_titles:

# 过滤非中文标题（避免广告等干扰）

if "·" not in title.text:

print(title.text)

代码解析：

- requests.get(url) 发送 GET 请求获取网页内容

- BeautifulSoup 用 lxml 解析器处理 HTML

- find_all('span', class_='title') 根据标签和类名提取元素

- 过滤逻辑避免输出非电影标题（如广告）

四、进阶：处理动态网页（以豆瓣短评为例）

动态网页数据通常通过 API 接口返回，需分析网络请求获取真实数据地址：

import requests

import json

# 豆瓣电影《奥本海默》短评 API（需从浏览器开发者工具获取）

api_url = "https://movie.douban.com/j/chart/top_list_comments"

params = {

"movie_id": "35477223", # 电影ID

"start": 0, # 起始评论数

"limit": 20, # 每页评论数

}

# 发送请求（带参数）

response = requests.get(api_url, params=params)

comments_data = json.loads(response.text) # 解析JSON数据

# 提取并输出评论

print("《奥本海默》短评：")

for comment in comments_data:

print(f"用户 {comment['author']}：{comment['content'][:50]}...")

五、爬虫注意事项（避免被封IP）

1. 设置请求头：模拟浏览器行为（添加 User-Agent 等）

headers = {

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",

"Accept": "text/html,application/xhtml+xml,application/xml"

}

response = requests.get(url, headers=headers)

2. 控制请求频率：添加延时（避免频繁请求）

import time

time.sleep(1) # 每次请求间隔1秒

3. 遵守网站规则：查看网站 robots.txt （如豆瓣允许合理爬虫，但禁止高频请求）

六、实战练习：爬取小说网站章节

以爬取某小说网站章节为例，完整代码框架：

import requests

from bs4 import BeautifulSoup

import os

import time

# 小说主页

novel_url = "https://example.com/novel"

# 1. 获取章节列表

def get_chapter_list(url):

response = requests.get(url)

soup = BeautifulSoup(response.text, 'lxml')

chapters = soup.find_all('a', class_='chapter-link')

return [(chapter.text, chapter['href']) for chapter in chapters]

# 2. 获取章节内容

def get_chapter_content(chapter_url):

response = requests.get(chapter_url)

soup = BeautifulSoup(response.text, 'lxml')

content = soup.find('div', class_='content').text

return content

# 3. 保存内容到文件

def save_to_file(chapter_name, content, novel_name):

if not os.path.exists(novel_name):

os.makedirs(novel_name)

file_path = f"{novel_name}/{chapter_name}.txt"

with open(file_path, 'w', encoding='utf-8') as f:

f.write(content)

print(f"已保存：{chapter_name}")

# 主流程

if __name__ == "__main__":

novel_name = "小说名称"

chapters = get_chapter_list(novel_url)

for i, (chapter_name, chapter_url) in enumerate(chapters):

print(f"正在爬取第 {i+1}/{len(chapters)} 章：{chapter_name}")

content = get_chapter_content(chapter_url)

save_to_file(chapter_name, content, novel_name)

time.sleep(2) # 间隔2秒，避免频繁请求

七、进一步学习资源

- 书籍：《Python爬虫开发与项目实战》《精通Python网络爬虫》

- 在线课程：

- 廖雪峰 Python 教程

- 爬虫实战：B站视频信息采集

- 工具推荐：

- 浏览器开发者工具（F12）：分析网络请求

- Postman：调试 API 请求

通过以上步骤，你可以完成基础爬虫的开发。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。
如若转载，请注明出处：http://www.pswp.cn/news/912002.shtml
繁体地址，请注明出处：http://hk.pswp.cn/news/912002.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！