生成网站sitemap.xml地图教程

要生成 sitemap.xml 文件，需要通过爬虫程序抓取网站的所有有效链接。以下是完整的解决方案：

步骤 1：安装必要的 Python 库

ounter(line

pip install requests beautifulsoup4 lxml

步骤 2：创建 Python 爬虫脚本 (`sitemap_generator.py`)

ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line

import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urljoin, urlparseimport xml.etree.ElementTree as ETfrom datetime import datetime
def get_all_links(base_url):    # 存储已访问和待访问的链接    visited = set()    queue = [base_url]    all_links = set()
    while queue:        url = queue.pop(0)        if url in visited:            continue
        try:            headers = {                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'            }            response = requests.get(url, headers=headers, timeout=10)            if response.status_code != 200:                continue
            # 添加到已访问列表            visited.add(url)            all_links.add(url)            print(f"Crawled: {url}")
            # 解析 HTML 获取新链接            soup = BeautifulSoup(response.text, 'lxml')            for link in soup.find_all('a', href=True):                href = link['href'].strip()                full_url = urljoin(url, href)
                # 过滤无效链接                parsed = urlparse(full_url)                if parsed.scheme not in ('http', 'https'):                    continue                if not parsed.netloc.endswith('91kaiye.cn'):  # 仅限本站链接                    continue                if '#' in full_url:  # 忽略锚点                    full_url = full_url.split('#')[0]
                # 添加到待访问队列                if full_url not in visited:                    queue.append(full_url)
        except Exception as e:            print(f"Error crawling {url}: {str(e)}")
    return all_links
def create_sitemap(links, filename='sitemap.xml'):    root = ET.Element('urlset', xmlns='http://www.sitemaps.org/schemas/sitemap/0.9')        for link in sorted(links):        url_elem = ET.SubElement(root, 'url')        ET.SubElement(url_elem, 'loc').text = link        ET.SubElement(url_elem, 'lastmod').text = datetime.now().strftime('%Y-%m-%d')        ET.SubElement(url_elem, 'changefreq').text = 'daily'        ET.SubElement(url_elem, 'priority').text = '0.8'        tree = ET.ElementTree(root)    tree.write(filename, encoding='utf-8', xml_declaration=True)    print(f"\nSitemap generated: {filename} with {len(links)} URLs")
if __name__ == '__main__':    base_url = 'https://www.91kaiye.cn/'    print("Starting crawl...")    links = get_all_links(base_url)    create_sitemap(links)

步骤 3：运行脚本

ounter(line

python sitemap_generator.py

执行说明：

爬虫逻辑：
- 从首页 https://www.91kaiye.cn/ 开始广度优先搜索
- 自动过滤非本站链接、锚点和无效 URL
- 记录每个页面的最后修改日期（默认当天）
- 设置更新频率为 daily，优先级为 0.8

输出文件：

生成的 sitemap.xml 格式如下：

ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line
ounter(line

<?xml version='1.0' encoding='utf-8'?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">  <url>    <loc>https://www.91kaiye.cn/page1</loc>    <lastmod>2023-10-05</lastmod>    <changefreq>daily</changefreq>    <priority>0.8</priority>  </url>  ...</urlset>

注意事项：

反爬措施：
- 如果网站有反爬机制，可能需要：
  - 添加 time.sleep(1) 延迟请求
  - 使用代理 IP
  - 设置更真实的请求头
动态内容：
- 对于 JavaScript 渲染的页面（如 Vue/React），需改用 Selenium 或 Playwright
优化建议：
- 在服务器上定期运行（如每周一次）
- 提交到 Google Search Console
- 在 robots.txt 中添加：
  - ounter(line
```
Sitemap: https://www.91kaiye.cn/sitemap.xml
```

替代方案：使用在线工具

如果不想运行代码，可用在线服务生成：

XML-Sitemaps.com
Screaming Frog SEO Spider（桌面工具）

生成后请将 sitemap.xml 上传到网站根目录，并通过百度/Google站长工具提交。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。
如若转载，请注明出处：http://www.pswp.cn/news/917810.shtml
繁体地址，请注明出处：http://hk.pswp.cn/news/917810.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！