robots.txt AI 爬虫配置完整指南

robots.txt 不再只是管理 Googlebot 的工具。在 AI 搜索时代，你需要为十几种 AI 爬虫制定访问策略。这篇指南帮你搞定。

robots.txt 基础回顾

robots.txt 是放置在网站根目录的文本文件（https://yoursite.com/robots.txt），用于告诉网络爬虫哪些页面可以抓取、哪些不可以。它遵循 Robots Exclusion Protocol（REP）标准。

基本语法：

User-agent: *
Allow: /
Disallow: /private/

Sitemap: https://yoursite.com/sitemap.xml

关键规则：

User-agent 指定爬虫名称，* 表示所有爬虫
Allow 允许抓取指定路径
Disallow 禁止抓取指定路径
更具体的规则优先于通用规则
robots.txt 是「建议」而非「强制」，但主流爬虫都会遵守

AI 爬虫 User-Agent 清单

以下是 2026 年需要关注的 AI 爬虫 User-Agent：

OpenAI

GPTBot — 用于模型训练和改进
ChatGPT-User — 用于 ChatGPT 的实时搜索功能
OAI-SearchBot — 用于 OpenAI 的搜索产品

Anthropic

ClaudeBot — 用于 Claude 的知识获取
anthropic-ai — Anthropic 的通用爬虫

Google

Google-Extended — 用于 Gemini 和 AI Overview（与 Googlebot 分开控制）

其他

PerplexityBot — Perplexity AI 搜索
Bytespider — 字节跳动 AI 产品
cohere-ai — Cohere AI
Applebot-Extended — Apple Intelligence
meta-externalagent — Meta AI

允许还是屏蔽 AI 爬虫

这是一个需要根据你的业务目标来决定的策略问题：

允许 AI 爬虫的理由

AI 搜索流量：允许爬虫访问是被 AI 搜索引用的前提
品牌曝光：在 AI 回答中被引用能提升品牌知名度
流量增长：AI 搜索正在成为重要的流量来源

屏蔽 AI 爬虫的理由

内容保护：不希望内容被用于 AI 训练
付费内容：付费墙后的内容不应被免费获取
服务器负载：AI 爬虫可能消耗大量带宽

配置示例

全部允许（推荐给大多数网站）

# AI 搜索爬虫 - 允许
User-agent: ChatGPT-User
Allow: /

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

允许搜索，屏蔽训练

# 允许 ChatGPT 搜索
User-agent: ChatGPT-User
Allow: /

# 屏蔽 GPTBot（训练用途）
User-agent: GPTBot
Disallow: /

# 允许 Perplexity 搜索
User-agent: PerplexityBot
Allow: /

# 屏蔽 Google AI 训练
User-agent: Google-Extended
Disallow: /

部分允许

# 允许 AI 爬虫访问博客，屏蔽付费内容
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /premium/
Disallow: /api/

User-agent: ChatGPT-User
Allow: /blog/
Allow: /docs/
Disallow: /premium/
Disallow: /api/

测试你的 robots.txt

配置完成后，务必测试确认规则生效：

Google Search Console：使用 robots.txt 测试工具验证 Googlebot 和 Google-Extended 的规则
在线验证工具：使用 robots.txt 验证器检查语法错误
手动检查：直接访问 https://yoursite.com/robots.txt 确认文件内容正确
SEO智检：自动检测 robots.txt 中的 AI 爬虫配置，并给出优化建议

常见陷阱

通配符误伤：User-agent: * / Disallow: / 会屏蔽所有爬虫，包括 AI 爬虫
CDN 缓存：修改 robots.txt 后，CDN 可能缓存旧版本。记得清除缓存
WAF 屏蔽：Cloudflare 等 WAF 可能在 robots.txt 之前就屏蔽了 AI 爬虫。检查防火墙规则
大小写敏感：User-Agent 名称是大小写敏感的，确保拼写正确
遗漏新爬虫：AI 爬虫在不断增加，定期检查并更新配置
Crawl-delay：部分 AI 爬虫支持 Crawl-delay 指令，可以用来控制抓取频率

一键检测你的 robots.txt AI 爬虫配置

使用 SEO智检的免费 AI 爬虫配置器，自动分析你的 robots.txt 并生成优化建议。

免费检测 robots.txt →