Google 的 robots.txt

我们知道，通过 robots.txt 文件可以阻止(不是绝对的)搜索引擎的爬虫或者机器人对自己站点的搜索行为。无意中看了看 Google 的 robots.txt 文件。

看看内容：

User-agent: *
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalog_list
Disallow: /news
Disallow: /pagead/
Disallow: /relpage/
Disallow: /imgres
Disallow: /keyword/
Disallow: /u/
Disallow: /univ/
Disallow: /cobrand
Disallow: /custom
Disallow: /advanced_group_search
Disallow: /advanced_search
Disallow: /googlesite
Disallow: /preferences
Disallow: /setprefs
Disallow: /swr
Disallow: /url
Disallow: /wml
Disallow: /hws
Disallow: /bsd?
Disallow: /linux?
Disallow: /mac?
Disallow: /microsoft?
Disallow: /unclesam?
Disallow: /answers/search?q=
Disallow: /local
Disallow: /froogle?
Disallow: /froogle_
Disallow: /print?
Disallow: /scholar?
Disallow: /palm
Disallow: /complete

可以看出 Google 对大多数入口都是不允许搜索的。也怕后院起火:) /cobrand 这个有些陌生，看看什么内容?

在 2000 年的时候，Google 的界面还要简洁。顺便看看百度的robots.txt:

User-agent: Baiduspider
Disallow: /baidu
User-agent: *
Disallow: /shifen/dqzd.html

/shifen/dqzd.html 这个页面干什么的?打开看看，原来是竞价排名的区域核心代理一览表以及地区总代理一览表。这也算不上什么重要信息阿，还藏着掖着的。

再看看 MSN Search 的：

# robots.txt for http://search.msn.com
User-agent: *
Disallow: /results
Disallow: /keepalive/
Disallow: /static/
Disallow: /w3c/
Disallow: /cfgs/
Disallow: /schema/
Disallow: /kids/
Disallow: /Kidz/
Disallow: /pass/

虚拟目录，基本都进不去

在 robotstxt.org 几乎可以找到关于 robots.txt 的一切信息，包括互联网上 Robots 的数据库(可惜的是，国内搜索引擎的信息几乎为0，是否也反映了一些问题呢?)。

3 thoughts on “Google 的 robots.txt”

rw 2006/03/06 at 11:06 PM

good

Reply ↓
dearsatan 2006/10/10 at 1:59 AM

呵呵，长见识
 http://www.findfun.cn

Reply ↓
software download 2006/12/09 at 4:06 PM

我的GG sitemap里面既然有这个错误，晕死了,是怎么回事？
HTTP 错误 (1)
HTTP 错误/未找到域名
可能未正确解析 DNS。我们可以与 DNS 服务器通讯，但无法找到域名。

Reply ↓

记录一些关于互联网的信息碎片

Google 的 robots.txt

3 thoughts on “Google 的 robots.txt”

Leave a Reply Cancel reply