`urllib.robotparser` --- robots.txt 的剖析器¶

原始碼：Lib/urllib/robotparser.py

此模組提供了一個單獨的類別 RobotFileParser，它可以知道某個特定 user agent（使用者代理）是否能在有發布 robots.txt 文件的網站擷取特定 URL。有關 robots.txt 文件結構的更多細節，請參閱 RFC 9309。

class urllib.robotparser.RobotFileParser(url='')¶

此類別提供了一些方法可以讀取、剖析和回答關於 url 上的 robots.txt 文件的問題。

set_url(url)¶: 設置指向 robots.txt 文件的 URL。

read()¶: 讀取 robots.txt URL 並將其輸入到剖析器。

parse(lines)¶: 剖析 lines 引數。

can_fetch(useragent, url)¶: 根據從 robots.txt 文件中剖析出的規則，如果 useragent 被允許 fetch url 的話，則回傳 True。

mtime()¶: 回傳最近一次 fetch robots.txt 文件的時間。這適用於需要定期檢查 robots.txt 文件更新情況的長時間運行網頁爬蟲。

modified()¶: 將最近一次 fetch robots.txt 文件的時間設置為目前時間。

crawl_delay(useragent)¶: 針對指定的 useragent 從 robots.txt 回傳 Crawl-delay 參數的值。如果此參數不存在、不適用於指定的 useragent ，或是此參數在 robots.txt 中所指的條目含有無效語法，則回傳 None。

在 3.6 版被加入.

request_rate(useragent)¶: 以 named tuple RequestRate(requests, seconds) 的形式從 robots.txt 回傳 Request-rate 參數的內容。如果此參數不存在、不適用於指定的 useragent ，或是此參數在 robots.txt 中所指的條目含有無效語法，則回傳 None。

在 3.6 版被加入.

site_maps()¶: 以 list() 的形式從 robots.txt 回傳 Sitemap 參數的內容。如果此參數不存在或此參數在 robots.txt 中所指的條目含有無效語法，則回傳 None。

在 3.8 版被加入.

下面的範例展示了 RobotFileParser 類別的基本用法：

>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url("http://www.pythontest.net/robots.txt")
>>> rp.read()
>>> rrate = rp.request_rate("*")
>>> rrate.requests
1
>>> rrate.seconds
1
>>> rp.crawl_delay("*")
6
>>> rp.can_fetch("*", "http://www.pythontest.net/")
True
>>> rp.can_fetch("*", "http://www.pythontest.net/no-robots-here/")
False

© 版權所有 2001 Python Software Foundation.
此頁面採用 Python 軟體基金會授權條款第 2 版。
文件中的範例、應用技巧與其他程式碼額外採用了 Zero Clause BSD 授權條款。
更多訊息請見歷史與授權條款。

Python 軟體基金會是一家非營利法人。敬請捐贈。

最後更新於 6月 12, 2026 (18:24 UTC)。發現 bug？
以 Sphinx8.2.3建立。