Article Categories

Selected Reading

urllib.robotparser - Parser for robots.txt in Python

Python Server Side Programming Programming

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. This file is a simple text-based access control system for computer programs that automatically access web resources. Such programs are called spiders, crawlers, etc. The file specifies the user agent identifier followed by a list of URLs the agent may not access.

Example robots.txt File

#robots.txt
Sitemap: https://example.com/sitemap.xml
User-agent: *
Disallow: /admin/
Disallow: /downloads/
Disallow: /media/
Disallow: /static/

This file is usually put in the top-level directory of your web server.

Python's urllib.robotparser module provides RobotFileParser class. It answers questions about whether or not a particular user agent can fetch a URL on the Web site that published the robots.txt file.

RobotFileParser Methods

set_url(url)

This method sets the URL referring to a robots.txt file.

read()

This method reads the robots.txt URL and feeds it to the parser.

parse()

This method parses the lines argument.

can_fetch()

This method returns True if the useragent is able to fetch the url according to the rules contained in robots.txt.

mtime()

This method returns the time the robots.txt file was last fetched.

modified()

This method sets the time robots.txt was last fetched.

crawl_delay()

This method returns the value of the Crawl-delay parameter robots.txt for the useragent in question.

request_rate()

This method returns the contents of the Request-rate parameter as a named tuple RequestRate(requests, seconds).

Basic Example

Here's how to create and use a RobotFileParser ?

from urllib import parse
from urllib import robotparser

AGENT_NAME = 'PyMOTW'
URL_BASE = 'https://example.com/'

# Create parser instance
parser = robotparser.RobotFileParser()
parser.set_url(parse.urljoin(URL_BASE, 'robots.txt'))

# Read the robots.txt file
try:
    parser.read()
    print("robots.txt file loaded successfully")
except Exception as e:
    print(f"Error reading robots.txt: {e}")

Error reading robots.txt: HTTP Error 404: Not Found

Checking URL Access Permissions

You can check if a URL can be fetched by a specific user agent ?

from urllib import robotparser

# Create a simple in-memory robots.txt content
robots_content = """
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/

User-agent: GoogleBot
Allow: /
"""

# Parse the content directly
parser = robotparser.RobotFileParser()
parser.set_url('http://example.com/robots.txt')

# Parse lines directly instead of reading from URL
lines = robots_content.strip().split('\n')
parser.parse(lines)

# Check permissions for different user agents
print("Can '*' fetch '/admin/'?", parser.can_fetch('*', '/admin/'))
print("Can '*' fetch '/public/'?", parser.can_fetch('*', '/public/'))
print("Can 'GoogleBot' fetch '/admin/'?", parser.can_fetch('GoogleBot', '/admin/'))

Can '*' fetch '/admin/'? False
Can '*' fetch '/public/'? True
Can 'GoogleBot' fetch '/admin/'? True

Working with Crawl Delays

Robots.txt can specify crawl delays for different user agents ?

from urllib import robotparser

robots_with_delay = """
User-agent: *
Crawl-delay: 1
Disallow: /admin/

User-agent: SlowBot
Crawl-delay: 5
Disallow:
"""

parser = robotparser.RobotFileParser()
parser.set_url('http://example.com/robots.txt')
lines = robots_with_delay.strip().split('\n')
parser.parse(lines)

# Check crawl delays
print("Crawl delay for '*':", parser.crawl_delay('*'))
print("Crawl delay for 'SlowBot':", parser.crawl_delay('SlowBot'))
print("Crawl delay for 'FastBot':", parser.crawl_delay('FastBot'))

Crawl delay for '*': 1
Crawl delay for 'SlowBot': 5
Crawl delay for 'FastBot': None

Conclusion

The urllib.robotparser module provides a simple way to parse and query robots.txt files. Use can_fetch() to check permissions and crawl_delay() to respect website crawling policies when building web scrapers or crawlers.

Nitya Raut

Updated on: 2026-03-25T05:46:32+05:30

910 Views

Previous Next