Google Reminds Websites To Use Robots.txt To Block Action URLs

Last updated on

In a recent LinkedIn post, Gary Illyes, a Google Analyst, reinforced timeless advice for website owners: Utilize the robots.txt file to block web crawlers from accessing URLs that trigger actions such as adding items to carts or wishlists.

Illyes emphasized the frequent issue of excessive crawler traffic overwhelming servers, often originating from search engine bots indexing URLs meant for user interactions.

He stated:

“Upon examining the content we crawl from the sites in question, it’s all too common to find action URLs like ‘add to cart’ and ‘add to wishlist.’ These hold no value for crawlers and are likely not intended for indexing.”

To mitigate this unnecessary server strain, Illyes recommended restricting access in the robots.txt file for URLs containing parameters like “?add_to_cart” or “?add_to_wishlist.”

Illustrating with an example, he proposed:

“If your URLs resemble:

It’s advisable to include a disallow rule for them in your robots.txt file.”

While utilizing the HTTP POST method can also deter the crawling of such URLs, Illyes cautioned that crawlers can still execute POST requests, underscoring the continued relevance of robots.txt.

Reinforcing Decades-Old Best Practices

Alan Perkins, contributing to the discussion, highlighted that this guidance reflects web standards established in the 1990s for analogous reasons.

Quoting from a 1993 document titled “A Standard for Robot Exclusion”:

“In 1993 and 1994, there were instances where robots visited WWW servers where they weren’t welcome for various reasons… robots traversed sections of WWW servers that weren’t suitable, such as very deep virtual trees, duplicated information, temporary data, or cgi-scripts with side-effects (like voting).”

The robots.txt standard, proposing regulations to limit well-mannered crawler access, emerged as a “consensus” solution among web stakeholders in 1994.

Obedience & Exceptions

Illyes confirmed that Google’s crawlers consistently adhere to robots.txt rules, with rare exceptions meticulously documented for cases involving “user-triggered or contractual fetches.”

This commitment to the robots.txt protocol has long been a cornerstone of Google’s web crawling policies.

Why SEJ Cares

Although the guidance may appear basic, the resurgence of this decades-old best practice emphasizes its continued relevance.

Through utilizing the robots.txt standard, websites can effectively manage overly enthusiastic crawlers, preventing them from monopolizing bandwidth with futile requests.

How This Can Help You

Whether you manage a modest blog or a bustling e-commerce hub, adhering to Google’s recommendation to utilize robots.txt for restricting crawler access to action URLs can yield several advantages:

  1. Reduced Server Strain: By blocking crawlers from accessing URLs triggering actions like adding items to carts or wishlists, you can curtail unnecessary server requests and conserve bandwidth.
  2. Enhanced Crawler Efficiency: Providing clearer directives in your robots.txt file regarding which URLs crawlers should avoid can streamline the crawling process, ensuring that pages/content you prioritize for indexing and ranking receive proper attention.
  3. Elevated User Experience: By directing server resources towards genuine user interactions rather than wasted crawler hits, you can potentially enhance load times and operational smoothness, contributing to a more satisfying experience for visitors.
  4. Compliance with Standards: Implementing these guidelines aligns your site with the widely accepted robots.txt protocol standards, established as industry best practices for decades.

Reassessing robots.txt directives could serve as a straightforward yet impactful measure for websites seeking to exert greater control over crawler behavior.

Illyes’ messaging underscores the enduring relevance of these age-old robots.txt rules in our contemporary web landscape.

Original news from SearchEngineJournal