Google Answers A Crawl Budget Issue Question

seo@optimus42.com

2 years ago

A Reddit user raised a concern about their “crawl budget” problem, wondering if numerous 301 redirects leading to 410 error responses were draining Googlebot’s crawl capacity. In response, John Mueller from Google provided insight into the possible reasons behind the subpar crawling activity and clarified aspects of crawl budgets in general.

Crawl Budget

The concept of a crawl budget, widely embraced within the SEO community, was introduced to account for instances where certain websites aren’t crawled as extensively as desired. It suggests that each site is allocated a finite number of crawls, setting a limit on the extent of crawling it receives.

Understanding the origin of the crawl budget concept is crucial in grasping its essence. Google has consistently denied the existence of a singular “crawl budget,” although the crawling behavior of Google may imply otherwise.

Matt Cutts, a prominent Google engineer at the time, hinted at this complexity surrounding the crawl budget in a 2010 interview. He clarified that the notion of an indexation cap, often assumed by SEO practitioners, isn’t entirely accurate:

“Firstly, there isn’t really a concept of an indexation cap. Many believed that a domain would only have a certain number of pages indexed, but that’s not the case.

Furthermore, there isn’t a strict limit imposed on our crawling activities.”

In 2017, Google released a detailed explanation of the crawl budget, consolidating various crawling-related facts that align closely with what the SEO community had referred to as a “crawl budget.” This updated explanation offers more clarity compared to the broad term “crawl budget” previously used. (The Google crawl budget document was summarized by Search Engine Journal.)

The key points regarding crawl budget can be summarized as follows:

Crawl Rate: This refers to the number of URLs Google can crawl, dependent on the server’s ability to provide the requested URLs. For instance, a shared server hosting multiple websites may have hundreds of thousands or even millions of URLs. Therefore, Google prioritizes crawling based on a server’s capacity to handle page requests.
Duplicate and Low-Value Pages: Pages that duplicate others (such as faceted navigation) or provide low value can consume server resources, limiting the number of pages that the server can provide to Googlebot for crawling.
Page Weight: Lightweight pages are easier for Google to crawl in greater numbers.
Soft 404 Pages: Soft 404 pages may divert Google’s attention towards low-value pages instead of focusing on pages of significance.
Link Patterns: Both inbound and internal linking structures can influence which pages are prioritized for crawling.

Reddit Question About Crawl Rate

The Reddit user’s query revolves around whether the creation of what they perceive as low-value pages is impacting Google’s crawl budget. Specifically, they describe a scenario where a request for a non-secure URL of a page that no longer exists redirects to the secure version of the missing webpage, which returns a 410 error response (indicating the page is permanently removed).

Their inquiry is as follows:

“I’m attempting to stop Googlebot from crawling certain very old non-HTTPS URLs that have been crawled for six years. I’ve implemented a 410 response on the HTTPS side for these outdated URLs.

When Googlebot encounters one of these URLs, it first encounters a 301 redirect (from HTTP to HTTPS), followed by a 410 error.

Two questions: Is Googlebot content with this combination of 301 and 410 responses?

I’m facing ‘crawl budget’ issues, and I’m unsure if these two responses are contributing to Googlebot’s exhaustion.

Is the 410 response effective? In other words, should I skip the initial 301 redirect and return the 410 error directly?”

John Mueller from Google responded to the Reddit user’s query:

Regarding the combination of 301 redirects and 410 error responses, Mueller stated that it is acceptable.

In terms of crawl budget, Mueller clarified that it primarily becomes a concern for exceptionally large websites. He directed the user to Google’s documentation on managing crawl budget for large sites. If the user is experiencing crawl budget issues despite not having a massive site, Mueller suggested that Google might simply not perceive much value in crawling more of the site’s content, emphasizing that it’s not necessarily a technical problem.

Reasons For Not Getting Crawled Enough

Mueller suggested that Google “probably” doesn’t see the benefit in crawling additional webpages. This implies that those webpages may require scrutiny to determine why Google deems them unworthy of crawling.

Some prevalent SEO strategies tend to result in the creation of low-value webpages lacking original content. For instance, a common approach involves analyzing top-ranked webpages to discern the factors contributing to their ranking, then replicating those elements to enhance one’s own pages.

While this approach seems logical, it fails to generate genuine value. If we simplify the choice into binary terms, where “zero” represents existing search results and “one” signifies originality, mimicking existing content merely perpetuates zeros—producing websites that offer nothing beyond what’s already in the search engine results pages (SERPs).

Undoubtedly, technical issues like server health can impact crawl rates, among other factors. However, regarding the concept of crawl budget, Google has consistently stated that it primarily pertains to large-scale websites, rather than smaller to medium-sized ones.

Original news from SearchEngineJournal