Google’s 15 Reasons for Not Indexing Pages Explained & How to Get More Pages Indexed. Plus explainer chart for all the reasons.
Contents
I’ve worked on some big websites – some with more than a million URLs – and often get asked why Google isn’t indexing all of a site’s pages. This usually comes up after viewing the Google Search Console Index Coverage report.
I’m regularly asked these two questions:
- Why isn’t Google indexing all of our pages?
- How do we get them indexed?
While it’s possible to find out which of a site’s pages are indexed by doing a full site crawl using proxies, this can take a long time for large sites. But there is a more immediate and accurate way to get this information: Google Search Console.
Search Console is how Google communicates directly to websites and provides data not available elsewhere: about a site’s technical health, traffic, keywords, and various notifications. GSC data is especially valuable because comes from how Google sees a site.
Free Excluded Reasons Cheatsheet
I’ve made a nice chart of Google’s reasons for excluding pages that explains their causes and how to resolve them. Get your copy here. Just make your own copy so you can follow along at home!
I’ll dig into the different reasons for excluded pages and how to fix them.
But first two big questions:
Will Google index all of a site’s pages?
Not necessarily. Google states indexing is not guaranteed and that it may not index all of a site’s pages.
This needs to be communicated to stakeholders in order to manage expectations. If your CEO is waiting for all one million pages to be indexed, you have to explain that Google says they may not ever be.
What are Excluded Pages?
Excluded pages are URLs that Google chooses not to index for a variety of reasons, i.e. they’re excluded from the index. I used to think it was because Google just didn’t like me, but it turns out this isn’t the case (as far as I know).
There are three general reasons causing excluded pages:
- Pages are blocked from being crawled or indexed, usually through ‘noindex’ or robots.txt.
- Google can’t find them because of poor site architecture and internal linking, or server issues. It can also be a combination of these.
- Low content quality: Google is blunt: “The quality of the content on a page is low.”
Where does Google show excluded pages?
You can find excluded pages in Google Search Console’s Page Indexing Report.
What is the Page Indexing Report?
This is where Google shows which of a site’s pages are indexed or not and the reasons why.
The report groups pages as:
- Included (indexed)
- Excluded (not indexed)
There are four categories of indexation status, which include indexed and non-indexed pages:
- Errors: Googlebot can’t even reach these URLs to consider whether to index them, because of well, errors.
- Valid with Warnings: These are indexed but need fixing.
- Valid: These pages are healthy and indexed
- Excluded: these are pages not indexed: the excluded pages.
Why Does Google Exclude Pages from its Index?
Within the Excluded bucket, Google has 15 specific reasons for not indexing pages. Not every site is subject to all the reasons; it’s usually a combination.
Google’s excluded pages fall into four general groups:
- Content Related: Though these encompass content quality issues, content may not be the sole reason Google has given them.
- Status Related: These involve pages with non-200 status codes.
- Canonical Related: Errors involving canonicals and canonicalization.
- Blocking Related: These are due to URLs being blocked from Googlebot.
These simulated search results are especially useful for showing others how your proposed pages or meta info changes will look in the wild.
Google’s 15 Reasons for Excluding Pages
Group: Content Related
Note: Crawled Not Indexed and Discovered Not Indexed differ from Google’s other reasons because they encompass content-related issues in addition to technical ones. Therefore, resolving these errors may require substantial effort to improve the amount and quality of content on pages in addition to fixing any technical errors.
1. URL Marked ‘noindex’
- What it means: These pages have a ‘noindex’ directive applied to tell the search engines not to index them.
- How to fix: Evaluate these pages to decide whether or not you want them indexed. If you want them included, remove the directives from those pages and test to make sure they’re now allowed before resubmitting to Google.
2. Crawled Not Indexed
What it means: Google has crawled these pages and knows about them, but has decided not to index them.
Crawled Not Indexed usually requires more work to fix, and can be a top action item since it could include important pages. This happened on a site I worked on, where around 30% of the site’s pages weren’t getting indexed, some of which were “money” pages.
What Causes Crawled Not Indexed?
There are two main reasons for pages getting the Crawled Not Indexed treatment:
- Low-quality content: either duplicate and/or thin.
- Poor Internal Linking and Site Architecture make the pages too hard for Googlebot to find. (Googlebot doesn’t like doing a lot of work to find pages).
How to fix Crawled Not Indexed:
There are two principal ways:
- Determine if the cause is due to content quality, site architecture, or both.
- To determine if the content quality is the problem: Evaluate if these excluded pages are too thin, i.e. not enough on the page, or if the content is duplicated elsewhere on the site.
- For site architecture and internal linking: Crawl the site to find out if the affected pages follow internal linking practices, and implement a proper internal linking strategy to ensure Googlebot will crawl them. An example could be important product pages being buried under a subcategory with no links to important pages such as the homepage.
Resources on Crawled Not Indexed
- Onely has an excellent guide to fixing it.
- Moz has a nice primer on Crawled Not Indexed.
3. Discovered Not Indexed
What it means: Google knows about these URLs but hadn’t bothered to crawl them. There is some overlap between Crawled and Discovered Not Indexed.
Causes of Discovered Not Indexed
- Server Overload: This was a root cause on a site I worked on, where the servers did not have enough horsepower to handle being crawled without getting overloaded. Google will back off from crawling if it senses this could impact server load.
- Content Overload: A site may simply have too many pages and too much content than Google wants to crawl.
- Crawl budget problems, including server issues: a site’s crawl budget may be exhausted with what has already been crawled, leaving no more to crawl these pages. These reasons may particularly impact the crawl budget:
- Pages aren’t included in the sitemap and get left out of the crawl budget.
- Excessive redirects: Too many redirects and redirect chains will cause Googlebot to go around in circles and is not a direct path to reaching URLs. It doesn’t like that.
- Poor Content quality: Google can tell if pages are likely to be thin, based on ones it has already crawled and decided to not index.
How to Fix Discovered Not Indexed
- Increase Server Horsepower: this can make a huge difference and is relatively straightforward to get done.
- Determine if you may have too many pages for Google to crawl, which can be very subjective. If you think you do, look into consolidating pages around similar topics.
- Optimize the crawl budget. Start by checking GSC’s Crawl Stats Report as well as server health. Pay particular attention to average response time and 5XX status code responses.
- Point Googlebot towards the pages you want to be crawled, which may require more or less weighing which pages can be sacrificed.
- Check the sitemap: if the pages aren’t included in the sitemap, Google’s not going to think they’re important. Include them if they’re not in it.
- Fix Internal Linking and Redirects, especially redirect chains and loops.
- Speed Up the Site: Increase server power and speed, and optimize the site content for faster speed.
- Block pages and content that don’t need to be crawled. The crawl budget gets wasted on pages like paginated URLs, canonicals, and tag pages.
- Improve content quality, the same way it was done for Crawled Not Indexed.
Fun Fact: Despite addressing the causes, there’s no guarantee that excluded pages will indeed get indexed. Lo and behold, Google recently announced that Discovered Not Indexed can last forever. Yay!
Related Resources
- AHREFS has a great guide to Discovered Not Indexed.
- Adaptive has one for optimizing the crawl budget.
- My own look at fixing site speed.
Group: Status Code Related
4. Server Error 500
What it means: The page returned a 500-level server error code when Googlebot attempted to crawl it. This means there’s something wrong with your server/s that needs to be addressed.
How to fix: Diagnose the server problem and implement the needed fixes. Google’s documentation on Fixing Server Errors is a good place to start. Specific fixes depend upon the configuration of your particular server and would likely need some non-SEO expertise, which is what I had to do.
5. Not Found 404
What it means: The pages in question are broken, hence the 404 error code.
How to fix: There’s a multitude of reasons a URL returns a 404, but there are two general approaches:
- Fix these pages so they work – are “live” and return a 200 status code.
- If they’re not supposed to be live, implement a 301 permanent redirect to an appropriate page.
6. Soft 404
What it means: Soft 404s are kind of a weird concept. They only exist internally in Google and aren’t a “real” page status. In other words, while they don’t return a true 404 status code they are treated that way because something’s not right with them.
There are several reasons a page can be designated as a Soft 404, including thin content, broken code, or a broken resource connection. Check these URLs in GSC’s URL Inspection Tool which shows how Googlebot sees the page.
How to fix: Evaluate each URL to decide if the page really should be dead, (a 404 or 410) and redirect to an appropriate page. Google’s documentation outlines how to fix Soft 404s is a good place to start, and Onely has a good guide.
Related Resources
- Google’s documentation on Soft 404s is a good place to start.
- Onely has a good guide.
7. Redirect Errors
What it means: Something’s wrong with the redirects, such as redirect chains that are too long, redirect loops, or a bad URL in a redirect chain.
How to fix: Since there could be more redirects than what GSC is reporting, do a full site crawl (Screaming Frog works great for this) or use a debugging tool like Lighthouse. There are also Chrome plugins like Link Redirect Trace that work for ad hoc testing.
Related Resources
- Audit redirects with Screaming Frog
- SEMRush has a primer on redirects
- Onely has an awesome guide to fixing redirect errors in GSC
Group: Canonical Related
8. Duplicate Without User-Selected Canonical
What it means: This is just what it says: the page is a duplicate of another one but hasn’t had a canonical version declared, so Google has gone ahead and chosen one. These pages won’t be indexed.
How to fix: You don’t have to do anything, unless you think another page should be the canonical one. If so, add a canonical tag to explicitly tell Google which one you want.
9. Duplicate, Google Chose a Different Canonical than User
What it means: Even though the page already has a canonical specified, Google thinks another page would be better, and has chosen another one as the canonical.
How to fix: Nothing needs to be done, unless you want a different canonical from the one Google has chosen. In that case, add your desired one into the rel=canonical.
10. Alternate Page with Proper Canonical Tag
What it means: The page in question is an alternate of the canonical (indexed) version and points to that one.
How to fix: You don’t have to do anything. Nice, right?
Group: Blocking Related
11. URLS Blocked by Robots.txt
What it means: Search engine crawlers can’t even get to these pages because they’re blocked in robots.txt. If they can’t even access these pages, they won’t be able to even consider indexing them to show to users in the search results.
How to fix:
- Pull a full list of blocked URLs from GSC and also do a full site crawl to see which URLs are being blocked.
- Evaluate each blocked URL to decide if you want to be blocked.
- Make changes to your robots.txt file to allow any URLs you want Google to crawl.
- Test the blocked pages in a tool such as Google’s old-school robots.txt Testing Tool or in a crawl (the Frog is great at this).
12. Blocked by Page Removal Tool
What it means: A page was blocked with the Page Removal Tool in Search Console. Pages can be blocked this way for up to 90 days, and should really be done in ad hoc situations, like when a developer isn’t available. A situation could be when a duplicate version of a key page is live and you need to take it down immediately.
How to fix: Add ‘noindex’ to these pages if you want them to be permanently removed from Google’s index. If a page needs only temporary blocking, go ahead and leave it blocked. Just remember the 90-day limit clock is ticking until the page becomes available to Google again.
13. Blocked Due to Unauthorized Request 401
What it means: Googlebot’s authorization to reach this page was blocked by a 401 code; it didn’t have permission.
How to fix: If you want Google to be able to reach the page, remove the authorization requirement.
14. Blocked Due to Access Forbidden 403
What it means: Ooh, “forbidden access!” Sounds exciting. But not really. Similar to the 401 problems, pages returning a 403 status code means they require access permissions, like with an account sign-in.
How to fix:
- Decide whether to allow non-signed-in users (including Googlebot) to be able to visit the page and grant access.
- Choose to allow Googlebot access even without authentication. How-to Geek has a nice guide to this.
15. URL Blocked to Other 4XX Issue
What it means: There’s some other kind of 4XX error happening that isn’t covered above.
How to fix: Google suggests debugging using the URL Inspection Tool.
Wrapping Up
I hope this helps make sense of the confusing world of excluded pages and why Google may or may not index them.
Do you have too many excluded pages or other indexation problems?
If you’re stuck trying to figure out why Google won’t index all of your pages, get in touch and I can help.