Welcome to our next installment of Inseev’s Video+Article series! Today we are focusing on a topic that many SEOs are confused by and don’t fully understand.
Removing Pages from Google’s Index
There are several ways to remove a page from Google’s index. From the tools offered by Google to the robot’s commands, we will review all of the different ways to get pages out of the index. Check them out below!
What is Thin Content?
First, it is important that we define thin, or low-quality, content. Simply put, thin content is a page on your site that adds little to no value to the user. It could be an empty page or a page that has broken text or plenty of errors. Duplicated content is also a type of thin content as a duplicated page is not adding any additional value to the user.
Thin content is very important for Google’s Panda Algorithm. Panda is now part of the core algorithm and this is not a very complicated update. Poor or low-quality pages are bad for your site. Period.
How To Find Low Quality Pages
Finding low-quality pages can be found with serval tools. This post is not about identifying these pages and we have a separate post on content audits for SEO here.
Checking to see if the subfolders are indexed is easy. You can simply type in “site:domain.com inurl:subfolder“. Make sure to use lowercase letters. Here is an example:
So you have a lot of thin pages? What can we do? What should we not do? Let’s start with the wrong way first.
The wrong way to remove pages from Google’s Index: Blocking pages in Robot.txt file
One of the main things that we see when we tell a developer, Hey. We found these pages—they go into the robots.txt (which I will link here) and they just put a block directive on the pages, or the directory, if there’s a pattern of a directory that they want to block. That is the worst thing you could do.
If you have 2900 pages that have no content on them, and they are very, very thin pages—which is lowering the overall Google quality score from a site perspective—and you go into your robots.txt and you block all of those pages from being crawled, they are going to follow the internal links that they found originally on the site to find those pages. Then, they are just going to hit that robots.txt. It’s the first thing that bot hits when it starts crawling a site, and it’s going to turn around. It’s never going to actually go to the page and remove it from the index. It’s just going to realize it can’t crawl it and it’s going to slowly drop it out of the index, after a year or two years.
While you’re putting them in the robots.txt, all you’re doing is perpetuating the problem— not fixing it. Right? That’s incorrect and that’s what you want to avoid
The Right Way(s) to Remove Pages from the Index
Ultimately, what you want to do is to try and force Google to register that you are removing the pages in some way.
301 the page to another URL and force Google to crawl it.
Let’s say I just 301 them all to a single page. Now, if Google wants to crawl that 301, they are going to slowly say, Okay. This is the new page. They don’t want this. They don’t want me to have this page in my index anymore. I’m going to drop it out. That’s the first solution. Again, that’s not always the right solution.
I’m not going to get into the details of why it may or may not be the solution but trying to understand what the solution you need is also very important.
Sometimes a website will have pages that haven’t been crawled in three years and Google still has them in their index. We know they’re old pages because we know how many results are on them. The company no longer even services these pages, doesn’t even merge the pages. If we put the 301 in place, they might be linking to this page from a very, very old blog post. Google crawls that blog post once every six months. They crawl it and then it takes them six months to crawl it again. When they finally crawl it, they see the 301.
Let’s say we actually want to get pages out of the index quickly—then we have to force Google to crawl through the 301s.
How you can do that is you can obviously create a fake HTML sitemap and just put all your links in it, and force Google to crawl that. That’s one way. The other way is creating a static HTML page with all the links on them, and then submit that via a search console.
That’s a quick way that you can have Google crawl through all of the 301s that you need them to process instantly. Hopefully, you could get some of those out within the next week or two weeks. If you keep doing that and you keep forcing Google to crawl the same file every day, they’ll keep dropping them out in a week or two.
Add a noindex tag to the page’s meta tag and force crawl
The other thing you can do is obviously add a noindex to the URL’s meta tag. If I had 2900 pages that I wanted to get rid of, but I didn’t want to 301 them—they are marketing pages or something and I couldn’t 301 them—then I could add a noindex tag to them. That’s what I’m actually suggesting in the scenario in the video.
If you really want to get them out faster by adding them to an HTML sitemap, to an XML sitemap, to a just blank page—whatever you need to do—then get them out of the index and remove that page. Remove the links on a sitemap to wherever you added them from that, and you will be good to go.
Canonicalize the URL to another base URL
Canonicalizing is when you want to tell Google there’s an alternate version of the page, or the primary version of the page, that they should be indexing. Then, they should consolidate all of the singles onto that single URL from all the consolidated URLs—not just simply to get it out of the index. It does accomplish that. It does get rid of URLs that have parameters or URLs that are very, very similar. In general, you want to use canonicals when you’re dealing with things like URLs parameters.
Don’t use a canonical tag to get things out of the index unless you absolutely have to. We would use canonical tags when a company would come to us and say something like, We don’t have the noindex solution. We can’t do that on our website. If you can’t do that on your website, we’ve got to figure out something. If you can use canonicals, maybe we can at least try to force it. Google’s pretty good at that. The canonical directive is not as strong. Google will sometimes ignore canonicals. They will never ignore our robots.txt and they will never ignore a noindex. Those are way stronger directives. Those are the ones we want to use.
Here’s an example of canonical tagging placed right. This is how they should function when you use a parameter to adjust the content specifically on a page. This is going to create a new URL that we don’t want to index. If there is no canonical tag, Google will index as a separate URL, a duplicate page, so we just want to canonicalize it back to the base version as you see above.
Again, canonical is not always the solution. It is the right one when you have parameters. In general, if you’re trying to remove pages quickly from the index, it’s probably not because you have parameters. It’s probably because you did something stupid like left a subdomain index that was a staging subdomain, or you found a ton of blank pages.
If you have 2900 pages, I don’t even think you could do that anymore in this where you’re removing them by hand. In Google, the old Google search console interface here—which may not be sunsetted if you’re watching this video—it’s still live as of February 2019. If you click in and you actually want to remove something from the index, you could do that. It says it’s temporary. If they re-crawl it, they might re-index it, but I can just request it to be removed—which in this situation I should because there’s no content on this page. Again, it’s a blank page.
For 2900 URLs, it’s not going to work. I’m pretty sure the time’s out after you do like 50, and you can’t remove any more. It’s a lot better of a solution for a couple of URLs. What I like to do is I like to do one of these solutions. I like to take that page that I put all the URLs on and I like to remove them by hand. I need to make sure I got rid of my page and it’s just a bunch of links—it’s just a blank page for Google—and force crawl.
Let the pages 404
The last thing that you could do is you could actually let all of these pages 404 and Google will drop them out naturally, but again this is not the right solution if you are trying to have Google drop them out quickly.\
If you think this a performance problem—like 75% of your site is blank and now Google’s viewing it. Your client just got crushed by an algorithm update or has been experiencing a lot of problems with the performance—and your perception is that’s because Google thinks the site is pretty much like a blank website—don’t do this.
Don’t let the pages 404 and drop them out naturally. That could take 6 to 12 months before you see any sort of improvement.
I would suggest going in and trying to noindex those pages first, get as many as you can out with a force crawl—then go let them all 404 and you can start to improve the rest of the pages’ quality over time.
There’s a bunch of different ways you can do this. It all comes down to the technical recommendation and the situation you’re in. There will be more videos on canonicalization and how it works down the line. You can also do additional research if you need it yourself. In general, this is the basic framework for removing pages from the index.
Looking for more SEO support? Our team does a fantastic job of identifying low-quality content as part of our SEO audit service.