Every seasoned marketer knows the importance of a well thought-out prospecting list, be it for a press release on your client’s latest innovation, a social campaign targeting industry influencers, or simply as a means of reaching out to others in your niche to earn links. We have a great post titled What is Link Juice for further reading. However, getting to the point of a well-formed, thorough prospect list takes time and a keen eye. Usually, this includes lots of sifting through irrelevant sites that just so happened to use your keyword in one out-of-context post. But what if there was a way you could save time and cut out at least some of the hard work?
What is Scrapebox?
When you prospect, you’re probably doing some sort of a manual Google search, either completely by hand, or with the help of a tool like the MozBar that allows you to export your SERPs (search engine results page) into a handy Excel spreadsheet. Either way, manually prospecting takes a fair amount of time, which is always valuable, no matter the project.
Fortunately, there’s a way to change that.
Scrapebox, among other uses, is a tool that does all of your Google searching for you, turning your desired keywords into a list of potential prospects in a matter of minutes.
How Do I Use It?
First off, you’ll need to provide your keywords. If you only have a few, you can just go ahead and type them directly into the keyword box. However, if you have a pre-researched list, you can import it directly into Scrapebox via a .txt file and the Import button.
Now, let’s say you have your list of 50 keywords that, in addition to being relevant to your prospecting needs, could potentially return a lot of irrelevant results, like e-commerce listings. Scrapebox’s merge tool (displayed as an M button) allows you to import a second set of keyword modifiers – also via a .txt file – that will join each modifier with each keyword, saving you the time of re-creating your keyword list in Excel or manually editing and adding in each keyword.
In this example, we’ve merged our list of book keywords with the terms “book blog” and “book reviews.”
Depending on how many keywords you’re attempting to scrape, it may be wise to invest in some proxies. The reason for this is due to the nature of Scrapebox- simultaneously running multiple searches is an easy way to get your scrape blocked by a captcha, similar to the one shown below:
The use of proxies will create the illusion that the multiple searches are occurring in various locations all over the world.
The use of proxies is not mandatory, however it is highly recommended. We use this service for our proxies: http://buyproxies.org/
If you do choose to use proxies, for the next step you’ll want to import them into Scrapebox. To do so, click the Manage button in the proxies section, load your .txt file with your proxies and input any associated information (username, passwords, etc.) and you’re all set on that front.
Normally, you’d see your list of fully-loaded proxies here. However, we didn’t feel like sharing ours today.
We’re almost ready to go!
If you have a moderately-sized list of keywords without any advanced operators (under 50 usually seems to run smoothly without error), all you need to do is make sure Custom Harvester is selected and set your number of results (100-200 is generally sufficient, as most results beyond that generate lower-quality results).
Once you click the Start Harvesting button, you’ll be given the option to choose which search engines you’d like to use. After those are selected, just click Start, and Scrapebox will start gathering all of your URLs. Generally, this will only take a minute or two, but if you have a longer list of keywords, you might have to wait a few minutes more.
Managing Your Data
Once your scrape is finished running, Scrapebox will give you the option to export your list of completed (and occasionally non-completed) keywords. After that, you’ll be left with your list of URLs harvested. If your prospecting needs rely more on a domain than the actual content or context of the page, you can remove any domains right in Scrapebox under the Remove/Filter tab on the right-hand side. However, if your prospecting requires you to take a more in-depth look at the specific results of your queries, it’s recommended to leave in multiple URL results per domain, so as to maximize your chance of finding relevant results.
Once you’ve trimmed everything down, go to the Export URL List and export to the file type of your choice, save it, and you’re done!
This is what a finished scrape looks like, in case you were curious.
Help! I’m Having Some Problems
My Scrapes Aren’t Returning Any Results/Are Giving Me a Lot of Error Codes
If you start running scrapes consisting of dozens, hundreds, or even thousands of advanced search queries, you’ll soon discover that your scrapes aren’t returning any results and are instead resulting in error codes. This is caused when Google is able to detect that your proxies are running automated searches. How does this happen? As I mentioned earlier, Google is able to detect when an IP address is conducting multiple searches in a short period of time and asks for captcha verification to prevent this kind of automated searching.
Fortunately, (or unfortunately, depending on who you ask), proxies are unable to decipher these words, meaning that Google is able to temporarily ban them from conducting any further searches. This captcha response is triggered more quickly when advanced queries are used in quick succession than when simple ones are, meaning that a large scrape with advanced queries requires a time delay in order to keep the proxies from being used too quickly.
To do so, simply check Detailed Harvester in the Harvester and Proxies section. Once you click Start Harvesting, you’ll see a screen similar to the Custom Harvester, but with an option in the lower left-hand corner to add in a time delay. A smaller list of 50-100 advanced queries should be fine with a 30-to-60 second delay, but a longer list can require up to 300 seconds (and yes, this also means you might need to allow scrapes to run overnight, or for a few days on occasion if you plan on running everything in one go). This delay feature is also useful if multiple people in your office are sharing a set of proxies and running Scrapebox frequently, as it means Google has more opportunities to ban your proxies from being used.
I’m Getting a Lot of Bloggers From Düsseldorf in My Results
Sometimes, you’ll notice that you’re getting a large amount of results with websites that are of no use to your prospecting due to their location. This is often caused by many proxies being distributed internationally, and thus occasionally returning results from their “home” location. If you have a large number of proxies, you are able to remedy this by temporarily disabling selected proxies in Scrapebox.
To do so, click the Manage button in the proxies section. This will load a complete list of your proxies, as well as their information, including location. You can select proxies from areas that are returning a large number of irrelevant results by using Control + Click, and remove them by selecting Remove the Selected Proxies under the Filter tab. Just remember to reload your entire list of proxies for future prospecting!
I Want to See Which Keywords Returned Which Results
Once in a while, it might be helpful to see which keywords returned which results- like in an early round of prospecting where you’re still determining which keywords and operators are working for you and which ones are returning irrelevant results. Fortunately, you guessed it, Scrapebox has a way of letting you do so.
First, select “Connections, Timeout and Other Settings” from the Settings drop-down at the top.
Next, under the “More Harvester Settings” tab, check the box saying “Save additionally keywords with URLs. Go ahead and exit out by clicking OK and run your scrape as normal. Once your scrape is finished, you’ll be able to see and export your list of URLs as usual.
To find your keywords, you need to get to your Harverster_Sessions folder (which should be hiding wherever you installed Scrapebox, if you haven’t had to access it before), where you’ll find a .txt file titled “kw_urls_MM-DD-YYYY_HH-MM-SS”. All you need to do from there is paste it into Excel, work some text-to-columns magic, and you’ll have yourself a list of URLs and their associated keywords!
Jessi Carr is a Digital Marketing and PR Specialist at Inseev Interactive and the Managing Editor of 365BusinessTips. In her spare time, she enjoys perusing the San Diego craft beer scene and fervently rewatching 90’s NBC sitcoms.