Web Crawlers

Web crawling is pretty fascinating, but typically super boring.

This article is for non-developers who need to understand the importance because they want their websites to get better visibility, acquire more organic traffic, and make more money.

I wholeheartedly promise to make it not boring. And to prove that to you, I will start it off with a semi-relevant joke:

I wonder what my parents did to fight boredom before the internet? I asked my 15 brothers and sisters and they didn’t know either.

In this article we cover:

The definition of web crawlers
How web crawlers work
The importance of web crawlers for SEO (and the livelihood of your business)
More jokes

What are Web Crawlers?

Let’s start by defining web crawler.

Web crawlers (also called ‘spiders’, ‘bots’, ‘spiderbots’, etc.) are software applications whose primary directive in life is to navigate (crawl) around the internet and collect information, most commonly for the purpose of indexing that information somewhere.

They’re called “web crawlers” because crawling is actually the technical term for automatically accessing a website to obtain data using software. Essentially, a crawler is kind of like a virtual librarian. It looks for info on the internet, and then sends it to a database for organizing, cataloguing, etc. so that the crawled information is quickly & easily retrievable by search engines when needed (like when you perform a search).

Most people call web crawlers either crawlers, bots, or spiders.

I really don’t think many people call them spiderbots, but it’s fun to say.

Spiderbot! Your mission, should you choose to accept it, is to never-endingly roam the extraordinarily large (and continuously expanding) internet and collect all its information, and put it into our index. Now go forth, acquire & extract!

That’s pretty much how it works.

Now, who the F in their right mind would want to go through the internet and catalogue all that information? It sounds like the punishment that would be given to the utmost of sinners, by Lucifer himself.

Enter: Googlebot.

That’s Googlebot: He/she/it is a never-ending robot (piece of software) who runs around the internet and takes all your information (the information from your website, the information you load onto social media websites, the information you send in your Gmails, the information you speak into your Google Home, etc.) and sends it into the Google index. This is how search engines work.

Google isn’t the only one though — other search engine companies (like Yahoo, Bing, etc.) make their money by providing information to us people glued to our computers and phones searching things 24/7 — but they need to acquire that information some how. They do with this web crawlers.

How do Web Crawlers Work?

The primary goal of a web crawler is to create an index (more in this later) and to learn what every web page on the internet is about, so the information can be retrieved by search engines and provided to you (the searcher) extremely quickly, and with great accuracy — meaning providing you results that answer the search intent of whatever it is you typed (or spoke) into the search engine.

The internet is like a continuously growing library with billions of books (websites), but no official/central filing system. So, search engine companies use internet-based software known as web crawlers to discover publicly available webpages — like your website.

An overview of the process of web crawling

Web crawlers systematically browse the internet to find websites. But how do they find all different websites? And how do they find all the pages on your website?

Crawling Links: Web crawlers follow hyperlinks, much like we humans do when we are browsing the web ourselves, to go from page to page — or one website to another. These could be internal links to go from page to page on one site, or backlinks to go from website A to website B.
Crawling Sitemaps: Web crawlers also take a look at your website’s sitemap to understand all the pages it needs to visit and index.
Manual Submission: You can manually submit your website, and a list of its pages, to search engines using tools like Google Search Console, Bing Webmaster Tools, etc.

Then, they copy the information on the web pages they find (text, HTML, hyperlinks, metadata, etc.) and send it their search engine mothership (the web crawler’s company servers) which download the webpages into their enormous databases and organize / index the information in a way that it can be searched and referenced very quickly.

Was that definitely 100% technically accurate? I don’t know. I’m not a web developer. But it’s close enough for you to get the overall idea of how it works without you having to re-read the definition 17 times and still be confused.

We aren’t trying to be that website where you read something like:

Anyways — web crawlers send information into Google’s database in a way that it can be accessed by you (searchers) very quickly.

When crawlers find a webpage, the search engine’s systems render the page content taking note of key elements like keywords and we keep track everything it the Search index.

This technology is called “indexing”.

Historically, Google’s entire search engine index / algorithm ran on using keywords to understand, index, organize and serve pages (when someone performed a search).

That’s why when you search for something on Google it can somehow return 4,220,000,000 results of information in less than half a second… Absolute Insanity.

Note: this process of visiting pages, crawling around all of the links, downloading the information, etc. is all happening on your website, which means your web server (aka web host) is the one who has to process information and it uses your resources, which web-hosts will charge you for.

So, not only is Google making you spend money to essentially “steal and organize” your information, they then make you to pay them to advertise if you want your website to show up at the top of the search page. Think about that for a second…

That’s why at my agency SERP, we provide SEO Services with pride — and see it as a battle against the giants. A way for us to help the little guys take back what is theirs — organic search engine real estate, where you don’t have to PAY for clicks.

I digress. Back to web crawling.

Improved Web Crawling, Improved Indexing

Now, however, Google is evolving and being able to create more sophisticated and complex understanding of information.

Instead of simply organizing information on webpages by keywords, it is now able to understand entities — the “same” we us humans do.

For example, the keyword phrase “nicholas cage” was simply a string of 12 letters separated by a space.

Now, Google understands more about this keyword, the reasons people search for it, and that Nicholas Cage is an entity — specifically a person entity.

So when you search for nicholas cage you are provided with more information about him, as a person.

You can read more about this process our article about SERP features.

Web Crawler Policies

Since web crawlers are software, they follow rules, known as policies.

Selection policies — tell crawlers what pages to download, and which ones to not download
Re-visit policies — tell crawlers when to go back check for changes.
Politeness policies — tell crawlers how to avoid overloading web sites. Hint: you have some power here by giving them instructions in your robots.txt file.

There are more policies but … I’m already getting bored talking about them, so let’s get back to what’s important here.

The importance of web crawlers for SEO

Search engine optimization — the practice of preparing content for search engine indexing so your website shows up higher in SERP results, and you get more clicks, traffic, sales, etc.

Without web crawlers your site would never be found, and thus unable to be presented on search engines.

Selectivity

Most crawlers don’t attempt to crawl the entirety of the Internet, because let’s face it — some website’s are more important than others, and the internet is just way too big.

Web crawlers (remember they are software) require resources (aka money) to run, so companies want to make sure they are using their resources as efficiently as possible, so they must be selective.

These bots decide which pages to crawl first based factors they deem to be important:

How popular is the site? They need to keep crawling it if they want the updated information the site continues to publish to be accessible from a search engine.
Popularity is determined by hundreds of ranking factors, but the main ones are: Traffic, # of links to the page, etc.

Crawl Budget

A web crawlers “crawl budget” is basically the amount of pages it will crawl (and index) on any given website during a given timeframe.

What does this mean for you? If your site is too slow, too hard to crawl, deemed not important enough, etc. you will run out of your budget and the crawler will leave. It will miss finding pages, and thus your pages wont be indexed in search engines.

So, as an SEO specialist, you want to make sure you optimize your website to maximize crawl budget.

Do this by having:

A good sitemap
Good website architecture
Good Page Speed
Lots of backlinks
Good internal linking
A properly setup robots.txt file
Making sure your website does not have alot of broken pages (404s, etc.)

Robots.txt

Your robots.txt file is a file on your website that crawlers look at for directives — you can invite the spiders in, or keep them out — the choice is yours.

We have an article extensively covering robots.txt, but just to recap — you may not want bots visiting certain pages (maximize your crawl budget on your more important website sections) or maybe you just want to block certain bots.

What certain bots you ask? The bad kind of bots.

Good Bots vs. Bad Bots

So we want our website to be found by Google, Bing, Yahoo, etc. so our business can be found by customers and grow. Great.

And now we know that in order for your website to be found we must make sure that these crawler bots are finding our website. Great.

But not all web crawlers are programs created by the search engine companies, and not all bots are deployed around the internet to INDEX content — some are here to scrape content.

What are scraper bots?

Ever got spam phone calls? spam emails? How did these people get your contact information? Well, one way was that it was scraped off your website, or some website, on the internet.

Ever wonder how your business/personal information ends up on websites where you know for certain you didn’t add it? Might have been scraped.

Bots can scrape anything posted publicly on the Internet. Anything includes text, images, HTML, CSS, etc.

Malicious bots can collect all sorts of information that hackers/attackers use for a variety of purposes:

Text based content can be reused on another website to steal the first website’s SERP rankings
Attackers could use your website’s entire HTML & CSS code to create a duplicate of your site and try to deceive users to inputting their usernames, passwords, credit card information, etc.
etc.

Personal information can be scraped in bulk to collect databases of people in a specific cohort, and used for marketing purposes. Admittedly, this is not nearly as malicious as the previous examples but it still illustrates the point — not all bots are here to index your content for search engines.

Fun Fact: Bots are believed to make up over 40% of all internet traffic!

Not only is that a staggering amount of bot related activity, it has real implications for you as a website owner. It affects your analytics, your server resources, etc.

Since this article isn’t about malicious bots (we could have an entire series on that) I will stop there with it.

Final Thoughts

Hate it or love it, bots are everywhere. Web crawlers make up almost half the internet, so to be a responsible business owner, website owner, SEO consultant, etc. it is critical we understand them and continue to learn about what we can do to let the good ones in, and keep the bad ones out.

Want to learn more? Do this:

Subscribe here: devin.to/youtube
Sign up here: devin.to/email
Reddit user? Join here: https://www.reddit.com/r/SERPUniversity/
Facebook user? Join here: https://www.facebook.com/serpuniversity