Do you have a startup? This is what you need to know about how the Google search engine works

How does the Google search engine work: crawling, indexing and ranking?

Learn how the Google search engine works. How it finds, tracks and classifies the one billion web pages that exist to offer you the results you see when you search on Google.

Although the details of the process are quite complex, knowing the basic (non-technical) concepts of crawling, indexing and ranking can help you better understand the methods behind an SEO strategy for search engines.

A complex task

Google says to know more than 130 billion pages on the web. Actually, it’s probably much more than that number. There are many pages that Google keeps out of the process of crawling, indexing and ranking for various reasons.

To keep the results more relevant to users, the Google search engine has a well-defined process that identifies which are the best web pages for each type of search.

This process is evolving, as it works to further improve the search results.

How do Google search results work?

This process involves the following steps:

Web tracking: follow the links to discover the most important web pages. Web indexing: store the information of all the recovered pages for later recovery. Web Classification: determine what each page is about and how it should be classified for relevant queries.

Track websites

The search engines have trackers that “crawl” the World Wide Web to discover the pages that exist, in order to help identify which are the best web pages to evaluate in case of consultation.

Web crawlers move through the links to websites they find.

These links of websites link the pages of the same website and the entire web, creating a way for the crawlers to reach the one billion interconnected web pages that exist.

Each time the crawlers look at a web page, they do so through the “Document Object Model” (or “DOM”) on the page to see what’s in it.

The DOM is the representation of the HTML and Javascript code of the page that crawlers search to find links to other pages. This allows the search engine to discover new pages on the web and each of the new links they find are added to a queue that the tracker will visit later.

Tracking a whole website every day would be too much work. For this reason, what Google does is usually extend its tracking for several weeks. In addition, the Google search engine does not track each and every web page that exists.

Web crawlers start with websites that are safe and that will serve as a reference when comparing other websites. The crawlers expand their tracking through the web by following the links they find on the pages they visit.

Index a website on Google

Indexing is the act of adding information about a web page to the index of a search engine. The index is a collection of web pages, a database, which includes information on the pages tracked by the search engines.

The index catalogs and organizes, among others:

Detailed data on the nature of the content and the relevance of each web page.
A map of all the pages to which each page is linked.
The hyperlink (anchor links) of any link.

Other information about the links: whether or not they are ads, where they are on the page, aspects of the context of the link and what that implies about the page that receives the link.

The content index is the database with which search engines such as Google store and retrieve data when a user queries in the search engine.

Before deciding which web pages to show the index and in what order, search engines apply algorithms to help classify those web pages.

Classification of results

In order to provide results to the search engine user, they must carry out three essential steps:

Interpret the intention of the user’s query
Identify web pages in the index related to the query
Sort and return those web pages in order of relevance and importance

This is one of the main areas in which search engine optimization intervenes.

Effective SEO positioning helps determine the relevance and importance of those web pages for queries that are related.

So, what does relevance and importance mean?

Relevance: degree to which the content of a web page matches the intention of the search engine (the intention is what search engines try to achieve with that search, which is a complex task to be solved for search engines or SEO).
Importance: web pages are considered more important the more they are mentioned elsewhere. Traditionally, they appear in the form of links from other websites to that web page.

To achieve the task of assigning relevance and importance, search engines have complex algorithms designed to take into account hundreds of signals that help determine the relevance and importance of any web page.

These algorithms often change because search engines work to improve their methods and give better results to users. Although they are constantly modified, some of the fundamentals of what the search engines look for are well known.

Although the complete list of signals that search engines use in their algorithms will probably never be known, search engines have revealed some basic ones through the exchange of knowledge with the web publishing community. Be that knowledge to create lasting SEO strategies.

How do search engines evaluate content?

As part of the classification process, a search engine must understand the nature of the content of each web page it tracks. In fact, Google gives importance to the content of a web page for its classification.

In 2016, Google confirmed that the content is among the 3 main ranking factors for web pages.

To understand what the page is about, the search engines analyze the words and phrases that appear on it, and then build a map of that data, known as the “concept map”. This conceptual map helps define the relationship between the concepts on a web page.

What search engines can “see” on a web page

To evaluate content, search engines analyze the data found on a web page to make sense. Since search engines are software programs, they “see” web pages in a very different way than we do

Search engine crawlers see web pages in the form of DOM. As a human, if you are trying to see what the search engines see, one thing you can do is look at the source code of the page.

To do this, you can start by right clicking on the web page of your browser and click on «see source code of the page». The normal view of the human web page view could look like this:

This will show you the source code of the web page:

The difference between this and the DOM is that we do not see the effect of JavaScript execution, but we can use it to learn a lot about the content of the page. Body content on a web page can often be found in the source code. Below is an example of the unique content we see on the website.

In addition to the unique content on the page, there are other elements on a web page that search engine crawlers find that help search engines understand what the page is about.

This includes things like:

The metadata of the web page, including the title tag and the meta description tag, found in the HTML code. Although they can not be easily seen on the website that humans see, these tags serve as the title and description of the website in the search results, and must be maintained by the owners of the websites.
The alt attributes for the images in a web page. These are descriptions that website owners should keep to describe what the image is about. Since search engines can not “see” images, this helps them to better understand the content of the page and also plays an important role for people with disabilities who use screen reading programs to describe content on a web page. . Get more information about web accessibility and Alt attributes of images.

What search engines can not “see” on a web page

It is important to understand the elements of a web page that search engines can not see. In this way, you can help adapt the content of your website so that the crawlers understand it better.

Items that search engines can not see include:

Flash files: Google can extract certain information from Adobe Flash files, but it is difficult because Flash is software designed to create animations.

When web designers use Flash, they generally do not insert text that helps explain what appears in the files. Many designers opt for HTML5 as an alternative to Adobe Flash because it is compatible with search engines.

Audio and video: Like images, it is difficult for search engines to understand what audio or video is about without context.

There are some exceptions in which search engines can extract limited data on ID3 identification tags, such as Mp3 files. This is one of the reasons why many publishers have added transcripts to audios and videos on a web page. These transcripts help search engines have more context.

Content within a program: this includes AJAX and other forms of JavaScript methods that dynamically load content on a web page.

Google can read Javascript, despite still having limitations. It could be said that Google executes most of the Javascript, but there are cases in which problems can still arise from the way it is executed

Iframes: an iframe tag is normally used to add content from another part of your own website to the current web page, or to add content from another site on your webpage.

Google may not treat this content as part of your page, especially if it is obtained from a third-party website. Historically, Google has ignored content within an iframe, but there may be exceptions to that general rule.

Summary

The search engines seem very simple: “write a query in the search box” and the results appear. However, these instant results are based on a complex set of processes that help identify the most relevant data for the user, so that he can do things like find a recipe, search for a product or get an answer to a question.

Why should you care?

Knowing the fundamental principles of tracking, web indexing and ranking helps site owners to adapt them to facilitate reading and understanding of search engines and better guide them to the correct search results and positioning in Google and other search engines.