How do search engines work?

The Fundamentals of Search Engine:

  1. INTRODUCTION:
  • The base technique that any Search engine follows is CRAWLING. Search engines crawl over all the webpages using their own spiders/web crawlers/search engine bots to access information. 
  • They usually download the webpages they crawl over and use the link present in the webpage already, for extending their crawling process. In short, by crawling over one webpage and its following links, it discovers new webpages.
  • THE SEARCH ENGINE INDEX is a data structure that holds all the discovered webpages and their URLs. Along with these URLs, its relevant key factors are also included. Those factors are:
  • Keywords: Keywords that speak out the total use of the webpage is included.
  • Content: The type of data that is crawled over is included.
  • Freshness: This element shows whether the crawled webpage is updated or not. How recently was it updated is identified.
  • User engagement: What did the previous user do in that webpage or domain is recorded.

Use of Search Engine Algorithm:

  • The basic use of this algorithm is to provide relevant, fast and high quality content/web results for the queries posted by the user. For example: When you search for ‘Apples in India’ in Google, the Google search engine will produce an immediate set of webpages with relevant keywords and high-quality content.
  • This makes the user engagement better. User satisfaction is also looked into.
  • From the list of web pages displayed on the SERP, the user selects one or a few. This action further affects the quality of a webpage and through this activity, the owners of the webpages learn more techniques to upgrade their webpages to earn more search rank.

What happens when a search is performed?

  • Even before a user activates their search on the search engines, these search engines crawl over almost all the webpages and store them together into an Index. 
  • When a user makes a search, the search engine displays the relevant web pages from the index and using an algorithm places each webpage on the basis of their web rank. These web ranks might vary for each search engine.
  • The search engines not only produce relevant web pages on the SERP after looking into the search query/keywords but also use other data to return relevant web pages and they are:
  • Location: When you type ‘hospitals near me’ in the search engine, a list of hospitals near your current location will be displayed. Here, the search engine uses its location-dependent searches
  • Language detected: Sometimes when the language of the user is detected by the search engine, it will produce/show results in that language.
  • Previous Search history: Depending on what the user has previously searched in that search engine, the search engine will produce different results for the same query.
  • Device: Based on the device through which a search is made, different sets of search results are produced. For example: Google follows a separate search index for mobile searches and desktop searches. 

Sometimes a few webpages/URLs might not be indexed by the search engine because of:

  • When you use the file extension ‘Robots.txt’, it will say which webpage to crawl and which not to. 
  • When one uses the ‘noindex tag’ for a webpage then it means that they are informing the search engine to not index that particular webpage.
  • You can also use a ‘canonical tag’ to make the search engine index another similar page. 
  • When the webpage with all tags to make the search engine index it, contains less and low-quality content will not be indexed, as per the search engine algorithm. Even duplicate or plagiarised content will lead to this.
  • In case of any error page for any URL or when any URL does not return any valuable output, then it will not be indexed by the search engine. 
  1. SEARCH ENGINE CRAWLING:
  • Through this process, the search engines use their spiders to crawl over all the webpages to download them and get their URL for identifying additional webpages for crawling.
  • These crawled web pages/URLs will be indexed for producing fast and quality SERP.
  • However, the crawligns happen periodically or on a regular basis in order to keep the web pages updated to their latest information. For example: If a user adds additional information to their already crawled webpage, then Google will find that update while crawling over the webpage again and will update its index for showing the updated version in the SERP.
  • As the first step, the search engines first download the ‘robots.txt’ file of a URL to understand which webpages to crawl and which not to. 
  • However, when you use Sitemaps in ‘robots.txt’ it will specify the web pages that need to be crawled. 
  • The search engines also follow an algorithm as to how frequently it will crawl over. For instance: if a webpage updates its content daily then the search engine will crawl over it daily while it will not do the same to a web page that updates once a month.
  • One can identify which web crawlers have crawled over their webpages and how frequently, by using their User-agent field and by examining their web servers’ log. 
  • In case of non-text files like images etc, the search engine will not crawl over its URL. But when some title/filename and metadata/description is added to the images, then the search engine will crawl over it. Though not much data is crawled and indexed from these non-text files, it still indexes whatever content they have, ranks them accordingly and generates traffic for it.
  • Search engines also link for new web pages by re-crawling the already known webpages. However, the links of the new web pages are stored for downloading it later. But this is the method through which Search engines expand and get to crawl over almost all the public webpages. This is why backlinks, internal and outbound linking are necessary.

What are SITEMAPS?

Sitemaps are a file that contains all the URLs of a website that needs to be crawled over by the search engine spiders. Through this, even the deepest web pages in a website or rarely seen webpage from a website is crawled by the search engines and are offered to the users in SERP. 

  • Apart from the above mentioned methods of how search engines will crawl over a webpage, one can manually make the search engine crawl over their particular webpage. One can do this by entering the URL of the particular webpage on the search engine’s relevant interface. This is mostly done for speeding up the crawling and indexing process after updating your webpages.
  • However, you can submit only 10 page submissions, per day.
  • In case of large volumes of content, then use sitemaps as an XML file for better crawling by Google.
  • In most cases, the response time for both Sitemaps and Page submissions are the same.
  1. SEARCH ENGINE INDEXING:
  • Search engines store most of the webpages/URLs that it crawls over in their INDEX.
  • This function has increased the response time and has actually made it spontaneous.
  • In other words, Indexing is referred to the process through which the search engines organise their crawled over content before a user makes a search. Therefore, when a search is made, the relevant URLs from the index of webpages/URLs are displayed in the SERP.
  • In order to produce instant search results, the search engine uses a technique called INVERTED INDEXING/REVERSE INDEX. This refers to how the search engine, rather than listing all the words available in each of the crawled documents, lists documents based on certain words. This simplifies the data to be stored and increases the response time as the search engine can easily scan through the index of certain words alone. This is why KEYWORDS are necessary.

For example: when the user searches for ‘Apples in India’, the search engine will display those websites that contain the words Indian Apples/Appe in India or such similar combination of words. 

  • Apart from indexing, a search engine also caches a webpage into a highly compressed text-only version. This includes the HTML and metadata of the webpage too. This cached data refers to the latest version of the webpage that the search engine has seen. You can access it by clicking the ‘green down arrow’ present near each search result. Currently Bing is not supporting this.
  • PAGE RANKS play a vital role in determining which webpage/URL should come first on the SERP. Google uses this Page rank method with its own algorithm. The algorithm focuses on the number of backlinks a webpage has to fix its Pagerank. Google no longer displays the page rank publicly, therefore you can use certain other applications like that of Moz.com, etc, to determine the pagerank. 

NOTE: Each pagerank determiner will use their own algorithm, which need not be the same as Google’s.

  • When one uses rel=”nofollow”, then that particular page will not add to the pagerank of the overall website.

For eg: If a website has 5 different webpages, then each contributes equally to the overall pagerank.

  • The pagerank is determined by how many backlinks a page has and also based on the rank of the website that gives you backlinks. 
  1. DIFFERENT SEARCH ENGINES AND THEIR DIFFERENCES
ATTRIBUTES GOOGLE BING YANDEX BAIDU
ORIGIN By Google By Microsoft Famous in Russia Famous in China
DEVICE INDEXING Planning to launch mobile-first indexing that gives more importance to the mobile version rather than the desktop version. Planning to decide the pagerank based on its loading time on mobile.
No plans of having a mobile-first index like Google.  In 2016, they launched their mobile-friendly algorithm which marks each page as mobile friendly or not.  Also focuses on mobile-friendly features of a webpage. However, for non mobile-friendly web pages, it uses its own algorithm to convert them into mobile-friendly ones. 
BACKLINKS FOR RANKING Quantity of backlinks was focused. 
But due to many low-quality backlinks, Google now focuses on both quality and quantity of backlinks.
Focused on the quality of the backlinks. Though it sees the quality of the backlinks as an important factor, it does not allow it to affect its ranking algorithm. Backlinks from Chinese websites have more value than those of others. 
SOCIAL MEDIA FOR RANKING Does not use social media as a ranking factor to avoid misleading and incomplete data.   Allows social signals to affect your pagerank. 
When you are popular on social media, you will gain more rank on BING.
Receives only less ranking signals from social media.  Does not involve social media signals for ranking the pages. 

6. CRAWLING BUDGET:

This refers to the number of webpages/URLs in a website that a search engine will crawl over in a given time period.

When the Crawl rate and Crawl demand is put together, the crawl budget of a search engine is derived.

  • There are restrictions placed on the crawling budget to avoid overcrowding in the website’s server, which will also affect the user performance, when not overlooked. 
  • When a website uses a dedicated server, it will have a larger crawl budget. But when it uses an outsourced or shared server, it will have a lesser crawl budget.
  • Another factor to consider is that, when a website is hosted with a shared server but quickly responds, it will have more crawl budget unlike a website that responds slowly even with a dedicated server. (Crawl Health matters)
  • No matter what, keep your content new, unique and engaging for better page rank and SERP results.

CRAWL RATE refers to the no.of URLs that a search engine will attempt to crawl over per second. It is proportional to no.of HTTPs that they will want to open simultaneously.

CRAWL RATE LIMIT refers to the maximum no.of crawls/fetching that can be made without affecting the user experience. 

CRAWL DEMAND:

  • The crawl rate will vary according to the crawl demand for each page.
  • The crawl demand refers to the demand for the already indexed pages that the user exerts. 
  • When many users demand to see a few web pages, then those webpages will be treated as popular sites and will be crawled over frequently.
  • Even the new or the regularly updated web pages will have more demand from the users than the still and old pages. 

Managing crawl budgets is necessary, especially for large websites which hold many webpages. It is also mainly to get the webpages re-crawled faster to update the new information in the search engine’s index.

Also avoid using low value URLs to avoid wasting a large portion of the crawl budget on useless links.  To do this, you can disallow duplicate pages, pages with low content, etc, to be not indexed at all. This is where the ‘robots.txt’ is put to action. 

Also avoid URL sprawls by creating URLs only for high-quality and important webpages. 

Even broken links will lead to more crawl budget, therefore it needs to be avoided. Keep a regular check on such links and keep it to a minimum.

Avoid giving more URL addresses to one webpage/ avoid URL redirects to keep your crawl budget under control. 

ROBOTS.TXT:

  • This will notify the web spiders to avoid crawling over the specified webpages to avoid the wastage of crawling budget and also for privacy purposes.
  • This has to be a simple text file encoded in UTF-8.
  • Keep your text file under certain size limits as each search engine has their own size limits. For Google, it is 500 KB.
  • Always place ‘robots.txt’ in the root of the domain.

For example: https://www.tipse.com/robots.txt

The protocol that is placed for the entire domain will not affect other pages like http://www.tipse.com or others, as each of these should have their own ‘robots.txt’.

  • Reduce the usage of this protocol to have more crawling and more positive consequences.
  • Keep the architecture of your website clean and simple, for easy crawling. 
  • However, use this protocol to those low-quality webpages, only when you cannot resolve it in a short period.
  • As per Google recommendations, you should use this protocol only when you have server issues or crawl efficiency issues. 

There a few pages which may not it be crawled, and they are:

  1. When the pages are not sorted properly.
  2. User-generated content.
  3. Pages that have sensitive information
  4. More no.of internal search pages. 

Avoid using ‘robots.txt’ during:

  1. Let the search engine crawl over your website thoroughly in order to understand the pages’ layouts, designs, CSS, coding etc. Do not block crawling of java scripts because when this happens, it will lead to algorithmic penalties.
  2. Either use Google Search console or place the parameters in a URL fragment (/page#sort=price) to avoid certain pages from crawling. Also add ‘rel=”nofollow” in the URL parameters.
  3. When you use ‘robots.txt’, you might hinder the page value that backlinks give you.
  4. No matter what, your webpage will get indexed so why waste it by blocking the contents?
  5. Do not use this protocol, especially in social media as this will affect the formation of your page’s snippet.

USE OF SITEMAPS IN ROBOTS.TXT:

  • The sitemap in robots.txt will help the search engine find all URLs of a website.
  • Always use absolute URLs instead of relative URLs, while placing sitemaps in robots.txt. 

Eg: https://www.tipse.com/sitemap.xml

  • Sitemaps can be placed on external domains too.
  • However in Google Search console, these sitemaps will not be available, unless it is manually submitted.

Related pages:

How to do Search Engine Optimization?

How to build content library?

By