Translate

Monday 26 March 2018

Why the Deep Web is not indexable?

There are several methods that prevent web pages from being indexed by traditional search engines. I have categorized them for your reference below.
  • Contextual Web: Pages with content varying for different access contexts.
  • Dynamic content: Dynamic pages which are returned in response to a submitted query or accessed only through a form, especially if open-domain input elements are used; such fields are hard to navigate without domain knowledge.
  • Limited access content: Sites that limit access to their pages in a technical way (e.g., using the Robots Exclusion Standard or CAPTCHAs, or no-store directive which prohibit search engines from browsing them and creating cached copies).
  • Non-HTML/text content: Textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.
  • Private Web: Sites that require registration and login (password-protected resources).
  • Scripted content: Pages that are only accessible through links produced by JavaScript as well as content dynamically downloaded from Web servers via Flash or Ajax solutions.
  • Software: Certain content is intentionally hidden from the regular Internet, accessible only with special software, such as Tor, I2P, or other darknet software. For example, Tor allows users to access websites using the .onion server address anonymously, hiding their IP address.
  • Unlinked content: Pages which are not linked to by other pages, which may prevent web crawling programs from accessing the content. This content is referred to as pages without backlinks (also known as inlinks). Also, search engines do not always detect all backlinks from searched web pages.
  • Web archives: Web archival services such as the Wayback Machine enable users to see archived versions of web pages across time, including websites which have become inaccessible, and are not indexed by search engines such as Google.

No comments:

Post a Comment