What's the deep web and how do you use it?

Here’s a quick guide on using the deep web including a number of sites you might find interesting, such as The Wayback Machine and Elephind.

Do you use the deep web, and if so, which sites do you use the most?

7 Likes

A useful article, that deals with the confusion many people have over deep vs. dark web. A couple of issues:

  1. I would be incredibly surprised if the dark web were larger than the ‘surface’ web. The article refers to them constituting 6% and 4% respectively of all web content.
  2. No localisation? Trove (National Library of Australia) appears in a screenshot but not in the article proper. You can also search the National Archives of Australia for records, along with other Commonwealth, state and territory government websites.
  3. A warning for anyone who has a reason to use the dark web (Tor). Governments have increasingly worked on ways to identify specific users on the Tor network, mainly by owning routers within the network and using some fancy maths. If you are likely to draw serious government attention, then there are ways to assist in obfuscating your use. Just remember that if someone wants to find you badly enough, they probably can (Dread Pirate Roberts is a case in point).
  4. Nerd point here, but I suggest that it would be useful to describe how search engines work (‘bots’ crawling the web and indexing what they find except where it says ‘do not index’) to give readers a better understanding of the end results.
6 Likes

I’d add another point that could be mentioned.

Some commercial sales sites add lots of keyword metadata, tags, and text to their pages in the ‘surface’ web findable directories and are then inspected by the search bots, and indexed for the search engines.

The site then shows up prominantly on a search for a searched product. Sometimes the company pays to be listed prominantly. Google does this, but at least tells you the site has paid to be listed near the top of search results.

With others you go to the site and then find no such product on their ‘deep’ web pages.
Misleading and annoying practice. Nearly as bad a clickbaiting.

4 Likes

An entire industry has been built out of being ‘findable’ online, and appearing on ‘page one’.

1 Like

Nice post.
It is interesting to see that the Choice Community website attempts to block SEO type bots, crawlers, and spiders, but gives more open access to the Google bot.
The robots.txt for this site is openly viewable. As is sitemap.xml.
One for the nerds to take a look at.

1 Like

At least one TOR user was caught because … the TOR browser was based on an out-of-date set of browser sources. In other words, the mainstream browser was half a dozen or more versions ahead and had had some important bugs fixed. (Presumably the solution to this is for a mainstream browser to adopt TOR support as core functionality.)

It is ironic to note that TOR was invented by government (specifically by the US government).

A crawler is of course free to ignore robots.txt so one has to ask whether the relevant web server attempts to enforce it (in the case of publicly accessible content).

Indeed. It is important to realise therefore that what we see in the surface web is not actually the surface web but the surface web as seen through a distorting lens.


Regarding uncrawlable content, I believe there is an interface available into search engines whereby the site indexes its own content and then hands the index over to the search engine. This deals with, for example, content behind a paywall. So this makes it ambiguous as to whether that particular content forms part of the surface web or the deep web.

Paywall operators obviously do this in order to bring in new customers, and when you visit the page from the search results you are typically prevented from seeing the whole page or in some way restricted in access unless you sign up and pay up.

There are also sites that do that but which don’t require payment in monetary terms and instead insist on your signing up so that you pay up in privacy terms.

Personally, I wish they wouldn’t do this and that, instead, restricted content remained strictly in the deep web - rather than search engines producing results that I won’t be able to access.

While on the subject of crawlers, I note from my web server logs that Facebook operates a “crawler”. So if anyone ever posts on Facebook a link to a page on your web site then Facebook will immediately and periodically thereafter access that page, even after some years. It is possible that other social media companies do the same.

3 Likes

Sure. Same with sitemap.xml.
Both are files set up by site admins to help search bots find information that the admins want users to find. And which bots are welcome.
If a search bot wants to make a pest of itself by ignoring directions in these files then that is the job of the web server to deal with.
So if FB submits an HTTP request using its usual user agent of ‘facebookexternalhit’ then send it nothing in the web server processing.

1 Like

Back at the turn of the millennium, I submitted a number of autobiographical short stories to a site in the United States called anotherstory.com.

Its business model was to allow anyone to read anything on the site for free – and almost everything on the site was fiction, although an unkind reviewer might say the same thing about my autobiography – but there was a modest charge to download/save or to print, the proceeds of which was to be shared between the owners of the site and the author.

The site started throwing a 404 error about three weeks later, which coincided with the general carnage commonly called the Tech Wreck.

Unfortunately, one of those short stories was lost to me as a local digital file, but when I first heard about the Wayback Machine a decade later I assumed it would be able to find that missing story on some archived version of anotherstory.com.

For reasons I don’t understand, it only finds anotherstory.com from the end of 2007, by which stage it seems to have morphed into a set of links to the likes of audible.com, with nothing by way of stories to read on the site, let alone anything from 2000.

Any thoughts?

1 Like

Your story site disappeared before Internet archiving started up.
Wayback was just a hobby site at the time of the ‘tech wreck’ in the early 2000’s.
Even Google didn’t really get going until the mid 2000’s.

2 Likes

Thanks for that news, Gregr, disappointing as it is.

Google was around in 2000, since I occasionally used it when even my favourite search engine aggregator, Copernic, wasn’t finding me what I wanted.

It frustrated me that Google refused to be part of Copernic, but it seems they had the better business model, since not only does nobody use Copernic any longer for searching the web, but neither does anybody use any of the search engines it used to aggregate (except perhaps for Yahoo, if you think today’s Yahoo Search is a direct descendant).

1 Like