How Internet Search Engines work
I've published a short article on how search engines work in 2002 and thought it would be a good idea to present it to you as well. The information it contains are still valid so take your time and read through this article and post your comments and questions in the usual place.
1. Search engine structure
A search engine consists of eight interweaved elements, those are.
- Url Server, Crawler, Parser, Store Server, Lexicon, Hit List, Repository, Searchers
The Url Server manages a list that contains unverified urls, new urls are added to the Url Server in different ways, for example by using a form on the search engines' website. Another possibility that new urls are added is that a visited url contains links to new urls that are not verified. Each url gets a so called docID, which is easier to archive than the full url.
The Crawler gets unverified urls from the Url Server and changes the url into ip addresses using DNS. As soon as the ip address is available it opens a HTTP connection to the ip. If this is successful it starts a GET command to receive the page's contents (source). The page content is then transferred to the Storeserver that compresses the content.
The Parser decompresses the sources retrieved from the Storeserver. After that the source is analysed in the following way. First it searches for words that are not in his Lexicon, if it finds a new word it is added to the Lexicon. Words that already exist in the Lexicon are added to the hitlist with a remark on how often it occurs in the source. Additionally information like the title, part of the text or the whole text are saved in the Repository.
The Lexicon contains all words that the parser found in all urls processed, each word has a pointer to the hit list of that word.
The Hit List has pointers to the Repository. This makes it possible for the search engine to present results pretty fast. All information about the site that is stored in the repository is presented in the search result window. (normally title, url and the first line(s) of the page)
The Searcher is the link between the user and the search engine. Users enter search phrases in the searcher, hit enter and the searcher uses the Lexicon and the Hit List to present results.
Example:
----------
User types http://www.google.com/ into his web browser, then enters the search phrase "security website" into the search field. (which is the Searcher component)
The Searcher uses the Lexicon to check the pointers of the two words (the pointers are connected to the Hit List), follows the pointers to the Hit List, checks the first 10 entries in the Hit List, follows the pointer to the Repository and creates a new webpage containing the first 10 entries of the hitlist, the title and some lines of the pages from the Repository
The most important element of a search engine is its rating system. Search engines measure urls to determine which results are displayed first when a user starts a search. I've chosen Google as the representative search engine due to the fact that search engines measure differently.
Google awards points to every url, the more points a url got the higher the rank in the Hit list. important elements that add points are the url itself, the title, the keywords, the content, headings aso. Additionally urls get points for every link that links to them from other urls. Links from urls that have a high rating themselves give more points than links from urls with low ratings.
2. Advanced search methods:
Search Engines use Boolean operators for advanced searches. The operators AND, OR and NOT are used by every known search engine.
word1 AND word2 means that the search engine looks for urls that contain word1 and word2, urls with only one of the words are not displayed in the results
word1 OR word2 means that the search engine displays all urls tha contain word1 or word2 or word1 and word1.
word1 NOT word2 means that the search engine looks for pages that contain word1 and NOT word2
Examples: (google)
----------------------
"Clinton President" - displays results with urls that contain Clinton and President but not where one of them is missing
"Clinton AND President" - same result as above
"Clinton +President" - same result as above
"Clinton OR President" - displays results with urls that contain either Clinton or President or both
"Clinton NOT President" - displays results with urls that contain Clinton but NOT President
"Clinton -President" - same results as above
Google uses additional features, those are explained below.
"allintitle" - (allintitle:deny security), displays urls that contain the terms in their title
"allinurl" - (allinurl:security advise), displays urls that contain all terms in their url.
"cache" - (cache:www.deny.de), displays a cached version of the requested url
"date" - (google advanced search, no shortcut), presents results of a specific period
"filetype" - (deny filetype:pdf), displays urls that contain the search terms and the specified file type.
"info" - (info:www.deny.de) Goggle lists stored information about the requested website
"intitle" (intitle:security advise), displays url that have the first term in their title and the others anywhere on the page.
"inurl" - (inurl:security advise), displays urls that contain the first term in their url and the other terms anywhere on the page
"languages" - (google advanced search, no shortcut), displays only results in a specified language
"link" - (link:www.deny.de), displays all urls that link to the page
"occurences" - (google advanced search, no shortcut), specifies where the search term has to occur on the page
"phrase searching" ("to keep an eye on"), only urls are displayed where the phrase is written like this. It wont show results where only part of the phrase is present
"related" - (related:www.deny.de), displays similar pages only
"safe search" - (google advanced search, no shortcut), filters urls that contain unsuitable webpages for minors, for example pornographic pages
"site" - (security site:www.securityadvise.de), processes only one url for the terms
"spell" - (spell:advise), spell checks the term
"stocks" - (stocks:YHOO), displays financial information about companies, you need the special company code to search for it, you get that code at Yahoo (http://finance.yahoo.com/l
Doesn’t Windows 8 know that www. or http:// are passe ?
Well it is a bit difficulty to distinguish between name.com domains and files for instance.
I know a service made by google that is similar to Google bookmarks.
http://www.google.com/saved
@Ashwin–Thankful you delighted my comment; who knows how many “gamers” would have disagreed!
@Martin
The comments section under this very article (3 comments) is identical to the comments section found under the following article:
https://www.ghacks.net/2023/08/15/netflix-is-testing-game-streaming-on-tvs-and-computers/
Not sure what the issue is, but have seen this issue under some other articles recently but did not report it back then.
Omg a badge!!!
Some tangible reward lmao.
It sucks that redditors are going to love the fuck out of it too.
With the cloud, there is no such thing as unlimited storage or privacy. Stop relying on these tech scums. Purchase your own hardware and develop your own solutions.
This is a certified reddit cringe moment. Hilarious how the article’s author tries to dress it up like it’s anything more than a png for doing the reddit corporation’s moderation work for free (or for bribes from companies and political groups)
Almost al unlmited services have a real limit.
And this comment is written on the dropbox article from August 25, 2023.
First comment > @ilev said on August 4, 2012 at 7:53 pm
For the God’s sake, fix the comments soon please! :[
Yes. Please. Fix the comments.
With Google Chrome, it’s only been 1,500 for some time now.
Anyone who wants to force me in such a way into buying something that I can get elsewhere for free will certainly never see a single dime from my side. I don’t even know how stupid their marketing department is to impose these limits on users instead of offering a valuable product to the paying faction. But they don’t. Even if you pay, you get something that is also available for free elsewhere.
The algorithm has also become less and less savvy in terms of e.g. English/German translations. It used to be that the bot could sort of sense what you were trying to say and put it into different colloquialisms, which was even fun because it was like, “I know what you’re trying to say here, how about…” Now it’s in parts too stupid to translate the simplest sentences correctly, and the suggestions it makes are at times as moronic as those made by Google Translations.
If this is a deep-learning AI that learns from users’ translations and the phrases they choose most often – which, by the way, is a valuable, moneys worthwhile contribution of every free user to this project: They invest their time and texts, thereby providing the necessary data for the AI to do the thing as nicely as they brag about it in the first place – alas, the more unprofessional users discovered the translator, the worse the language of this deep-learning bot has become, the greater the aggregate of linguistically illiterate users has become, and the worse the language of this deep-learning bot has become, as it now learns the drivel of every Tom, Dick and Harry out there, which is why I now get their Mickey Mouse language as suggestions: the inane language of people who can barely spell the alphabet, it seems.
And as a thank you for our time and effort in helping them and their AI learn, they’ve lowered the limit from what was once 5,000 to now 1,500…? A big “fuck off” from here for that! Not a brass farthing from me for this attitude and behaviour, not in a hundred years.
When will you put an end to the mess in the comments?
Ghacks comments have been broken for too long. What article did you see this comment on? Reply below. If we get to 20 different articles we should all stop using the site in protest.
I posted this on [https://www.ghacks.net/2023/09/28/reddit-enforces-user-activity-tracking-on-site-to-push-advertising-revenue/] so please reply if you see it on a different article.
Comment redirected me to [https://www.ghacks.net/2012/08/04/add-search-the-internet-to-the-windows-start-menu/] which seems to be the ‘real’ article it is attached to
Comment redirected me to [https://www.ghacks.net/2012/08/04/add-search-the-internet-to-the-windows-start-menu/] which seems to be the ‘real’ article it is attached to
Article Title: Reddit enforces user activity tracking on site to push advertising revenue
Article URL: https://www.ghacks.net/2023/09/28/reddit-enforces-user-activity-tracking-on-site-to-push-advertising-revenue/
No surprises here. This is just the beginning really. I cannot see a valid reason as to why anyone would continue to use the platform anymore when there are enough alternatives fill that void.
I’m not sure if there is a point in commenting given that comments seem to appear under random posts now, but I’ll try… this comment is for https://www.ghacks.net/2023/09/28/reddit-enforces-user-activity-tracking-on-site-to-push-advertising-revenue/
My temporary “solution”, if you can call it that, is to use a VPN (Mullvad in my case) to sign up for and access Reddit via a European connection. I’m doing that with pretty much everything now, at least until the rest of the world catches up with GDPR. I don’t think GDPR is a magical privacy solution but it’s at least a first step.