Discourse search engine

When searching on the community it seems inconsistent and unnecessarily difficult to search for short strings alone if they can be substrings of other words, which many are.

If you search for bot you get all the articles including words starting with bot, ie bottle, botany, both, bother etc as well as bot. But it does not return robot.

Why does the software assume that if I type the start of a word I could want all possible words that follow? Isn’t too many spurious returns obscuring information just as much as not returning the correct matches? Why apply this rule only when the substring starts the word? This is not predictive text, it saves nothing at all but costs.

If you add one or more trailing spaces to your term they are truncated, so bot[space] returns the same list as above+++. In effect you cannot say the t at the end is the end of the word.

If you enclose the search term in quotes they are ignored, so “bot” returns the same as bot. If effect it is the same as the last case.

However, if you do both and search for “bot[space]” it works and returns only the articles with the string bot.

Why are quotes ignored sometimes and not others?

As there is no documentation it is assumed (like so much modern software) that the user will learn by trial and error and the occasional hint. How does such inconsistent behaviour make that easy?


+++ I am going to type [space] when I mean the character that is a space.

6 Likes

The search facility used doesn’t seem to support any form of substrings using wildcards. Nor does it seem to support ANDing or NOTting of multiple search words.

Search features that have been in Google since day one.

I’m in a few forums using the same software an generally have found the search function a waste of time.

Not defending Discourse search but.

While it does not perform with what ‘search users’ might be accustomed to, and certainly has shortcomings documented as being ‘known and addressed’ that are still open after a decade, many seem unaware of the hamburger menu when opening the search window.

image
it yields

and some of those options having their own further pulldowns.

3 Likes

Following on from my bot search example I have been reading about Discourse. It seems that the reason all words starting with bot are returned in a search is stemming. Stemming is the process of reducing forms of words to their stem, so affecting, affection and affectedly are all reduced to the stem word affect. Note that these are different parts of speech for the same, or a very similar meaning. This means the user does not have to choose the correct form to get a match. This is all good but:

Firstly, I would be most surprised if any of the words bottle , botany , both , bother are linguistically derived from bot. The matches are algorithmically derived. So the software is applying simplistic rules based on truncating common suffixes not looking for stems that are related in meaning. So we get the kinds of stupid matches that have no connection to the user’s intent.

Secondly, none of the above explains why using quotes, “bot”, does not turn this imitation stemming off and search for the actual string as typed.

I don’t think stemming is used. I think it is just that an index is searched rather than the actual content.

Indexes are arranged in a sorted tree structure, and will match and return links to topics that match the text entered.

Bot, bots, bottle, etc, will all be in the same branch structure to be searched, whereas robot, from where bot comes from, will be in a completely different branch path leading from the letter ‘r’.

Once the topic list is determined by the index search, then filters can be used to narrow down what is found.

Supposedly, the search engine used in Discourse has the ability to do a full text search, FTS, which is slower, but I have been unable to make it work.

I am not sure if this is the latest or even ‘it’ but feel free to wade through the code and report back.

Yikes…Ruby. No thanks.

1 Like

I don’t know but this suggests that it does. In any event it is not very useful.

I have yet to find an online forum that has a good search function, so Discourse isn’t alone there.

I usually find I can get better results by using Google search, if the information is public. You can tell Google to limit the search to (eg) this site by adding “site:choice.community” to the search string.
[Edit 21/08/2023 19:05: just realised the typo in the original of that example! Adding “-” before “site” would exclude that site. Oops! :flushed:]

This article has some useful information for those unfamiliar with Google’s search operators.

5 Likes

Quite right, not very useful in some situations.

Typically stemming the common suffixes like ‘s’, or ‘es’, or ‘ed’ to make root words for a search is clumsy and inaccurate.

What would be the situation with ‘stares’, for instance.

The better approach would be lemmatization to derive root words.

See also: How to search for whole words?

For my purposes, I find that the Discourse search engine works well but its limitations are well known. It doesn’t purport to be a fantastic, all-singing all-dancing, search engine. (That is a pity because only searching in local, structured content opens up a lot more possibilities than a general search engine that operates across the text of the whole public web.)

Bottom line: It’s open source so y’all feel free to submit a patch that adds more advanced search functionality. :wink:

Also, as I wrote last time:

A forum would only enable regex functionality for users with consideration and care because an accidental or malicious regex could chew up lots of server-side CPU but without achieving much or anything at all.

(The regex may have been supplied explicitly by the more advanced user or implicitly from the functionality offered by the search function.)

I would be happy with some logical consistency and a whiff of documentation of its existing functions.

2 Likes