Discourse search engine

syncretic · 19 August 2023 23:04

When searching on the community it seems inconsistent and unnecessarily difficult to search for short strings alone if they can be substrings of other words, which many are.

If you search for bot you get all the articles including words starting with bot, ie bottle, botany, both, bother etc as well as bot. But it does not return robot.

Why does the software assume that if I type the start of a word I could want all possible words that follow? Isn’t too many spurious returns obscuring information just as much as not returning the correct matches? Why apply this rule only when the substring starts the word? This is not predictive text, it saves nothing at all but costs.

If you add one or more trailing spaces to your term they are truncated, so bot[space] returns the same list as above+++. In effect you cannot say the t at the end is the end of the word.

If you enclose the search term in quotes they are ignored, so “bot” returns the same as bot. If effect it is the same as the last case.

However, if you do both and search for “bot[space]” it works and returns only the articles with the string bot.

Why are quotes ignored sometimes and not others?

As there is no documentation it is assumed (like so much modern software) that the user will learn by trial and error and the occasional hint. How does such inconsistent behaviour make that easy?

+++ I am going to type [space] when I mean the character that is a space.

Gregr · 20 August 2023 04:05

The search facility used doesn’t seem to support any form of substrings using wildcards. Nor does it seem to support ANDing or NOTting of multiple search words.

Search features that have been in Google since day one.

SueW · 20 August 2023 21:20

I’m in a few forums using the same software an generally have found the search function a waste of time.

PhilT · 20 August 2023 21:28

Not defending Discourse search but.

While it does not perform with what ‘search users’ might be accustomed to, and certainly has shortcomings documented as being ‘known and addressed’ that are still open after a decade, many seem unaware of the hamburger menu when opening the search window.

it yields

and some of those options having their own further pulldowns.

syncretic · 20 August 2023 23:34

Following on from my bot search example I have been reading about Discourse. It seems that the reason all words starting with bot are returned in a search is stemming. Stemming is the process of reducing forms of words to their stem, so affecting, affection and affectedly are all reduced to the stem word affect. Note that these are different parts of speech for the same, or a very similar meaning. This means the user does not have to choose the correct form to get a match. This is all good but:

Firstly, I would be most surprised if any of the words bottle , botany , both , bother are linguistically derived from bot. The matches are algorithmically derived. So the software is applying simplistic rules based on truncating common suffixes not looking for stems that are related in meaning. So we get the kinds of stupid matches that have no connection to the user’s intent.

Secondly, none of the above explains why using quotes, “bot”, does not turn this imitation stemming off and search for the actual string as typed.

Gregr · 21 August 2023 03:53

I don’t think stemming is used. I think it is just that an index is searched rather than the actual content.

Indexes are arranged in a sorted tree structure, and will match and return links to topics that match the text entered.

Bot, bots, bottle, etc, will all be in the same branch structure to be searched, whereas robot, from where bot comes from, will be in a completely different branch path leading from the letter ‘r’.

Once the topic list is determined by the index search, then filters can be used to narrow down what is found.

Supposedly, the search engine used in Discourse has the ability to do a full text search, FTS, which is slower, but I have been unable to make it work.

PhilT · 21 August 2023 04:07

I am not sure if this is the latest or even ‘it’ but feel free to wade through the code and report back.

github.com

discourse/discourse/blob/94cd5ac0b1b654ba55028c3cdead1bfb40af2991/app/controllers/search_controller.rb

# frozen_string_literal: true

class SearchController < ApplicationController

  before_action :cancel_overloaded_search, only: [:query]
  skip_before_action :check_xhr, only: :show
  after_action :add_noindex_header

  def self.valid_context_types
    %w{user topic category private_messages tag}
  end

  def show
    permitted_params = params.permit(:q, :page)
    @search_term = permitted_params[:q]

    # a q param has been given but it's not in the correct format
    # eg: ?q[foo]=bar
    if params[:q].present? && !@search_term.present?
      raise Discourse::InvalidParameters.new(:q)

This file has been truncated. show original

Gregr · 21 August 2023 04:14

Yikes…Ruby. No thanks.

syncretic · 21 August 2023 05:02

I don’t know but this suggests that it does. In any event it is not very useful.

isopeda · 21 August 2023 05:33

I have yet to find an online forum that has a good search function, so Discourse isn’t alone there.

I usually find I can get better results by using Google search, if the information is public. You can tell Google to limit the search to (eg) this site by adding “site:choice.community” to the search string.
[Edit 21/08/2023 19:05: just realised the typo in the original of that example! Adding “-” before “site” would exclude that site. Oops! ]

This article has some useful information for those unfamiliar with Google’s search operators.

Gregr · 21 August 2023 14:48

Quite right, not very useful in some situations.

Typically stemming the common suffixes like ‘s’, or ‘es’, or ‘ed’ to make root words for a search is clumsy and inaccurate.

What would be the situation with ‘stares’, for instance.

The better approach would be lemmatization to derive root words.

person · 22 August 2023 05:15

See also: How to search for whole words?

For my purposes, I find that the Discourse search engine works well but its limitations are well known. It doesn’t purport to be a fantastic, all-singing all-dancing, search engine. (That is a pity because only searching in local, structured content opens up a lot more possibilities than a general search engine that operates across the text of the whole public web.)

Bottom line: It’s open source so y’all feel free to submit a patch that adds more advanced search functionality.

Also, as I wrote last time:

A forum would only enable regex functionality for users with consideration and care because an accidental or malicious regex could chew up lots of server-side CPU but without achieving much or anything at all.

(The regex may have been supplied explicitly by the more advanced user or implicitly from the functionality offered by the search function.)

syncretic · 22 August 2023 05:39

I would be happy with some logical consistency and a whiff of documentation of its existing functions.

Join the conversation

Ask a question. Share tips. Help others.

Make yourself heard

Join our forum and be part of Australia’s biggest consumer movement.

Discourse search engine