Ticket #2733 (closed enhancement: fixed)

Opened 2 years ago

Last modified 1 year ago

FTS3 fulltext search implementation

Reported by: mithrandi Assigned to: mithrandi
Priority: highest Milestone:
Component: Mantissa Severity: normal
Keywords: Cc:
Estimated Completion (YYYY/MM/DD): Branch: branches/fts3-2733-3
Author: mithrandi

Description

SQLite has this FTS3 fulltext search thing, Mantissa should be able to use it.

Change History

09/23/08 15:46:48 changed by mithrandi

  • branch set to branches/fts3-2733.
  • author set to mithrandi.

(In [16722]) Branching to 'fts3-2733'

09/23/08 20:18:27 changed by mithrandi

  • keywords set to review.
  • owner changed from mithrandi to exarkun.
  • priority changed from normal to highest.

Well, this is a bit rough around the edges, and it doesn't implement the full API, but it's usable.

10/14/08 12:13:16 changed by exarkun

  • keywords deleted.
  • owner changed from exarkun to mithrandi.

On IRC, we talked about using a second table to implement the keyword feature.

10/15/08 02:06:39 changed by mithrandi

  • keywords set to review.
  • owner changed from mithrandi to exarkun.

I don't think using a second table is actually possible. Keyword values need to be fulltext indexed, which would mean the second table would need to be an FTS table as well; but since there would need to be multiple rows (one for each keyword name/value pair), we can't use the real docid column to store the docid, and you can't create an index on any user-defined columns in an FTS virtual table, so joining on the keyword table would perform a full table scan.

I also tried creating a new column for each keyword on the fly, but you can't alter a virtual table after it has been created.

Any other suggestions?

11/10/08 16:30:41 changed by exarkun

  • keywords deleted.
  • owner changed from exarkun to mithrandi.
  • completion_estimate_date changed.

Okay. I propose forgetting about keyword support in the FTS3 implementation. That's really sad, but the choice at this point seems to be between that and dropping the entire FTS3 implementation. Perhaps at some later time someone will discover a way to support this feature with FTS3, or maybe a glorious FTS4 is just over the horizon.

So, that in mind:

  1. test_fulltext.py
    1. merging conflicts with trunk, so please merge the branch forward
    2. the change to testDigitSearch isn't entirely great, but I see the motivation for it. So, instead of just removing this test, please also add another test which just covers the behavior testDigitSearch no longer covers. The FTS3 implementation can skip this since it won't work. The functionality should remain covered in general though, as the interface does intend to promise this behavior.
    3. Can test_DifficultTokens be made to pass if we do a little bit of preprocessing in Python to make these strings tokenize properly? Or maybe someday we can write a new tokenize for SQLite (in C, presumably :/)
  2. fulltext.py
    1. I guess it makes sense that indexing something with keywords silently drops the keywords but searching with keywords raises a noisy exception. The only other thing I would suggest here is that the NotImplementedError actually be handled by some UI somewhere, rather than it turning into a traceback for an end user.
    2. Lots of SQL statements here end with ;, but they don't need to.
    3. Some classes are missing docstrings, that should be fixed
    4. I'm not sure why someone decided to have all those global FOO_INDEX_DIR variables, but I don't see any reason to continue the tradition here. Can you just make that a class attribute or something?

11/10/08 21:43:22 changed by mithrandi

  • branch changed from branches/fts3-2733 to branches/fts3-2733-2.

(In [16997]) Branching to 'fts3-2733-2'

11/10/08 21:59:44 changed by mithrandi

  • keywords set to review.
  • owner changed from mithrandi to exarkun.
  1. test_fulltext.py
    1. Merged forward and resolved conflicts.
    2. The test that I removed from testDigitSearch seems to be a rather uninteresting test of keyword searching, which is tested in more detail by, say, testKeywordIndexing. Am I missing something here?
    3. The problem here is not the tokenization of the input, but what happens during searching, As far as I can tell, making test_DifficultTokens means reimplementing sqlite's tokenizing algorithm in Python, so that it can be applied to search terms; in this particular case, the search for u'718-555' needs to be transformed into something like u'718 555'. I'm not sure if this is a road we want to go down, but I'd like to hear your thoughts.
  2. fulltext.py
    1. I'm not sure what sort of UI you're talking about; I wasn't aware of any search UI in Mantissa itself.
    2. Removed the trailing semicolons in the SQL statements.
    3. Missing class docstrings added.
    4. Got rid of SQLITE_INDEX_DIR.

12/03/08 14:11:26 changed by exarkun

  • keywords deleted.
  • owner changed from exarkun to mithrandi.

Oops, you're right. The UI I was thinking of comes from Quotient.

Okay, the only other thing is that the FTS3 tests should be skipped if FTS3 isn't available (and we probably want to get a buildbot set up that actually has FTS3 so we can keep testing this, particularly since I think most people won't have FTS3 available for a while).

01/27/09 17:57:34 changed by mithrandi

(In [17169]) Skip FTS3 tests if FTS3 is not available. (refs #2733)

01/27/09 18:16:07 changed by mithrandi

  • keywords set to review.
  • owner changed from mithrandi to exarkun.

05/13/09 17:33:43 changed by glyph

  • keywords deleted.
  • owner changed from exarkun to mithrandi.

Looks almost good to land, except for the fact that this does apparently break the quotient search UI. The UI actually is in Mantissa (xmantissa.search), it's just the thing that actually glues it to full-text indexing that's in quotient (xquotient.quotientapp.MessageSearchProvider), and that should be in Mantissa anyway. I would maintain the same contract (not raise NotImplementedError) and just make keywords silently dropped on both sides, index and search. If you want, make it a better exception and actually handle it in Quotient.

I also have a few new tickets I'd like you to file.

  1. As far as I can tell, everything has been dealt with except point 1.3 and keyword indexing. I'm happy for this to go in without keyword indexing, but it should be done eventually.
  2. Regarding point 1.3, if the indexes generated by this branch can eventually support the "difficult tokens", then let's go ahead and defer that. When support is added, it should be broken up into multiple tests, hopefully ones with coding-standard-compliant name ;-), to illustrate the specific "difficult" cases and what exactly they're supposed to be.
  3. _SQLiteResultWrapper and _SQLiteIndex should be implementing some interface. But that's not a new failure in this branch, their analogues for other indexers don't implement interfaces either; they should all implement those interfaces.

05/14/09 14:55:05 changed by mithrandi

  • branch changed from branches/fts3-2733-2 to branches/fts3-2733-3.

(In [17279]) Branching to 'fts3-2733-3'

05/14/09 15:22:25 changed by mithrandi

(In [17283]) Make keyword searches fail silently instead of with an exception. (refs #2733)

05/14/09 15:28:27 changed by mithrandi

  • keywords set to review.
  • owner changed from mithrandi to glyph.

The NotImplementedError is gone now; I snuck in a test suite fix for the PyLucene?-not-available problem; this should probably have been on a separate ticket, but it's only 2 lines of code...

Also, I created tickets #2877, #2878, and #2879 to cover the issues you raised.

06/11/09 11:47:55 changed by glyph

  • keywords deleted.
  • owner changed from glyph to mithrandi.

OK. Looks pretty good. Please adjust VWS in the new code to have 2 blank lines between methods, 3 between classes, and merge :).

06/11/09 15:54:27 changed by mithrandi

  • status changed from new to closed.
  • resolution set to fixed.

(In [17455]) Add SQLite FTS3 fulltext search backend.

Fixes #2733 Author: mithrandi Revier: glyph, exarkun

jethro@divmod.org