Go Back   Rhinocerus > Newsgroup > Newsgroup comp.databases.mysql

Reply
 
Thread Tools Display Modes
  #1 (permalink)  
Old 07-16-2012, 09:15 AM
Jason C
Guest
 
Posts: n/a
Default Short words

Do any of you have a way around the "short word feature", other than modifying ft_min_word_len? I'm getting more and more complaints that my "search engine sucks" because it can't tell the difference between (for example) "9mm pistol" and "pistol", but I don't really want to slow it down by modifying ft_min_word_len, either.

I read one solution of having a second column, and adding __ to every word,then modifying the query to add __ to each of the query words to make it match the word in the table. But of course, that doubles the size of the table, and since one of my tables are 1.3G, I'm not sure that doubling it is very wise.

If I did lower ft_min_word_len to 2, would I REALLY notice that much of a query speed difference? Have any of you benchmarked it? Are we talking abouta difference of 0.007 to 0.009 (which is acceptable), or 0.007 to 2.0 (which is not)?

TIA,

Jason
Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

  #2 (permalink)  
Old 07-16-2012, 11:17 AM
Axel Schwenke
Guest
 
Posts: n/a
Default Re: Short words

Jason C <jwcarlton@gmail.com> wrote:

> Do any of you have a way around the "short word feature", other than
> modifying ft_min_word_len? I'm getting more and more complaints that
> my "search engine sucks" because it can't tell the difference between
> (for example) "9mm pistol" and "pistol", but I don't really want to
> slow it down by modifying ft_min_word_len, either.


Congratulations! You have just found why the MySQL fulltext search
sucks. And any other search engines without "term preprocessor" suck
as well. If you want terms like "9mm", "C++" or "AS/400" indexable
(and thus: findable) then you must transform them, such that the
search engine recognizes them as words.

> I read one solution of having a second column, and adding __ to
> every word, then modifying the query to add __ to each of the query
> words to make it match the word in the table.


That's the basic idea, but I rather suggest

"9mm" -> nine_millimeter
"C++" -> C_plus_plus

etc. pp.


XL
Reply With Quote
  #3 (permalink)  
Old 07-16-2012, 11:03 PM
Jason C
Guest
 
Posts: n/a
Default Re: Short words

On Monday, July 16, 2012 7:17:12 AM UTC-4, Axel Schwenke wrote:
> That's the basic idea, but I rather suggest
>
> &quot;9mm&quot; -&gt; nine_millimeter
> &quot;C++&quot; -&gt; C_plus_plus
>
> etc. pp.


That makes sense, but I'm not entirely sure how to do that on the fly. There are millions of potential combinations that could be queried, so I'm not sure I would write a program to know when it should be "momma" or "momillimetera" (replacing all "mm" with "millimeter").

Would it be practical to convert all characters to hex code? This is probably crazy, but is there a way to query MySQL as hex, or would I still have to have a second column with everything converted to hex?

Or, would it simply be faster to use LIKE, convert all whitespaces in the query to "%", and just forget about MATCH...AGAINST?
Reply With Quote
  #4 (permalink)  
Old 07-17-2012, 12:36 AM
Jerry Stuckle
Guest
 
Posts: n/a
Default Re: Short words

On 7/16/2012 7:03 PM, Jason C wrote:
> On Monday, July 16, 2012 7:17:12 AM UTC-4, Axel Schwenke wrote:
>> That's the basic idea, but I rather suggest
>>
>> &quot;9mm&quot; -&gt; nine_millimeter
>> &quot;C++&quot; -&gt; C_plus_plus
>>
>> etc. pp.

>
> That makes sense, but I'm not entirely sure how to do that on the fly. There are millions of potential combinations that could be queried, so I'm not sure I would write a program to know when it should be "momma" or "momillimetera" (replacing all "mm" with "millimeter").
>
> Would it be practical to convert all characters to hex code? This is probably crazy, but is there a way to query MySQL as hex, or would I still have to have a second column with everything converted to hex?
>
> Or, would it simply be faster to use LIKE, convert all whitespaces in the query to "%", and just forget about MATCH...AGAINST?
>


If you want to search for "millions of potential combinations", then
your best bet is to just change your ft_min_word_len to 2. Yes, it will
slow down searches - but it will still be better than trying to keep up
with the millions of short words.

This is a problem with any database, not just mysql. Full text searches
are not handled well with any of the databases.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================


Reply With Quote
  #5 (permalink)  
Old 07-18-2012, 12:48 AM
Jonas Smith
Guest
 
Posts: n/a
Default Re: Short words

On Mon, 16 Jul 2012 19:03:59 -0400, Jason C wrote
(in article <31d1a18c-295c-4826-b4ec-c9ef9624f444@googlegroups.com>):

> On Monday, July 16, 2012 7:17:12 AM UTC-4, Axel Schwenke wrote:
>> That's the basic idea, but I rather suggest
>>
>> &quot;9mm&quot; -&gt; nine_millimeter
>> &quot;C++&quot; -&gt; C_plus_plus
>>
>> etc. pp.

>
> That makes sense, but I'm not entirely sure how to do that on the fly. There
> are millions of potential combinations that could be queried, so I'm not sure


> I would write a program to know when it should be "momma" or "momillimetera"
> (replacing all "mm" with "millimeter").
>
> Would it be practical to convert all characters to hex code? This is probably


> crazy, but is there a way to query MySQL as hex, or would I still have to
> have a second column with everything converted to hex?
>
> Or, would it simply be faster to use LIKE, convert all whitespaces in the
> query to "%", and just forget about MATCH...AGAINST?


Try Sphinx search. I was created specifically for this problem.

http://sphinxsearch.com/

--
Jonas

Reply With Quote
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off




All times are GMT. The time now is 10:57 AM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.