Lab notes

Tags, MySQLJune 21, 2005 6:37 pm

Philipp Keller tackles tags and MySQL again, this time running a benchmark against four different schemas, one of which uses fulltext search. My conclusion was slightly different, though: fulltext search is quick (and simple) on the insert, and mysterious on the select. I just couldn’t make it scale consistently, but I’m willing to give it another try.

Bonus point, he makes the benchmark source code available under the LGPL, so you can see exactly what he’s testing and how he’s testing it.

Anyway, go here to get all the deets.

Lab notes, MySQLJune 6, 2005 6:58 pm

A few more days of playing with the fulltext search engine landed me in performance dead-end.

The test case consists of three tables. The primary table holds most of the information and decides on the tag of each record. It’s the biggest table with the largest number of records. The secondary tables hold aggregates or grouped information, and that’s where I was hoping to put the tags and search with a fulltext index. The record length is bigger, but the number of records is much smaller.

The first problem came up early in the performance tests. There’s not a lot of information on how MySQL manages fulltext queries internally, so I could only assume it’s equivalent to anything I’d write, only leaner and meaner. Maybe some overhead, but I have less code to write and maintain, so some overhead is an acceptable tradeoff. My rule is generally, if it’s up to 20% slower, but half the complexity, than I’d rather buy a bigger box than write more code.

(For the record I’m talking about complexity, not lines of code. Those are two different metrics. Adding trivial code to squeeze out more performance is a no-brainer, but increasing complexity is a bad tradeoff. You get to live with that complexity, instead of taking a free ride on the Moore train)

The problem I encountered had to do with queries that looked into the fulltext index, but also narrowed down records by other indexes, e.g. the user identifier or a date range. Mixing indexes led to extremely bad performance, my only guess is that MySQL uses one key to build a temporary table and then uses the other key to narrow down the search. Everything fits in memory, there’s no disk I/O, but performance is still too slow to be reasonable.

But even without that complexity, I couldn’t get fulltext queries to perform fast enough. It turns out that the primary table has 340x more records than the secondary one. I planned around a 1:250 difference. Given that ratio, the fulltext search ran substantially faster than any queries I could do on the primary table. At that step in the testing I was optimistic about using fulltext search.

But then I increased the secondary table size to about a third the record count of the primary table, and ended up with queries that on average were three times slower. That’s not only bad, it messes with my capacity planning. I don’t have an upper limit in mind for the primary table, but I can guestimate how well it will performance at different orders of magnitude. With fulltext, I seem to have an upper limit on the secondary table, after which I need a different strategy.

Not what I had in mind.

Lab notes, Tags, MySQLMay 30, 2005 7:11 am

I’ve been looking at different schemas that can support tagging, which I realized that full text search is the perfect application for tagging.

A scheme where each tag (applied to an object by a user) occupies a row in the database is simple and efficient in terms of storage size and retrieval. With the right set of indexes (at least three are required), queries are extremely efficient. But the code for managing the tags is either complex or restrictive. To begin with, tagging each object requires multiple inserts, and changing the object’s tag list is a combination of inserts and deletes. Renaming a tag is just as tricky, thought MySQL’s replace statement keeps it under check.

Querying on a single tag is easy, querying on two tags slightly more complicated, but each extension increases the level of complexity, that even reasonable search capabilities become unreasonably complicated to implement. Users who need or searches, or searches that include one tag but exclude others, are out of luck.

Using fulltext search simplifies everything, and brings up new and interesting options. The scheme holds all tags applied to an object by a user in a single field, using a fulltext index on that field. Tagging an object requires a single insert, changing the object’s tags requires a single update. Renaming a tag is an update with a bit of string manipulation. But queries is where fulltext search takes us to a whole new level.

With the in boolean mode option you can query for interesting combination of tags. For example, you can query all records that have the tags foo and bar ("+foo +bar"), or all records that have the tags foo or bar ("foo bar"). You can query all records that have the tags foo or bar, but excluding any records that have the tag bar ("(foo bar) -baz"). So we have mandatory, optional and negative tags, as well as and an or relations.

The default configuration for MySQL will not index any word shorter than 4 letters, and will not include any word that appears in the stopword list (e.g. common words like “with” and “after”). Unfortunately, these two restrictions are bad for tags. A meaningful tag may fall in the stopword list, and many meaningful tags are shorted than 4 letters (e.g. "osx"). It’s possible to change the minimum word length and the stopword list, but doing so will mess with other applications of fulltext search.

There is a simple solution around this problem, which relies on the use of underscores to prefix tags. A tag prefixed with an underscore is not recognized by the stopword list, and several underscores can be used to get around the minimum word length limitation. If this sounds like a complication, consider that the scheme requires some text manipulation on all tags before they are stored. In fact, the scheme I use stores the tag lists in two separate fields. The first field holds the tags as entered by the user, e.g. “SF” and “San Francisco”. The second field holds the tags in a searchable format, and is used for fulltext searches, so it would hold "__sf" and "_sanfrancisco".

You may want to read Philipp Keller’s posts on various schemes used for tagging data, and the MySQL fulltext search documentation.