Quantcast
Channel: NoSQL – codecentric AG Blog
Viewing all articles
Browse latest Browse all 22

MongoDB Text Search Explained

$
0
0

The upcoming release 2.4 of MongoDB will include a first, experimental support for full text search (FTS). This feature was requested early in the history of MongoDB as you can see from this JIRA ticket: SERVER-380. FTS is first available with the developer release 2.3.2.

Full Text Search 101

Before looking at how MongoDB implemented its initial full text search, we need to learn a little bit about the basics. There are (at least) two important concepts in order to unterstand full text search:

Stop Words

Stop words are used to filter words that are irrelevant for searching. Examples are is, at, the etc. Let’s have a look at the following sentence …

I am your father, Luke

… and these stop words: am, I, your. After applying the stop words, that’s what’s left of our sentence:

father Luke

The remains are processed in the next step. Please note that stop words are langugage dependent and may also vary from domain to domain.

Stemming

Stemming is the process of reducing words to their root, base or .. well .. stem. Remember things like declension and conjugation? These typically change the stem of a word. Example

waiting, waited, waits

have all the same stem wait. This processing is also language dependent. Implementations for stemming are called stemmers.

The following diagram sums up the whole process:

mongo_fts_2

So let’s see how we can use MongoDB for full text search.

Enable Text Search

Up to now, text search is disabled by default. You have to enable it at server start with the follwing command line option:

$ ./mongod --setParameter textSearchEnabled=true

Create a text index

First of all, you define a special kind of index on a field, similar to geospatial indexes:

db.txt.ensureIndex( {txt: "text"} )

Language settings are important with FTS. MongoDB uses the open source stemmer Snowball and a custom set of stop words for every language supported by that stemmer. The default language is English.

If you have a look at the indexes, our special text index shows up:

> db.txt.getIndices()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "txt.txt",
                "name" : "_id_"
        },
        {
                "v" : 0,
                "key" : {
                        "_fts" : "text",
                        "_ftsx" : 1
                },
                "ns" : "txt.txt",
                "name" : "txt_text",
                "weights" : {
                        "txt" : 1
                },
                "default_language" : "english",
                "language_override" : "language"
        }
]

Insert documents

If you insert a document to the above collection, MongoDB applies filtering of stop words and stemming to the content of the indexed text field. Each stem is added to the index pointing to the current document.

db.txt.insert( {txt: "I am your father, Luke"} )

You can easily see that the stop word filtering happened, because there are only 2 keys in the index txt.txt.$txt_text:

> db.txt.validate()
{
        "ns" : "txt.txt",
         ...
        "nIndexes" : 2,
        "keysPerIndex" : {
                "txt.txt.$_id_" : 1,
                "txt.txt.$txt_text" : 2
        },
        ...
}

Search

If you want to perform a full text search, you run a command on the collection holding the text index:

db.txt.runCommand( "text", { search : "father" } )

Again, the language (this time the language of the search phrase) defaults to English.

The result looks like this:

> db.txt.runCommand("text", {search: "father"} )
{
        "queryDebugString" : "father||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.75,
                        "obj" : {
                                "_id" : ObjectId("50e820689068856d0ac6a801"),
                                "txt" : "I am your father, Luke"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 1,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 114
        },
        "ok" : 1
}

We have one hit for “father” using the index. The ObjectId of the document is return alongside with the full text.

This doesn’t feel like rocket science? Ok, then try a more advanced example:

> db.txt.insert({txt: "I'm still waiting"})
> db.txt.insert({txt: "I waited for hours"})
> db.txt.insert({txt: "He waits"})
> db.txt.runCommand("text", {search: "wait"})
{
        "queryDebugString" : "wait||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 1,
                        "obj" : {
                                "_id" : ObjectId("50e82dc9c95b73b63ec5f5aa"),
                                "txt" : "He waits"
                        }
                },
                {
                        "score" : 0.75,
                        "obj" : {
                                "_id" : ObjectId("50e82db5c95b73b63ec5f5a9"),
                                "txt" : "I waited for hours"
                        }
                },
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50e82dabc95b73b63ec5f5a8"),
                                "txt" : "I'm still waiting"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 3,
                "nscannedObjects" : 0,
                "n" : 3,
                "timeMicros" : 148
        },
        "ok" : 1
}

That’s pretty cool, isn’t it? As you can see, the resulting documents are sorted in descending order according to the score. There is a metric applied that measures the distance between the search word and the indexed stems.

Examples

All examples can be found on github. Try them yourself.

Summary

Of course, this implementation of a full text search won’t enable MongoDB to compete with search engines like Apache Solr or Elastic Search, but it is a step in the right direction. I think there are many use cases where this kind of FTS is absolutely sufficient. And don’t forget: this is the first release. We probably will see other interesting features in the future.

If I had to write a wish list, I would write the following:

  • Enable users to provide their own stop word lists (w/o compiling). This could be done via a command line option pointing to a file or a new system collection like system.fts.stopwords
  • Use a stemmer implementation that supports more languages than these. What about all the Asian langugages?
  • Introduce the concept of a dictionary in order to handle
    • synonyms,
    • irregular words and
    • compound words that are common in various European languages, something like the German words Volltextsuche (full text search) or Erdbeermarmeladenglas (jar of strawberry jam).

What’s next

In my next blog article I will have a closer look at more advanced features and non-English languages.

In the meantime: try text search yourself, especially if you have huge product data sets. Report any errors or suggestions to the Mongo JIRA.


Viewing all articles
Browse latest Browse all 22

Trending Articles