Nov 26, 2010

Basic inverted index in MongoDB

i m attracted by Ward Bekker's article Simple ranked text search for MongoDB and his ruby code, he used mapreduce to create inverted index for a set of documents, and ranked the search result by another mapreduce. i modified a bit his code and implement in my PHP application.

Consider a business collection as below
{
  name: 'Old Novice Solutions Ltd.',
  category: 'IT',
  keyword: [
    {'word': 'old', 'weight': 10},
    {'word': 'novice', 'weight': 10},
    {'word': 'solutions', 'weight': 10},
    {'word': 'it', 'weight': 5},
    {'word': 'ltd', 'weight': 1}
  ]
}

Keywords are generated (the generation is not covered here) based on name and category, according to some rules
  • All keywords are stored in lower case.
  • For the weight, the words from name are the highest, those from category are lower, 'company', 'co', 'limited', 'ltd' are the lowest.
  • No duplicated keywords, only the one with the highest weight is stored.

The first mapreduce creates inverted index for existing documents and stores in keyword collection. The index data affected by new documents and modified documents is not covered in this post.
db.runCommand({
    mapreduce: 'business',
    map: function() {
        var keyword = this.keyword;
        var business = {'_id': this._id, 'name': this.name, 'cat': this.cat, 'weight': 0};
        for (var i = 0; i < keyword.length; i++) {
            if (keyword[i] != '') {
                business.weight = keyword[i]['weight'];
                emit(keyword[i]['word'], {biz: [business]});
            }
        }
    },
    reduce: function(key, values) {
        var biz = [];
        for (var i = 0; i < values.length; i++) {
            biz = biz.concat(values[i].biz);
        }
        return {biz: biz};
    },
    out: 'keyword',
});

keyword collection is something like this
{
  '_id': 'old',
  'value': {'biz': [
      {
        'id': ObjectId('...'),
        'name': 'Old Novice Solutions Ltd',
        'cat': 'IT',
        'weight': 10
      },
      {
        'id': ObjectId('...'),
        'name': 'Old Trafford Stadium',
        'cat': 'Stadium',
        'weight': 10
      }
    ...
    ]
  }
}

The second mapreduce handles search query and returns ranked output in PHP.
<?php

$query = array('old', 'novice');
$businesses = search($query);
foreach ($businesses as $business) {
    echo $business['value']['doc']['name'] . ' ' .
         $business['value']['doc']['cat'];
    echo '<br />';
}

function search($query) {
    $map = new MongoCode( <<<JSCODE
function() {
    for (var i = 0; i < this.value.biz.length; i++) {
        emit(this.value.biz[i]['_id'], {count: 1, doc: this.value.biz[i]});
    }
}
JSCODE
    );
    $reduce = new MongoCode( <<<JSCODE
function(key, values) {
    var totalCount = 0;
    var totalWeight = 0;
    var doc = values[0].doc;
    for (var i = 0; i < values.length; i++) {
        totalCount += values[i].count;
        totalWeight +=values[i].doc.weight;
    }
    doc.weight = totalWeight;
    return {count: totalCount, doc: doc};
}
JSCODE
    );

    $conn = new Mongo();
    $db = $conn->selectDB('test');
    $result = $db->command(array(
        'mapreduce' => 'keyword',
        'map' => $map,
        'reduce' => $reduce,
        'query' => array('_id' => array('$in' => $query))));
    $biz = $db->selectCollection($result['result'])
              ->find(array('value.count' => array('$gte' =>count($query))))
              ->sort(array('value.doc.weight' => -1));
    return $biz;
}

Thanks Ward Bekker.

No comments:

Post a Comment