Friday, March 22, 2013

Mongodb group by count, update

The new Aggregation Framework from mongodb (available in version 2.1 and above) is pretty useful...rather than write mapreduce functions, you can use their aggregation framework to do simple group by, count(*), max and sum type of queries.


--This is how you insert into a collection based on query on another collection
 db.basecollection.aggregate( [  
{ $group: { _id: "$c1",              
   rowcount: { $sum: 1 },  
                    lastupdated: { $max: "$c2" } } }
 ], {$sort:{'lastupdated':1}} ).result.forEach(function(doc){
   db.aggcollection.insert(doc)});

The equivalent sql would be something like:
insert into your_aggregate_table (category, rowcount, last_updated)
select category, count(*), max(update_date)
from your_base_table
group by category


--update (insert if don't exist) a collection based on query on another collection
db.basecollection.aggregate( [  
{ $group: { _id: "$c1",              
   rowcount: { $sum: 1 },  
                    lastupdated: { $max: "$c2" } } }
 ]).result.forEach(function(doc){
   db.aggcollection.update({_id: doc._id}, { $set: {rowcount: doc.rowcount, lastupdated: doc.lastupdated}}, {upsert:true})
   });


Thursday, March 21, 2013

useful unix commands for data processing

I use IBM Datastage for ETL at work. Datastage has an "Execute Command Activity" which allows us to issue an command on the operating system, in our case it's a linux.

Below are the commands that have come in very handy and efficient in helping me process the data or augment my data workflow.

To convert a single column into a comma separated row of values:
paste -s -d, your_input_file


To get last row:
tail -1 your_input_file