krausshalt

krausshalt

Async iterators with mongo DB

Asynchronous batched iterable for (mongo) cursors. When one is not enough and all is too much

Async iterators landed fresh in node.js v10 as experimental option and since then they have been evolved to latest LTS and stable versions of node.js

Since then I never had a real use case for them. But recently we had to process a lot of data from mongo db without dumping the whole collection into memory.

First result just uses the regular mongo cursor and a simple for ... each loop to process the data one after the other:

collection.find(...).forEach(document => process(document)
    .then(...) // Does not block, so promises arent possible at all
    .catch(...)
)

This works but neither supports promises nor is very efficient. Second approach is my very first usage of the new async iterator syntax by just iterating through the cursor:

for await (const document of cursor.find(...)) {
  await process(document)
}

Surprisingly this just works out of the box because the mongo cursor exposes an asynchronous next() method which is all what's required in order to loop through that collection. That's one nice solution but has a drawback: What if process(document) needs some time and slows down processing of those documents?

Now I need a batch of documents which could processed in parallel but not all of them a the same time to fuckup memory. To the ...

Bat(ch) Mobile

I wrote a small node module which does exactly that: Fetching N items from n cursor (does not have to be mongo), yields the batch of items and awaits the processing. And so on and so on until the cursor is exhausted. The example from above would look like this now:

const { getBatchedIterableFromCursor } = require('batch-mobile')

const cursor = collection.find(...)
for await (const batchOfItems of getBatchedIterableFromCursor(cursor, 100)) {
  await process(batchOfItems) //this is now an array of items
}

The only change is that the document is now an array of documents and one additional function call where the size of the batch is specified (default is 200).