Taking VerneMQ replication to the next level

TL;DR: VerneMQ now ships with a new metadata store based on the Server Wide Clocks framework which considerably improves replication performance in terms of processing power and bandwidth usage as well as supporting real distributed deletes.

Over the last year we've been working hard on taking VerneMQ to the next level. Besides implementing the full MQTT 5.0 spec we've also been implementing a new replication mechanism to make VerneMQ even more robust and scalable. The new replication algorithm has been merged to master and in VerneMQ 1.7, which was released a few days ago, one can configure it by simply setting metadata_plugin = vmq_swc in the VerneMQ configuration file.

The current approach

Until now we've been using the Plumtree replication algorithm which was originally built to distribute metadata between nodes of a Riak cluster and it has served VerneMQ and its users well, see for example the KubeCon 2018 talk Our Journey to Service 5 Million Messaging Connections on Kubernetes.

That said, Plumtree has some limitations, the biggest functional one is that it doesn't support distributed deletes, and deletions are therefore emulated by setting a special value marking the key as deleted - this value is called a tombstone. VerneMQ uses Plumtree to store subscriber data and retained messages where respectively the client-id and the topic are used as keys. This means VerneMQ keeps around a tombstone for each client which has otherwise disconnected and left no state behind as well as for each topic where a retained message was once stored.

Another thing which we and our users have experienced with Plumtree is that it is quite resource intensive in terms of processing power and bandwidth when replicating large amounts of data when new nodes are added or return to the cluster after a long running net-split.

Moving forward

So, while Plumtree did what it was supposed to and did it well, we were always on the lookout if there might be something better out there. And at some point we came across the paper Concise Server-Wide Causality Management for Eventually Consistent Data Stores which introduced some novel ideas called Server-Wide Clocks (SWC) which at least in theory makes anti-entropy much more efficient than with Merkle trees as well as supporting distributed deletes. An experimental evaluation was later published as part of the doctoral thesis of Ricardo Gonçalves Multi-Value Distributed Key-Value Stores (here the term Node-Wide Intra-Object Causality Management is used instead of Server-Wide Clocks) which confirms that SWC is much more efficient than an equivalent Merkle tree implementation.

So we decided this was interesting enough to implement SWC in VerneMQ and while building and implementing it we've done various tests at different scales and durations and the results so far have been very encouraging: more predictable memory usage, lower cpu-usage in general, (much) faster anti-entropy rounds and real, distributed deletes.

We believe SWC is getting ready for prime-time, but we'd be grateful for feedback and experiences from running it in the wild to learn how it performs and behaves under scenarios and use-cases we haven't tested or thought about. To switch to the SWC replicated store all you have to do is to set metadata_plugin = vmq_swc in the vernemq.conf file on all the nodes.

From the entire VerneMQ team we'd like to thank Ricardo Gonçalves, Paulo Sérgio Almeida, Carlos Baquero, Victor Fonte and all the other researchers for moving the state of the art of distributed system to the next level!

Looking forward to hear from you!