The Past, Present and Future of Message Queue 2

December 18, 2022 18 min read

Lucas Lee

Abstract： Second part of “The Past, Present and Future of Message Queue”. See previous articles.

IV. At Present: The Era of Microservices

As the Internet continues to advance, more Internet users than ever are attracted by democratic information access and the convenient lifestyle that comes with it. For Internet companies, an increasing number of Internet users means continuous, rapid business growth and new waves in the booming stock market. In 2011, Google had a market capitalization of USD180 billion and employed more than 24,000 people [44]. After ten years of growth, its market value approached USD2 trillion in 2021 [45], more than 10 times that of 2010, with more than 150,000 employees. Nevertheless, application software is facing increasing challenges: to cope with the access of hundreds of millions Internet users, application software must refactor their architectures to ensure smoother access to websites for more and more users.

In this context, many developers were paying attention to microservices. For all techniques, they attract no attention until they are needed rather than created. Microservices are among them. Although the idea of microservices [46] was proposed back in 2005, these services were applied to refactor back-end systems of websites only after 2014, to facilitate more and more high concurrent accesses. RocketMQ was created as one of the solutions to such needs. To support the massive user accesses during the Double Eleven shopping spree, Alibaba rebuilt its websites including Taobao and Tmall with microservices. Before then, e-commerce systems only consisted of a few large back-end service programs, leaving the websites inadequate for high concurrency, until the back-end services were refactored into hundreds of microservices. Before 2010, service communication of Taobao was enabled by the ActiveMQ-based messaging platform Napoli, a, and stored messages in the persistent database [47] through Notify.

As Alibaba’s e-commerce system went through microservice adaptation, equipping these thousands of microservices with the capability of asynchronous communication became the main concern. As Napoli needs to store data in a database, it is of very limited throughput, unable to support asynchronous communication of large-scale microservice clusters. Therefore, Alibaba switched to Kafka after the platform went open source, hoping to support large-scale microservice asynchronous interactions. However, notwithstanding high throughput and persistence capability, Kafka was still insufficient for large-scale microservices scenarios [48], mainly for the following three reasons. First, such scenarios require good support for message features, like transactional messages and delayed messages. For instance, users usually pay the bill after submitting an order, but if the order is not paid for a long time after submission, the order will be canceled. In this case, the feature of delayed messages is needed, which Kafka does not support. Second, in microservice scenarios, a single message must be of high quality. If any message is lost, the data of the order will be lost too, which is unacceptable for transactions while such cases happen too often with Kafka. Originally, Kafka was designed to improve message throughput by using batch delivery which is unsafe for messages. After all, for big data scenarios, losing a single message is tolerable, for it generally does not affect the conclusion of big data analysis. Third, Kafka is unstable when creating multiple Topics, which seriously impacts the throughput of the entire system. It is resulted from its model design.

To deal with these problems, Alibaba decided to develop a message queue product that can meet the needs of large-scale microservice scenarios. In 2012, the company developed such a product and named it MetaQ, then renamed RocketMQ when it became open source in 2013. In simple terms, RocketMQ is a combination of Kafka and RabbitMQ. It implements message features including transactional messages, delayed messages, and dead letter queues. RocketMQ uses Kafka’s storage model and sequential read/write scheme to ensure high throughput volumes, but for the sake of higher message quality, it sends/receives messages singly instead of in batches. Furthermore, Kafka’s data storage mechanism is also improved in RocketMQ to solve Kafka’s instability under massive topic scenarios. Kafka is designed to have several partitions in every topic, and every partition writes one storage file at a time and only writes the next file until the current one is full (e.g., 1 GB). The storage directory of Kafka is illustrated below [49].

This storage mechanism works well with a small number of topics which are not much required in big data scenarios. When it comes to large-scale microservices, the business will require hundreds and thousands of topics. When Kafka sets hundreds of topics, its specific storage model will lead each broker node to create hundreds of files. When these files are read, parts of the data will be loaded to the page cache of the operation system, and too many page caches will leave the system extremely unstable, hence Kafka’s poor performance in multi-topic scenarios. This shortcoming was improved in RocketMQ, by storing all partitions (i.e., message queue in RocketMQ) of each broker to a log file, which makes reading data a bit more complex but perfectly solves the performance problem.

As a solution to message queue applications in large-scale microservice scenarios, RocketMQ has received a lot of Internet companies’ attention since it became open source. In 2019, it won the first place in the vote for Most Popular Open Source Software in China [50]. Previously, a large number of Internet companies including Didi Chuxing, WeBank, Tongcheng Travel (formerly Tongcheng-Elong), and Kuaishou applied Kafka in the scenarios of both big data analysis and large-scale microservice interactions until they gradually switched to the open source RocketMQ for large-scale microservices [51]. The following graph presents the comparison tests between Didi’s RocketMQ and Kafka with varied message sizes and numbers of topics [52].

duibi

As seen from the tests, the first group of data at the top, using Kafka with the consumption turned on, had a size of 2048 bytes per message. As the number of Topics increased to 256 topics, its throughput dropped dramatically. The second group using RocketMQ was only slightly impacted by the increasing number of topics. The third and fourth groups showed the results of the previous groups with consumption shut down, which led to similar conclusions as mentioned above, only with slightly higher throughput.

V. At Present: From Cloud Computing to Cloud Native

Beyond democratic information access, the Internet has also benefited the world with cloud computing. In 2000, the well-known American online shopping platform Amazon launched Merchant.com, an e-commerce service [53] to help third-party merchants build their online shopping websites based on Amazon’s e-commerce engine. To facilitate the advancement of the project, Andy Jassy led the team to reconstruct Amazon’s internal system by using APIs so that third-party merchants as well as Amazon’s internal teams could easily work with the e-commerce engine. Three years later, Andy Jassy [54] came up with a conception of building an Internet operation system, hoping to provide more companies with access to online computing, online storage, and many other software infrastructures for easier software system building. In 2004, AWS released its first infrastructure service called Simple Queue Service (SQS) [55], followed by S3 and EC2 in 2006. At that time, the basic framework of Amazon online web service came into shape, initiating a super market where over hundreds of billions of dollars are traded every year [56]. The era of cloud computing has come.

Using cloud computing brings great experiences for companies that need to build software systems. On one hand, cloud computing platforms like AWS provide virtual computers for users, which can be rented for building software systems on the platform, saving users from purchasing extra computers. On the other hand, AWS also provides such commonly used components as message queues, databases, and storage, which makes the construction easy. Ever since 2010, demands for cloud computing have become stronger alongside the exponential growth of Internet companies. All the Internet companies need to build their own software systems as their businesses are based on software, and they crave for higher responsiveness and cost-effectiveness. Therefore, cloud computing is their best choice. It can be verified by the AWS financial statements: in 2006, AWS’s revenue was only USD21 million, but soared to over USD150 billion in 2021, as the rise of the Internet and higher enthusiasm of global companies for digitization transformation, and Amazon’s market capitalization peaked accordingly, approaching USD2 trillion.

aws

With cloud computing implemented in massive companies, problems are also popping up, and the issue of cost is the most critical one. After building their software based on cloud services provided by cloud vendors, a lot of Internet companies come to realize that doing so costs much more than buying servers as they previously did. Paying for server hardware is a one-time fee and while paying for cloud services is on a time basis. For example, an octa-core 16 GB hardware server costs about RMB17,000 [58], but renting an octa-core 16 GB cloud server from a cloud computer provider costs RMB18,000 per year [59], broadly flat with the price for buying a hardware one. To address this issue of cost and make companies enjoy dividends in cloud computing, cloud native emerged in such a background. It is not only a type of techniques but also a concept which allows users to build software systems using cloud computing by its native means. The cloud computing should be scalable, and runs on demand. The cloud-based software runs only when it is needed, according to the vision of cloud computing. For instance, a website would rather not run when it has no visitors while activating larger clusters to support massive access when a vast number of users visit it. Since virtual machines are billed based on usage, for websites based on virtual machines, if they continue to run when there is no user access, they still must pay. Therefore, it would be best that there is a technique with which companies can build software according to the visits. Serverless is such a technique that provides company users with the key capability of using cloud computing natively, i.e., the software only runs while in need.

The term “serverless” was put forward by Iron.io in 2012 [60], and just like cloud computing, products of serverless have already existed : Google released such a product back in 2008, called the Google App Engine [61]. As mentioned before, a technique attracts no attention until it is needed rather than created, which is the case for both microservices and serverless. Since Internet companies using cloud computing appealed for cost reduction, more and more of them adopted Lambda, a serverless product rolled out by AWS in 2014. After that, mainstream cloud computing providers including Azure and GCP also rolled out their serverless products. The University of California, Berkeley, which successfully forecasted the era of cloud computing, published another report, “Cloud Programming Simplified: A Berkeley View on Serverless Computing”[62], predicting that serverless is where cloud computing will go in the next decade. In the report, the definition of “serverless” is simply put as: serverless computing = FaaS (Function as a Service) + BaaS (Backend as a Service) and a serverless service must scale automatically with no need for explicit provisioning, and be billed based on usage. In 2020, half of AWS users adopted Lambda to build their business systems. According to a survey by Datadog, in 2021, the time that Lamdba was called increased by 350% compared with two years ago.

lambda

VI. In Future: A Brand New World

Even though the development of Serverless has come a long way, it is only the beginning. With its revolutionary modes of operation, serverless has led to tremendous changes in software architecture and brought great opportunities and challenges to the building of software systems. It is bound to play a leading role in cloud computing over the next decade because its on-demand operation characteristics not only fulfill the commitment of cloud computing but also meet the essential user demand of bringing costs down and increasing efficiency. A new era is on the horizon. Let us examine some of the potential changes coming up.

1. In the Future, EDA will replace SOA to become the common design paradigm for software architecture

Event-Driven Architecture (EDA) [63] is nothing new. As stated earlier, technologies only get noticed when they are needed rather than created. No era needs EDA technology more than the Serverless Era. Owing to the on-demand operation characteristics of serverless, triggering a serverless program via EDA will become the most popular mode. Tracing back to the evolution of software architecture, the design paradigm based on Service-oriented Architecture (SOA) [64] has always taken the center stage, with Remote Procedure Call (RPC) being the default pattern of SOA-based software. Therefore, most of the communication among programs were synchronous. The emergence of serverless, however, makes it possible for asynchronous communication to rise to the top. The figure below is from the “The State of Serverless” report by Datadog in 2022 [65].

123

Among the top four services invoking Lambda functions, except API Gateway, three services namely SQS, EventBridge, and SNS all adopt asynchronous mode. It can be foreseen in the future that the message system will become the main trigger source of serverless, and it will become the new normal practice for enterprises to build software systems by using event-driven programming.

2. Serverless will connect everything from cloud to ecosystem in an event-driven manner

On one hand, RPC, a synchronous communication pattern, is quite distinct from the asynchronous communication pattern of a message. Similar to phone calls, bi-directional communication is necessary when programs need RPC-based communication. Once one side sends a request, the other must respond, or the communication would fail. Moreover, a program can only respond to one RPC at a time, which means other programs are going to be suspended. On the other hand, asynchronous communication is more like sending a message to someone on Facebook, where the availability of that person does not affect communication, and it is possible to communicate with multiple people at the same time. Serverless benefits a lot from asynchronous communication, thus significantly lowering the cost of software integration. In the past, a SOA-based system can only contain less than one thousand microservices. The more services a system has, the more complicated it will be, causing a tremendous drop in its availability. An event-driven system can support hundreds of thousands of serverless programs and is less affected by physical locations. The reason is that communication among services used to be based on RPC where a communication failure may occur between two services that cross clouds or data centers due to timeout, network malfunction, etc. Now, asynchronous communication could solve those problems.

The advantages of asynchronous communication make it possible to build large-scale business systems based on serverless that cross clouds and regions. It creates more possibilities for digitalized companies to build systems, including large-scale AI, IoT, and autopilot in the future. For example, for sure the future IoT system will be built crossing clouds, edges, and terminals, where they are responsible for big data analysis, data aggregation, real-time analysis, as well as data reporting and instruction execution, respectively. Meanwhile, such a communication method brings more possibilities for companies to build business systems. In the past, user business systems were built on a single cloud, whereas business systems built across multiple clouds can protect users from being locked by manufacturers, bring costs down, thus improving competitiveness. Imagine a scenario where a user needs to build a serverless-based image processing business. Image storage via AWS S3 is indeed convenient and cost-efficient. Nevertheless, the image processing service of Google Cloud has higher precision and faster speed. In this case, the best solution is to build a cross-cloud image processing communication scheme driven by an event. Every time an image is stored in AWS S3, the serverless program will invoke the image service by event to perform image processing.

3. In the future, CloudEvents will become a new de facto standard for application communication.

A new era needs a new standard. In the Serverless Era, business systems built by users feature larger scale, broader range, and more diversified services. The first two features have been discussed previously. Systems using traditional SOA architecture mainly contain several services developed by the user; they are relatively simple. Business systems in the Serverless Era, however, constitute many serverless services, many cloud services which may come from various cloud computing service providers, and even some IoT equipment. Frequent communication is required among numerous cross-cloud systems with different architectures, which require a unified data standard. At this point, CloudEvents is the best option. Initiated by CNCF, CloudEvents is a standard aiming to support function portability among cloud providers and event flow processing interoperability. It naturally adapts to serverless well and is supported by various serverless products such as Knative, OpenFaaS, Serverless.com [66], and IBM Cloud Code Engine. At present, there is no message queue product on the market that fully supports the CloudEvents protocol.

Undoubtedly, message systems will be the core fundamental component for EDA-based software systems in the Serverless Era. But currently, MQ which supports event receiving and sending under a serverless scenario faces great challenges. Most of the popular MQs we see now emerged around 2010. They were developed to address big data and large-scale microservices scenarios. At that time, the developers had no knowledge about serverless. As a result, from the perspective of serverless, serverless-based applications under the architecture of popular MQ would encounter many issues.

There are four main aspects as follows: The first issue is that traditional MQ cannot directly trigger serverless products, including Lamdba, to run. The primary reason is that the consumer model of the current MQ is based on the pull model, but serverless needs push model, which means there is no way to directly send messages to the serverless system. In 2020, Gartner presciently pointed out this issue [67] in the report regarding choosing the right Event Brokers.

The second issue is that the current message systems limit the scalability of the serverless. Take Kafka as an example, each topic is divided into several partitions according to its storage model. The size of Kafka’s downstream systems consuming the cluster of a single topic cannot be greater than the number of its topic partitions, because only one consumer can consume messages for every single partition. It works fine under the traditional ETL scenario, but a serious problem would arise if the downstream of Kafka is a serverless cluster. The reason being, regardless of many consumer instances being popped out by serverless according to a scaling strategy, if the number of consumers is greater than the number of partitions, then they would not be able to consume a message. [68]

The third issue is that serverless needs to handle a large number of cloud events, which may require processes such as filtering and transformation during transmission. But the mainstream message queues are weak in processing. Kafka does not have the inherent event processing ability, and its message filtering ability is provided by Kafka Connect, whereas RabbitMQ, ActiveMQ, and other MQ are not able to process messages.

The fourth issue is that the rebalancing mechanism is adopted by the popular message queues when providing load balancing. The message queue rebalancing will be triggered whenever a consumer who uses message queues joins or leaves a cluster. Such a mechanism will trigger all clients of the consuming cluster to rebalance their load. Since systems need to frequently switch actions such as start and stop under the serverless scenario, the system’s rebalancing event will be triggered many times, which leads to huge overhead.

Based on the above analysis, to better support the serverless scenarios future message systems should at least contain the following features:

Message receiving should adopt the push model which can send messages to serverless systems to trigger events.
CloudEvents [69] to send standard events should be natively supported. It is a brand-new event interaction standard for the cloud native era and is currently supported by the main open source serverless platforms such as Knative and OpenFaaS. CloudEvents standard should be natively supported to directly send events to serverless platforms, which is more convenient.
More diversified event-handling abilities, namely filtering, and transformation, should be provided.
Rebalance issues regarding the current message queue should be solved.

Presently, few message queue products provide full support for the Serverless scenario. EventBridge offered by cloud computing manufacturers such as AWS and Azure may be an option. It has the ability to push events, which can directly trigger serverless programs and can filter events to some extent. However, a common issue existing in EventBridge provided by cloud manufacturers is that cloud products from their own companies will be able to conduct better event interaction, whilst the support to cloud products from other companies and open source is less so. Therefore, those companies who adopted EventBridge for message communication among applications found themselves stuck in it. History always repeats itself in an amazing way, and this is extremely similar to the scenario before the emergence of the AMQP protocol. This time, a marvelous open source message queue product rises to fulfill the mission of the Serverless Era.

The Past, Present and Future of Message Queue 2

IV. At Present: The Era of Microservices

V. At Present: From Cloud Computing to Cloud Native

VI. In Future: A Brand New World

References