By Janie Chang
February 17, 2015 9:00 AM PT
Many real-time services such as e-commerce, online gaming, and social media depend on cloud-computing platforms. Designing those services to be scalable and reliable, however, is a challenge.
On January 23, 2015, Microsoft released Orleans, an open-source platform that provides a straightforward approach to building high-scale distributed computing applications. Orleans, available for download from GitHub, simplifies programming for developers by removing the need to learn and apply complex concurrency logic or scaling patterns. It was designed for use in the cloud and has been used extensively on Microsoft Azure.
The main benefits of Orleans are developer productivity and transparent scalability, two of the most prominent challenges when developing services for the cloud. Sergey Bykov of Microsoft Research has experienced those challenges firsthand from his days with Microsoft's Online Services Division.
"The lifecycles of high-scale cloud services seem to follow a pattern," Bykov says. "The service is designed to scale up to a certain initial limit. Then the service hits a brick wall, at say, a hundred thousand of concurrent users. At that point, the developers have to re-architect completely, because you can't achieve significant scalability with just a few incremental changes. Just to face the same challenge at a new scale level. Another brick wall, another re-engineering project."
What's worse, is that these bottlenecks occur just as the user base is growing.
"It's like doing a heart transplant on your system," Bykov says. "It's expensive and risky. I found this a really interesting problem, so when the opportunity came up, I decided to join. I liked the group's focus on building technologies that prove the research. I joined Project Orleans on day one."
Bykov is passionate about Orleans because of how it simplifies programming in the cloud. He also enjoys solving problems that include many pieces, thanks to an appreciation for physics.
"People tend to equate computer science with mathematics, but I think of programming as more similar to physics," he explains. "During my early school years in Russia, I had an exceptional physics instructor who had coached hundreds of students to compete in the Physics Olympiads. I learned that when you have a problem in physics, first you must understand what pieces are involved, how they connect, and how they influence each other. You build a mental model, and only then would you try to solve the problem, and eventually apply math.
"When you program systems, it's the same. You identify the pieces involved, how they interact, build a model, and then you apply mathematics and programming. In reality, you spend a very small percentage of time writing code and more time figuring out why things don't work as intended."
It's a mindset that's been an advantage for Bykov throughout Project Orleans.
“Orleans solved many of the issues the Halo team had been investigating.”
— Sergey Bykov
The project had two main goals. The first was to improve programmer productivity by making cloud-services development much simpler and faster. The second was reliable scalability. If Orleans could enable a developer trained for single-machine desktop applications to build services for the cloud, then Orleans would deliver huge productivity gains.
Project Orleans found that existing tools for building in the cloud didn't do much to simplify application development. Experienced developers still had to write code to manage systems resources. The researchers felt there was room for huge programmer-productivity gains if they could remove the need to code lower-level ‘plumbing' that is difficult to get right.
The development team at 343 Industries for the popular game ”Halo 4” (Xbox 360) felt the same way. In 2011, they were looking at building their own tools to simplify programming in the cloud.
"It was serendipity," Bykov says. "Tamir Melamed, development manager with the Halo services team, attended an internal presentation on a completely unrelated topic. He fell into conversation with Ravi Pandya, at the time an architect on Project Orleans. After some in-depth discussions between our groups, everyone realized that Orleans solved many of the issues the Halo development team at 343 Industries had been investigating. We decided to join forces."
The Project Orleans team had a lot to prove. Development for ”Halo: Combat Evolved Anniversary” (Xbox 360) was already in progress, and the game's release date, Nov. 15, 2011, was just three months away.
"Fortunately, the Project Orleans team at Microsoft Research is filled with talented systems engineers who combine principled, academic rigor with a pragmatic understanding of the value in developer productivity," says Hoop Somuah, principal software engineer for the Halo services team. "This was evident not only in the features of Orleans, but also in our interaction with the team. They really understood what it meant to work to tight timelines."
The two teams worked closely to meet the release date. They held daily progress reviews, resolved issues, and measured performance. The Halo team completed the release of “Halo: Combat Evolved Anniversary” (Xbox 360) with a couple of weeks to spare—a testament to the productivity gains and reliability that Orleans delivered.
As further testament, the Halo team decided to build all the services for the 2012 release of ”Halo 4” (Xbox 360) using Orleans.
Game developers achieved dramatic productivity gains, but as Bykov notes, the research team profited as well. The researchers gained insights that led to a significant contribution for Orleans: the concept of the ‘virtual actor', an abstraction implemented by the Orleans programming model.
“We were able to build new services with ease and confidence in their scalability.”
— Hoop Somuah
The notion of actors, which can be described as concurrent computational entities, has been around since the 1970s. When programming within an actor model, developers must write code that manages actors explicitly: look up whether particular actors are already activated, activate them if necessary, and deactivate them.
"When we began working on scenarios for ‘Halo: Combat Evolved Anniversary’ and ‘Halo 4’ for the Xbox 360," Bykov recalls, "we discovered that in a fast-paced environment such as this, where actors can be game sessions which are always coming and going, it was very challenging to write code to look up whether actors are activated, then activate or deactivate them. It was possible, but you ended up writing a lot of code.
"That's when we had our ‘aha' moment. What if programmers could write code with the assumption that all the actors are always there? Then they wouldn't have to check whether certain actors have been activated, or create them, or deal with a situation when someone else tries to create the same actor at the same time. That would make application coding far simpler. We didn't come up with the term ‘virtual actor' until later, but we developed the notion during our collaboration with the ‘Halo 4’ team."
This approach, described in the technical paper, Orleans: Distributed Virtual Actors for Programmability and Scalability, by Philip A. Bernstein, Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin, solves a number of complex distributed-systems problems. The virtual-actor abstraction is a novel contribution that enables a simplified programming model that is both efficient and scalable.
“This simplicity ultimately also made these services easier to deploy, operate, and maintain on Azure.”
— Hoop Somuah
Anyone who has played “Halo 4” (Xbox 360) has relied on Orleans as part of the gaming experience. Members of the game’s team confirm that they could not have achieved the scale and level of user interactivity they needed without Project Orleans.
"Orleans enabled us to build, launch, and operate the dozens of services needed to support ‘Halo 4’ (Xbox 360) with a relatively small engineering team," Somuah says. "The programming model was very approachable, and the virtual-actor concept provided a clean separation between our application logic and network-topology details. This, in turn, enabled us to easily scale up our services to meet our user load efficiently and with cost efficacy.
"We were able to build new services with ease and confidence in their scalability and even rewrote some of our existing services to achieve greater scale. This simplicity ultimately also made these services easier to deploy, operate, and maintain on Azure."
Project Orleans has been used in production at Microsoft since 2011. Although it is best known as the basis of the “Halo 4” (Xbox 360) services architecture, Orleans actually was designed for general-purpose cloud programming, and it has been used for other cloud-based applications within Microsoft.
Bykov cites the example of a single developer who built a real-time analytics-aggregation service in four days. The Project Orleans team worked with the developer for two weeks to test and tune the service, after which the service was approved for production use.
Bykov and his team felt ready for feedback from a wider audience.
“You don’t need hundreds of servers to start reaping developer-productivity gains.”
— Sergey Bykov
During the Build 2014 conference in April, Microsoft announced a public preview of Orleans. The goal was to solicit feedback on developer productivity, especially comments regarding usability of the Orleans application-programming interface, and the virtual-actor model.
Positive feedback from the public preview led to the decision to release Orleans as open source. Businesses that run on cloud services require 24x7 uptime. Their developers and IT managers need the assurance of being able to access code and fix problems themselves, instead of waiting for vendor assistance. For this reason, open source has become the norm for cloud-computing tools. Furthermore, Project Orleans benefits from scaling out development to technical and academic communities, because users will have access to a diversity of innovations that Microsoft could not achieve on its own.
Bykov emphasizes that developers should be able to benefit from Orleans the moment they go from a single-server setup to a distributed system.
"The distributed runtime takes care of server failures, messaging, routing, single-threaded execution, and other system-level guarantees," Bykov explains. "You don't need hundreds of servers to start reaping developer-productivity gains." As a result, there has been a lot of positive feedback about the Orleans public preview.
"A programming model is how you think about a problem, the way you model a system, and only then the way you code," Bykov says. "Adopting a new programming model is inherently different than just using a new library to solve a specific standalone problem."
As the shift to cloud computing continues to change the way developers design and code, Bykov says it's rewarding to hear that users appreciate how Orleans is helping in that transition. "They recognize the value of not writing a lot of complicated distributed system code and compliment the simplicity, elegance, and power of the Orleans programming model."