5 Years of Building a Cloud: Interview with LivePerson Openstack Cloud Leader

Koby Holzer has 17 years of experience working with large infrastructure environments, with the last 4.5 of these at LivePerson as the director of cloud engineering, specifically focussing on Openstack. His past experience includes working with prominent Israeli telecom companies in the area of technological infrastructure. I have personally known Koby for the past few years, through discussing, lecturing and enjoying the great cloud and DevOps community in Israel.


Source: Twitter
Though unfortunately I didn’t have the pleasure to be at the last OpenStack summit in April, I was thrilled to see Holzer pictured taking part in the keynotes session at one of the most important global cloud events. And the following is the result of another great interview with a true cloud leader.

Ofir: Let’s start with a simple question. How was the OpenStack Summit?

Koby: The summit was amazing, and it’s getting better every year. It was my fourth one. This was the most exciting one for me as I had the opportunity to , talking about LivePerson, Openstack, containers…and all the new and exciting stuff we are doing with the LivePerson cloud.
It was the biggest convention yet with over 7,500 people in attendance. And it just continues to get bigger every year in every aspect — logistics, keynote sessions, educational tracks.
On the educational side I particularly like the practical real life cases. It’s very interesting to see how other companies are tackling Openstack. My focus was on two specific tracks – the case studies, including the AT&T, Comcast and the containers track, which was very popular at the conference. The one session that I remember as particularly meaningful was the session – . The panel included  experts from the most popular container technologies in the market  (i.e.Kubernetes, Docker, Mesos).
IMG_3983.JPG
Source: Twitter

Ofir: How would you summarize the evolution of LivePerson over the past five years?

Koby: Early 2012 we started learning and evaluating Openstack, which was the Diablo version at that time. We started to play with it, building our cloud labs and making the decision to go to production with a small portion of LivePerson (LP) services during the middle of that year. And when we reached production we were already using Essex. In 2012 we had Essex in production and towards the end of that year we decided to rewrite LivePerson’s service from one big  monolithic service to many microservices.
The next step was adopting Puppet, which accelerated the consumption of our private cloud. R&D moved from consuming physical servers to virtual OpenStack instances. By the end of 2013 we had already created a large cluster with more than 2,000 instances on 4 data centers, and from then on it just continued to evolve.
In 2013-2014 we were dealing with the OpenStack upgrade challenge and managed to move to  Folsom, then Havana and Icehouse. We try to upgrade as often as possible, however, the bigger our cloud gets the more difficult that is.
In 2015 we reached a point where we had finished rewriting the service, and our new “LiveEngage” service was ready for production and real users. Today we have something like 8,000 instances on 7 data centers, running on more than 20,000 physical cores. 2016 is the year for us to migrate to containers and Kubernetes, something which we expect to span well into 2017.

Source: openstack.org

Ofir: If you were to look over these last 4.5 years, what would you say have been, or are still, your main challenges?

Koby: I’m managing the cloud engineering and there is a rather large team here of software developers. We were very lucky that the R&D organization decided to move to microservices at the same time that we introduced OpenStack and Puppet. Looking back, I am not sure if it was planned, but the timing was just perfect. While development built a modern microservices based service, ops adopted and implemented cloud and automation.
In terms of management challenges, I will just narrow it down and talk about the challenges that I see for 2016. Migrating 150 services to containers is something that my team cannot accomplish alone. We are in a continuous effort to maintain the partnership with R&D and create a joint effort when it comes to educating ourselves and being able to optimally use the new technologies. That includes moving from continuous integration to continuous delivery, and building a strong delivery pipeline.
The operations goal is to build an environment that enables R&D to own the service end-to-end, not only to develop it but also to be able to support a quality and robust production environment.

Source: Twitter

Ofir: Can you point to any specific challenge that you faced and overcame throughout your cloud journey?

Koby: One big challenge was the deployment and adoption across the organization of Puppet. If only the cloud production and operations team was using it, it wouldn’t have been enough. We needed our software developers to adopt Puppet as well and use it as a standard delivery method. And making 200 developers use a new technology doesn’t happen overnight as you can imagine.
I learned that it’s not something that you can easily convince that number of smart people to do just by saying “guys this is great technology and it’s the only way we can deal with delivery”. We learned our lesson from that and now we work much closer with R&D. Taking decisions together from the start.
Remember that this was almost 4 years ago. It took a management decision from the very top of LivePerson General Manager in order for everyone to understand that this was the way forward. Our entire R&D was instructed that all new updates will go to production using Puppet. A small team of DevOps experts was brought in order to support and train the R&D teams and made sure Puppet was being used on a daily basis. This team carried out workshops and were the people to go to if any questions were raised. It took around a year to bring everyone up to speed and today Puppet is the main delivery tool.
Another challenge which is a common for OpenStack is the upgrades, at least with older versions. After 4 years of practicing, the process of upgrading takes one engineer up to three months. This was the story for every upgrade until now. The most recent upgrade has been the biggest so far, mainly due to the fact that our cloud has grown significantly and that we also needed to upgrade the hosts in tandem.
Upgrading thousands of physical servers while maintaining the service uptime is no simple task. In order to do this we need to take a group of servers, run a live migration of workloads to the other servers, then upgrade and ensure nothing was harmed before bringing the group back into the pool. There are lots of considerations and activities behind this, including understanding and segmenting the sensitive workloads.

The LivePerson team at OpenStack Day Israel. Photo: Lauren Sell. ()

Ofir: How do you manage to keep transparency throughout the upgrade process?

Koby: We built a smart cloud discovery solution which updates automatically. Transparency is key and we have complete control over each individual VM and service. The system records all activities and can be accessed using an API and UI.

Ofir: What 3 takeaway tips can you give from your experience?

Koby: 1. As the operations manager you should be able to build an efficient and professional team. Which obviously depends on the size of your OpenStack cloud. Considering that a cloud consists of best casino canada thousands of hosts you need at least 2 network professionals, 3 talented operations/engineering guys that are responsible for automating everything, and one storage guy. This team does not include the teams that operate the daily tasks and use the Openstack resources for the LivePerson service.
In addition, you need to think of every management aspect. Security is not part of our team, although ideally it should be. We are supported by our R&D team’s security experts.  When dealing with building your private cloud team remember that your R&D care less whether it’s an OpenStack, physical servers, VMWare, whatever. They just need the resources and the flexibility that the cloud and DevOps promise.
2. Learning from the past with the Puppet challenge, it was like us telling R&D “we demand that you deliver with Puppet”, but as an IT leader you need to understand and market the values of the new technology. And it never ends, but once you have done it the next time will be easier, as I see with our current move to Docker and Kubernetes. Eventually, we want to work together as equals, with everyone adopting the technology together, learning together and coping with all challenges together.
In order to accomplish that you need to create a “feature team” that includes representation from parties involved including the architects, leading developers, operations, network and security.
Although that might be challenging I strongly suggest to educate the other parties, not only on the touch points between dev and ops but also to get them to know behind the scenes of your cloud and get them to have the knowledge they need to use the OpenStack/Kubernetes APIs in particular. This is something that we are still working on with our R&D team. And together with containers our developers will be able to enjoy real independency with provisioning and consumption of resources. Connecting between the software and the infrastructure and letting the developers decide what they need and when is the flexibility, IT operations are responsible for.
3. Everyone should adopt the  DevOps approach. R&D and Production are both developers, each with their own location in the delivery system. Although I am proud that LP is a cloud pioneer we still have a way to go on that matter and that’s exciting. Becoming Netflix or Google doesn’t happen overnight. The good news is that this road never ends and there is always something new to learn, adopt and do better.

Ofir: What are your future thoughts about the private cloud/cloud market landscape for the next 5 years?

Koby: I’m not sure about the next 5, so let’s start with the next 2 and move on. I think that in 2 years we will see hybrid clouds big time — this is also what we are aiming for. By using Kubernetes we will be able to use all the public clouds, including our own private one the same way, with the same teams and tools. What I want to see in LP is a very dynamic multi cloud environment. For example, let’s say that Amazon just changed their prices and I know in real-time that I can get a better price with Google, I will want all workloads and traffic to seamlessly move to GCP, and if it changes again the day after, I will want it to automatically move back to a third public cloud. The workload migration will be based on a price/performance equation while taking into consideration the SLA of each workload.
In regards to OpenStack there is no doubt that today the compute, network and storage are much better than 4 years ago, even ENT ready. I think that those core components will be much better, support larger scales and so it will be easier and easier to upgrade seamlessly. The second priority is to have openstack integrate better with public clouds, burst workloads, DR and backup projects supporting us everywhere: on our Openstack private cloud, in EC2, Google and Azure. For example, Trove working the same in private cloud, EC2 instances, Google cloud, etc. Since the future is Hybrid, it just makes sense to have those extra cool projects work for me everywhere I choose. I think it will make Openstack much stronger.

Related posts