Using neural nets to help system administrators operate large OpenStack instances
Big software is a new class of software composed of so many moving pieces that humans cannot design, deploy or operate them unaided. OpenStack, Hadoop and container-based architectures are all byproducts of big software. The only way to address this complexity is with automatic, AI-powered analytics.
Canonical and Skymind are working together to help System Administrators operate large OpenStack instances. With the growth of cloud computing, the size of data has surpassed human processing ability. Overwhelming amounts of data make it difficult to identify patterns like the signals that precede server failure. Using deep learning, Skymind enables OpenStack to discover patterns automatically, predict server failure and take preventative actions.
Canonical, the company behind Ubuntu, was founded in March 2004 and launched its Linux distribution six months later. Amazon created AWS, the first public cloud, shortly thereafter, and Canonical worked to make Ubuntu the easiest option for AWS and later public cloud computing platforms.
In 2010, OpenStack was created as the open-source alternative to the public cloud. Quickly, the complexity of deploying and running OpenStack at cloud scale showed that traditional configuration management, which focuses on instances (i.e. machines, servers) rather than
running micro-service architectures, was not the right approach. This was the beginning of what Canonical named the Era of Big Software.
Big Software is a class of software made up of so many moving pieces that humans cannot design, deploy and operate alone. It is meant to evoke big data, defined initially as data that cannot be stored on a single machine. OpenStack, Hadoop and container-based architectures are all big software.
The first challenge of big software is to create a service model for successful deployment - to find a way to support immediate and successful installations of software on the first day. Canonical has created several tools to streamline this process. Those tools help map software to available resources:
Metal as a Service which is a provisioning API for bare metal servers.
Policy and governance tool for large fleets of OS instances.
Service modeling software to model and deploy big software.
Big Software is hard to model and deploy and even harder to operate, which means Day 2 operations also require a new approach.
Traditional monitoring and logging tools were designed for operators who only had to oversee data generated by fewer than 100 servers. They would find patterns manually, create SQL queries to catch harmful events, and receive notifications when they needed to act. When noSQL became available, this improved marginally, since queries would scale.
But that does not solve the core problem today. With Big Software, there is so much data that a normal human cannot cope with and find patterns of behavior that result in server failure.
This is where AI comes in. Canonical believes that deep learning is the future of Day 2 operations. Neural nets can learn from massive amounts of data to find the needles in almost any haystack. Those nets are a tool that vastly extends the power of traditional system administrators, transforming their role.
Initially, neural nets will be a tool to triage logs, surface interesting patterns and predict hardware failure. As humans react to these events and label data (confirming AI predictions), the power to make certain operational decisions will be given to the AI directly: e.g. scale this service in/out, kill this node, move these containers, etc. Finally, as AI learns, self-healing data centers will become standard. AI will eventually be able to modify code to improve and remodel the infrastructure as it discovers better models adapted to the resources at hand.
The first generation deep-learning solution looks like this: HDFS + Mesos + Spark + DL4J + Spark Notebook. It is an enablement model, so that anyone can do deep learning, but using Skymind on OpenStack is just the beginning.
Ultimately, Canonical wants every piece of software to be scrutinized and learned in order to build the best architectures and operating tools.