What if collecting data center telemetry was a snap?

Believe it or not, this is actually the first true big blog post since I announced I was joining Intel. That is 1.5 years of being buried over my head in all the cool things Intel does in the data center. I could write a whole different post just on how much different I see the world of computing now. But I will save that for another day.

Today is a special day. Because in the time that I have been at Intel I have been busy. Busy learning, building a team, working on the problem space around the next generation of cloud computing within our data centers.

I am part of a team at Intel that has a specific vision for where we believe the next level of cloud evolution emerges. We call this Intelligent Resource Orchestration (IRO).

IRO is the idea that the cloud components running workloads and consuming hardware can do so in a highly automated way. Less human interaction, more use of modern patterns to achieve higher density, scale, and agility. This model is made up of a few primary domains that interact with each other:

  • Watching – The collector of information. The idea that all of the hardware and software state of the components and services can be consumed. This is not necessarily a single thing, but a practice of making data about the resources (like servers) available for wide consumption.
  • Deciding – The decision maker. This domain can have a multitude of purposes but serves them is a specific pattern. Something happens and it decides if something else needs to happen. If a resource is requested, a system dies, or the load pattern on one cluster changes then something may need to happen. It is the idea that a computer has the data provided by the Watcher and the context to make a decision to change the state of the systems. This is where things like schedulers, decision engines, orchestration policies and more live.
  • Acting – The doer. This is a concept that any state that can change should be changed in an automated fashion. Exposing good APIs with good patterns is key here. The idea of reducing human intervention and directly exposing the ability to change things to the Decider’s ability to choose what happens next.
  • Learning – The iterator. This is a scary new concept to some. But with the great expanse of computing power and modern innovations in both machine and deep learning, we are approaching a period of time where the computer is a useful tool for recognizing patterns in a large set of data. We need a specific domain wired to the Watch, Decide, and Act domains and look for opportunities to evolve this loop next time around. This keeps with the concept of removing the human from the loop. What If the computer can recognize that things are changing and make recommendations to improve the next decision, the telemetry to watch, or which APIs should be called upon for action?

Here at Intel we are looking for ways to push each of the domains above closer to the reality of today’s data centers.

As we did this we recognized one key thing. All of the above sits on the foundation of data. You cannot make good decisions on where to place the next workload if the Decider is blind to the workloads running. Or maybe it is blind to the truth of how those workload interact and just makes poorer scheduling decisions. The concept of both valuable, accurate, and consumable data from the hardware and software is the key first step to decision, iteration, and change needed for IRO.

Internally we looked around at the data Intel Architecture provides and it is vast. We also recognized that we should expand this even more and look for ways to extend with more data into new areas. But, their are better questions for right now. We have a lot of telemetry, specific contextual data about our systems. But is it easy to consume for an IRO model? Could consuming this telemetry from devices, servers, and more be even easier? Would that help in building better decisioning systems for cloud?

We looked at a lot of existing open source tools and we loved some of them for specific things. We looked at a lot of internal Intel telemetry tools and loved them too (did you know Intel runs some of the largest engineering clusters in the world?). But each tool was focused much closer to either a subset of the problem space or utilized mechanisms that, while friendly on a few nodes, were less friendly on 1000’s of nodes. None of them seemed to answer the question for what we wanted to achieve for IRO. Which brings up why I came to Intel in the first place, cloud innovation.

The big dirty secret to innovation is failure. It always starts with a question like “What if?” or “Is it possible?”. Sometimes it just ends with “No, I can’t” or “This is not important right now”. But every once in a while it ends with “Yes, we can”.

My job at Intel is running a smart and collaborative group of engineers working on the emerging edge of IRO. Specifically, we working on how to push the needle on cloud orchestration, scheduling, and telemetry. A part of the way we run this team is the idea of “fail fast”. Things in software move so quickly now that sometimes it is difficult to build successful solutions if it takes too long to emerge from incubation. Inside Intel SDI we decided that a percentage of our work should be trying something with the idea that failure is ok. Instead of working 5-7 years to research, design, build, and release we would look to try smaller things quickly and get them in front of people that would care quickly. If they work, great! If they don’t, start over!. This is not something you want to do on a large scale. But examples of this are out there. From Google to Netflix to Github. The idea that creativity sometimes spawns better ideas than heavy planning is important. And making smaller risky bets alongside the big ones might be a good idea.

Based on the idea of failing fast, the question the team asked ourselves was: “What would an operational framework focused on making the consumption of telemetry much easier look like?”

This question was the genesis of the project we took on in the first part of this year. We set about the path of building a new telemetry framework with the purpose of making this problem easier for a model like IRO. It was a hard road and one that we were blazing for the first time within our organization. In the end of our alpha we took this project and made it open to Intel internal departments. We call this step an “internal open sourcing”. Here was our moment of truth. Did this approach make sense or was it a “fail fast” project?

Well, I am sure you can guess I would not be writing a blog post about a failed project. Our internal “open sourcing” resulted in internal projects and teams talking to us about integrating it into their solutions, an Intel Labs group started contributing and using it in their research, and some very positive responses from a few customers. So we moved a little further along and decided that this little innovation might be worth putting out into the open for everyone.Because of this, it is my privilege to introduce you to a new software framework from Intel we call snap.

Snap is a telemetry framework written in Golang for the purpose of making the consumption of data center telemetry easier. Today, Intel is open sourcing snap under an Apache 2 License for everyone. You can find snap on Github: https://github.com/intelsdi-x/snap

Lets get to the good part about what snap does.
Snap provides the ability to do a few key things:
  • Define telemetry workflows and run them on a schedule
  • Provide an open plugin model decoupling actions in the workflow from running the workflow
  • Several operational improvement inspired by modern DevOps tool sets
  • Strong focus on exposing all state and commands with API

The point of snap is to get something out of a system and sink that data somewhere it is needed. A key concept to this is the idea that the telemetry is often reused. Obtaining telemetry like VM saturation or CPU usage is valuable to operational teams, systems concerned with accounting and chargeback, and schedulers looking to place the next VM. This idea of reuse was influential in how we implemented the snap plugin model.

In snap every important system action is performed by a plugin. You have three types of plugins:

  1. Collector – This collects telemetry from *something* and forwards it on.
  2. Processor – This transforms the telemetry in some way. Encrypting, changing an object model, serializing, machine learning at a node level, or even allowing a policy engine to be injected.
  3. Publishing – This is the plugin that sinks the data into another system that consumes the telemetry. This can be common messaging or databases like RabbitMQ, Kafka, MySQL, or InfluxDB. This could also be things like email, files, or custom publishing to a private API.

What is important in the snap plugin model is that each plugin operates independently and that the snap framework allows you to wire these together in multiple ways. You can use collector plugins to grab specific sets of telemetry, forward it through a processor that *learns* what normal is and filters out noise which then in turn publishes the filtered data into RabbitMQ for pickup by another system. And at the same time you can forward the same collector telemetry directly to InfluxDB to populate an operational dashboard. The goal with snap is to make the description of this to be declarative.

Plugins in snap are also loaded dynamically at runtime. This was a big requirement we wanted to meet to make snap more operationally friendly and enable the clustering I write about a bit further down. In snap all operations for plugins are completely dynamic. You can load new collectors, processors, or publishers at any time. Vice versa you can unload any of these at runtime. No restarting the service or having to implement configuration management on top of your telemetry daemon.

A part of the way we accomplish this is with the concept of the metric catalog. When you load a collector into snap the metrics (unique telemetry items) are added to a single catalog. It is this catalog that is selected against when you create a task and run a workflow. This abstraction means that you are not selecting the “Intel CPU” plugin but instead the specific Intel metrics you want to collect. This is important because we support the ability to dynamically upgrade plugins while a cluster of snap daemons is running. Metric selections in your workflow automatically will use the newest plugin version implementing that metric. This means as plugin creators (Intel is one, hopefully others will join) release new plugin versions, the customer using them can downstream and upgrade without service disruption. This counts for processor and publishing plugins also. And we make this even easier with our Tribe management I walk through below.

Snap already has a host of plugins available with today’s release including:

Collector:

Ceph
Docker
Facter
Libvirt
Intel NodeManager
Intel PCM
Linux Perfevents
PSUtil
Intel SMART (disk)

Processor

Movingaverage

Publisher:

SAP HANA
InfluxDB
Kafka
MySQL
OpenTSDB
PostgreSQL
RabbitMQ
Riemann

And we have plugins in flight for completion right now for Ethtool, IOstat, Nova, Open vSwitch, and OSv. For a complete list see our Plugin Catalog  which we will keep updated as things develop.

Plugin authoring itself is built to be easy to accomplish with some Golang savvy. See our authoring guide and our best practices document.

Right now plugins are normally written in native Golang. But we also support a JSON-RPC interface for writing plugins in any language. We have plans for writing plugin client libraries in Java, Python, Ruby, and C++ soon. Plugins are written and compiled separately from snap itself. This means you can choose your own license or even keep your plugins private or proprietary if you prefer – we prefer open sourcing 🙂

Controlling snap is just if not more important than what it can do. To start off we decided that all operations and data from the snap daemon would be over a REST API. Anything you can do can be controlled via this API. We provide a CLI tool called snapctl which provides a CLI interface for calling the snapd REST API. This choice to use REST was important as we want snap to be something another control system can manage or integrate easily into existing customer solutions. Snap does not require complex configuration management work to control service restarts or change. Everything is dynamic and available over the API.

We also made a strong effort to secure snap for this first release. We provide the ability to cryptographically sign compiled plugins and verify the signatures on the snap daemon before loading or running. We encrypt the communication channel between plugins and the daemon. And we provide the option to secure the REST API endpoint for snap.

But, the CLI and API are not the only tricks in the bag when it comes to snap. One of the key needs of the IRO model is that as the size of the resources grow, the work to manage and maintain doesn’t become too cumbersome. To this end we planned from day one to use novel ways to control snap and make management easier.

Within snap we implement this operational automation using a feature we call tribe. Tribe allows you to cluster a group of snap nodes into a “tribe”. This tribe can then implement a feature we call an “agreement”. This agreement has specific behavior that the tribe will agree on like running the same plugin or running the same set of tasks. This allows an end user to take an entire compute farm of snap enabled nodes, group them into a tribe, and implement agreements that they will all run the same tasks and plugins. If a user were to load a new plugin into any of the members of the tribe, the other members would recognize they need to load this plugin also, and begin to share around the plugin itself. The same goes for running a task. Creating a new task to run a specific workflow against any member of the tribe implements that task against all of them. This does not use a master so requests can go to any node in the tribe.

The end result is the operational cost of loading a plugin or creating a workflow task for one snap node is close to the same as it is for a 100 or 1000 snap nodes. For more info we have specific tribe examples and information here. We are looking to expand the tribe agreements to also contain things like configuration, logging, and more.

There is no possible way I could go over all the features of snap in one blog post. I could try to cover stuff like the extensible scheduling, workflow routing, metric query support, and more. But the team has done an amazing job trying to capture all of this in the documentation for snap.

I will mention that this is just the start for us. There are several core features that have not been implemented yet in this beta but are on the roadmap for the next big release. These include:

  • Windows Support
    • A big reason to choose Golang was for its cross-compilation abilities. We want to extend all the goodness of snap to the Windows world and not just Linux and OSX.
  • Distributed Workflows
    • Right now the workflows are performed on a single node at a time (collect=>process=>publish). We have the ability with tribe to discover and allow workflows to operate across snap nodes. When we built each component of snap we heavily decoupled core modules allowing us to enable specific roles later. We can eventually stand up clusters of just collectors that send their information to a cluster of processors which in turn send to a small group of nodes that publish into a system. This flexibility means we can reduce the impact on workload nodes and fully utilize specific hardware for things like encryption or machine learning.
  • Event Subscription
    • Right now the telemetry collected by plugins is done so on a schedule. We want to add the ability for events to trigger running the same workflows rather than being scheduled. This is an important features for more performant monitoring.
  • Routing expansion
    • Under the covers of snap is the ability to load balance multiple plugins. We have the ability to use this for enabling a greater scale for future snap plugins.

In addition we will be releasing a host of plugins for exposing Intel Architecture telemetry. We have our sights on powerful CPU and memory, networking, and specific workload metrics. Intel and the SDI team is committed to exposing as much as we can in 2016. We already have internal customers at Intel looking to utilize for their own needs.

With this open source release snap is in beta. We are looking for a few things now that we have it in the open:

  1. Comments/issues/feedback/bugs – If you find a bug we will fix it. If you want a feature we will look into it.
  2. Maintainers – Long term we prefer this project is a mix of people trying to solve the problem in this space. We want you to help. If you are interested in becoming a maintainer and have the chops. Reach out to one of the current ones on the README.
  3. Plugins – Build something. If it works tell us about it and we can add your repo to our Plugin Catalog. What if snap could collect from storage arrays, VMware clusters, or sink data into New Relic or others? We can do a lot. But the ecosystem can do so much more.
  4. Examples/Blog Posts/Demos – If you do something cool we will link it

So that is it. It is out there and ready for you to play with at home or work 🙂

We are excited to try and be a part of enabling the Intelligent Resource Orchestration model for our customers. And this is just the beginning. We are already down the path on some new questions around the IRO problem space so stay tuned for more things in 2016. And of course we are hiring. If this project or something like this would be interesting to you and you like working in healthy collaborative teams of good people, give us a ping at sdirecruiting@intel.com.

.nick

Coding

8 Comments Leave a comment

  1. I am trying to test Snap to monitor an openstack deployment. I get the following error:
    ./snapctl task create -t /home/stack/snap-v0.9.0-beta/json/iostat-file.json
    Using task manifest to create task
    Error creating task:
    invalid character ‘<' looking for beginning of value

  2. Nice…. have you heard of watch4net ? It looks like a similar concept to what they do except this is open source and watch4net is proprietary. EMC purchased it and renamed it to ViPR SRM

  3. Is there anything within current SNAP that when the system boots it can publish the following audit?
    NIC Name, IP Address of Nic, MAC address in NIC, Aliases of that IP, Application Names, Ports Utilized by that application.

    I would like more of the system audit, but this is a start. I want to piblish this as events such that Service Now can determine a any change not tied to change management.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: