Wednesday, May 22, 2013

A Proposal for Determining If A VM Is Used or Unused (work-in-progress)

Motivation

As virtualization technology has taken over the information services landscape, the cost - both in terms of money and effort - of deploying a new server has fallen dramatically. Since the commoditization of PC architecture servers in the 2000s, organizations have typically deployed one application per a server to isolate each application for ease of maintenance and security. If an application did not use all of its server's resources all of the time, the remaining computing resources went to waste. To recover that waste, many organizations have recently replaced servers, each running one application, with many virtual machines, also each running one application. These virtual machines all run on a few large physical servers. If an application is idle, then other applications can use the remaining resources. With this shift, VMware estimates that the infrastructure cost of one application, and thus one VM, is now down to $1774.00 [VMware 2012]. In addition to the lower infrastructure cost, automation has driven the deployment cost of a new virtual machine to near zero. With this dramatic cost drop, organizations have witnessed their population of virtual machines balloon. With these ballooning populations of virtual machines, comes ever greater potential for virtual machines to simply linger, unused, as their applications fall out of use, and as the human structure of an organization shifts over time.

For many organizations, the cost of having a human constantly review the virtual machine population for unused virtual machines would cost prohibitive. Review would be possible, however, if the majority of in-use virtual machines could be filtered out of the population, leaving only virtual machines with a high chance of being unused for review by a human. This article seeks to develop a classifier for flagging virtual machines as potentially unused using a statistical method.

Background


A systems administrator can use their experience to determine if a virtual machine is still being used by looking at various properties of that virtual machine and then making a judgement call regarding its level of use. The explicit process looks something like:

If,
  • Property 1 = a
  • Property 2 = b
  • Property 3 = c
  • Property 4 = d
then,
  • in my professional judgment, I believe this virtual machine to be unused.
For an administrator to make this judgement across hundreds of virtual machines would take many hours, so it would be best to automate this process.  To automate it, we need to capture it into computer code; and to capture it into computer code, we must express it mathematically.  So the question becomes,
Is there an existing mathematical model that captures this process, and thus approximates the systems administrator's expertise?
In fact, there is such a model.

Bayesian Classifiers 

Consider "this email is spam". It's boolean, that is true or false, and has some chance of being true. We denote that chance:



Think of a large area where each point is a possible email.  Area(spam) is the area of points that contains all of the emails you consider spam.  If Area(spam) has an area of zero, then no emails are spam. If Area(spam) equals one, then all emails are spam. Usually Area(spam) is some value between 0 and 1. [Moore n.d.] In fact, for all of the emails sent today, Area(spam) is .665. [Gudkova 2013]

To have computer determine whether or not an email is spam, the computer must use the properties of the email available to it to determine the overall probability that the email is spam - namely the words in the email.

The area of all email can be sliced up into partitions, where each partition contains all the emails that have a certain word like "rolex" in them. These partitions overlap the two larger partitions that split the area into spam emails and non-spam emails.

Using the above description, to determine if an email is spam, we ask, if an email has rolex in its corpus of words, what is the probability that the email is spam? [Graham 2002] That is,  out of all the emails containing rolex, how many are in the spam partition?

Conditional probabilities provide a model for this type of question:

(1)


the probability of condition U, given the probability of datum V.

The mathematicians Laplace and Bayes give a concise formula, Bayes Theorem, that relates conditional probabilities and overall probabilities for the condition and datum:

(2)

where, for our purposes, V is a measurable property of a virtual machine and U is the classification "unused".

There are many things that we can measure on each virtual machine. It would be good to be able combine them to find the probability that a particular virtual machine is unused. Thus, we would like to find a way to use the above equation to determine whether a virtual machine is "in use" or "unused", given several measurable properties. [Larsen 2001]

Buckets of Marbles


Consider two buckets, U and I, full of marbles of eight different colors. We want to be able to pull a marble at random from one of those two buckets and then estimate the probability that it was in one bucket or the other. That is, we want to use the color of the marble to estimate the classification of the marble, U or I. In terms of the areas we described above, imagine all of the marbles laid out flat with each color grouped together. Each color is an distinct area. Overlapping those distinct areas are two larger areas, U and I. We want to estimate how likely an arbitrary marble picked off the plane would have been plucked from under the U area, based on its color.

To this, we start by pulling a sample from each bucket, U and I. We count the different numbers of each color marble in the sample from each bucket. We count the total numbers of each color marble across both bucket samples. Finally, we count the total number of marbles in both samples.

Say we want to know the probability of a marble being from the U bucket given that it's red.  We can use the above relationship. Using the count of red marbles from the sample from U, we can estimate the number of red marbles in U.  This is P(red|U). Using the number of marbles in the U sample and the total number of marbles in both samples, we can estimate the number of marbles in the U bucket vs. the total number of marbles in both buckets. This P(U). And by counting the number of red marbles in both samples, we can estimate the total number of red marbles across both buckets. This P(red). The above equation says that we can estimate the probability that a red marble comes from the U bucket using this relationship between those values:

(3)

But what if the sample from one of the buckets has none of a particular color? This would make our count zero, and would not give us any estimate of the number of that color in the original bucket. Laplace gives us a slightly more complex estimator to "smooth" over that zero count [Smith 2009]:

Instead of count of a color divided by the number of marbles in the sample, we can use:

(4)

This smooths out the zero by assuming slightly lower likelihood of that color in the bucket than 1/total marbles in the sample.  This estimator gives us a way to estimate the likelihood of picking color from each bucket, even if the sample does not contain that color.

VM Classification

For this proof of concept, we need analyze a sample population of virtual machines, and then assign each VM a colored marble, based our analysis of that virtual machine. I propose doing this in the following way.

For simplicity, we will use eight colors, as we did above. To get those eight colors, we will measure three metrics, percent free memory, disk blocks transferred yesterday, and average daily logins. We will take a sample from the population of all of the VMs in the infrastructure and then divide them into two buckets, U (unused) and I (in use) using our experience. These represent samples from the two larger buckets of virtual machines that contain between them all of the virtual machines in the infrastructure. This manual division also represents the "expert knowledge" we want to approximate programmatically  For each metric, we will calculate the mean of that metric in each of the two sample sets. Then we will mark a particular virtual machine as above the mean for that metric (A), or below (B).  Thus each virtual machine in the two sample groups will have a triplet associated with it (A/B)(A/B)(A/B). This is the "color" of the virtual machine.  There are 8 combinations representing 8 marble colors:

AAA
AAB
ABB
BAA
BBA
BBB
BAB
ABA

(5)

I chose 130 virtual machines from the overall population. I determined 13 of those virtual machines to be unused, based on my professional experience.  I then determined the triplet for each virtual machine in the "in use" group and each virtual machine in the "unused" group. Then I counted the number of each "color" virtual machine in each group. Here the Laplace estimator applied. The sample size of the "unused" virtual machines was so small (only 13 unused virtual machines), that I needed the Laplace estimator to estimate the likelihood of colors in the larger "unused" bucket that didn't appear in the sample. Indeed, even in the larger "used" sample, some colors were missing, the Laplace estimator applied in that case as well. I then made overall estimates across both samples for overall unused virtual machines and the overall occurrence of each color.

Thus, for each virtual machine "color" in the two sample sets, I estimated:
  • P(a triplet IF the virtual machine was unused) = P(C|U)
  • P(unused virtual machines) = P(U)
  • P(a triplet) = P(C)
So, by Bayes Theorem, I was able to approximate the probability that a virtual machine was unused, given that it had a particular color:
(6)
I then calculated P(U|C) for each color:

P(U|AAA) =.66666666666666666623
P(U|AAB) =.07407407407407407400
P(U|ABB) =.33333333333333333311
P(U|BBB) =.26666666666666666645
P(U|BBA) =.66666666666666666652
P(U|BAA) =.66666666666666666666
P(U|ABA) =.66666666666666666660
P(U|BAB) =.00666666666666666666

(7)

With these values, I can evaluate any virtual machine in the environment for the probability it's unused.

For an arbitrary virtual machine, I would first take measurements of each of the three metrics. Then I would determine the triplet using the mean for each metric determined from the unused virtual machine sample. I would then look up the virtual machine's "color" in the above list then to get the estimate of the probability that the virtual machine is unused.

The above process may not be completely perfect, but it makes a much cheaper first pass at finding unused virtual machines than having a system administrator evaluate each virtual machine by hand.  A system administrator need only evaluate machines with colors that have .666 probability of being unused for example.

References


Graham, Paul. "A Plan for Spam." A Plan for Spam. Http://www.paulgraham.com/, Aug. 2002. Web. 23 May 2013. .

Gudkova, Darya. "Spam in Q1 2013." Securelist.com. Kaspersky Lab ZAO., 8 May 2013. Web. 22 May 2013.

Larsen, Richard J., and Morris L. Marx. An Introduction to Mathematical Statistics and Its Applications. 3rd ed. Upper Saddle River, NJ: Prentice Hall, 2001. Print.

Moore, Andrew W. "Probabilistic and Bayesian Analytics." Probability for Data Miners. Andrew W. Moore, n.d. Web. 22 May 2013.

Smith, David. "Estimation - Maximum Likelihood and Smoothing." Introduction to Natural Language Processing (http://people.cs.umass.edu/~dasmith/inlp2009/). University of Massachusetts, Amherst, Sept. 2009. Web. 22 June 2013. .

VMWare, Inc. "Determine True Total Cost of Ownership." Get Low Total-Cost-of-Ownership (TCO) with Maximum Virtual Machine Density. VMWare Inc., Sept. 2012. Web. 23 May 2013. .

Appendix - Code

Since this blog is somewhat about doing things in bash that really should not be done in bash, I did all the math to calculate each P(U|C) in a bash script. I will include it here. It's not optimized by any means and includes some creative abuses of bash. Note that the format of the original data file is virtual_machine_name,free_mem_percentage,blocks_transferred_yesterday,average_daily_logins. As I said above, I manually broke the sample list into used and unused virtual machines, and proceeded from there.


#!/bin/bash

set -u
#set -e
#set -x

status () {
  echo -n " |$*| "
}

get_unused_training_data () {

  for virtual_machine in `cat ${unused_VMs} `; do grep ${virtual_machine} ${training_data} ; done | sort | uniq 
  unset virtual_machine

}

get_in_use_training_data () {

  for virtual_machine in `cat ${in_use_VMs} `; do grep ${virtual_machine} ${training_data} ; done | sort | uniq 
  unset virtual_machine

}

get_unused_training_data_count () {

  get_unused_training_data | wc -l

}

get_in_use_training_data_count () {

  get_in_use_training_data | wc -l

}

get_unused_average_for_free_memory () {

  get_unused_training_data | awk -F, '{lines=$lines+1;sum=+$2} END {print sum/lines } '

}  

get_in_use_average_for_free_memory () {

  get_in_use_training_data | awk -F, '{lines=$lines+1;sum=+$2} END {print sum/lines } '

}

get_unused_average_for_block_transfer () {

  get_unused_training_data | awk -F, '{lines=$lines+1;sum=+$3} END {print sum/lines } '

}

get_in_use_average_for_block_transfer () {

  get_in_use_training_data | awk -F, '{lines=$lines+1;sum=+$3} END {print sum/lines } '

}

get_unused_average_for_logins () {

  get_unused_training_data | awk -F, '{lines=$lines+1;sum=+$4} END {print sum/lines } '

}

get_in_use_average_for_logins () {

  get_in_use_training_data | awk -F, '{lines=$lines+1;sum=+$4} END {print sum/lines } '

}

return_above_or_below_group_average () {

  a_or_b_value="`echo ${1} | awk -F. '{print $1}'`"
  a_or_b_average="`echo ${2} | awk -F. '{print $1}'`"

  if [ ${a_or_b_value} -gt ${a_or_b_average} ]
  then
    echo "A"
  fi

  if [ ${a_or_b_value} -lt ${a_or_b_average} ]
  then
    echo "B"
  fi

  if [ ${a_or_b_value} -eq ${a_or_b_average} ]
  then
    echo "A"
  fi

  unset a_or_b_value
  unset a_or_b_average

}

get_in_use_triplets () {

  get_in_use_triplets_free_memory_average="`get_in_use_average_for_free_memory`"

  get_in_use_triplets_block_transfer_average="`get_in_use_average_for_block_transfer`"

  get_in_use_triplets_logins_average="`get_in_use_average_for_logins`"

  for virtual_machine_line in `get_in_use_training_data`
  do
    virtual_machine_name="`echo ${virtual_machine_line} | awk -F, '{print $1}' `"
    
    get_in_use_triplets_local_value="`echo ${virtual_machine_line} | awk -F, '{print $2}' `"
    free_memory_a_or_b="`return_above_or_below_group_average ${get_in_use_triplets_local_value} ${get_in_use_triplets_free_memory_average}`"

    get_in_use_triplets_local_value="`echo ${virtual_machine_line} | awk -F, '{print $3}' `"
    block_transfer_a_or_b="`return_above_or_below_group_average ${get_in_use_triplets_local_value} ${get_in_use_triplets_block_transfer_average}`"

    get_in_use_triplets_local_value="`echo ${virtual_machine_line} | awk -F, '{print $2}' `"
    logins_a_or_b="`return_above_or_below_group_average ${get_in_use_triplets_local_value} ${get_in_use_triplets_logins_average}`"

    echo "${virtual_machine_name},${free_memory_a_or_b}${block_transfer_a_or_b}${logins_a_or_b}"

    unset virtual_machine_name
    unset free_memory_a_or_b
    unset block_transfer_a_or_b
    unset logins_a_or_b

  done

}

get_unused_triplets () {

  get_unused_triplets_free_memory_average="`get_unused_average_for_free_memory`"

  get_unused_triplets_block_transfer_average="`get_unused_average_for_block_transfer`"

  get_unused_triplets_logins_average="`get_unused_average_for_logins`"

  for virtual_machine_line in `get_unused_training_data`
  do
    virtual_machine_name="`echo ${virtual_machine_line} | awk -F, '{print $1}' `"
    
    get_unused_triplets_local_value="`echo ${virtual_machine_line} | awk -F, '{print $2}' `"
    free_memory_a_or_b="`return_above_or_below_group_average ${get_unused_triplets_local_value} ${get_unused_triplets_free_memory_average}`"

    get_unused_triplets_local_value="`echo ${virtual_machine_line} | awk -F, '{print $3}' `"
    block_transfer_a_or_b="`return_above_or_below_group_average ${get_unused_triplets_local_value} ${get_unused_triplets_block_transfer_average}`"

    get_unused_triplets_local_value="`echo ${virtual_machine_line} | awk -F, '{print $2}' `"
    logins_a_or_b="`return_above_or_below_group_average ${get_unused_triplets_local_value} ${get_unused_triplets_logins_average}`"

    echo "${virtual_machine_name},${free_memory_a_or_b}${block_transfer_a_or_b}${logins_a_or_b}"

    unset virtual_machine_name
    unset free_memory_a_or_b
    unset block_transfer_a_or_b
    unset logins_a_or_b
    unset get_unused_triplets_local_value

  done

}

get_l_of_triplets_if_unused () {

  for property_triplet in AAA AAB ABB BBB BBA BAA ABA BAB
  do

    number_of_unused_members_with_triplet="`get_unused_triplets | grep ${property_triplet} | wc -l `"
    number_of_in_use_members_with_triplet="`get_in_use_triplets | grep ${property_triplet} | wc -l `"
    number_of_unused_members="`get_unused_training_data | wc -l `"
    number_of_in_use_members="`get_in_use_training_data | wc -l `"
    number_of_combinations="8"

  # Laplace smoothed likelyhood estimator: http://people.cs.umass.edu/~dasmith/inlp2009/lect5-cs585.pdf
  # ((number of group members with a certain triplet)+1)/((number of combinations)+(# of members in bucket))

    export l_of_${property_triplet}_if_unused="`echo \(${number_of_unused_members_with_triplet}+1\)/\(${number_of_combinations}+${number_of_unused_members}\) | bc -l `"
  #    export l_of_${property_triplet}_if_in_use="`echo \(${number_of_in_use_members_with_triplet}+1\)/\(${number_of_combinations}+${number_of_in_use_members}\) | bc -l `"

  #  echo "`env | grep l_of_${property_triplet}_if_unused | awk -F= '{print $2}'`"
  #    echo "`env | grep l_of_${property_triplet}_if_in_use | awk -F= '{print $2}'`"
      
  done 

}

get_overall_l_of_triplets () {

  total_virtual_machines="`cat ${training_data} | wc -l`"
  number_of_combinations="8"

  for property_triplet in AAA AAB ABB BBB BBA BAA ABA BAB
  do

    export count_of_${property_triplet}_for_in_use="`get_in_use_triplets | grep ${property_triplet} | wc -l `"
    export count_of_${property_triplet}_for_unused="`get_unused_triplets | grep ${property_triplet} | wc -l `"
    count_of_in_use="`env | grep count_of_${property_triplet}_for_in_use | awk -F= '{print $2}'`"
    count_of_unused="`env | grep count_of_${property_triplet}_for_unused | awk -F= '{print $2}'`"
    export overall_l_of_triplet_${property_triplet}="`echo \(${count_of_in_use}+${count_of_unused}+1\)/\(${number_of_combinations}+${total_virtual_machines}\) | bc -l `"
    # echo "`env | grep overall_l_of_triplet_${property_triplet} | awk -F= '{print $2}'`"
    # echo $((${count_of_in_use}+${count_of_unused}))
  done 
}

get_p_unused_if_triplet () {

# P(u|triplet) = ( p(triplet|unused) * p(unused) ) / p(triplet)

  get_l_of_triplets_if_unused
  get_overall_l_of_triplets

  
  count_of_unused="`get_unused_training_data | wc -l `"
  total_virtual_machines="`cat ${training_data} | wc -l`"
  number_of_combinations="8"

  l_of_unused="`echo \(${count_of_unused}+1\)/\(${number_of_combinations}+${total_virtual_machines}\) | bc -l `"

  for property_triplet in AAA AAB ABB BBB BBA BAA ABA BAB
  do
      l_of_triplet_if_unused="`env | grep l_of_${property_triplet}_if_unused | awk -F= '{print $2}'`"
      overall_l_of_triplet="`env | grep overall_l_of_triplet_${property_triplet} | awk -F= '{print $2}'`"

#echo ${l_of_unused}
#echo ${l_of_triplet_if_unused}
#echo ${overall_l_of_triplet}

      export p_unused_if_triplet_${property_triplet}="`echo \(${l_of_triplet_if_unused}\*${l_of_unused}\)/${overall_l_of_triplet} | bc -l `"
      echo "p_unused_if_triplet_${property_triplet}=`env | grep p_unused_if_triplet_${property_triplet} | awk -F= '{print $2}'`"

  done

}

#Local data files
training_data="./training_data"
unused_VMs="./unused_VMs"
in_use_VMs="./in_use_VMs"

get_p_unused_if_triplet

exit 0