Wednesday, September 19, 2012

An algorithm for automatically flagging an unused physical server or virtual machine for retirement

by Adam Keck (incorporating helpful suggestions from Carl Friend and Tyler Yip)

Abstract


If you have a large number of virtual machines and physical servers, you need some method to automatically determine when to retire a machine after it is no longer used.  Otherwise, you have to rely on the business owner of each physical server or virtual machine to tell you when they no longer need the resource.

Below is an algorithm for automatically flagging a physical server or virtual machine for retirement due to lack of usage. In some environments, with proper tuning, this algorithm could even trigger automatic retirement.


Automatic Server Retirement Algorithm (ASRA)

  • Record these data for the lifetime of a server (physical or virtual machine):
    • disk writes and reads
    • network transmits and receives
    • CPU cycles used
  • In the first M days of usage, calculate the arithmetic mean and the RMS for each data type. (M depends on the length of your business cycles). We suggest M=truncate(365.2/4)days (i.e., one quarter year or three months).
  • Calculate the standard deviation from the arithmetic mean for the first M days of usage of each data type.
  • Continue recording daily total values for the above data.
  • Every M days, thereafter, calculate the M-day arithmetic mean of the above data for each of the above data types.
  • Use one of the following methods either to flag the server for retirement (or to automatically retire it).
    • A: Flag the server for removal if the monthly arithmetic mean or RMS for all types of data is less than N% of the first quarter's arithmetic mean or RMS. N is dependent on your site's requirements and server behavior. Carl suggests N=50.
    • B: Flag the server for removal if the monthly arithmetic mean or RMS for all types of data is less than N% of the moving arithmetic mean or RMS described in the next sentence. If the arithmetic mean or RMS is higher than the first M-days' calculation, or if the arithmetic mean or RMS is lower than the first M-days' calculation by less than K%, recalculate the moving arithmetic mean or RMS with that period's data and the first M-days' data and any prior period's data where the arithmetic mean or RMS also exceeded the calculation from first M days or this period's arithmetic mean or RMS is lower than the first M-days' calculation by less than K%. N is dependent on your site's requirements and server behavior. Carl suggests N=50. I suggest K=(1/4)(1-N%)
    • C: Flag the server for removal if, for all data types, the data type's monthly arithmetic mean is lower than N standard deviations from the first M-days' arithmetic mean. N is dependent on your site's requirements and server behavior. I suggest N=1.
    • D: Flag the server for removal if, for all data types, each data type's arithmetic mean for this period is lower than N standard deviations from the moving arithmetic mean described in the next sentence. If the arithmetic mean is higher than the first M-days' calculation, or if arithmetic mean is lower than (1/K)(N standard deviations) from the first M-days' calculation, recalculate the arithmetic mean with this period's data and the first M-days' data and any prior period's data where the arithmetic mean also exceeded the first M-days or was lower by less than (1/K)(N standard deviations). N is dependent on your site's requirements and server behavior. I suggest N=1 and K=4.
  • Methods B and D take into account the case where a server reaches its normal workload beyond the M-day initial period.
  • Methods A and B will probably be easier to program vs. C and D.  C and D are driven by a specific server's usage patterns.