Sunday, April 13, 2014

Two Trends Worth Mentioning...

I usually shy away from predictions, but I think two new trends I am seeing more of recently are worth mentioning: post-materialism culture and self-navigating quadro/multi-copters.

Post-materialism


Since about three or four years ago, I've been seeing a slowly strengthening trickle of articles and personal stories on topics like the "tyranny of stuff" and "paying for experiences instead of things". The first well constructed expression of this trend I read was Bruce Sterling's "Last Viridian Note". While such a change in lifestyle in the past was driven by more external reasoning like "save the earth", this discourse has a more personal flavor to it. Up until the Great Recession (and a little after), there was in the U.S.A., and certain other emerging and developed economies, a relentless cultural drive to have luxurious things and, often, more than one of a particular luxurious thing. Doing so signaled high status.

Today, however, in some sub-cultures of the West, owning almost nothing but a few exceedingly high quality items and developing the complete freedom of time and wealth to have endless interesting experiences holds the highest status. This shift is fascinating, because with a little skill, one needs much less wealth to create such a life, vs. the cost of a life of accumulating and maintaining a large collection of possessions. In an era of stagnant real income, eliminating the ongoing costs of possessions is becoming ever more attractive. Experiences happen, often make us happier, and then leave our lives. Unless we suffer some injury or ailment, the costs of an experience stop when it stops. People are starting to notice the "time cost" of possessions. Everything you have must, at some point, be maintained or curated. We lose that time forever. Between the reduction in available real income and the time spent on curation, shifting to a life of interesting experiences that make us happier leaves us with both more money and more time.

Quadro/multi-copters


Much of the startup chatter today is about disrupting this or that. Usually, either the disruption is of mundane things, or the basic business math does not really hold up in the long run, or the disruption is entirely within our online lives ("We're going to bring 'social' to ordering fast food online"). Actual "in real life" disruption, like horse and buggy to automobile, does not come around too often because of the great costs involved in developing a new technology. Two such technologies are emerging and merging, quadro/multi-copters and self-navigation. I think they will change our daily experience of transportation before 2020. In my understanding, while multi-copters have more rotor units, the whole system is simpler to manage and easier to fix.

Like some past deep shifts in technology, multi-copter technology started in universities and the toy industry. Toys have notoriously razor-thin profit margins, so getting the complex flight behavior, efficiency, and reliability of a multi-copter into a profitable toy-priced package bodes well for future scaling. Toy multi-copters have already been imbued with self-navigation and self-organization behaviors, driven by sharply falling costs of GPS technology, model-based design, and the ongoing concurrent trends of miniaturization and power reduction for computers. This means that a particularly difficult aspect of a new technology, the mathematical models running the technology, are already simple enough and mature enough to sell toys profitably.

Consider a multi-copter harness around a single standard shipping container or around a locked together block of containers. This would enable air delivery of the products inside with much less airport infrastructure - especially if the flight is fully automated. Multi-copters need similarly small infrastructure to helicopters, but their software model driven multitude of direct-driven rotors can have much better recovery characteristics than helicopters in the case of single rotor unit failure. Then consider, with enough safety engineering, the equivalent of an automated aerial train system without the need for the dedication of large land tracts to airports.  It would take less infrastructure to build out such a system than an equivalent rail system. Such an infrastructure is certainly a strong candidate for enabling people mobility in rural areas of the world with limited rail infrastructure, like Africa and Siberia.

Saturday, February 1, 2014

Lean - A Primer

A Brief Primer on LEAN

An living document of my overall understanding of LEAN. This document is licensed under Creative Commons License
LEAN - A Primer by Adam S. Keck is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at http://www.bashedupbits.com/2014/02/lean-primer.html.
Permissions beyond the scope of this license may be available at https://plus.google.com/+AdamKeckLeanLeadershipIT/posts.


Based on:




Four Capabilities of Accelerating organizations

  • Specify precise approach to every task with built-in tests to detect failure and stop on error.
  • Swarm to solve problems – immediate RCA.
  • Share new knowledge throughout organization.
  • Leaders lead by developing above capabilities in every single employee.

Work Flow and System Design

  • LEAN is a framework for successfully designing complex systems and work flows. A system is a collection of related work flows. A company can be viewed as a single complex system that  provides value to customers in return for money.
  • LEAN is a way of working: Every employee uses LEAN work flow every day to both follow and improve the processes for which they are responsible.
  • LEAN is a specific implementation of Deming’s “Plan, Do, Check, Act” (PDCA)
  • Work flows should stop on error or self-regulate (“autonomation”): Each step has built in tests for verification.

On failure, stop the process and trigger an immediate RCA (Autonomation)

  • Start from the delivery of the correct work flow output to its consumer. Each downstream need paces and specifies work upstream in the process (Kanban is a specific method of achieving this goal).
  • Develop the work flow by working backward from the output that exactly fulfills the needs and requirements of final customer (i.e., what Raving Fans calls the Ideal).
  • Mistakes: Human error generally considered only when a person does not follow the current written process or does not verify that the output of each step is correct. All other errors are considered defects in the process that allow errors to occur. This philosophy drives the “Swarm to RCA each failure” capability.

Work Flow Creation Framework (In order)

  • Specify outputs – What does the work flow have to deliver, to whom, when, and what does it mean for the work flow to be successful?
  • Design pathways – Flow of materials, information, and services. Who is specifically responsible for each step in a pathway?
  • Design step connections – Linkages between adjacent process steps
  • Specify task methods – How exactly is each step in each process accomplished successfully?

Work Flow Creation Tools

  • Checklist
  • Automation code
  • Flow chart
  • Input/Output/Handoff chart

Problem Solving (Iterate)

Ideal

  • Defect-free work flow
  • On-demand work flow
  • Work flow provides only exact output needed by client process or customer.
  • Immediate fulfillment of needed output.
  • Work flow runs without waste
  • Work flow is safe and secure (personnel not harmed, and information and assets secure) 

General

  • The commonly cited “A3 process” is a specific work flow and presentation format that implements the elements below in a way that ensures customer buy-in at each step.
  • Use graphical elements to efficiently present and confirm information with customers.

Elements

  • Background: Why is this problem important?
  • Current condition: Measurements and metrics. Get and confirm information directly.
  • Gap analysis: How do the process and its outputs differ from the ideal (see previous section)
  • Root Cause Analysis: Swarm on failure; analyze gaps from ideal
  • Develop Countermeasures (rapid prototyping = experiments to find right solutions.)
  • Specify target condition: Desired new process with countermeasures in place
  • Measure actual outcome of new process: repeat measurements and metrics
  • Gap analysis and further RCA

Sharing Knowledge

  • Organization-wide sharing accelerates productivity. Everyone follows documented work flows. 
  • See one, Show one, Do one (from hospital LEAN efforts).
  • Codify discoveries for wide dissemination (e.g. Toyota “Lesson learned books” specify what’s feasible/cost-effective for a certain type of output or process)
  • Knowledge-base that stores documented work flows = “company memory”

Training

  • Regularly practice system design and problem solving
  • Develop LEAN skills in all employees, at every level.
  • Practice following work flows with verification of each step to prevent defects.

Leading

  • Everyone works using LEAN principles every day. Bottom up – sometimes guided - work flow improvement.
  • Everyone who knows the skills above leads those who are new by teaching them the above skills.
  • Learn to see and solve problems with rapid prototype iterations. Practice this skill.
  • Work flow improvers get and confirm information directly. Nothing is assumed.

Monday, November 18, 2013

How to play a video on a Raspberry Pi Desktop by double-clicking on a file...

The article describes how to open video, audio, and other media files in the Raspberry Pi desktop (the LXDE file manager) using the GPU-based player program.





Does double-clicking on a video file in Raspbian result in slow blocky playback in SMPlayer and VLC on your Raspberry Pi?

The short answer is that those video players will not work because at this time (Nov. 2013), they do not make use of the GPU on the Raspberry Pi. You need to use the hardware accelerated player, omxplayer, that is used in XBMC Live and OpenELEC.  The problem is that omxplayer is a command line player that is designed to be embedded in the XBMC based distributions.  I present below a way to make it play videos, if you double-click them in the Raspbian Desktop. Others have presented this method, but I've added a little bit of abstraction to make management easier. To start, open LXTerminal and the follow the process below.

Step One - Get rid of the CPU-based media players


sudo aptitude remove vlc smplayer


Step Two - Install omxplayer and xterm


aptitude install omxplayer xterm


I'm installing xterm, because its command line syntax is clear. To have keyboard control when omxplayer is run, it must be run from an open terminal. I don't know why this is, but this is what works. Simply calling omxplayer as the application to open a media file works, but you lose keyboard control. This means, for example, that you can't quit omxplayer in the middle of a video.


Step Three - Make a wrapper script with a simple name to start omxplayer in an xterm


sudo nano /usr/local/bin/vplay


Add the following contents to the file:


#!/bin/bash
exec xterm -fullscreen -fg black -bg black -e omxplayer -o hdmi -r "$1"


The "-o hdmi" forces omxplayer to pipe audio through the HDMI cable. Leave this option out if you have your Pi configured to use the headphone jack.

Save the file and quit, then make it executable:


sudo chmod 755 /usr/local/bin/vplay


Step Four - Make the "vplay" script the default handler for each video file type


We will use "mp4" files as an example.

Find a video file with the "mp4" file extension.  Right-click on it and select "Open with...". Click the "Custom Command Line" tab. Type "vplay %f" into the "Command line to execute:" box. Check the box at the bottom of the screen with the label "Set selected application as default action for this file type".

Click "OK"

If everything is correct, the file will now play in omxplayer. Press "q" to quit the program.

From this point forward, double-clicking any "mp4" file in the LXDE file manager will automatically play the file in omxplayer.  Spacebar pauses. The arrow keys skip forward and back. "2" speeds up the playback. To stop the sped up playback, press the spacebar twice.

Repeat step four for any other file extensions you want to automatically play.

If you make a mistake in the last step, you can clean up your bad attempt by deleting the "user-*" files in ~/.local/share/applications/.

Using this wrapper script technique means that you can modify the omxplayer options at any time without have to make the changes for each file extension in the LXDE file manager. Just edit /usr/local/bin/vplay.

Note that this script works for audio files as well. They will play with a black screen. It makes for a lightweight way to play audio files without opening an full application like Clementine.

Hope this helps!

-Adam

Friday, November 1, 2013

PowerShell One-Liners


Introduction



PowerShell is Microsoft's shell for their product lines. It's now on version 3.0. If you miss the power of the command line while using Windows on either your laptop or servers, PowerShell provides that power.


Important concepts:


  • Almost all aspects of the Microsoft ecosystem are objects within an overarching structure. You query and manipulate this structure and its objects with PowerShell. This includes all aspects of SharePoint, Active Directory, and Exchange. 
  • This "object nature" means that PowerShell pipes pass objects and properties, not just text. 
  • Variables store data-structures of objects. 

One-liners



Note: Unwrap the code lines before you use them.

Get Help



Get the usage of the command "Select-Object":

Get-Help Select-Object

Built-in examples for the command "Select-Object":

Get-Help Select-Object -examples | more

Get the list of all commands and sort it:

Get-Command | select-object name | sort name | more

Get the list of help topics for other parts of PowerShell:

Get-Help about*


Opening Files and Programs



PowerShell equivalent to Apple's Mac OS X command "open" is "Invoke-Item":

Start firefox.exe:

Invoke-Item "C:\Program Files (x86)\Mozilla Firefox\firefox.exe"

Open the file "Document.pdf" that is located the current directory:

Invoke-Item Document.pdf


Invoke-Item "\\myserver\c\Files\Document.pdf"

Manage Processes


To pattern match on an object list, use "Where-Object". The current object being processed is referred to by the special variable "$_". Members are accessed via the "." operator.:

Get-Process | Where-Object {$_.processname -match "powershell" } | Select-Object processname,CPU,VM

Dump all properties for all processes, print the process name and the VM size, and then sort by VM size:

Get-Process | Select-Object processname,virtualmemorysize | sort virtualmemorysize

Find the busiest Google Chrome process:

Get-Process chrome* | Select-Object processname,ID,CPU | sort CPU

Store the list of process objects:

$ListOfProcessObjects = Get-Process

Print the process name and virtual memory size from the stored process objects and sort by virtual memory size:

$ListOfProcessObjects | Select-Object processname,VM | sort VM

Print the chrome processes and sort by virtual memory size:

$ListOfProcessObjects | Where-Object { $_.processname -match "chrome" } | select-object processname,VM | sort VM

Find the Google Chrome process with the largest VM size:

Get-Process chrome* | sort VM | Select-Object processname,ID,VM -last 1

Find the Google Chrome process with the smallest VM size:

Get-Process chrome* | sort VM | Select-Object processname,ID,VM -first 1

Stop all Chrome processes:

Stop-Process -processname chrome*

Working on file systems


Find all "exe" files in a tree, list their full path, and sort "fullname":

Get-ChildItem 'C:\Tree\Of\Files\' -recurse -include *.exe | select-object fullname | sort fullname | more

Find all mp3s and sort by ascending size:

Get-ChildItem 'C:\Tree\Of\Files\' -recurse -include *.mp3 | select-object fullname,length | sort length

Find all mkvs and sort by ascending lastaccesstime:

Get-ChildItem 'C:\Tree\Of\Files\' -recurse -include *.mkv | select-object fullname,lastaccesstime | sort lastaccesstime

To get a list of all of an objects properties, use Where-Object on the list of file system objects to get a single object, and then pipe the object to: Select-Object * | more

Get-ChildItem 'C:\Tree\Of\Files\' -recurse -include *.pdf | Where-Object { $_.fullname -match ".*Q1_Report.pdf" } | Select-Object * | more

Get pdfs that were last accessed by Windows in 2008, get their fullname, length, and last access time, then finally sort by length in ascending order:

Get-ChildItem 'C:\Tree\Of\Files\' -recurse -include *.pdf {$_.LastAccessTime -match "2008" } | Select-Object fullname,length,LastAccessTime | sort length

You can output a command to CSV with "Export-CSV". This command requires a filename as an argument:

Get-ChildItem 'C:\Tree\Of\Files\' -recurse -include *.pdf | Select-Object fullname,lastaccesstime,length | sort length | Export-Csv C:\Files\list.csv

Load the above results into the clipboard as a list:

Get-ChildItem 'C:\Tree\Of\Files\' -recurse -include *.pdf | Select-Object fullname,lastaccesstime,length | sort length | Format-list | clip

New directory:

New-Item c:\Files\Log_Data -type directory

New directory on a server:

New-Item \\myserver\c\Files\Log_Data -type directory

New empty file:

New-Item c:\Files\Log_Data\logoutput.txt -type file

Create a new file on a server:

New-Item \\myserver\c\Files\logoutput.txt -type file

Rename a file:

Rename-Item c:\Files\Log_Data\logoutput.txt logoutput.new.txt

Rename a file on a server:

Rename-Item \\myserver\c\Files\logoutput.txt logoutput.new.txt

Delete a file:

Remove-Item C:\Files\Log_Files\logoutput.txt.new

Delete a file on a server:

Remove-Item \\myserver\c\Log_Files\logoutput.txt

Delete a directory:

Remove-Item C:\Files\Log_Files

Delete a directory on a server:

Remove-Item \\myserver\c\Log_Files

Write text to a file. This replaces the contents of the file:

set-content c:\Files\Log_Files\logoutput.txt.new -value "Line 1" + "Line 2" + "Line 3" + ....

Log Processing


Store list of log objects from the event log "System":

$SystemLogs = Get-EventLog System

Get all the log entries of entrytype "Error" from the stored system logs and then sort by "Message"

$SystemLogs | Where-Object {$_.entrytype -match "error" } | select-object message,entrytype | sort message | more

Get all the log entries of entrytype "Error" from the stored system logs and then return a sort list of unique log messages:

$SystemLogs | Where-Object {$_.entrytype -match "error" } | select-object message| sort message | more | Get-Unique -asstring | more

Get list of logging providers:

$ListOfProviders = get-winevent -listprovider *

Looking at Hotfixes


Get the IDs of all installed hotfixes and their install times and then sort by install time:

Get-HotFix | select-object hotfixid,installedon | sort installedon | more

Get the hotfix list from a remote machine (replace someservername with the name of a server in your environment). The account from which you run this needs admin rights to that machine:

Get-HotFix -computername someservername| Select-Object hotfixid,installedon | sort installedon | more

Using PowerShell on remote machines


Start an interactive PowerShell session on the remote computer myserver:

Enter-PsSession myserver

Stop an interactive PowerShell session:

Exit-PsSession

Run a command on a list of remote machines:

Invoke-Command -computername myserver1, myserver2, myserver3 {get-Process}

Run a remote script on a list of remote machines:

Invoke-Command -computername myserver1,myserver2,myserver3 -filepath \\scriptserver\c\scripts\script.psl

Operate interactively on a list of machines by setting up a "session" of open connections:

$InteractiveSession = new-pssession -computername myserver1, myserver2, myserver3

Run a remote command on the new session. This runs it all the connections in the session:

Invoke-Command -session $InteractiveSession {Get-Process} 

Run the remote command on the session, but report only certain objects:

invoke-command -session $InteractiveSession {Get-Process | select-object name,VM,CPU }

Groups and Users


Get all of the user objects in "Data-Center-Team"

Get-ADGroupMember -Identity “Data-Center-Team”

Suppose the group IT-Team contains the group "Data-Center-Team" and other teams. To list the groups in "IT-Team":

Get-ADGroupMember -Identity “IT-Team”

To list the groups in "IT-Team" and all of those groups' members:

Get-ADGroupMember -Identity “IT-Team” -Recursive

Add user "thomasd" to the Data-Center-Team group:

Add-ADGroupMember -Identity “Data-Center-Team” -Members "thomasd"

Remove user "thomasd" from the "Data-Center-Team" group:

Remove-ADGroupMember -Identity “Group-A” -Members "thomasd"

Add the members of "London-Office" group to the "IT-Group" group:

Get-ADGroupMember -Identity “London-Office” -Recursive | Get-ADUser | ForEach-Object {Add-ADGroupMember -Identity “IT-Group” -Members $_}

Remove the members of the "London-Office" group from the "IT-Group" group:

Get-ADGroupMember -Identity “London-Office” -Recursive | Get-ADUser | ForEach-Object {Add-Remove-ADGroupMember “IT-Group” -Members $_}

Get all of the user objects in groups beginning with "Development-":

Get-ADGroup -LDAPFilter “(name=Development-*)” | Get-ADGroupMember | Get-ADUser

Get all of the users in groups beginning with "Development-" that are disabled:

Get-ADGroup -LDAPFilter “(name=Development-*)” | Get-ADGroupMember | Get-ADUser | Where-Object {$_.Enabled -eq $False }

Find all of the users in groups beginning with "Development-" that are disabled and add them to the "Development-Disabled" group:

Get-ADGroup -LDAPFilter “(name=Development-*)” | Get-ADGroupMember | Get-ADUser | Where-Object {$.Enabled -eq $False} | ForEach-Object { Add-ADGroupMember -Identity “Development-Disabled” -Members $_ -Confirm:$False }

Get all members of all groups with their enabled status and put them in a CSV file in C:\Files\ :

Get-ADGroup -LDAPFilter “(name=Development-*)” | Get-ADGroupMember | Get-ADUser | Select-Object Enabled,SamAccountName | sort Enabled | Export-Csv C:\Files\Development-Group-Users.csv

Other


Reset your network connections:

"release", "renew", "flushdns" | %{ipconfig /$_}
Get a list of the domain controllers in your domain:

[System.DirectoryServices.ActiveDirectory.Domain]::GetCurrentDomain() | Select-Object DomainControllers


-Adam (a0f29b982)

Wednesday, June 26, 2013

Programatically named variables in bash.

Suppose you wanted to do the following in bash:

for label in a b c d e f
do
  variable_${label}=${label}
done

This has the intended result of setting a series of variables:
variable_a
variable_b
variable_c
variable_d
variable_e
variable_f

But if what if you want to dereference them programatically?
for label in a b c d e f
do
  echo ${variable_${label}}
done
is not acceptable bash syntax.
But there is a way...
We can abuse export and env. We set them with:
for label in a b c d e f
do
  export variable_${label}=${label}
done
We can then programmatically dereference the variables by searching for them in the output of env and using awk to get their value.
for label in a b c d e f
do
  echo "`env | grep variable_${label} | awk -F= '{print $2}'`"
done
How's that for bash abuse?

Wednesday, May 22, 2013

A Proposal for Determining If A VM Is Used or Unused (work-in-progress)

Motivation

As virtualization technology has taken over the information services landscape, the cost - both in terms of money and effort - of deploying a new server has fallen dramatically. Since the commoditization of PC architecture servers in the 2000s, organizations have typically deployed one application per a server to isolate each application for ease of maintenance and security. If an application did not use all of its server's resources all of the time, the remaining computing resources went to waste. To recover that waste, many organizations have recently replaced servers, each running one application, with many virtual machines, also each running one application. These virtual machines all run on a few large physical servers. If an application is idle, then other applications can use the remaining resources. With this shift, VMware estimates that the infrastructure cost of one application, and thus one VM, is now down to $1774.00 [VMware 2012]. In addition to the lower infrastructure cost, automation has driven the deployment cost of a new virtual machine to near zero. With this dramatic cost drop, organizations have witnessed their population of virtual machines balloon. With these ballooning populations of virtual machines, comes ever greater potential for virtual machines to simply linger, unused, as their applications fall out of use, and as the human structure of an organization shifts over time.

For many organizations, the cost of having a human constantly review the virtual machine population for unused virtual machines would cost prohibitive. Review would be possible, however, if the majority of in-use virtual machines could be filtered out of the population, leaving only virtual machines with a high chance of being unused for review by a human. This article seeks to develop a classifier for flagging virtual machines as potentially unused using a statistical method.

Background


A systems administrator can use their experience to determine if a virtual machine is still being used by looking at various properties of that virtual machine and then making a judgement call regarding its level of use. The explicit process looks something like:

If,
  • Property 1 = a
  • Property 2 = b
  • Property 3 = c
  • Property 4 = d
then,
  • in my professional judgment, I believe this virtual machine to be unused.
For an administrator to make this judgement across hundreds of virtual machines would take many hours, so it would be best to automate this process.  To automate it, we need to capture it into computer code; and to capture it into computer code, we must express it mathematically.  So the question becomes,
Is there an existing mathematical model that captures this process, and thus approximates the systems administrator's expertise?
In fact, there is such a model.

Bayesian Classifiers 

Consider "this email is spam". It's boolean, that is true or false, and has some chance of being true. We denote that chance:



Think of a large area where each point is a possible email.  Area(spam) is the area of points that contains all of the emails you consider spam.  If Area(spam) has an area of zero, then no emails are spam. If Area(spam) equals one, then all emails are spam. Usually Area(spam) is some value between 0 and 1. [Moore n.d.] In fact, for all of the emails sent today, Area(spam) is .665. [Gudkova 2013]

To have computer determine whether or not an email is spam, the computer must use the properties of the email available to it to determine the overall probability that the email is spam - namely the words in the email.

The area of all email can be sliced up into partitions, where each partition contains all the emails that have a certain word like "rolex" in them. These partitions overlap the two larger partitions that split the area into spam emails and non-spam emails.

Using the above description, to determine if an email is spam, we ask, if an email has rolex in its corpus of words, what is the probability that the email is spam? [Graham 2002] That is,  out of all the emails containing rolex, how many are in the spam partition?

Conditional probabilities provide a model for this type of question:

(1)


the probability of condition U, given the probability of datum V.

The mathematicians Laplace and Bayes give a concise formula, Bayes Theorem, that relates conditional probabilities and overall probabilities for the condition and datum:

(2)

where, for our purposes, V is a measurable property of a virtual machine and U is the classification "unused".

There are many things that we can measure on each virtual machine. It would be good to be able combine them to find the probability that a particular virtual machine is unused. Thus, we would like to find a way to use the above equation to determine whether a virtual machine is "in use" or "unused", given several measurable properties. [Larsen 2001]

Buckets of Marbles


Consider two buckets, U and I, full of marbles of eight different colors. We want to be able to pull a marble at random from one of those two buckets and then estimate the probability that it was in one bucket or the other. That is, we want to use the color of the marble to estimate the classification of the marble, U or I. In terms of the areas we described above, imagine all of the marbles laid out flat with each color grouped together. Each color is an distinct area. Overlapping those distinct areas are two larger areas, U and I. We want to estimate how likely an arbitrary marble picked off the plane would have been plucked from under the U area, based on its color.

To this, we start by pulling a sample from each bucket, U and I. We count the different numbers of each color marble in the sample from each bucket. We count the total numbers of each color marble across both bucket samples. Finally, we count the total number of marbles in both samples.

Say we want to know the probability of a marble being from the U bucket given that it's red.  We can use the above relationship. Using the count of red marbles from the sample from U, we can estimate the number of red marbles in U.  This is P(red|U). Using the number of marbles in the U sample and the total number of marbles in both samples, we can estimate the number of marbles in the U bucket vs. the total number of marbles in both buckets. This P(U). And by counting the number of red marbles in both samples, we can estimate the total number of red marbles across both buckets. This P(red). The above equation says that we can estimate the probability that a red marble comes from the U bucket using this relationship between those values:

(3)

But what if the sample from one of the buckets has none of a particular color? This would make our count zero, and would not give us any estimate of the number of that color in the original bucket. Laplace gives us a slightly more complex estimator to "smooth" over that zero count [Smith 2009]:

Instead of count of a color divided by the number of marbles in the sample, we can use:

(4)

This smooths out the zero by assuming slightly lower likelihood of that color in the bucket than 1/total marbles in the sample.  This estimator gives us a way to estimate the likelihood of picking color from each bucket, even if the sample does not contain that color.

VM Classification

For this proof of concept, we need analyze a sample population of virtual machines, and then assign each VM a colored marble, based our analysis of that virtual machine. I propose doing this in the following way.

For simplicity, we will use eight colors, as we did above. To get those eight colors, we will measure three metrics, percent free memory, disk blocks transferred yesterday, and average daily logins. We will take a sample from the population of all of the VMs in the infrastructure and then divide them into two buckets, U (unused) and I (in use) using our experience. These represent samples from the two larger buckets of virtual machines that contain between them all of the virtual machines in the infrastructure. This manual division also represents the "expert knowledge" we want to approximate programmatically  For each metric, we will calculate the mean of that metric in each of the two sample sets. Then we will mark a particular virtual machine as above the mean for that metric (A), or below (B).  Thus each virtual machine in the two sample groups will have a triplet associated with it (A/B)(A/B)(A/B). This is the "color" of the virtual machine.  There are 8 combinations representing 8 marble colors:

AAA
AAB
ABB
BAA
BBA
BBB
BAB
ABA

(5)

I chose 130 virtual machines from the overall population. I determined 13 of those virtual machines to be unused, based on my professional experience.  I then determined the triplet for each virtual machine in the "in use" group and each virtual machine in the "unused" group. Then I counted the number of each "color" virtual machine in each group. Here the Laplace estimator applied. The sample size of the "unused" virtual machines was so small (only 13 unused virtual machines), that I needed the Laplace estimator to estimate the likelihood of colors in the larger "unused" bucket that didn't appear in the sample. Indeed, even in the larger "used" sample, some colors were missing, the Laplace estimator applied in that case as well. I then made overall estimates across both samples for overall unused virtual machines and the overall occurrence of each color.

Thus, for each virtual machine "color" in the two sample sets, I estimated:
  • P(a triplet IF the virtual machine was unused) = P(C|U)
  • P(unused virtual machines) = P(U)
  • P(a triplet) = P(C)
So, by Bayes Theorem, I was able to approximate the probability that a virtual machine was unused, given that it had a particular color:
(6)
I then calculated P(U|C) for each color:

P(U|AAA) =.66666666666666666623
P(U|AAB) =.07407407407407407400
P(U|ABB) =.33333333333333333311
P(U|BBB) =.26666666666666666645
P(U|BBA) =.66666666666666666652
P(U|BAA) =.66666666666666666666
P(U|ABA) =.66666666666666666660
P(U|BAB) =.00666666666666666666

(7)

With these values, I can evaluate any virtual machine in the environment for the probability it's unused.

For an arbitrary virtual machine, I would first take measurements of each of the three metrics. Then I would determine the triplet using the mean for each metric determined from the unused virtual machine sample. I would then look up the virtual machine's "color" in the above list then to get the estimate of the probability that the virtual machine is unused.

The above process may not be completely perfect, but it makes a much cheaper first pass at finding unused virtual machines than having a system administrator evaluate each virtual machine by hand.  A system administrator need only evaluate machines with colors that have .666 probability of being unused for example.

References


Graham, Paul. "A Plan for Spam." A Plan for Spam. Http://www.paulgraham.com/, Aug. 2002. Web. 23 May 2013. .

Gudkova, Darya. "Spam in Q1 2013." Securelist.com. Kaspersky Lab ZAO., 8 May 2013. Web. 22 May 2013.

Larsen, Richard J., and Morris L. Marx. An Introduction to Mathematical Statistics and Its Applications. 3rd ed. Upper Saddle River, NJ: Prentice Hall, 2001. Print.

Moore, Andrew W. "Probabilistic and Bayesian Analytics." Probability for Data Miners. Andrew W. Moore, n.d. Web. 22 May 2013.

Smith, David. "Estimation - Maximum Likelihood and Smoothing." Introduction to Natural Language Processing (http://people.cs.umass.edu/~dasmith/inlp2009/). University of Massachusetts, Amherst, Sept. 2009. Web. 22 June 2013. .

VMWare, Inc. "Determine True Total Cost of Ownership." Get Low Total-Cost-of-Ownership (TCO) with Maximum Virtual Machine Density. VMWare Inc., Sept. 2012. Web. 23 May 2013. .

Appendix - Code

Since this blog is somewhat about doing things in bash that really should not be done in bash, I did all the math to calculate each P(U|C) in a bash script. I will include it here. It's not optimized by any means and includes some creative abuses of bash. Note that the format of the original data file is virtual_machine_name,free_mem_percentage,blocks_transferred_yesterday,average_daily_logins. As I said above, I manually broke the sample list into used and unused virtual machines, and proceeded from there.


#!/bin/bash

set -u
#set -e
#set -x

status () {
  echo -n " |$*| "
}

get_unused_training_data () {

  for virtual_machine in `cat ${unused_VMs} `; do grep ${virtual_machine} ${training_data} ; done | sort | uniq 
  unset virtual_machine

}

get_in_use_training_data () {

  for virtual_machine in `cat ${in_use_VMs} `; do grep ${virtual_machine} ${training_data} ; done | sort | uniq 
  unset virtual_machine

}

get_unused_training_data_count () {

  get_unused_training_data | wc -l

}

get_in_use_training_data_count () {

  get_in_use_training_data | wc -l

}

get_unused_average_for_free_memory () {

  get_unused_training_data | awk -F, '{lines=$lines+1;sum=+$2} END {print sum/lines } '

}  

get_in_use_average_for_free_memory () {

  get_in_use_training_data | awk -F, '{lines=$lines+1;sum=+$2} END {print sum/lines } '

}

get_unused_average_for_block_transfer () {

  get_unused_training_data | awk -F, '{lines=$lines+1;sum=+$3} END {print sum/lines } '

}

get_in_use_average_for_block_transfer () {

  get_in_use_training_data | awk -F, '{lines=$lines+1;sum=+$3} END {print sum/lines } '

}

get_unused_average_for_logins () {

  get_unused_training_data | awk -F, '{lines=$lines+1;sum=+$4} END {print sum/lines } '

}

get_in_use_average_for_logins () {

  get_in_use_training_data | awk -F, '{lines=$lines+1;sum=+$4} END {print sum/lines } '

}

return_above_or_below_group_average () {

  a_or_b_value="`echo ${1} | awk -F. '{print $1}'`"
  a_or_b_average="`echo ${2} | awk -F. '{print $1}'`"

  if [ ${a_or_b_value} -gt ${a_or_b_average} ]
  then
    echo "A"
  fi

  if [ ${a_or_b_value} -lt ${a_or_b_average} ]
  then
    echo "B"
  fi

  if [ ${a_or_b_value} -eq ${a_or_b_average} ]
  then
    echo "A"
  fi

  unset a_or_b_value
  unset a_or_b_average

}

get_in_use_triplets () {

  get_in_use_triplets_free_memory_average="`get_in_use_average_for_free_memory`"

  get_in_use_triplets_block_transfer_average="`get_in_use_average_for_block_transfer`"

  get_in_use_triplets_logins_average="`get_in_use_average_for_logins`"

  for virtual_machine_line in `get_in_use_training_data`
  do
    virtual_machine_name="`echo ${virtual_machine_line} | awk -F, '{print $1}' `"
    
    get_in_use_triplets_local_value="`echo ${virtual_machine_line} | awk -F, '{print $2}' `"
    free_memory_a_or_b="`return_above_or_below_group_average ${get_in_use_triplets_local_value} ${get_in_use_triplets_free_memory_average}`"

    get_in_use_triplets_local_value="`echo ${virtual_machine_line} | awk -F, '{print $3}' `"
    block_transfer_a_or_b="`return_above_or_below_group_average ${get_in_use_triplets_local_value} ${get_in_use_triplets_block_transfer_average}`"

    get_in_use_triplets_local_value="`echo ${virtual_machine_line} | awk -F, '{print $2}' `"
    logins_a_or_b="`return_above_or_below_group_average ${get_in_use_triplets_local_value} ${get_in_use_triplets_logins_average}`"

    echo "${virtual_machine_name},${free_memory_a_or_b}${block_transfer_a_or_b}${logins_a_or_b}"

    unset virtual_machine_name
    unset free_memory_a_or_b
    unset block_transfer_a_or_b
    unset logins_a_or_b

  done

}

get_unused_triplets () {

  get_unused_triplets_free_memory_average="`get_unused_average_for_free_memory`"

  get_unused_triplets_block_transfer_average="`get_unused_average_for_block_transfer`"

  get_unused_triplets_logins_average="`get_unused_average_for_logins`"

  for virtual_machine_line in `get_unused_training_data`
  do
    virtual_machine_name="`echo ${virtual_machine_line} | awk -F, '{print $1}' `"
    
    get_unused_triplets_local_value="`echo ${virtual_machine_line} | awk -F, '{print $2}' `"
    free_memory_a_or_b="`return_above_or_below_group_average ${get_unused_triplets_local_value} ${get_unused_triplets_free_memory_average}`"

    get_unused_triplets_local_value="`echo ${virtual_machine_line} | awk -F, '{print $3}' `"
    block_transfer_a_or_b="`return_above_or_below_group_average ${get_unused_triplets_local_value} ${get_unused_triplets_block_transfer_average}`"

    get_unused_triplets_local_value="`echo ${virtual_machine_line} | awk -F, '{print $2}' `"
    logins_a_or_b="`return_above_or_below_group_average ${get_unused_triplets_local_value} ${get_unused_triplets_logins_average}`"

    echo "${virtual_machine_name},${free_memory_a_or_b}${block_transfer_a_or_b}${logins_a_or_b}"

    unset virtual_machine_name
    unset free_memory_a_or_b
    unset block_transfer_a_or_b
    unset logins_a_or_b
    unset get_unused_triplets_local_value

  done

}

get_l_of_triplets_if_unused () {

  for property_triplet in AAA AAB ABB BBB BBA BAA ABA BAB
  do

    number_of_unused_members_with_triplet="`get_unused_triplets | grep ${property_triplet} | wc -l `"
    number_of_in_use_members_with_triplet="`get_in_use_triplets | grep ${property_triplet} | wc -l `"
    number_of_unused_members="`get_unused_training_data | wc -l `"
    number_of_in_use_members="`get_in_use_training_data | wc -l `"
    number_of_combinations="8"

  # Laplace smoothed likelyhood estimator: http://people.cs.umass.edu/~dasmith/inlp2009/lect5-cs585.pdf
  # ((number of group members with a certain triplet)+1)/((number of combinations)+(# of members in bucket))

    export l_of_${property_triplet}_if_unused="`echo \(${number_of_unused_members_with_triplet}+1\)/\(${number_of_combinations}+${number_of_unused_members}\) | bc -l `"
  #    export l_of_${property_triplet}_if_in_use="`echo \(${number_of_in_use_members_with_triplet}+1\)/\(${number_of_combinations}+${number_of_in_use_members}\) | bc -l `"

  #  echo "`env | grep l_of_${property_triplet}_if_unused | awk -F= '{print $2}'`"
  #    echo "`env | grep l_of_${property_triplet}_if_in_use | awk -F= '{print $2}'`"
      
  done 

}

get_overall_l_of_triplets () {

  total_virtual_machines="`cat ${training_data} | wc -l`"
  number_of_combinations="8"

  for property_triplet in AAA AAB ABB BBB BBA BAA ABA BAB
  do

    export count_of_${property_triplet}_for_in_use="`get_in_use_triplets | grep ${property_triplet} | wc -l `"
    export count_of_${property_triplet}_for_unused="`get_unused_triplets | grep ${property_triplet} | wc -l `"
    count_of_in_use="`env | grep count_of_${property_triplet}_for_in_use | awk -F= '{print $2}'`"
    count_of_unused="`env | grep count_of_${property_triplet}_for_unused | awk -F= '{print $2}'`"
    export overall_l_of_triplet_${property_triplet}="`echo \(${count_of_in_use}+${count_of_unused}+1\)/\(${number_of_combinations}+${total_virtual_machines}\) | bc -l `"
    # echo "`env | grep overall_l_of_triplet_${property_triplet} | awk -F= '{print $2}'`"
    # echo $((${count_of_in_use}+${count_of_unused}))
  done 
}

get_p_unused_if_triplet () {

# P(u|triplet) = ( p(triplet|unused) * p(unused) ) / p(triplet)

  get_l_of_triplets_if_unused
  get_overall_l_of_triplets

  
  count_of_unused="`get_unused_training_data | wc -l `"
  total_virtual_machines="`cat ${training_data} | wc -l`"
  number_of_combinations="8"

  l_of_unused="`echo \(${count_of_unused}+1\)/\(${number_of_combinations}+${total_virtual_machines}\) | bc -l `"

  for property_triplet in AAA AAB ABB BBB BBA BAA ABA BAB
  do
      l_of_triplet_if_unused="`env | grep l_of_${property_triplet}_if_unused | awk -F= '{print $2}'`"
      overall_l_of_triplet="`env | grep overall_l_of_triplet_${property_triplet} | awk -F= '{print $2}'`"

#echo ${l_of_unused}
#echo ${l_of_triplet_if_unused}
#echo ${overall_l_of_triplet}

      export p_unused_if_triplet_${property_triplet}="`echo \(${l_of_triplet_if_unused}\*${l_of_unused}\)/${overall_l_of_triplet} | bc -l `"
      echo "p_unused_if_triplet_${property_triplet}=`env | grep p_unused_if_triplet_${property_triplet} | awk -F= '{print $2}'`"

  done

}

#Local data files
training_data="./training_data"
unused_VMs="./unused_VMs"
in_use_VMs="./in_use_VMs"

get_p_unused_if_triplet

exit 0





Friday, April 12, 2013

Fixing SSH connection problems in EGit in Eclipse


Note: I posted a version of this on Stack Overflow.
Errors can occur when there is an underlying SSH authentication issue, like having the wrong public key on the git remote server or if the git remote server changed its SSH host key.
Often the an SSH error will appear as: "Invalid remote: origin: Invalid remote: origin"

Eclipse will use the .ssh directory you specify in Preferences -> General -> Network Connections -> SSH2 for its ssh configuration. Set it "{your default user directory}.ssh\" .
To fix things, first you need to determine which ssh client you are using for Git. This is stored in the GIT_SSH environmental variable. Right-click on "Computer" (Windows 7), then choose Properties -> Advanced System Settings -> Environment Variables.
If GIT_SSH contains a path to plink.exe, you are using the PuTTY stack.
  • To get your public key, open PuTTYgen.exe and then load your private key file (*.ppk). The listed public key should match the public key on the git remote server.
  • To get the new host key, open a new PuTTY.exe session, and then connect to git@{git repo host}.
  • Click OK and say yes to store the new key.
  • Once you get a login prompt, you can close the PuTTY window. The new key has been stored.
  • Restart Eclipse.
If GIT_SSH contains a path to "ssh.exe" in your "Git for Windows" tree, you are using Git for Windows's OpenSSH.
  • Set %HOME% to your default user directory (as listed in Eclipse; see above).
  • Set %HOMEDRIVE% to the drive letter of your default user directory.
  • Set %HOMEPATH% to the path to your default user directory on %HOMEDRIVE%
  • To get your public key, open the file %HOMEDRIVE%%HOMEPATH%/.ssh/id_rsa.pub (or id_dsa.pub) in a text editor. The listed public key should match the public key on the git remote server.
  • To get the new host key, run: cmd.exe
  • Run Git Bash
  • Ctrl-C
  • At the bash prompt, run /c/path/to/git/for/windows/bin/ssh.exe git@{git remote host}.
  • Type yes to accept the new key.
  • Once you have a login prompt, type: ctrl-c
  • Close the cmd.exe window
  • Restart Eclipse.
Finally, if you are still having trouble with your external ssh client, delete the GIT_SSH environmental variable and set the HOME environmental variable to your default user directory on Windows. Without the GIT_SSH variable, EGit will use its internal ssh client (java). It will use the .ssh directory you specified above as its SSH configuration directory.
Note: If you have Git for Windows, you can use its tools to create a SSH key pair your .ssh directory:
  • Set %HOME% to your default user directory (as listed in Eclipse).
  • Set %HOMEDRIVE% to the drive letter of your default user directory.
  • Set %HOMEPATH% to the path to your default user directory on %HOMEDRIVE%
  • Run Git Bash
  • Ctrl -C
  • Run: ssh-keygen.exe -t rsa -b 2048
  • Save to the default filenames
  • Choose a passphrase or save without one. If you save with a passphrase, Eclipse will prompt you for it each time you push or pull from your git remote server.
  • Close Git Bash
You can also use the GUI in the SSH2 Preference pane in Eclipse to manage hosts and keys.

Friday, November 30, 2012

Where does Kindle Reader for Mac OS X from the App Store store my books?


Where does Kindle Reader for Mac OS X store my books?


Here:


/Users/yourusername/Library/Containers/com.amazon.Kindle/Data/Library/Application Support/Kindle/My Kindle Content/

Thursday, October 25, 2012

Straight to Voicemail, If Unknown - A simple free method for blocking scam calls and robocalls


Problem: How do I block scam calls and robo calls?

Premise: 

If the call is important, the caller will leave a voicemail.

Solution:


We put every number we know into our caller ID systems.

If a number we do not recognize, or a number that blocks caller ID calls, we always let the call go to voicemail. Callers that really want or need to talk with us will leave a voicemail. If we are not interested, we delete the message.

This process initially upset some of our parents, but we have not had to deal with a robo or scam call in some time, since the calling computers almost never leave a voicemail. Our parents are now used it and leave messages. Sometimes, we pick up as soon as they start talking. Our friends mostly communicate via Facebook, internet chat tools, and email these days, so they are used to asynchronous communication and don't mind leaving a message. The political parties and charities we support do leave messages. We call them back to donate or express our support.

It's simple, effective, and free.

Friday, October 12, 2012

A script to split a file tree into separate trees - one per file extension present in the original tree

Purpose

Have you ever had a tree of files from which you only needed certain types of file? For example, I had an iTunes library with some Apple files from another iTunes account combined with a large number of MP3s. I wanted to pull out the tree of MP3s only. You can make such a tree by passing a combination of flags to rsync that make it do an exclusive include.

How?

Pass the following flags to rsync to make it do an exclusive include for files fitting a certain globbing pattern. Fill in for the variables of course, if you want to use this line alone.

In particular, this rsync line:

rsync -av --include '*/' --include "*.${extension}" --exclude '*' ${source_directory}/ ${top_directory_of_results}/${extension}/

The script:

==========================================================

This tool reads a directory of files that have extensions and then copies each type of file to its own tree.

The location of each file in the subtree matches that file's location in the original tree.

Usage:

 ./split_by_file_extension.sh \
{-s source directory|--source-dir=source directory }\
{-t top directory of results|--top-directory-of-results=top directory of results}\
{-e comma,separated,list,of,extensions | --extensions=comma,separated,list,of,extensions}


==========================================================

#!/bin/bash

set -e
set -u

find_of_files="./find.of.files.$$"

usage () {

 echo "=========================================================="
 echo "This tool reads a directory of files that have extensions"
 echo "and then copies each type of file to its own tree."
 echo ""
 echo "The location of each file in the subtree matches that"
 echo "file's location in the original tree."
 echo ""
 echo "Usage: $0 {-s source directory|--source-dir=source directory} \ "
 echo "          {-t top directory of results|--top-directory-of-results=top directory of results} \ "
 echo "          {-e comma,separated,list,of,extensions | --extensions=comma,separated,list,of,extensions} "
 echo "=========================================================="
}

are_these_the_same_path () {

 original_directory="`pwd`"
 cd "$1"
 first_directory="`pwd`"
 cd "${original_directory}"
 cd "$2"
 second_directory="`pwd`"
 cd "${original_directory}"

 if [ "${first_directory}" = "${second_directory}" ]
 then
  echo true
 else
  echo false
 fi

}

if [ $# -eq 0 ]
then
 usage
 exit 1
fi

needed_number_of_arguments_set=0

while [ $# -gt 0 ]
do
 case $1 in
  -s|--source-dir=*)
   if [ "$1" = "-s" ]
   then
    shift
    source_directory="$1"
    shift
   else
    source_directory="`echo $1| sed s,--source-dir=,,`"
    shift
   fi
   echo "Source Directory: ${source_directory}"
   if [ ! -d ${source_directory} ]
   then
    echo""
    echo "source_directory is not a directory."
    echo ""
    usage
    exit 1
   fi
   needed_number_of_arguments_set="`echo ${needed_number_of_arguments_set} + 1| bc`"
  ;;
  -e|--extensions=*)
   if [ "$1" = "-e" ]
   then
    shift
    extensions="$1"
    shift
   else
    extensions="`echo $1| sed s#--extensions=##`"
    shift
   fi
   echo "Extensions: ${extensions}"
   needed_number_of_arguments_set="`echo ${needed_number_of_arguments_set} + 1| bc`"
  ;;
  -t|--top-directory-of-results=*)
   if [ "$1" = "-t" ]
   then
    shift
    top_directory_of_results="$1"
    shift
   else
    top_directory_of_results="`echo $1| sed s,--top-directory-of-results=,,`"
    shift
   fi
   echo "Target Directory: ${top_directory_of_results}"
   if [ ! -d ${top_directory_of_results} ]
   then
    echo""
    echo "top_directory_of_results is not a directory."
    echo ""
    usage
    exit 1
   fi
   needed_number_of_arguments_set="`echo ${needed_number_of_arguments_set} + 1| bc`"
  ;;
  -h|--help)
   usage
   exit 0
  ;;
  *)
   echo ""
   echo "Unrecognized flag." 1>&2
   usage
   exit 1
  ;;
 esac
done

if [ "${needed_number_of_arguments_set}" -ne "3" ]
then
 echo""
 echo "All of the options must be set." 1>&2
 usage
 exit 1
fi

are_source_directory_and_top_directory_of_results_the_same="`are_these_the_same_path ${source_directory} ${top_directory_of_results}`"

if [ "${are_source_directory_and_top_directory_of_results_the_same}" = true ]
then
 echo ""
 echo "source_directory and top_directory_of_results cannot be the same." 1>&2
 echo ""
 usage
 exit 1
fi

#######################################
#
# Main Process.
#
# Do a find for files.
# Check for files with extensions provided.
# Get directory path for files with listed extensions.
# Make the path for that file on the extension directory in the target directory.
# Copy files from source tree to the specific path in the target tree with rsync. 
#
#######################################

for extension in `echo "${extensions}" | sed s/,/\ /g`
do
  if [ ! -d ${top_directory_of_results}/${extension} ]
  then
     mkdir ${top_directory_of_results}/${extension}
  fi
done

for extension in `echo "${extensions}" | sed s/,/\ /g`
do
  rsync -av --include '*/' --include "*.${extension}" --exclude '*' ${source_directory}/ ${top_directory_of_results}/${extension}/
done