Learning about Hadoop on Azure

May 18, 2012

In order to learn about Hadoop on Azure I crunched 5GB of ftp logs and counted the top 100 messages.  This is a variation on the code sample that counts words in Leonardo Da Vinci Project Gutenberg EBook.  There is a similar Hadoop application that extracts information from IIS logs.  

 

1.       Sign-in using access code and  provision a cluster. 

·         NB: This takes approximately 20 minutes.  The cluster appears to be very temporary – ~48 hours.   

·         Reference: Windows Azure Deployment of Hadoop-based services on the Elastic Map Reduce (EMR) Portal

2.       Load up analysis data on Azure Blob storage. 

·         Once the data is loaded you can query the ABS using the command #ls asv://ftplog/sample

 

[21] Sat 15Oct11 00:00:31 – (013831) 220 Serv-U FTP Server v11.0 ready…

[02] Sat 15Oct11 00:00:31 – (013831) Closed session

 

·         NB: Transferring 5GB to ABS took approximately 8 hours

·         Reference: Setup Azure Blob Store for Hadoop on Azure CTP

3.       Copy the data from ABS to Hadoop (Optional)

·         NB: This process took about 60 seconds

·         Reference: How to transfer data between different HDFS clusters.

4.       Execute the counter job

·         Upload “FTPLogMessageCounter.js” using fs.put(“bin”)

 

var map = function (key, value, context) {

    if (!value) {

        return;

    }

    var words = value.substring(37);

    context.write(words,1);

 };

 

var reduce = function (key, values, context) {

    var sum = 0;

    while (values.hasNext()) {

        sum += parseInt(values.next());

    }

    context.write(key, sum);

};

 

·         Run the query interactively – using the command: pig.from(“asv://ftplog/sample”).mapReduce(“bin/FTPLogMessageCounter.js “, “word, count:long”).orderBy(“count DESC”).take(100).to(“log_output”)

·         View the results using the command: file = fs.read(“log_output”)

·         NB: Make sure to use the word “pig.” Many of the Hadoop on Azure docs leave this part out.

·         Reference: Running a JavaScript Map/Reduce Job from Interactive JavaScript Console

 

General Reference:

·         Hadoop-based Services on Windows Azure How To Guide

·         Introduction to Hadoop on Windows Azure (video)

·         Hadoop on Azure (video)

·         Hadoop on Azure Yahoo Message Group

 


What do you do when demand for your application outgrows the capabilities of an RDBMS?

February 11, 2012

Relational Database Management Systems (RDBMS) systems date back to the early 1970s and are characterized by a fixed schema, SQL (structured query language), and ACID compliance. (Atomicity: Transactions are complete or not complete, Consistency: leaves the system in a known state, Isolation: impacts only itself, and is Durable: permanent.) Data is organized in rows and columns like a spreadsheet and relationships can be built between different tables. In general relational databases work incredibly well until you hit “big data.” Big data is often defined using the three V’s velocity, volume, and variety.

  • Volume – multiple terabyte or more
  • Variety – numbers, audio, video, text, streams
  • Velocity – how fast the data is collected and requested

clip_image001

Source: Montis.com

With the volume of data they need to process every second, it is easy to see why Google, Twitter, Facebook, and many other sites like them had to look for options beyond RDBMS. Over the past 10 years a technology known as NoSQL has emerged largely in response to the challenge of solving big data problems. Generically speaking NoSQL solutions have the following characteristics.

  • No-schema required upfront
  • Can store massive amounts of data
  • Auto-sharding / elasticity (spreading data over multiple machines)
  • Distributed query support (ability to execute a query on shared data)
  • BASE (basically available, soft state, eventually consistent) not ACID

There is a raft of NoSQL solutions available for a variety of different business problems. The following figure from the 451 Group is an interesting illustration of the current marketplace. The acronym SPRAIN refers to:

  • Scalability – hardware economics
  • Performance – MySQL limitations
  • Relaxed consistency – CAP theorem (*)
  • Agility – polyglot persistence (use the right database for given problem)
  • Intricacy – big data, total data
  • Necessity – open source

* CAP (or Brewer’s) theorem says that a distributed computer system can only simultaneously satisfy two of three guarantees: consistency, availability, and partition tolerance.

clip_image002

Source: 451 Group

Having spent time putting a NoSQL (Mongo) database into place I can attest to the fact that it’s harder than it looks (hard to query directly, limited support, limited reporting, and steep learning curve). That said, the products and documentation are getting better, developers are becoming more comfortable with the technology, and companies like Couchbase are sprouting up to provide support. Today NoSQL is a great thing if you have the opportunity to start from a clean sheet of paper. In particular the no upfront-schema works really well with the MVC programming model (i.e., Rails, ASP.Net MVC3).

Many organizations don’t have the luxury of starting over and have RDBMS-based applications that for one reason or another face performance or scalability challenges that for business reasons (time, cost, skill, etc.) not technical challenges cannot be addressed with a new database. As Couchbase points out RDBMS developers have resorted to “sharding” – putting data across multiple servers, denormalizing data – adding redundant columns to the schema to optimize performance, and adding memory cache to optimize query performance. None of these solutions is really the answer and for an organization facing a big data problem.

Another way to optimize the performance of a database bound application is by minimizing query time by loading the database itself into an SSD or flash memory. The “catch” so to speak is that flash memory is not cheap but on the other hand it is much much less expensive than re-writing software. FusionIO makes a flash memory product called ioDrive. IoDrive is a PCI card that effectively behaves like an SSD drive. Because FusionIO has designed the card to integrate at the memory tier they have managed to minimize I/O bottlenecks. “Spinning disks” have a read rate of approximately 200-300 IOPS (I/O Operations per second); IoDrive can achieve a rate of almost 100K IOPS.

References:


Hosting an MVC3 (with membership) application on EC2

February 4, 2012

One of my side projects was to get an MVC3 application that uses the Razor View Engine and Membership hosted on EC2 running Linux. I found some amazingly helpful resources along the way – particularly from Nathan Bridgewater at Integrated Web Systems.

Step one of the project is to get an EC2 instance prepped and ready.  Basically I followed the cookbook instructions on Bridgewater’s site - Get Started with Amazon EC2, Run Your .Net MVC3 (RAZOR) Site in the Clould with Linux Mono.

The exact commands I used:

Create new AMI ID ami-ccf405a5 and associate elastic IP (xx.xx.xx.xx)
sudo apt-get update &;& sudo apt-get dist-upgrade –y
wget http://badgerports.org/directhex.ppa.asc
sudo apt-key add directhex.ppa.asc
sudo apt-get install python-software-properties
sudo add-apt-repository 'deb http://ppa.launchpad.net/directhex/ppa/ubuntu lucid main'
sudo apt-get update
sudo apt-get install mono-apache-server4 mono-devel libapache2-mod-mono
cd /srv
sudo mkdir www; cd www
sudo mkdir default
sudo chown www-data:www-data default
sudo chmod 755 default
cd /etc/apache2/sites-available/
sudo vi mono-default (see mono-default, change IP address)
cd /etc/apache2/sites-enabled
sudo rm 000-default
sudo ln -s /etc/apache2/sites-available/mono-default 000-mono
sudo mv /var/www/index.html /srv/www/default
sudo vi /srv/www/default/index.html
sudo apt-get install apache2
sudo service apache2 restart
Test in a browser via IP address (you should see the default apache page)

My mono default:

# xx.xx.xx.xx is my Elastic IP address
  ServerName xx.xx.xx.xx
  ServerAdmin myemail@domain.com
  DocumentRoot /srv/www/default
  MonoServerPath xx.xx.xx.xx "/usr/bin/mod-mono-server4"
  MonoDebug xx.xx.xx.xx true
  MonoSetEnv xx.xx.xx.xx MONO_IOMAP=all
  MonoApplications xx.xx.xx.xx "/:/srv/www/default"

    Allow from all
    Order allow,deny
    MonoSetServerAlias xx.xx.xx.xx
    SetHandler mono
    SetOutputFilter DEFLATE
    SetEnvIfNoCase Request_URI "\.(?:gif|jpe?g|png)$" no-gzip dont-vary

    AddOutputFilterByType DEFLATE text/html text/plain text/xml text/javascript

Step two is to test mono with a simple Asp.net page.  Put this file into /srv/www/default.  Edit with sudo and view via browser at http://xx.xx.xx.xx/test.aspx.

<%@ Page Language="C#" %>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>ASP.Net Test page</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script runat="server">
private void Page_Load(Object sender, EventArgs e)
{
lblTest.Text = "This is a successful test.";
}
</script>
</head>
<body>
<h1>
This is a test page</h1>
<asp:Label runat="server" ID="lblTest"></asp:Label>
</body>
</html>

If problems are encountered check logs in /var/log/apache2/access.log or /var/log/apache2/error.log
Step three is to get MySql installed and tested with this simple application.

sudo apt-get install mysql-server
sudo apt-get install libmysql6.1-cil
CREATE DATABASE sample; USE sample;
CREATE TABLE test (id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(25));
INSERT INTO sample.test VALUES (null, 'Lucy');
INSERT INTO sample.test VALUES (null, 'Ivan');
INSERT INTO sample.test VALUES (null, 'Nicole');
INSERT INTO sample.test VALUES (null, 'Ursula');
INSERT INTO sample.test VALUES (null, 'Xavier');
CREATE USER 'testuser'@'localhost' IDENTIFIED BY 'somepassword';
GRANT ALL PRIVILEGES ON sample.* TO 'testuser'@'localhost';
FLUSH PRIVILEGES;

Put this file into /srv/www/default. Edit with sudo and view via browser at

<%@ Page Language="C#" %>
<%@ Import Namespace="System.Data" %>
<%@ Import Namespace="MySql.Data.MySqlClient" %>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>ASP and MySQL Test Page</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script runat="server">
private void Page_Load(Object sender, EventArgs e)
{
string connectionString = "Server=127.0.0.1;Database=sample;User ID=testuser;Password=somepassword;Pooling=false;";
MySqlConnection dbcon = new MySqlConnection(connectionString);
dbcon.Open();

MySqlDataAdapter adapter = new MySqlDataAdapter("SELECT * FROM test", dbcon);
DataSet ds = new DataSet();
adapter.Fill(ds, "result");

dbcon.Close();
dbcon = null;

SampleControl.DataSource = ds.Tables["result"];
SampleControl.DataBind();
}
</script>
</head>
<body>
<h1>Testing Sample Database</h1>
<asp:DataGrid runat="server" ID="SampleControl" />
</body>
</html>

Step four is to get the simplest possible MVC3 Razor application functioning on Ubuntu / EC2.  Again Bridgewater has a more detailed explanation of what to do at his website linked here.

  1. Go into Visual Studio 2010 and create a new project MV3 / Razor project making no changes to the default project template.
  2. Build it and locally.
  3. Ensure that these references are set to “copy local”: System.Web.Mvc, System.Web.Helpers, and System.Web.Routing
  4. Copy System.Web.Razor, System.Web.WebPages, System.Web.WebPages.Razor, System.Web.WebPages.Deployment into your application’s bin directory.  You will find these files in in C:\Program Files (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\Assemblies
  5. Publish the application to a scratch directory
  6. Copy the published application to your EC2 machine.  I used git bash to tar (tarr –zcvf aws.tar.gz *) the files as Bridgewater recommends but could not get scp to work so I ftp’d the file over.
  7. On the EC2 machine cd /srv/www/default; sudo mv /home/ubuntu/aws.tar.gz; sudo tar –zxvf *.gz; sudo chown –R www-data;www-data *; sudo chmod 755 *; sudo service restart apache2 restart
  8. Confirm working from browser by checking default IP address http://xx.xx.xxx
  9. NB: I had to hit refresh several times before the application would work.

Step five is to use implement membership using MySQL.

  1. On your Windows machine.  Edit the default controller and decorate it with the [Authorize] attribute.
  2. Edit your web.config shown below.  This is where it can get hairy.  If you want to run this locally on Windows you need to install the MySQL connector for .Net and Mono http://dev.mysql.com/downloads/connector/net/.  Make sure that you reference system.web.  On Ubuntu the application uses system.data.  The trick is to add them both so you can run the same code on Ubuntu and Windows.  Also notice that I’ve made database password clear text.  As Nathan notes this is not a good practice.
  3. On the Ubuntu machine Go into MySQL and create a database called membership.
  4. Deploy the application to EC2 and test the application using step 4.

<?xml version="1.0"?>

<!--
 For more information on how to configure your ASP.NET application, please visit
 http://go.microsoft.com/fwlink/?LinkId=152368
 -->

<configuration>
 <connectionStrings>
 <add name="Default"
 connectionString="data source=127.0.0.1;user id=aspnet_user;
 password=secret_password;database=membership;"
 providerName="MySql.Data.MySqlClient" />
 </connectionStrings>

<system.web>
 <compilation debug="true" targetFramework="4.0">
 <assemblies>
 <add assembly="System.Web.Abstractions, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31BF3856AD364E35" />
 <add assembly="System.Web.Routing, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31BF3856AD364E35" />
 <add assembly="System.Web.Mvc, Version=2.0.0.0, Culture=neutral, PublicKeyToken=31BF3856AD364E35" />
 </assemblies>
 </compilation>

<authentication mode="Forms">
 <forms loginUrl="~/Account/LogOn" path="/" timeout="2880" />
 </authentication>

<!--NOTE that "hashed" isn't supported with the public release of MySql.Web 6.3.5 under
 Mono runtime. But I can't bring myself to share sample code that doesn't hash the
 passwords by default. ;) The version included with this sample project is slightly modified to
 allow hashed passwords in Mono. I highly recommend checking out the latest version of
 MySql .NET Connector. http://dev.mysql.com

 Also, I found that you have to rebuild MySql.Data and MySql.Web
 using .NET 4.0 profile if you want it to work with Asp.Net 4.0 under Mono. This is a known bug and should
 be published in upcoming versions of the connector. -->
 <membership defaultProvider="MySqlMembershipProvider">
 <providers>
 <clear/>
 <add name="MySqlMembershipProvider"
 type="MySql.Web.Security.MySQLMembershipProvider, mysql.web"
 connectionStringName="Default"
 enablePasswordRetrieval="false"
 enablePasswordReset="true"
 requiresQuestionAndAnswer="false"
 requiresUniqueEmail="true"
 passwordFormat="hashed"
 maxInvalidPasswordAttempts="5"
 minRequiredPasswordLength="6"
 minRequiredNonalphanumericCharacters="0"
 passwordAttemptWindow="10"
 applicationName="/"
 autogenerateschema="true"/>
 </providers>
 </membership>

<roleManager enabled="true" defaultProvider="MySqlRoleProvider">
 <providers>
 <clear/>
 <add connectionStringName="Default"
 applicationName="/"
 name="MySqlRoleProvider"
 type="MySql.Web.Security.MySQLRoleProvider, mysql.web"
 autogenerateschema="true"/>
 </providers>
 </roleManager>

<profile>
 <providers>
 <clear/>
 <add type="MySql.Web.Security.MySqlProfileProvider, mysql.web"
 name="MySqlProfileProvider"
 applicationName="/"
 connectionStringName="Default"
 autogenerateschema="true"/>
 </providers>
 </profile>

<pages>
 <namespaces>
 <add namespace="System.Web.Mvc" />
 <add namespace="System.Web.Mvc.Ajax" />
 <add namespace="System.Web.Mvc.Html" />
 <add namespace="System.Web.Routing" />
 </namespaces>
 </pages>

<!--Don't forget to update this... I left it open to make it easier to debug.-->
 <customErrors mode="Off"/>
 </system.web>

<system.data>
 <DbProviderFactories>
 <clear/>
 <add name="MySQL Data Provider"
 description="ADO.Net driver for MySQL"
 invariant="MySql.Data.MySqlClient"
 type="MySql.Data.MySqlClient.MySqlClientFactory, MySql.Data"/>
 </DbProviderFactories>
 </system.data>

<system.webServer>
 <validation validateIntegratedModeConfiguration="false"/>
 <modules runAllManagedModulesForAllRequests="true"/>
 </system.webServer>

<runtime>
 <assemblyBinding xmlns="urn:schemas-microsoft-com:asm.v1">
 <dependentAssembly>
 <assemblyIdentity name="System.Web.Mvc" publicKeyToken="31bf3856ad364e35" />
 <bindingRedirect oldVersion="1.0.0.0" newVersion="2.0.0.0" />
 </dependentAssembly>
 </assemblyBinding>
 </runtime>
</configuration>


HTML5 Validation using Yepnope

November 23, 2011

In this months’ MSDN magazine there is an interesting article on Browser and Feature Detection.  What really caught my eye was the piece on Modernizr, the JavaScript library that implements browser feature detection.

Modernizr has built-in detection for most HTML5 and CSS3 features that’s very easy to use in your code. It’s very widely adopted and constantly enhanced. Both Modernizr and jQuery are shipped with the ASP.NET MVC tools.

As HTML5 and CSS3 become more and more prevalent feature detection is increasingly relevant.

A growing number of ready-made “fallbacks” for many HTML5 features, known as shims and polyfills, can ease that burden. These come in the form of CSS and JavaScript libraries or sometimes even as Flash or Silverlight controls that you can use in your project, adding missing HTML5 features to browsers that don’t otherwise support them. The difference between shims and polyfills is that shims only mimic a feature and each has its own proprietary API, while polyfills emulate both the HTML5 feature itself and its exact API. So, generally speaking, using a polyfill saves you the hassle of having to adopt a proprietary API.  The HTML5 Cross Browser Polyfills collection on github contains a growing list of available shims and polyfills.

As .Net developers we are usually insulated from browser incompatibility issues, however, there may be situations where you are not using the .Net framework.  There is a good example of using yepnope (a conditional resource loader) for HTML5 Form Validation at CSSKarma.  I tweaked and re-implemented  the example using a .Net MVC application which you can see below.  A couple of “gotchas”:

· Modernizr is part of the .Net MVC tools that come from Microsoft, however, yepnope is not in the release of the MVC tools that at least I have on my machine.  You will need to download version 2.0 from modernizr.comto get yepnope support.

· The JQuery syntax for working with YepNope takes some getting used to.  Make sure to load JQuery first.

· Visual Studio 2010 does not seem to know about HTML5 and will warn of a validation when you use its attributes. See figure #2.

clip_image002[4]

Figure 1 – MyValidation.js – based on the example at CSSKarma

clip_image004

Figure 2 – HTML5 Source

clip_image006

Figure 3 – Output

References:

http://yepnopejs.com/

http://haz.io/

http://modernizr.com


Best practices for adding scalability

November 11, 2011

My thesis is that a you can’t have a good SaaS application that doesn’t scale.  By definition the need for scalability is driven by customer demand but there is demand and there is DEMAND. A handful of lucky organizations (Google, Twitter, Facebook) are faced with industrial strength volume every minute of every day. Organizations with this type of DEMAND can afford to have entire divisions dedicated to managing scalability. Most people are dealing with optimizing their resources for linear growth or the happy situation where their application (Instragram) catches fire (in some cases overnight). A scalable architecture makes it possible to expand to cloud services such as EC2 and Azure or even locally hosted capacity. Absent a scalable architecture an organization is faced with curating a collection of tightly coupled servers and overseeing a maintenance nightmare.

Scalability is the ability to handle additional load by adding more computational resources.  Performance is not scalability, however, improving system performance mitigates to some degree the need for scalability.  Performance is the number of operations per unit of time that a system can handle (e.g., words / second, pages served / day, etc.).  There are two types of scalability – vertical and horizontal.

Vertical scalability is achieved by by adding more power (more RAM, faster CPU) to a single machine.  Vertical scalability typically results in incremental improvements.  Horizontal scalability is accommodating more load by distributing processing to multiple computers.  Where vertical scalability is relatively trivial to implement, horizontal scalability is much more complex.  Conversely, horizontal scalability offers theoretically unlimited capacity.  Google is the classic example of infinite horizontal scalability using thousands of low-cost commodity servers.

If you have the luxury of working off of a blank sheet of paper or have the flexibility to implement a major new technology stack some of the better solutions for implementing scalability include ActiveMQ, and Hadoop. Microsoft’s AppFabric Service Bus promises capability in this area for Azure hosted applications. Many times scalability was considered when an application was first created but has proven to be inadequate for current demand.  The following are suggestions for improving an existing application’s scalability.

Microsoft’s Five Commandments of Designing for Scalability

  • Do Not Wait- A process should never wait longer than necessary.
  • Do Not Fight for Resources - Acquire resources as late as possible and then release them as soon as possible.
  • Design for Commutability- Two or more operations are said to be commutative if they can be applied in any order and still obtain the same result.
  • Design for Interchangeability – Manage resources such that they can be interchangeable (i.e., database connection).  Keep server side components as stateless as possible.
  • Partition Resources and Activities – Minimizing relationships between resources and between activities

Microsoft’s Best Practices for Scalability

  • Use Clustering Technologiessuch as load balancers, message brokers, and other solutions that implement a decoupled architecture.
  • Consider logical vs. physical tierssuch as the model view controller (MVC) architecture.
  • Isolate transactional methodssuch that components that implement methods that implement transactions are distinct from those that do not.
  • Eliminate Business Layer State such that wherever possible server-side objects are stateless.

Shahzad Bhatti’s Ten Commandments for Scalable Architecture

  1. Divide and conquer – Design a loosely coupled and shared nothing architecture.
  2. Use messaging oriented middleware (ESB) to communicate with the services.
  3. Resource management – Manage http sessions and remove them for static contents, close all resources after usage such as database connections.
  4. Replicate data – For write intensive systems use master-master scheme to replicate database and for read intensive systems use master-slave configuration.
  5. Partition data (Sharding) – Use multiple databases to partition the data.
  6. Avoid single point of failure – Identify any kind of single point of failures in hardware, software, network, power supply.
  7. Bring processing closer to the data – Instead of transmitting large amount of data over the network, bring the processing closer to the data.
  8. Design for service failures and crashes – Write your services as idempotent so that retries can be done safely.
  9. Dynamic Resources – Design service frameworks so that resources can be removed or added automatically and clients can automatically discover them.
  10. Smart Caching – Cache expensive operations and contents as much as possible.

References:


Software Due Diligence – Part III

September 29, 2011

Software due diligence is a bit like having a home inspection done when purchasing a house.  Some problems are more serious than other.  For example, if you find that there is mold or asbestos in the basement that might be a reason to walk away.  Like a home inspection in most cases, the diligence does not reveal such serious problems with the software that you will want to back out of the deal entirely, it is typical that you may want to re-consider your valuation or take steps to manage the transition.

Red Flags

  • Architecture that will not scale (possible walk away)
  • Application cannot be re-built / run from source control checkout
  • Inability to engender a sense of confidence that the solution really works
  • Lack of forethought (this is where I’d like to go)
  • Architecture that cannot be cleanly expressed
  • Prima donna developers
  • Absence of technical leadership
  • Reliance on obsolete technology (i.e., Delphi)
  • Business logic consistently found in the presentation layer
  • Absence of any documentation whatsoever
  • Critical code that no one owns (i.e., that was developed by abc who isn’t here anymore)
  • Serious ethical breakdowns

Reasonable expectations

  • Absence of perfect documentation (even the best organizations are challenged to have up to date documentation)
  • At least one thing that impresses you as “world-class” (the more the better)
  • Good code
  • Finding that there are one or two go-to people
  • People wear many hats (product manager, QA, developer, etc.)
  • Insufficient infrastructure
  • Reliance on free services
  • Lack of published standards, metrics, or formal process
  • Informal bug tracking
  • Out of date off-the-shelf software
  • Limited requirements documents

Things to give you pause

  • Non-homogenous technology configuration
  • Bleeding edge / specialized technology (e.g., Cassandra, assembly)
  • Dependence on a service provided for free (Google Translate API)
  • Sloppy code
  • Lack of appreciation of the competition
  • Insufficient knowledge of best practices (JQuery vs. JavaScript)
  • How well will the system under examination compliment / be incompatible with existing systems?
  • Are some areas more complete than others? (some code is more battle tested)
  • Is there something that should be patented?
  • Poor user interface
  • General sense of mediocrity

Software Due Diligence – Part II

September 29, 2011

Operations

  • Review of current architecture either via documentation and whiteboard.
  • Watch the application run from an OS console. (e.g., top, perfmon)
  • Watch the application run from purpose-built administrator tools.
  • Describe the hosting architecture.
  • Where is the system hosted?
  • How redundant is the system?
  • How is the system monitored?
  • What are the biggest bottlenecks in the system?
  • Has your system ever been compromised?
  • Characterize the reliability of your system.
  • Have you done any vulnerability or penetration testing?
  • How would you handle 10X volume, 100X volume, 1000X volume? (This is a big one.)
  • Inventory of hardware and software (technology) assets.
  • Where is your source code stored?

Software

  • Review the source code (looking for good coding practices, clean architecture, exception handling, etc.).
  • Review database schema and query the live database (or copy of live).
  • Inventory custom components and software license agreements.
  • Are there any public or private APIs?
  • Review (developer) documentation associated with the code.
  • Review user-facing documentation and/or training materials.
  • Build all applications from source code and deploy to hosting environment.
  • Is there a debug / development interface?
  • Is there a database of customer feature requests or open issues?
  • Are any obsolete technologies (i.e. Delphi) in use?

People

  • Meet key employees and get to know their backgrounds.
  • Who is the go to person?
  • How do people collaborate?
  • Describe your SDLC?
  • Where do requirements come from?
  • How is the software tested?

Product

  • See a demo of all products, utilities, and supporting software.
  • Product Roadmap: Recent, Past, Present, Future.
  • Review current business model and sales process.
  • Are there any prototypes or product concepts that we should see or discuss? (These can be hidden gems)

Software Due Diligence – Part I

September 29, 2011

I was recently asked about what goes into software due diligence.  This is the first of three posts on this topic.  In this post I outline my thoughts on the process itself.  Part II is my working list of questions.  Finally, part III are some thoughts about what to expect and when to walk away.

There are a bunch of good checklists out there for buying an entire company.  See references below.  Most of these checklists talk about software diligence relatively generically. After looking at a number of different organizations for a variety of different reasons I’ve built myself a checklist that may be useful starting point for others. I think my checklist is most applicable for medium sized applications (~1MM SLOC) built by teams of 3-10 people. Larger applications probably warrant a more sophisticated approach.

This post is not about valuing the business, assessing the product, or anything to do with market position. It is all about looking at the code, how the code is hosted, and assessing the technical assets of a business. If you are doing a project for a VC they typically want to know if there are any “red flags.” There are two types of red flags – those that are correctable and those that are not. A correctable red flag is something like lack of off-site backups or a non-redundant server. An uncorrectable red flag (which typically means walk away) are prima donna developers, limited / no documentation, or an architecture that cannot scale. If you are doing a project for a business that is trying to integrate a property with their own systems they want to know about the red flags but they also want an understanding of what they are going to be inheriting and what it’s going to take to make it useful as fast as possible.

Invariably the technical examination will overlap with looking at the business itself. For example, when buying a product that claims to have a million users its prudent to query the database and see that there are at least a million email addresses in the database. Similarly, the business development folks are often after any information that they can use to value the business or close the deal.

I think that much of technology due diligence is common sense. The good news is that, if it’s done right, you quickly get a feel for the goodness of the product.  A process that has served me well is to sit in on the preliminary conversations (which are often over the phone) with the management team. I may / may not ask some general questions during that meeting. I will then follow-up with a call to the CTO (or technology lead) to get into more detail. The goal of the technology call is to confirm my understanding of the technology stack and to set expectations for an on-site visit.

I am not a big fan of questionnaires.  I believe an on-site visit is critical to getting a good understanding of the technology. More recently I’ve been challenged to find a place to visit as many of the principals work remotely from themselves. I think it is important to meet the people face to face. My primary objective is to learn as much as I can about the technology as possible. My secondary objective is to meet the development team and form an option about their respective competencies. By its nature diligence is the process of looking for problems. Rarely do you come back from looking at a business thinking that you under estimated how good it is. On the other hand I can think of many occasions where I’ve come away blown away by the people – their technical acumen, tenacity, and single minded determination to make something work.

Picking who goes on-site is a particularly important consideration. I think the minimum number is two – one subject matter expert on the business and someone who is proficient with coding in the language of the business being acquired. (You do not want a .Net person doing a Ruby on Rails evaluation). If budget permits an IT person is a very nice to have resource. Their perspective often compliments that of the business and developers.

References:

 


Comparing IE9, Firefox 5, and Chrome 12

July 31, 2011

Since the release if IE9 in March of 2011 Microsoft has been claiming that Internet Explorer is the best browser for business. They recently commissioned Forrester Research to help justify and quantify this statement. The actual study, “The Total Economic Impact of Windows Internet Explorer 9,” assessed the ROI six large organizations experienced upgrading from IE8 to IE9. Firefox and Chrome are never mentioned in the Forrester study. I accept the argument that IE is more optimized for a homogenous enterprise environments than Firefox and Chrome. I don’t think it’s fair or even accurate to imply that IE is the best browser.

I regularly use all three mainstream browsers and have a pretty good sense of their strengths and weaknesses. I did a bunch of real-world testing trying various sites, configuring the layout of the browsers, and looking at different features. In the end, with the exception of things that matter only to a relatively small audience, I came away feeling that all three are all more than adequate for day-to-day business usage.

Internet Explorer

IE9 has come a very long way. Its predecessors were buggy, full of security problems, and lacked standards compliance. IE9 is a fine, feature rich browser. I mainly use IE to access internal sites based on SharePoint (to take advantage of the tight integration with the operating system) and to access email remotely via Outlook Web Access (using an ActiveX control). I don’t think IE is as good as either Chrome or Firefox (largely because of the lack of a rich extension / add-on marketplace) but the gap is rapidly closing.

IE9

 

 

 

 

 

 

 

 

 

Strengths Weakness
ActiveX Controls Lack of rich extensions – possibly because it’s harder (big thing)
OS-level integration (i.e., SharePoint) Lack of spell checker (small thing)


Firefox

I am a long-time Firefox user; it has been my main browser for as long as I can remember. Initially I started using Firefox because IE was so problematic. Five years ago Firefox had unique and innovative features like tabs, extensions, and was much more secure than IE. What really hooked me on FF was the adblock plus extension. I cannot remember the last time I saw an ad in Firefox never mind clicked on one. Over the past year I’ve become more conscious of how long it takes for Firefox to start-up, load web pages, and how much memory it consumes. Part of my speed and memory problems are no doubt related to the extensions I have running.

FF5

 

 

 

 

 

 

 

 

 

Strengths Weakness
Rich extensions marketplace Memory Management

Chrome

Over the past year I’ve started to use Chrome more and Firefox less. What first grabs you about Chrome is the out-of-the-box speed. The application loads nearly instantly, pages render quickly, and in general the application feels zippy. Chrome has many of the nice features of Firefox but without the bloat. Most importantly for me most (but not all) extensions I use regularly are now available for Chrome. The extensions I use are Adblock plus, autocopy, Firebug, Instapaper, LastPass, StumbleUpon, UserAgentSwitcher, and Xmarks. One really nice under the hood feature that is unique to Chrome is that you don’t need to restart the browser after installing an extension. The other thing I’ve noticed is that Chrome seems to do a good job of memory management and distributing applications into separate processes. In my unscientific test of IE, Firefox, and Chrome, IE uses the least amount of memory. On the other hand, IE doesn’t have the add-ons/extensions that Firefox and Chrome do so the test is not really fair.

Besides the inertia of switching from one browser to another, what has held me back from hopping on Chrome bandwagon is the overhead of learning its development tools. I’ve become very attached to Firebug which is Firefox specific and I fairly regularly use the UserAgentSwitcher add-on which is not available for Chrome. Interestingly John Barton, the lead firebug developer, has recently joinedthe Google Chrome team.

Chrome 12

 

 

 

 

 

 

 

 

 

Strengths Weakness
Fast Memory Management
  Lack of user agent switcher add-on

Summary

For the moment I am “stuck” using all three browsers. Stuck really isn’t the right word for it. Until IE gets a richer extension model I probably won’t be using it unless I have to. Unless something unexpected happens I do expect that that Chrome will become my everyday browser.


State of software development – 2011

July 13, 2011

The three big trends influencing computing in 2011 are mobile, social, and Software as a Service (SaaS). Start-ups and established companies alike are pushing themselves to deliver products and services that address these markets.  If you are a professional software developer you very like use:

  • · C or C++ if you work on an application where an interpreted language is not acceptable – performance intensive application (CPU or memory), security, hardware, etc.;
  • · Java if you work for IBM, Sun, Salesforce.com, or some other big organization / platform that does not embrace Microsoft’s technology stack;
  • · .Net (probably C#) if you work for a (typically mid-to-large) organization that embraces the Microsoft stack; 
  • · Java if you write Android applications; Objective C if you write iPhone/iPad applications;
  • · PHP, Perl, or Python if you write web applications on a LAMP platform that is more than 3 years old; or,
  • · Ruby on Rails if you write web applications on a LAMP platform that is less than 3 years old.

If you code in something else you are in the minority.  This can be a dual edge sword.  The good is that you are by definition a specialist and can command higher compensation.  The downside is that there may not be demand for your particular skill. 

Platform as a Service

Platform as a Service (PaaS) service, a component of SaaS – where the hosting platform is moved to the cloud, is clearly maturing.  If an application takes off organizations now have multiple viable options for quickly adding capacity.  Amazon’s EC2 and Microsoft’s Azure are two of the more well known platforms.  For example, Mashable documents how mobile phone app Instagr.am went from 0 to 3M users in less than six months with EC2.

The concept of PaaS is somewhat game-changing for start-ups.  Essentially this means that you only need to invest in the computing power that you need at the moment.  Where 10 years ago a major component of the cost of starting a web business would be the data center, today you only pay for what you use.  If I were to boot-strap a company like Instagr.am I would start with a small number of physical servers in a low-end colo facility.  (See askwebhosting.com to see how cheaply this can be accomplished.)  Only once growth justified it would I move to EC2 (LAMP) or Azure (.Net).

I do believe that Azure is potentially game changing for Microsoft.  Prior to Azure you would never (and I do mean never) hear of start-ups using Microsoft’s technology.  At worst Microsoft’s development eco-system is as good as anything available in the LAMP stack – many people find it superior.  The problem was the cost of deploying Windows/SQL server was prohibitive for all but the largest organizations.  It will be interesting to see if Azure changes the equation.

Mobile

Today mobile application development requires you to be either a Hatfield or a McCoy.  You either are in the iPhone or the Android camp. Mobile application development is dominated by applications written in Objective C for Apple devices and Java for Android devices. I am aware of one cross platform tool Titanium.  While I have not used it personally I do not hear good things about it.  My sense is that you can get away with using Titanium if your mobile app is very straight-forward and / or doesn’t’ take advantage of any device specific functionality. 

Last summer I posted a link to TIOBE Software’s index of most popular programming languages. I went back and re-visited the numbers for this year and see only incremental change from a developer’s point of view. TIBOE update their rankings each month – this is the July report. They use data from Google, Bing, Yahoo!, Wikipedia, YouTube, and Baidu to calculate its ratings. I found another site langpop.com which uses a variety of different mechanisms to calculate popularity. Broadly speaking the two charts correlate. I like the TIOBE formatting and that they update their results each month.

Capture

Source: TIOBE Programming Community Index for July 2011

In addition to these languages listed above if you work on the web a professional software developers also needs know:

  • · HTML, XML, JSON, JavaScript, and have rudimentary CSS knowledge (1)
  • · Basic SQL skills – MySQL, SQL Server, or PL/SQL
  • · JQuery JavaScript library
  • · Implement AJAX in their respective language of choice

(1) Developers rarely have advanced CSS / Photoshop knowledge. Typically commercial designers are used for this type of work.

image

Source: LangPop Language Popularity Normalized Comparison – Updated April 13, 2011


Follow

Get every new post delivered to your Inbox.

Join 45 other followers