Thursday, December 29, 2011

Eng: Great Engineering Comes From Great Infrastructure

A great engineering culture starts with a great engineering infrastructure. A great infrastructure allows people to stack technologies and leverage arbitrarily upon each other, compounding their leverage to enormously high levels. On the other hand, without infrastructure and culture in place, a team struggles with mundane issues such as integration, expertise segmentation, communication problems, redundant system administration efforts, and silo’ed [undebuggable] code base.[1]

Many tech pundits say that Google by far is some of the worst product-oriented tech companies in Silicon Valley. Wave sucked donkies. Buzz was marginally better. G+ is a premature baby. The first few versions of Android was terrible in terms of usability/security/developer APIs. The first few versions of Google search was mediocre at best. The first few versions of Gmail was a big turn off for non-techies. And the early Google AdSense? Horrible contextual match.

This is not to say that Google is a horrible company as a whole but that it is a very misunderstood company. For what Google lacks as a product company, it makes up by having a superb backend infrastructure with very effective engineering practices. Lead by world most famous luminaries from both academia and industry, Google built some of the most amazing infrastructures including the homogenous data acenter/Borg (EC2), GFS (S3), MapReduce (Hadoop), SS Table/BigTable (HBase), Service Orinted Architecture (SoA), Protocol Buffer (Thrift), Mondrian (code review), Sawzall, SRE team & Borgmon (monitoring/management). These infrastructure foundations were and still are ahead of its time, and even today Amazon and Hadoop are still trying to catch up. This amazing Google foundations allow Google developers to build, integrate, and iterate a zillion different crappy products quickly. For example, while Facebook still have just one product in the last few years, Google used minimal developers to create Wave (failure), Buzz (failure), and Google+ within a short period of a few quarters.

So while Google end-user products are not exactly exemplary (relative to what Apple has cranked out), its backend infrastructure is something that is highly prized and coveted. Here are some of the simple engineering tenets that companies can replicate to also create a Googly infrastructure:

1) Use homogeneous systems. To minimize versioning problems, developers all use the same version of OS as the production machines. Imagine the old days back in the 90s: heterogeneous environment with NT4.0 + HPUX + SunOS + SCO + BSD + Irix + DEC, which creates a silo of people with different expertise, and invariably creates integration headaches. Because a significant amount of developer time is spent on integration and debugging, why create more problems by having a highly heterogeneous environment?

At Google, there is only 1.5 version of the OS, the “Goobuntu”. The developer uses the exact same version (and package dependencies) as the production system. “Versionitis” is no more.

2) Make the code base universally shared and accessible on a uniform version control system. This empowers any developer to look at and fix any other developer’s code, which minimizes unnecessary communications (e.g. “Hello Joe, I don't see your code and it is doing something weird, can you look into it?”). A developer at Google can read the entire Google source code, in google3/... as well as check in fixes for anyone else. How is that for transparency?

3) Pick a few core [production] languages, and stick to them. By having just a few core languages, developers from different groups can read each other’s code, and be able to jump from one group to another with ease.

The worst scenario is when a company starts to use various exotic languages (Haskell, Erlang, OCaml), or [as in the case of Twitter] when there are simply too many languages for one product (LISP + Perl + Python + PHP + Ruby + Lua). You really don’t want emails going back and forth that looks like “Hey Joe, do you know Lua? I think there is a bug in the re-Tweet feature and the person who wrote this Lua code left.”

At Google, the 4 main languages are: C++ (backend), Java (backend and middle-tier), Python (middle-tier and frontend), and JavaScript (frontent). Almost every developer at Google [who has been around] is well versed in these 4. There are many experimental but non-production languages/frameworks. For example, GWT has been out since 2007, and Go/DART are just experimental languages that no one touches.

4) Scale horizontally by creating Service-Oriented-Architected (SoA) systems. Instead of creating a monolithic bloated code-base, the use of SoA allows the system to be spawned, replicated, and distributed. It also allows components/services to be tested individually. SoA requires communications, hence unify the method of communication between the services (e.g. pick one: Protocol Buffer, Thrift) and the different core languages can communicate with each other with ease. With a horizontally scaled solution, you simply need to add more of the same machines to handle more load. Look at this slide for more info.

4b) If you're building to scale, then don't scale vertically. It is tempting to build a crappy traditional architecture and then just upgrade to better hardware to scale, like buying more RAM, faster machine, more processors, SSD, exotic SANS, Oracle, etc -- DON'T DO IT! Vertical scaling is a one time process and has a bound. Your crappily written server may handle 2000 QPS today and adding a superior hardware may boost it to 3000 QPS. You've maxed out your hardware configuration to get that 3000 QPS, now what? You're stuck! Superior hardware is never going to cure problems from shitware, the same reason that highrises can never be built out of muddy foundation.

4c) Do not use vendor specific solutions. I've seen so many startups getting locked in by .NET or Oracle or Federated MySQL and such because they used some exotic feature that is not available anywhere else, so they end up getting stuck and scale vertically (see 4b). One example is seeing a bunch of old-school programmers putting every logic they can think of in SQL stored procedures. Now, when it comes time to scale, they either want to shard databases (or even want to move it to distributed processing), but they can't because they've invested too much time writing those vendor specific procedures, making transitioning a very difficult if not impossible task without having to rewrite the entire software. People who insist on using a bunch of SQL stored procedures (logic in centralized computation) are usually a bunch of inexperienced folks who never experienced horizontal scaling-- homogenous and distributed (horizontal) computation model.

5) Separate system administrators from developers. First, developers dislike mundane system
administration tasks, and system administrators in general are not the best developers. Let the system administrators worry about system allocation, load, deployment, monitoring, costs. Let the few elite architects worry about the architecture that provides reliability and fault tolerance, and let the developers worry about code base, scalability (using SoA), extensibility and usability.

6) Measure measure measure. Every system call, every RPC, everything should be logged to monitor, improve, and scale. Google does this by implementing varz (similar to stats), and the SRE teams (elite sysadms) do tight monitoring and alert systems.

7) Foster a culture of computer scientists, not hackers. Hackers can crank out demos fast (case in point Yahoo) but they also create too many job opportunities for people who enjoy carrying pagers.

8) Hire hire hire. Hire based on abilities instead of “he is my friend.” This topic alone deserves a few pages of discussion.

No comments:

Post a Comment