Thursday, December 29, 2011

Eng: Great Engineering Comes From Great Infrastructure

A great engineering culture starts with a great engineering infrastructure. A great infrastructure allows people to stack technologies and leverage arbitrarily upon each other, compounding their leverage to enormously high levels. On the other hand, without infrastructure and culture in place, a team struggles with mundane issues such as integration, expertise segmentation, communication problems, redundant system administration efforts, and silo’ed [undebuggable] code base.[1]

Many tech pundits say that Google by far is some of the worst product-oriented tech companies in Silicon Valley. Wave sucked donkies. Buzz was marginally better. G+ is a premature baby. The first few versions of Android was terrible in terms of usability/security/developer APIs. The first few versions of Google search was mediocre at best. The first few versions of Gmail was a big turn off for non-techies. And the early Google AdSense? Horrible contextual match.

This is not to say that Google is a horrible company as a whole but that it is a very misunderstood company. For what Google lacks as a product company, it makes up by having a superb backend infrastructure with very effective engineering practices. Lead by world most famous luminaries from both academia and industry, Google built some of the most amazing infrastructures including the homogenous data acenter/Borg (EC2), GFS (S3), MapReduce (Hadoop), SS Table/BigTable (HBase), Service Orinted Architecture (SoA), Protocol Buffer (Thrift), Mondrian (code review), Sawzall, SRE team & Borgmon (monitoring/management). These infrastructure foundations were and still are ahead of its time, and even today Amazon and Hadoop are still trying to catch up. This amazing Google foundations allow Google developers to build, integrate, and iterate a zillion different crappy products quickly. For example, while Facebook still have just one product in the last few years, Google used minimal developers to create Wave (failure), Buzz (failure), and Google+ within a short period of a few quarters.

So while Google end-user products are not exactly exemplary (relative to what Apple has cranked out), its backend infrastructure is something that is highly prized and coveted. Here are some of the simple engineering tenets that companies can replicate to also create a Googly infrastructure:

1) Use homogeneous systems. To minimize versioning problems, developers all use the same version of OS as the production machines. Imagine the old days back in the 90s: heterogeneous environment with NT4.0 + HPUX + SunOS + SCO + BSD + Irix + DEC, which creates a silo of people with different expertise, and invariably creates integration headaches. Because a significant amount of developer time is spent on integration and debugging, why create more problems by having a highly heterogeneous environment?

At Google, there is only 1.5 version of the OS, the “Goobuntu”. The developer uses the exact same version (and package dependencies) as the production system. “Versionitis” is no more.

2) Make the code base universally shared and accessible on a uniform version control system. This empowers any developer to look at and fix any other developer’s code, which minimizes unnecessary communications (e.g. “Hello Joe, I don't see your code and it is doing something weird, can you look into it?”). A developer at Google can read the entire Google source code, in google3/... as well as check in fixes for anyone else. How is that for transparency?

3) Pick a few core [production] languages, and stick to them. By having just a few core languages, developers from different groups can read each other’s code, and be able to jump from one group to another with ease.

The worst scenario is when a company starts to use various exotic languages (Haskell, Erlang, OCaml), or [as in the case of Twitter] when there are simply too many languages for one product (LISP + Perl + Python + PHP + Ruby + Lua). You really don’t want emails going back and forth that looks like “Hey Joe, do you know Lua? I think there is a bug in the re-Tweet feature and the person who wrote this Lua code left.”

At Google, the 4 main languages are: C++ (backend), Java (backend and middle-tier), Python (middle-tier and frontend), and JavaScript (frontent). Almost every developer at Google [who has been around] is well versed in these 4. There are many experimental but non-production languages/frameworks. For example, GWT has been out since 2007, and Go/DART are just experimental languages that no one touches.

4) Scale horizontally by creating Service-Oriented-Architected (SoA) systems. Instead of creating a monolithic bloated code-base, the use of SoA allows the system to be spawned, replicated, and distributed. It also allows components/services to be tested individually. SoA requires communications, hence unify the method of communication between the services (e.g. pick one: Protocol Buffer, Thrift) and the different core languages can communicate with each other with ease. With a horizontally scaled solution, you simply need to add more of the same machines to handle more load. Look at this slide for more info.

4b) If you're building to scale, then don't scale vertically. It is tempting to build a crappy traditional architecture and then just upgrade to better hardware to scale, like buying more RAM, faster machine, more processors, SSD, exotic SANS, Oracle, etc -- DON'T DO IT! Vertical scaling is a one time process and has a bound. Your crappily written server may handle 2000 QPS today and adding a superior hardware may boost it to 3000 QPS. You've maxed out your hardware configuration to get that 3000 QPS, now what? You're stuck! Superior hardware is never going to cure problems from shitware, the same reason that highrises can never be built out of muddy foundation.

4c) Do not use vendor specific solutions. I've seen so many startups getting locked in by .NET or Oracle or Federated MySQL and such because they used some exotic feature that is not available anywhere else, so they end up getting stuck and scale vertically (see 4b). One example is seeing a bunch of old-school programmers putting every logic they can think of in SQL stored procedures. Now, when it comes time to scale, they either want to shard databases (or even want to move it to distributed processing), but they can't because they've invested too much time writing those vendor specific procedures, making transitioning a very difficult if not impossible task without having to rewrite the entire software. People who insist on using a bunch of SQL stored procedures (logic in centralized computation) are usually a bunch of inexperienced folks who never experienced horizontal scaling-- homogenous and distributed (horizontal) computation model.

5) Separate system administrators from developers. First, developers dislike mundane system
administration tasks, and system administrators in general are not the best developers. Let the system administrators worry about system allocation, load, deployment, monitoring, costs. Let the few elite architects worry about the architecture that provides reliability and fault tolerance, and let the developers worry about code base, scalability (using SoA), extensibility and usability.

6) Measure measure measure. Every system call, every RPC, everything should be logged to monitor, improve, and scale. Google does this by implementing varz (similar to stats), and the SRE teams (elite sysadms) do tight monitoring and alert systems.

7) Foster a culture of computer scientists, not hackers. Hackers can crank out demos fast (case in point Yahoo) but they also create too many job opportunities for people who enjoy carrying pagers.

8) Hire hire hire. Hire based on abilities instead of “he is my friend.” This topic alone deserves a few pages of discussion.

Tuesday, December 20, 2011

Thrift IDL (protocol)

After the horrible experience with Avro, I considered using Protocol Buffer and Thrift for the company. Protocol Buffer's strongest point is that it is stable (not much has changed in the past few years). It is used in every single possible service in Google, it has gone through a very stringent code-review process, it has been written by the world's most seasoned and anal engineers, and thus has been well battle tested. However, I consciously passed over the opportunity to suggest Protocol Buffer to use for the company partly because I'm considered a bias party, and to suggest it will simply reinforce the idea that "Kevin is a Googler so he's obviously biased. He thinks everything coming out of Google is amazing." To be fair, I really think that Google cranks out shit end-user products most of the time (Wave, Buzz, G+, Location, Google Base, Android, etc etc...). Sometimes Google happens to make good end-user products only because Google throws a billion darts in the dark and occasionally one of the darts hits the bullseye. That's all.

I tested Thrift, and it is acceptable. In terms of feature, it is very similar to Protocol Buffer. The first thing I tested was message backward and forward compatibility. There was no problem in either case. Whereas Avro returns an error saying that message format is different, Thrift server gracefully (and correctly) disregards new message types or ignores old messages.

In Java Thrift, you can set your Thrift objects using getters and setters, which is great because if the message type changes (name or type), the Java compiler will give you an error immediately. In Java Python, you can also set your Thrift objects using the constructor and the runtime system will catch name errors. In contrast, Avro does not do any of this, so your program will just run along happily even though you're setting my_integer="Not an integer" and somewhere down the line your program crashes and you're scratching your head.

One last thing I love about Thrift: there is an asynchronous transport!!! This is exactly what powers AdSense, and allows people to easily prototype distributed computation architectures.
http://blog.rapleaf.com/dev/2010/06/23/fully-async-thrift-client-in-java/

There are a few Thrift "bugs" that should be fixed. For example, suppose you set the following as message definition:
2: string lastname = "last_default",
7: string lastname = "HO",
...

The above should signal a compiler error (e.g. "Same type name not allowed."). There are many other errors that should have signaled an error, but are not. I guess either they are too busy, too lazy, or just expect the compiler (either C or Java) to catch the error.

One other minor difference between Protocol Buffer and Thrift: In Thrift, there is no deprecation keyword. In Protocol Buffer, deprecation field compiles into Java, and the compiler will tell you the field is deprecated to allow programmers to update. It's not a big deal, but it may be a big deal for companies that keep updating contracts between two services.

In the end, my take on Avro vs. Thrift is like this. Avro is like Microsoft Zune. Zune has all the bells and whistles-- AM radio, recorder, more buttons, higher display resolution, external HD, blah blah blah. The iPod on the other hand, just does one thing. On paper, Zune is superior over iPod. On paper, Avro is superior over Thrift. But in the end, Avro just doesn't work well (no forward/backward compatibility, buggy buggy buggy and the developers don't even respond to my bug report). What looks good on paper, isn't necessarily good in practice. You can't trust everything you read. You have to play with it.

Monday, December 19, 2011

Avro, what a complete waste of time

I'm responsible for evaluating the different IDLs (Protocol Buffer, Avro, and Thrift) as a unified form of communication between different services in the company. A key feature of today's IDL is backward and forward message compatibility. For example, if the client adds one more field to a message, the server should be able to take the new message and process it (while ignoring the new field). The opposite is true, where the server takes in additional fields in the message while the client does not, and the server should just assume that the field is empty.

I started with Avro because I had high hopes for Avro. It had great features that neither PB nor Thrift had (no need for field deprecation, no need for deprecation, no need to get an IDL compiler), and because it's built in to Hadoop's MapReduce. My experience with Avro began with downloading the package (version 1.6.1, the latest). I tried out an example code (phunt-avro-rpc-quickstart-avro-release-1.2.0-9-gce46e91.zip) with included two small Python codes, start_server.py and send_message.py (client). Both of them used the same IDL (mail.avpr). I got the client to send a message to the server with ease. Then, I tried the most important aspects of IDLs-- forward and backward message compatibility. I expected the server to gracefully accept old and new messages, but instead got something completely unexpected:

PATH=~/code/avro-example/avro-1.6.1/src ./send_message.py AA BB MSG
Traceback (most recent call last):
File "./send_message.py", line 56, in
print("Result: " + requestor.request("myecho", {"mymessage": message}))
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 145, in request
return self.issue_request(call_request, message_name, request_datum)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 260, in issue_request
call_response_exists = self.read_handshake_response(buffer_decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 204, in read_handshake_response
handshake_response.get('serverProtocol'))
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 120, in set_remote_protocol
REMOTE_PROTOCOLS[self.transceiver.remote_name] = self.remote_protocol
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 475, in
remote_name = property(lambda self: self.sock.getsockname())
AttributeError: 'NoneType' object has no attribute 'getsockname'


A well designed IDL should at least show a warning message indicating that the field is unknown (or new, etc). Nope! Avro returns with a weird socket-related error. Upon looking at the Avro library (avro-1.6.1/src/avro/ipc.py), line ~474 yields:

# read-only properties
sock = property(lambda self: self.conn.sock)
remote_name = property(lambda self: self.sock.getsockname())


So, I'm no Python expert but it's clear that self.sock does not exist, so I manually set remote_name in the constructor __init__ (meaning it's not a readonly variable anymore, but who cares) and viola, it works! Who the heck checked in this code anyways? My next attempt was the reverse: the server takes in a newer message and the client sends an older message and here's my very useful Avro message:

/code/avro-example/avro-1.6.1/src ./send_message.py AA BB MSG
Traceback (most recent call last):
File "./send_message.py", line 56, in
print("Result: " + requestor.request("myecho", {"mymessage": message}))
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 145, in request
return self.issue_request(call_request, message_name, request_datum)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 264, in issue_request
return self.request(message_name, request_datum)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 145, in request
return self.issue_request(call_request, message_name, request_datum)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 262, in issue_request
return self.read_call_response(message_name, buffer_decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 222, in read_call_response
response_metadata = META_READER.read(decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/io.py", line 445, in read
return self.read_data(self.writers_schema, self.readers_schema, decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/io.py", line 486, in read_data
return self.read_map(writers_schema, readers_schema, decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/io.py", line 615, in read_map
block_count = decoder.read_long()
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/io.py", line 184, in read_long
b = ord(self.read(1))
TypeError: ord() expected a character, but string of length 0 found


Alright, I've had enough. What I just attempted, was a very very common test case and if there's any decent amount of unit tests, this problem would never have existed. My guess now is that there is no unit test whatsoever, and there isn't much user base because I can't find this complaint anywhere via Googling (and I can't find much Avro documentation in the first place)! I sent an email to the Avro developer team this weekend and I've yet to receive a response. I am most utterly not impressed so far.

Conclusion:
1) Save yourself some time by using something else that is battle tested. Bleeding edge (in this case) is a waste of time.
2) The only thing that matters is your hands-on experience. Marketing and bias makes Avro look amazing (dynamic features, flexibility, maintenance free, language support, ...), but it doesn't matter if it does not work TODAY.

Friday, February 25, 2011

Recommending a fellow Googler to a startup

Once in a while I get this question: "Hey Kevin, thank you for recommending a fellow Googler. What do you think about X's technical skills?"

My long response is this:
The Google technical interview process is one of the most challenging interviews one can get. There's the resume screening (only one out of 1000 resumes pass through), then email screening, then phone screening, possible secondary phone screening, on-site screening, and finally the hiring committee (from Mountain View) reviews long and very detailed written feedback from 7-10 interviewers. If someone makes it in as an engineer, you are sure that person is way above average over the millions and millions of people who send in their resumes to Google each year. Think about this: if you're an engineer at less-than-stellar company that don't value core engineering (Fox, Yahoo, Citysearch, AT&T Interactive, MySpace) and you think you can do better, you have already at some point in your career applied to Google. It's only human to want to do better. Look. Chances are, an engineer you already know (who never went to Google) already applied and chances are he/she failed. I realize what I'm saying is really harsh, but this is harsh reality. For this reason, even a really bad Googler is still above tech industry average (e.g. especially from what I see in Los Angeles). Secondly, if a person survives the Google culture for a few years, you're sure that person is at least average amongst Googlers because the below average Googlers get kicked out very very fast; 2 to 3 quarters and you're out. I personally know a few that don't survive a year-- usually they're super smart but unmotivated and/or had other issues.

Having that said, technically, almost everyone I know at Google can kick the industry average programmer's ass. Googlers tend to come from top-tier schools or top-tier companies. They made it into the system. They are hardcore, trained under the stringent Google Code Readability process. People strive to get badges on their Moma page by being Googly-- being technically good. I am not exaggerating or bragging, I'm just saying this after observing different people from different backgrounds, and relative to Google, the average tech standard is a pathetically low bar.

Anyways, if I vouch for someone from Google, then that person is almost more than technically adequate. But then again, so is 90% of the other Googlers. There are of course distinctions amongst the group of the Special Force. Some people are slow but precise (they like to work on mission critical code). Some people are fast but sloppy (they like to work on social networking sites). Some like Java. Some like Python. Some like Javascript. Some like C++. Some people are smart, and some people are simply mind blowing brilliant. The Google gene-pool isn't all homogenous.

In the end, you should not have to worry about a Googler's technical skills. You may however, have to worry about many other things, like being able to give them challenging enough of a task, making them feel like they're making a big impact to the world, and providing enough incentives and rewards for keeping them; believe me, everyone is getting poached here and there these days with ridiculous packages. Keep in mind, there's a reason why Google managers tend to come and go very fast-- an x-manager once commented to me that it's really really hard to motivate and manage someone who is clearly much smarter than you are. I wasn't a manager at Google but I can understand why. Some of the smartest people I've met in the world are people I met in Google, and a few are a total pain in the ass to work with.

Tuesday, February 15, 2011

Product Managers

My startup is hiring a product manager because none of the engineers want to spend day after day doing product research, testing/trying out competitors' products, meeting solution providers and clients, blogging, product evangelist, PR, writing white-paper, doing patent research, and other things that engineers feel are too un-intellectual to do. While I have a nice Google system for interviewing elite engineers, I don't have a good system for hiring a good PM. In the end, I think there is so much variability in PMs that I don't think you can actually design a process for it. I do however think the most important thing is that the PM's style should match well with the overall vision of the company. For example, if the company is creating a business product, then the PM should be of the "spec hunting" type (more on this later). If the company is creating a user-end product, then the PM should be of the Steve Jobs type. In a hypothetical world were there are only two extreme types of PMs, then they are described as follows:

1) The most common PMs are the "spec hunters." They will write down specs that their competitors have, and try to compete on specs. They will list out a long matrix of features and check them off one by one. Case in point a few years ago the Microsoft Zune on paper is much more feature rich than anything out there. It's got a voice recorder, FM radio, more storage, Oled, 720p, 33 hr play time, so on so forth. The iPod does not have a voice recorder, does not have a FM radio, smaller storage, older/less resolution display, 30 hr play time. It doesn't do much. Almost all of Microsoft products are spec'ed products, designed by a committee with a long feature list that each committee member checks off. On paper, the Zune is clearly superior to the iPod. However, hiring a "spec hunting" PM is not a good match for consumer products; we all know the story with Zune vs. iPod today. Zune is dead.

Other examples of "spec hunting" PMs: America Online and Y! [homepage] are committee designed -- the idea with those products is that the more stuff you slap on a page, the happier the committee. It is no surprised that AOL looks like a mess. Ditto with Yahoo. Dell is yet another example. The Dell laptop on paper is superior to the Mac-- brighter screen, more HD, faster processor, bigger capacity battery, 1/2 the price. The list goes on and on and on.


2) The less common PMs are the minimalists. One example-- Steve Jobs is one such PM who designs with minimal features. Other example include Porsche, Ferrari, and other Italian designed products; all these cars have minimal features that simply run fast and look nice, and none have fancy OnStar or XM radio or GPS display or voice activated commands built in. Dropbox is in this spectrum too... it just does one thing-- let people drop files into the file system. Google search is another example of a minimalist page; you don't do anything on the main page except to search. http://www.google.com/corporate/tenthings.html Read point #2. Here the classic example: the Apple iPod only does one thing (whereas Zune and competitors do 100 other things), yet it still slaughtered competition. Zune is dead and the spirit of iPod lives on in the form of iPhone and iPad.

The minimalist PMs understand the power of looks, feel, navigation, intuitiveness, cohesiveness, and consistency. Whereas the "spec hunters" think they have taste, the minimalist PMs actually have taste.


In the end, I see that most companies are "run by committees" ruled by "spec hunting" PMs. I think the reason is clear-- minimalists who have a sense of taste are as rare as diamonds. Of course, there's nothing wrong with that. Despite Microsoft's extremely tasteless designs (case in point Microsoft Bob aka The Useless Paperclip Helper), it is still one of the most successful companies in the world.