Thursday, December 29, 2011

Eng: Great Engineering Comes From Great Infrastructure

A great engineering culture starts with a great engineering infrastructure. A great infrastructure allows people to stack technologies and leverage arbitrarily upon each other, compounding their leverage to enormously high levels. On the other hand, without infrastructure and culture in place, a team struggles with mundane issues such as integration, expertise segmentation, communication problems, redundant system administration efforts, and silo’ed [undebuggable] code base.[1]

Many tech pundits say that Google by far is some of the worst product-oriented tech companies in Silicon Valley. Wave sucked donkies. Buzz was marginally better. G+ is a premature baby. The first few versions of Android was terrible in terms of usability/security/developer APIs. The first few versions of Google search was mediocre at best. The first few versions of Gmail was a big turn off for non-techies. And the early Google AdSense? Horrible contextual match.

This is not to say that Google is a horrible company as a whole but that it is a very misunderstood company. For what Google lacks as a product company, it makes up by having a superb backend infrastructure with very effective engineering practices. Lead by world most famous luminaries from both academia and industry, Google built some of the most amazing infrastructures including the homogenous data acenter/Borg (EC2), GFS (S3), MapReduce (Hadoop), SS Table/BigTable (HBase), Service Orinted Architecture (SoA), Protocol Buffer (Thrift), Mondrian (code review), Sawzall, SRE team & Borgmon (monitoring/management). These infrastructure foundations were and still are ahead of its time, and even today Amazon and Hadoop are still trying to catch up. This amazing Google foundations allow Google developers to build, integrate, and iterate a zillion different crappy products quickly. For example, while Facebook still have just one product in the last few years, Google used minimal developers to create Wave (failure), Buzz (failure), and Google+ within a short period of a few quarters.

So while Google end-user products are not exactly exemplary (relative to what Apple has cranked out), its backend infrastructure is something that is highly prized and coveted. Here are some of the simple engineering tenets that companies can replicate to also create a Googly infrastructure:

1) Use homogeneous systems. To minimize versioning problems, developers all use the same version of OS as the production machines. Imagine the old days back in the 90s: heterogeneous environment with NT4.0 + HPUX + SunOS + SCO + BSD + Irix + DEC, which creates a silo of people with different expertise, and invariably creates integration headaches. Because a significant amount of developer time is spent on integration and debugging, why create more problems by having a highly heterogeneous environment?

At Google, there is only 1.5 version of the OS, the “Goobuntu”. The developer uses the exact same version (and package dependencies) as the production system. “Versionitis” is no more.

2) Make the code base universally shared and accessible on a uniform version control system. This empowers any developer to look at and fix any other developer’s code, which minimizes unnecessary communications (e.g. “Hello Joe, I don't see your code and it is doing something weird, can you look into it?”). A developer at Google can read the entire Google source code, in google3/... as well as check in fixes for anyone else. How is that for transparency?

3) Pick a few core [production] languages, and stick to them. By having just a few core languages, developers from different groups can read each other’s code, and be able to jump from one group to another with ease.

The worst scenario is when a company starts to use various exotic languages (Haskell, Erlang, OCaml), or [as in the case of Twitter] when there are simply too many languages for one product (LISP + Perl + Python + PHP + Ruby + Lua). You really don’t want emails going back and forth that looks like “Hey Joe, do you know Lua? I think there is a bug in the re-Tweet feature and the person who wrote this Lua code left.”

At Google, the 4 main languages are: C++ (backend), Java (backend and middle-tier), Python (middle-tier and frontend), and JavaScript (frontent). Almost every developer at Google [who has been around] is well versed in these 4. There are many experimental but non-production languages/frameworks. For example, GWT has been out since 2007, and Go/DART are just experimental languages that no one touches.

4) Scale horizontally by creating Service-Oriented-Architected (SoA) systems. Instead of creating a monolithic bloated code-base, the use of SoA allows the system to be spawned, replicated, and distributed. It also allows components/services to be tested individually. SoA requires communications, hence unify the method of communication between the services (e.g. pick one: Protocol Buffer, Thrift) and the different core languages can communicate with each other with ease. With a horizontally scaled solution, you simply need to add more of the same machines to handle more load. Look at this slide for more info.

4b) If you're building to scale, then don't scale vertically. It is tempting to build a crappy traditional architecture and then just upgrade to better hardware to scale, like buying more RAM, faster machine, more processors, SSD, exotic SANS, Oracle, etc -- DON'T DO IT! Vertical scaling is a one time process and has a bound. Your crappily written server may handle 2000 QPS today and adding a superior hardware may boost it to 3000 QPS. You've maxed out your hardware configuration to get that 3000 QPS, now what? You're stuck! Superior hardware is never going to cure problems from shitware, the same reason that highrises can never be built out of muddy foundation.

4c) Do not use vendor specific solutions. I've seen so many startups getting locked in by .NET or Oracle or Federated MySQL and such because they used some exotic feature that is not available anywhere else, so they end up getting stuck and scale vertically (see 4b). One example is seeing a bunch of old-school programmers putting every logic they can think of in SQL stored procedures. Now, when it comes time to scale, they either want to shard databases (or even want to move it to distributed processing), but they can't because they've invested too much time writing those vendor specific procedures, making transitioning a very difficult if not impossible task without having to rewrite the entire software. People who insist on using a bunch of SQL stored procedures (logic in centralized computation) are usually a bunch of inexperienced folks who never experienced horizontal scaling-- homogenous and distributed (horizontal) computation model.

5) Separate system administrators from developers. First, developers dislike mundane system
administration tasks, and system administrators in general are not the best developers. Let the system administrators worry about system allocation, load, deployment, monitoring, costs. Let the few elite architects worry about the architecture that provides reliability and fault tolerance, and let the developers worry about code base, scalability (using SoA), extensibility and usability.

6) Measure measure measure. Every system call, every RPC, everything should be logged to monitor, improve, and scale. Google does this by implementing varz (similar to stats), and the SRE teams (elite sysadms) do tight monitoring and alert systems.

7) Foster a culture of computer scientists, not hackers. Hackers can crank out demos fast (case in point Yahoo) but they also create too many job opportunities for people who enjoy carrying pagers.

8) Hire hire hire. Hire based on abilities instead of “he is my friend.” This topic alone deserves a few pages of discussion.

Tuesday, December 20, 2011

Thrift IDL (protocol)

After the horrible experience with Avro, I considered using Protocol Buffer and Thrift for the company. Protocol Buffer's strongest point is that it is stable (not much has changed in the past few years). It is used in every single possible service in Google, it has gone through a very stringent code-review process, it has been written by the world's most seasoned and anal engineers, and thus has been well battle tested. However, I consciously passed over the opportunity to suggest Protocol Buffer to use for the company partly because I'm considered a bias party, and to suggest it will simply reinforce the idea that "Kevin is a Googler so he's obviously biased. He thinks everything coming out of Google is amazing." To be fair, I really think that Google cranks out shit end-user products most of the time (Wave, Buzz, G+, Location, Google Base, Android, etc etc...). Sometimes Google happens to make good end-user products only because Google throws a billion darts in the dark and occasionally one of the darts hits the bullseye. That's all.

I tested Thrift, and it is acceptable. In terms of feature, it is very similar to Protocol Buffer. The first thing I tested was message backward and forward compatibility. There was no problem in either case. Whereas Avro returns an error saying that message format is different, Thrift server gracefully (and correctly) disregards new message types or ignores old messages.

In Java Thrift, you can set your Thrift objects using getters and setters, which is great because if the message type changes (name or type), the Java compiler will give you an error immediately. In Java Python, you can also set your Thrift objects using the constructor and the runtime system will catch name errors. In contrast, Avro does not do any of this, so your program will just run along happily even though you're setting my_integer="Not an integer" and somewhere down the line your program crashes and you're scratching your head.

One last thing I love about Thrift: there is an asynchronous transport!!! This is exactly what powers AdSense, and allows people to easily prototype distributed computation architectures.
http://blog.rapleaf.com/dev/2010/06/23/fully-async-thrift-client-in-java/

There are a few Thrift "bugs" that should be fixed. For example, suppose you set the following as message definition:
2: string lastname = "last_default",
7: string lastname = "HO",
...

The above should signal a compiler error (e.g. "Same type name not allowed."). There are many other errors that should have signaled an error, but are not. I guess either they are too busy, too lazy, or just expect the compiler (either C or Java) to catch the error.

One other minor difference between Protocol Buffer and Thrift: In Thrift, there is no deprecation keyword. In Protocol Buffer, deprecation field compiles into Java, and the compiler will tell you the field is deprecated to allow programmers to update. It's not a big deal, but it may be a big deal for companies that keep updating contracts between two services.

In the end, my take on Avro vs. Thrift is like this. Avro is like Microsoft Zune. Zune has all the bells and whistles-- AM radio, recorder, more buttons, higher display resolution, external HD, blah blah blah. The iPod on the other hand, just does one thing. On paper, Zune is superior over iPod. On paper, Avro is superior over Thrift. But in the end, Avro just doesn't work well (no forward/backward compatibility, buggy buggy buggy and the developers don't even respond to my bug report). What looks good on paper, isn't necessarily good in practice. You can't trust everything you read. You have to play with it.

Monday, December 19, 2011

Avro, what a complete waste of time

I'm responsible for evaluating the different IDLs (Protocol Buffer, Avro, and Thrift) as a unified form of communication between different services in the company. A key feature of today's IDL is backward and forward message compatibility. For example, if the client adds one more field to a message, the server should be able to take the new message and process it (while ignoring the new field). The opposite is true, where the server takes in additional fields in the message while the client does not, and the server should just assume that the field is empty.

I started with Avro because I had high hopes for Avro. It had great features that neither PB nor Thrift had (no need for field deprecation, no need for deprecation, no need to get an IDL compiler), and because it's built in to Hadoop's MapReduce. My experience with Avro began with downloading the package (version 1.6.1, the latest). I tried out an example code (phunt-avro-rpc-quickstart-avro-release-1.2.0-9-gce46e91.zip) with included two small Python codes, start_server.py and send_message.py (client). Both of them used the same IDL (mail.avpr). I got the client to send a message to the server with ease. Then, I tried the most important aspects of IDLs-- forward and backward message compatibility. I expected the server to gracefully accept old and new messages, but instead got something completely unexpected:

PATH=~/code/avro-example/avro-1.6.1/src ./send_message.py AA BB MSG
Traceback (most recent call last):
File "./send_message.py", line 56, in
print("Result: " + requestor.request("myecho", {"mymessage": message}))
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 145, in request
return self.issue_request(call_request, message_name, request_datum)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 260, in issue_request
call_response_exists = self.read_handshake_response(buffer_decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 204, in read_handshake_response
handshake_response.get('serverProtocol'))
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 120, in set_remote_protocol
REMOTE_PROTOCOLS[self.transceiver.remote_name] = self.remote_protocol
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 475, in
remote_name = property(lambda self: self.sock.getsockname())
AttributeError: 'NoneType' object has no attribute 'getsockname'


A well designed IDL should at least show a warning message indicating that the field is unknown (or new, etc). Nope! Avro returns with a weird socket-related error. Upon looking at the Avro library (avro-1.6.1/src/avro/ipc.py), line ~474 yields:

# read-only properties
sock = property(lambda self: self.conn.sock)
remote_name = property(lambda self: self.sock.getsockname())


So, I'm no Python expert but it's clear that self.sock does not exist, so I manually set remote_name in the constructor __init__ (meaning it's not a readonly variable anymore, but who cares) and viola, it works! Who the heck checked in this code anyways? My next attempt was the reverse: the server takes in a newer message and the client sends an older message and here's my very useful Avro message:

/code/avro-example/avro-1.6.1/src ./send_message.py AA BB MSG
Traceback (most recent call last):
File "./send_message.py", line 56, in
print("Result: " + requestor.request("myecho", {"mymessage": message}))
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 145, in request
return self.issue_request(call_request, message_name, request_datum)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 264, in issue_request
return self.request(message_name, request_datum)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 145, in request
return self.issue_request(call_request, message_name, request_datum)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 262, in issue_request
return self.read_call_response(message_name, buffer_decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 222, in read_call_response
response_metadata = META_READER.read(decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/io.py", line 445, in read
return self.read_data(self.writers_schema, self.readers_schema, decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/io.py", line 486, in read_data
return self.read_map(writers_schema, readers_schema, decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/io.py", line 615, in read_map
block_count = decoder.read_long()
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/io.py", line 184, in read_long
b = ord(self.read(1))
TypeError: ord() expected a character, but string of length 0 found


Alright, I've had enough. What I just attempted, was a very very common test case and if there's any decent amount of unit tests, this problem would never have existed. My guess now is that there is no unit test whatsoever, and there isn't much user base because I can't find this complaint anywhere via Googling (and I can't find much Avro documentation in the first place)! I sent an email to the Avro developer team this weekend and I've yet to receive a response. I am most utterly not impressed so far.

Conclusion:
1) Save yourself some time by using something else that is battle tested. Bleeding edge (in this case) is a waste of time.
2) The only thing that matters is your hands-on experience. Marketing and bias makes Avro look amazing (dynamic features, flexibility, maintenance free, language support, ...), but it doesn't matter if it does not work TODAY.