Wednesday, January 1, 2014

I Love Google Apps and Apple Devices

I recently switched from iPad/iPhone to Nexus 7/Nexus 5. I have been using/trying out Androids and iDevices for the past few years. I tolerate using Nexus 7/5, but I still find the iPad/iPhone 5s more enjoyable to use. I've had, played with, and tested Androids from the very first G1 (beta) up to 4.4.2 and although each version is an improvement to the previous version, I am still surprised to see how much rough edges the Androids still have. Below are my biggest gripes, and reasons why I prefer the iDevices, if price was not an issue:


1) Battery life

My Nexus 5 has tons of features, but the biggest "unfeature" is that drains in a day even with light usage (browser & G Maps) so I end up having to charge every day. You can have 100 features but they're irrelevant when you need to use them at the end of the day, when you're lost/stranded in Tenderloin District and can't make a call or look at maps.
I know some Android fanboi is going to knee-jerk respond by saying "You should try the Qi charger, it is great!" or "Try out this other slower Android! It has longer life!" Yeah... thanks... major inconvenience.

2) Placement of Android buttons

Placement of Android's back/menu buttons is, for a lack of a better word, piss poor. Sorry that I need to be so harsh on this, but I have normal sized thumbs and I keep pressing back/menu by accident, especially when I press space. This happens over and over and over again. 
See the problem here? Spacebar is too close to the 3 menu buttons below. Maybe I'm a cluts with fat fingers who can't help it but pressing wrong buttons constantly, but I'm sure I'm not the only cluts with fat fingers in this world who does this. This is possible THE most annoying problem and poor product decision I have encountered to date.

By now an Android fanboi is going to knee-jerk scream with "Why don't you just tilt it" or "Just root the phone and change the placement." Yeah...


3) Nexus 5/7 auto screen brightness burns the retina

iOS's screen is comfortable to look in total darkness. You can also set auto brightness and fine-tune the auto brightness. Another plus with iOS is that it has a "night-mode" that reverses black-white for all the apps that can be switched on/off very easily.

Sadly, on my brand new Nexus 5 and 7, it is either fully auto (no fine tuning that iOS has), or full manual (retro alert!). Furthermore, on the lowest brightness, Nexus 5/7 is simply too bright to read in total darkness. In order to turn brightness even lower on Android, you can either do manual mode (pain) or there are apps you buy to compensate for this lack of feature. I bought Lux for $3.80 to imitate iOS' auto-brightness, and downloaded Screen Dimmer for quasi-night-mode -- not as good as iOS' night-mode, but better than nothing.

4) Built-in Chinese/En keyboard input

I know I'm going to get dissed for this because most Android users reading this post in English don't have this issue. Let me just be blunt here: the built-in Android Chinese/En keyboard input is inferior to iOS 7. I understand these are issues that I (and other Chinese people, and I know there are many of them out there) run into every single day, and they're repetitive and very annoying and make everyone unproductive. On the iDevices, my Chinese keyboard and English keyboard are integrated -- same dictionary, and I can switch one input to another with one touch. But on the Android, they are separate "apps" each with very different configuration and dictionaries thus switching between them is a pain, requiring few extra steps. 

So now you're probably going to say "why don't you just keep the Chinese input app which has its very own separate version of English input options than the default Android keyboard, after all, there are hundreds of settings and options to choose from!!!" Yes, below are the problems I run into by doing exactly that:

4a) Default Chinese/En input: the dictionary size is pathetically puny. For a company that has tons of data on search query, I'm shocked that a Google device can't even recognize some of the most popular terms, like "apps". Are you kidding me? (This is not an issue on Android English-only keyboard.):

WTF, "Alps"? No!!!

4b) Default Chinese/En input doesn't auto learn from humans: Let's say I manually delete "Alps" and type Apps again, it still gives me the same incorrect replacement! On my iDevice, pressing back and correcting it again (as what most laymen would do) means "a person really intends that word, keep that word." But Android doesn't get it. This problem comes up OVER AND OVER AGAIN and is extremely irritating.

Also on Android, the default English-only input requires extra steps to save input/learn. For example, when I type "da" (I use this for text messaging) it corrects to "DA".  On the Android, one needs to manually add it with extra overhead -- unnecessary and annoying:


4c) Chinese/En context correction: Let's go back to the "Apps" to "Alps" example. Let's say now I manually delete "lps" in "Alps" so that I can continue to correct it to "Apps". It now auto corrects lps to post, making a "Apost":

Come on, are you kidding?


4d) Default Android Pinyin really sucks. This has different levels. One is that the dictionary/auto recognition is horrible. When I type ZZJNT (extremely common n-gram), I expect 中正紀念堂 (zhongzhengjiniantang) to be filled predictably like all my iOS and MacOS devices:

But on the Android this is what I get. BTW I don't think this has anything to do with the cross-straight relationships as I can find plenty of other common and non-political words that default Android dictionary fails to include. And please don't tell me that there are hundreds and hundreds of dictionaries on the web I can choose from and download from and install manually on the Android:

4e) iOS 7 handwriting recognition is superior, hands down. I can easily do n-gram hand-writing recognition (excuse my handwriting). Steve Jobs would be extremely proud even if he doesn't write Chinese, because this is the dream input method for hundreds of millions of Chinese people who don't do Pinyin and Zhuyin:

But on the Android, I'm forced to hand write one letter at a time. This is a very retro 1990s method:

I've already written 新年, why is Android not guessing 快樂 by context? It's got to be the most common n-gram in Chinese. Come on.

Furthermore, I'm not done writing 快 and it auto fills in 小 incorrectly because Android forces you to finish within milliseconds. So for young people and old people who write slower (even with the slowest setting), they will keep inputting in the incorrect letters. Bad usability:
 

4f) By now a Android fanboi is itching to knee-jerk at me and say: "there are hundreds of other Chinese inputs to choose from, why don't you try them out?" Look, I already spent hours and hours on 3 other top Chinese Input apps on Google Play Store, and they all suffer albeit different problems that I will have to get into in a much much longer post than this one. In short, I don't have a lot of patience for this anymore and I'm done QAing beta programs. More choices doesn't necessarily mean any of them is better than iOS 7 input that just works.

There are over a billion Chinese people out there, surely Google can make an out-of-box input experience more tolerable for them?

5) Durability (or lack of)

My previous iPhones have taken a lot of abuse. My toddler tosses everything and anything he can get his hands on. My bare iPhones have taken a lot of beating on hard floor and survived. On Jan 1 2014, my one month Nexus 5 (LG brand) with a protective case took one little 3 foot fall on a hardwood -- it looks perfectly fine, but the display stopped responding to touch completely! Boo!

6) Support (or lack of)

Due to my broken Android, I went through Google Tech Support (not LG), doing hard factory reset and a bunch of things and still have the problem. The worst part of it all? 2 hours wasted on the phone and the Google customer support is completely clueless and gave me an experience that beats Fry's Electronics' experience. "No. I want you to read the IMEA numbers one by one, don't tell me 'five-hundred ninety-one, just say FIVE NINE ONE so I can understand you better'". I tried very hard to not explode as she was pretty condescending.

After painful hours I finally got an RMA and will be getting a replacement Nexus 5, which will arrive after a week. I am a VERY UNHAPPY CAMPER as of this minute. I will not have a phone for a whole week.

If this were an Apple device, I'd go to the store and one of the Apple Genius would have the issue resolved in less than a day, or at least be a little bit nicer to a customer.


7) Android Fanbois

By writing this blog, I was surprised to see the amount of hate messages I got from Android Fanbois (fanbois = internet meme for "fanatical boys"). Yes, I am too uneducated and too stupid to use an Android, I should have tried X Y Z apps to fix KitKat 4.4.2 deficiencies, I should have rooted my phone, anything made by Apple is evil and therefore I am, Steve Jobs was an asshole and I support his kind by vouching for Apple stuff, yadda yadda yadda.


Summary:

Google is an amazing software company because its software gets the job done. I use Google Apps 90% of the time: Google Maps, YouTube, GMail, Google Authenticator, Google Calendar, Google Drive, Google Voice, Google Hangout, Google+, Google Contacts, Google Translate, and obviously, Google Search. 

Unfortunately, I find Google affiliated Android hardware piss poor, and the customer support to be on-par with Fry's Electronics. The top of the line hardware (Nexus 5/7 by LG/Asus) sucks, and the Android UX is just as bad even after all these years of improvements. The low-end cheap Androids are awfully slow (e.g. Google Maps take 2-3 seconds to update when I scroll), and the expensive Androids still suffer battery-hog problems and other UX issues that obviously can't be fixed with better hardware.

In an ideal world where money is not an issue, I would switch back to all [unlocked] iDevices and install Google Apps on them. But I'm cheap, so I'll just have to put up with Android.

Wednesday, April 10, 2013

Google, Please Mine Bitcoins

The price of Bitcoins is skyrocketing. The demand is going up and the supply is limited. In time, the supply will only dwindle (about 10 mil out of 21 mil coins have been mined so far), and as long as there are unstable currencies around the world, the demand for alternative currency such as Bitcoins will likely remain (or maybe even go up, who knows?!?). As I try to grasp the future social and economic impact this new and strange currency has, I started to wonder what the head of Google (who could tap into some of the world's most amazing and abundant servers) thinks about it.

For the greater good of its Shareholders, Employees, and All Earth Bound Slaves of Fiat Currency, Larry should rally the company to mine Bitcoins. Here are some top reasons why:
  1. Google has a gazillion of machines on Borg that are on standby, idling, and still eating up electricity. Putting Google servers into Bitcoin mining will generate wealth that could be distributed back to the company and its Shareholders. GOOG to $1000 baby!
  2. Disgruntled employees that are forced to work on G+ (hey, I know there are a bunch of you out there) could transition to work on mining Bitcoins. I'm sure they'll be relieved. In addition I think many employees would be proud of #4 (see below). Similarly, unmotivated workers that spend lots of time on misc-mtv@ troll mailing list and memegen will finally feel like they're making a change in the world, and be productive again. Look, if doing something is good for employees and good for Shareholders, why not?
  3. Google is in a superior position over other companies at running massive cloud/bandwidth/custom backplane/custom OS/GPU/ASIC/etc at scale. Obviously I didn't run through back of the envelope calculations but if there's a company that can make serious money by making pennies at Google scale, that would be Google.
  4. The security of Bitcoins depends on the majority of the clients/miners being honest. If Google starts running miners on its servers, it will help keep Bitcoins even more honest and secure. Therefore, Google doing mine = doing good deeds for everyone, and Do Good >> Do No Evil, right?
  5. Similar to the above, if there is an institution that starts owning a huge chunk of Bitcoin shares, I would rather that institution not come from authoritarians (the same that control fiat currency on a whim) or a bunch of wild libertarian/anarchist hackers. I'd rather that institution be Google. Some people say that Google has become evil, but would you rather see Goldman Sachs, Walmart, Monsanto, Chevron/Exxon/Enron-alikes involved? Um, no thanks. In Google I trust [a little bit more]. Dominate me Google. Please!
  6. Yeah yeah I know this is an ungoogly thing to say (how dare I criticize my friends/former peers). But more likely than not, many existing Googlers are already running Bitcoin miners on their Google issued machines, which is not much different than how they've been using it for other noble deeds like SETI@Home/Gene-folding/.... Create a new division in GoogleX, or one might as well as make it official in the company and make it a company-wide OKR. Just make sure they deposit back into the Shareholders' Bitcoin wallet pool, m'kay?
Larry, please go boldly where no man has gone before. The needs of many outweigh the needs of the few.  I AM NOT JOKING. Thanks for reading, I will now troll on Reddit. And /.

Wednesday, March 28, 2012

Coding Standard In The Company

Dear Engineers,

Our engineering organization is going to grow. We will hire many more people, and our code base will explode. In a large organization, coding is not just a form of communication between human-to-computer. It is also a form of communication between human-to-human. In this form of communication, coding is like writing an essay. An essay is consists of paragraphs, sentences, words, etc. In addition, writers try to follow certain conventions such as organization/structure, spelling, syntax, punctuations, ..., and they do so for the sake of making communications more effectively. There is a simple automated tool to check for the Python equivalent of punctuation/syntax/spelling (PEP8). There is also the Python equivalent of grammar checking (Pylint). Unfortunately, for high level concepts such as organization/structure/patterns, there are no tools out there (and it would be quite a feat to write a compiler to perform analysis that goes beyond context sensitive grammar). For these high level concepts, the state of the art is basically the code review.

Over the past few months I've been reviewing many people's code. People tell me that my guidelines seem arbitrary and "made up." I will make it clear to everybody that I didn't make up these guidelines. The guidelines below are very VERY common sound engineering practices. I will also make it clear that sometimes, I am too guilty of breaking these guidelines from time to time, and I hope people can point them out so that I can continually improve myself. In the end, the purpose of a coding guideline is to make the overall software quality higher so that it is 1) easier to read by other people (it is a lot more painful to read and fix other people's code than to write yours, right?) 2) easier to debug 3) easier to maintain 4) easier to extend.

Below are some of the most common coding issues. I understand that almost everyone has coded for over X Y Z number of years and you know how to code, but, there are always little things to increase the quality of the code. In short, when you submit a code for review, check for the items below because your code review responses/comments are actually quite predictable based on the common issues below:


1) "string typing" -- this is one of the most common habits programmers commit. String typing is the use of strings in parameters. Let's say you have a function that takes in "active", "disabled", "deleted" strings. As a programmer, you've done this for over 10 years and it has worked well so why should anyone care? Well, when some new hire comes in and makes a typo in the code (upper/lower case typo, or mistyped "activated" instead of "activate", etc), the code will be broken! In addition, refactoring (rename string constants in your code) in a 1 million line code project is extremely error prone and NOT fun -- what if "disable" and "disabled" strings are everywhere in the code, and many of them aren't even relevant to you?

Better: use constants (in a class). For example, class Status may contain instance variables active, disabled, deleted. These are automagically filled by IDEs, thus the refactor (rename, checking, etc) is done in less than 1 second. Lastly, static analysis tools such as Pylint (or Pycheck, ...) can properly infer whether the variables are valid or not BEFORE the code runs. Automated tools are your friends, so when you try to be clever or lazy (because typing string is so much easier right?), then you are on your own.


2) "string hashing" -- this is similar to string typing. This is the use of dynamic language's hash to pass data structure from one function to another. In another word, you are relying on a particular *convention* to hold the program together, so if anyone else forgets and/or breaks your convention, you are f***'ed at runtime! For example, you may have data = {'id': 123, 'name': 'Joe Blow', ...}. When you refactor or make a typo, well... you won't know you've made a mistake until the program runs [in production]!

Better: use classes to hold objects. This way, automated tools can tell you whether there's a mistake or not BEFORE the program even runs.


3) Reliance on a certain naming convention. This is similar to the above two. For example, let's say your code reads from a text file and the file contains text entry "delete" and "create". You try to be clever and make two local functions: deleteName(...) and createName(...) and perform a dynamic call by doing "local()[my_str + 'Name'](name)" "getattr(self, my_str + 'Name')(name)" Well, your code is making a certain assumptions, so if someone else changes the text file, you get a weird runtime error, and your intern spends 6.5 hours looking at this weird issue and time is wasted.

Better: don't get creative and never assume your input is always clean, because someone else will change your input, if not now, then definitely later.


4) Importing from *, or importing from relative paths. This is problematic both from readability standpoint as well as PATH runtime resolution point of view. Frequently, automated tools (such as nosetest) is confused about relative paths, and outputs erroneous messages.

Better: first, for readability, import full path explicitly (instead of *) so the reader know exactly what component you're using. This is super easy with IDEs. Secondly, if you must import a bunch of components in *, then import one level up. For example, "import xyz.ttypes" and use "ttypes.TypeA|B|C|D|E"... This has the added advantage that if you have a name space collision (e.g. ttypes.Rad and models.Rad), then you can disambiguify by using ttypes.Rad and models.Rad.


5) Lacking simple unit test. Let's say you are calling a library called bear.call(rabbit, berry). Function bear.call takes on two parameters, rabbit and berry. When the bear.call parameters are changed, Jenkins happily allow the check-in, and you don't realize your code is broken until runtime on production.

Better: unit tests are to *protect* your code from underlying changes. Even if you don't have time to increase coverage, a simple API usage unit test with only 5 lines goes a long way.


6) Catch all. This is the pattern of "try: ... except Exception e: pass" or AKA "sweep dirt under the rug" pattern. Typically the programmer is trying to perform some operation and is unsure whether the operation will succeed or not but wants to handle failure gracefully. So instead of addressing failures, the programmer blindly catches and passes *all exception*. Why is this a problem? Because 1) you are hiding problems that can compound and/or propagate later. For example if you have an arithmetic precision error and the bad results are ignored, then errors are eventually propagated to other functions and possibly stored in the form of data. I sure wouldn't want my bank to catch-all and save bad arithmetic results in the storage layer. 2) if you hide problems, then programmers don't have a chance to know there's an error and therefore never fix it.

Better: catch exactly what you need to catch, and propagate errors back to the original call site. For example, if you perform some math function, then only catch math errors. If the math function returns a file error for whatever reason, then just allow that error to propagate upwards to the caller, so that the full stack trace is shown to programmers.

There are of course some rare cases where you do in fact catch all error. For example, if you're writing a library layer and the library can call anything and everything (and the program must continue to run), then you have to catch all. In this case, you SHOULD log or print out errors instead of blindly doing nothing.


7) Function parameters named a, b, value, max_val, so on so forth. Why is this bad? Because the reader of the code has no clue what it is, and even with IDEs it is not very helpful! Another parameter that is really hard to read is the "pass all" *arg + **kw parameters. The reader has no choice but to dig into your code and READ what parameters are accepted.

Better: Describe your parameters like max_iteration, baseline_threshold, so on so forth. This way, when someone calls your function and types ALT-P, the names of the parameters pop out immediately. In situations where you must use *arg and **kw, document clearly in your function.


8) Rely on runtime magic. You take a string, and magically turn that into a function and call it. Or, you start using a bunch of introspection methods (find a function that starts with letter X and ends with Z and then call that function). Why is this to be avoided in general? Because automated tools cannot infer anything about the safety of reflections/dynamic calls, so your only assurance that the code is bug free, is when you run it with all possible inputs (ya right). Typically, this type of code is full of dynamic getattr, or local()['func_name'](parameter), so on so forth.

Better: KISS. don't try to be clever and out-clever automated tools. Automation tools (Pylint, Pycharm, Pychecker) are here to help you, so if you try to be clever and out-clever these tools, they become useless. Imagine that the reader of your code needs to think more to follow your logic-- the more complexity he needs to read, the more likely a bug will be passed on.


9) Assumptions about timing. I see this from time to time. Real example: the authors of project R observes that a script always exits within 45 seconds. So he create a cron job every minute that forks off that script. Well, as the data set gets larger, the script starts taking longer and longer. Eventually, the script takes such a long time to execute while the cron still launched every minute, the machine has a zillion of these processes in flight, thrashes, and runs out of virtual memory. The moral of the story: if you assume something behaves certain way 99.99% of the time, then even 0.01% of the time will really screw you up.

Better: don't make assumptions. In the case above, use a [priority] task queue.


10) Busy waiting. If you write a [multi-threading] program that has a for/while-IO-sleep pattern, you're doing busy waiting. For every busy waiting pattern out there, there is an efficient I/O sleep-signal-wake implementation/pattern to use.

Better: this is a very long discussion on IO, OS, wait, lock, signal conversation. Hit me up and I'll give you references to publications and/or text-books.


11) Global variables. You stick everything in global variables because it's "easier." Now functions may have side-effects. It may be easier to write programs using global variables, but it is difficult to understand because other people need to read the source code and see what other programs are doing with the global variables (which cause side-effects). Global variable programming is the antithesis of OO programming.

Better: OO because it increases level of communications and increases flexibility (even if it takes a bit more typing).



There will be others from time to time. I hope these guidelines can help people can program in a way that maximizes communication levels between other programmers, as to allow the code base to grow, to be readable, to be debuggable, to be maintainable, to be extensible.

Thursday, December 29, 2011

Eng: Great Engineering Comes From Great Infrastructure

A great engineering culture starts with a great engineering infrastructure. A great infrastructure allows people to stack technologies and leverage arbitrarily upon each other, compounding their leverage to enormously high levels. On the other hand, without infrastructure and culture in place, a team struggles with mundane issues such as integration, expertise segmentation, communication problems, redundant system administration efforts, and silo’ed [undebuggable] code base.[1]

Many tech pundits say that Google by far is some of the worst product-oriented tech companies in Silicon Valley. Wave sucked donkies. Buzz was marginally better. G+ is a premature baby. The first few versions of Android was terrible in terms of usability/security/developer APIs. The first few versions of Google search was mediocre at best. The first few versions of Gmail was a big turn off for non-techies. And the early Google AdSense? Horrible contextual match.

This is not to say that Google is a horrible company as a whole but that it is a very misunderstood company. For what Google lacks as a product company, it makes up by having a superb backend infrastructure with very effective engineering practices. Lead by world most famous luminaries from both academia and industry, Google built some of the most amazing infrastructures including the homogenous data acenter/Borg (EC2), GFS (S3), MapReduce (Hadoop), SS Table/BigTable (HBase), Service Orinted Architecture (SoA), Protocol Buffer (Thrift), Mondrian (code review), Sawzall, SRE team & Borgmon (monitoring/management). These infrastructure foundations were and still are ahead of its time, and even today Amazon and Hadoop are still trying to catch up. This amazing Google foundations allow Google developers to build, integrate, and iterate a zillion different crappy products quickly. For example, while Facebook still have just one product in the last few years, Google used minimal developers to create Wave (failure), Buzz (failure), and Google+ within a short period of a few quarters.

So while Google end-user products are not exactly exemplary (relative to what Apple has cranked out), its backend infrastructure is something that is highly prized and coveted. Here are some of the simple engineering tenets that companies can replicate to also create a Googly infrastructure:

1) Use homogeneous systems. To minimize versioning problems, developers all use the same version of OS as the production machines. Imagine the old days back in the 90s: heterogeneous environment with NT4.0 + HPUX + SunOS + SCO + BSD + Irix + DEC, which creates a silo of people with different expertise, and invariably creates integration headaches. Because a significant amount of developer time is spent on integration and debugging, why create more problems by having a highly heterogeneous environment?

At Google, there is only 1.5 version of the OS, the “Goobuntu”. The developer uses the exact same version (and package dependencies) as the production system. “Versionitis” is no more.

2) Make the code base universally shared and accessible on a uniform version control system. This empowers any developer to look at and fix any other developer’s code, which minimizes unnecessary communications (e.g. “Hello Joe, I don't see your code and it is doing something weird, can you look into it?”). A developer at Google can read the entire Google source code, in google3/... as well as check in fixes for anyone else. How is that for transparency?

3) Pick a few core [production] languages, and stick to them. By having just a few core languages, developers from different groups can read each other’s code, and be able to jump from one group to another with ease.

The worst scenario is when a company starts to use various exotic languages (Haskell, Erlang, OCaml), or [as in the case of Twitter] when there are simply too many languages for one product (LISP + Perl + Python + PHP + Ruby + Lua). You really don’t want emails going back and forth that looks like “Hey Joe, do you know Lua? I think there is a bug in the re-Tweet feature and the person who wrote this Lua code left.”

At Google, the 4 main languages are: C++ (backend), Java (backend and middle-tier), Python (middle-tier and frontend), and JavaScript (frontent). Almost every developer at Google [who has been around] is well versed in these 4. There are many experimental but non-production languages/frameworks. For example, GWT has been out since 2007, and Go/DART are just experimental languages that no one touches.

4) Scale horizontally by creating Service-Oriented-Architected (SoA) systems. Instead of creating a monolithic bloated code-base, the use of SoA allows the system to be spawned, replicated, and distributed. It also allows components/services to be tested individually. SoA requires communications, hence unify the method of communication between the services (e.g. pick one: Protocol Buffer, Thrift) and the different core languages can communicate with each other with ease. With a horizontally scaled solution, you simply need to add more of the same machines to handle more load. Look at this slide for more info.

4b) If you're building to scale, then don't scale vertically. It is tempting to build a crappy traditional architecture and then just upgrade to better hardware to scale, like buying more RAM, faster machine, more processors, SSD, exotic SANS, Oracle, etc -- DON'T DO IT! Vertical scaling is a one time process and has a bound. Your crappily written server may handle 2000 QPS today and adding a superior hardware may boost it to 3000 QPS. You've maxed out your hardware configuration to get that 3000 QPS, now what? You're stuck! Superior hardware is never going to cure problems from shitware, the same reason that highrises can never be built out of muddy foundation.

4c) Do not use vendor specific solutions. I've seen so many startups getting locked in by .NET or Oracle or Federated MySQL and such because they used some exotic feature that is not available anywhere else, so they end up getting stuck and scale vertically (see 4b). One example is seeing a bunch of old-school programmers putting every logic they can think of in SQL stored procedures. Now, when it comes time to scale, they either want to shard databases (or even want to move it to distributed processing), but they can't because they've invested too much time writing those vendor specific procedures, making transitioning a very difficult if not impossible task without having to rewrite the entire software. People who insist on using a bunch of SQL stored procedures (logic in centralized computation) are usually a bunch of inexperienced folks who never experienced horizontal scaling-- homogenous and distributed (horizontal) computation model.

5) Separate system administrators from developers. First, developers dislike mundane system
administration tasks, and system administrators in general are not the best developers. Let the system administrators worry about system allocation, load, deployment, monitoring, costs. Let the few elite architects worry about the architecture that provides reliability and fault tolerance, and let the developers worry about code base, scalability (using SoA), extensibility and usability.

6) Measure measure measure. Every system call, every RPC, everything should be logged to monitor, improve, and scale. Google does this by implementing varz (similar to stats), and the SRE teams (elite sysadms) do tight monitoring and alert systems.

7) Foster a culture of computer scientists, not hackers. Hackers can crank out demos fast (case in point Yahoo) but they also create too many job opportunities for people who enjoy carrying pagers.

8) Hire hire hire. Hire based on abilities instead of “he is my friend.” This topic alone deserves a few pages of discussion.

Tuesday, December 20, 2011

Thrift IDL (protocol)

After the horrible experience with Avro, I considered using Protocol Buffer and Thrift for the company. Protocol Buffer's strongest point is that it is stable (not much has changed in the past few years). It is used in every single possible service in Google, it has gone through a very stringent code-review process, it has been written by the world's most seasoned and anal engineers, and thus has been well battle tested. However, I consciously passed over the opportunity to suggest Protocol Buffer to use for the company partly because I'm considered a bias party, and to suggest it will simply reinforce the idea that "Kevin is a Googler so he's obviously biased. He thinks everything coming out of Google is amazing." To be fair, I really think that Google cranks out shit end-user products most of the time (Wave, Buzz, G+, Location, Google Base, Android, etc etc...). Sometimes Google happens to make good end-user products only because Google throws a billion darts in the dark and occasionally one of the darts hits the bullseye. That's all.

I tested Thrift, and it is acceptable. In terms of feature, it is very similar to Protocol Buffer. The first thing I tested was message backward and forward compatibility. There was no problem in either case. Whereas Avro returns an error saying that message format is different, Thrift server gracefully (and correctly) disregards new message types or ignores old messages.

In Java Thrift, you can set your Thrift objects using getters and setters, which is great because if the message type changes (name or type), the Java compiler will give you an error immediately. In Java Python, you can also set your Thrift objects using the constructor and the runtime system will catch name errors. In contrast, Avro does not do any of this, so your program will just run along happily even though you're setting my_integer="Not an integer" and somewhere down the line your program crashes and you're scratching your head.

One last thing I love about Thrift: there is an asynchronous transport!!! This is exactly what powers AdSense, and allows people to easily prototype distributed computation architectures.
http://blog.rapleaf.com/dev/2010/06/23/fully-async-thrift-client-in-java/

There are a few Thrift "bugs" that should be fixed. For example, suppose you set the following as message definition:
2: string lastname = "last_default",
7: string lastname = "HO",
...

The above should signal a compiler error (e.g. "Same type name not allowed."). There are many other errors that should have signaled an error, but are not. I guess either they are too busy, too lazy, or just expect the compiler (either C or Java) to catch the error.

One other minor difference between Protocol Buffer and Thrift: In Thrift, there is no deprecation keyword. In Protocol Buffer, deprecation field compiles into Java, and the compiler will tell you the field is deprecated to allow programmers to update. It's not a big deal, but it may be a big deal for companies that keep updating contracts between two services.

In the end, my take on Avro vs. Thrift is like this. Avro is like Microsoft Zune. Zune has all the bells and whistles-- AM radio, recorder, more buttons, higher display resolution, external HD, blah blah blah. The iPod on the other hand, just does one thing. On paper, Zune is superior over iPod. On paper, Avro is superior over Thrift. But in the end, Avro just doesn't work well (no forward/backward compatibility, buggy buggy buggy and the developers don't even respond to my bug report). What looks good on paper, isn't necessarily good in practice. You can't trust everything you read. You have to play with it.

Monday, December 19, 2011

Avro, what a complete waste of time

I'm responsible for evaluating the different IDLs (Protocol Buffer, Avro, and Thrift) as a unified form of communication between different services in the company. A key feature of today's IDL is backward and forward message compatibility. For example, if the client adds one more field to a message, the server should be able to take the new message and process it (while ignoring the new field). The opposite is true, where the server takes in additional fields in the message while the client does not, and the server should just assume that the field is empty.

I started with Avro because I had high hopes for Avro. It had great features that neither PB nor Thrift had (no need for field deprecation, no need for deprecation, no need to get an IDL compiler), and because it's built in to Hadoop's MapReduce. My experience with Avro began with downloading the package (version 1.6.1, the latest). I tried out an example code (phunt-avro-rpc-quickstart-avro-release-1.2.0-9-gce46e91.zip) with included two small Python codes, start_server.py and send_message.py (client). Both of them used the same IDL (mail.avpr). I got the client to send a message to the server with ease. Then, I tried the most important aspects of IDLs-- forward and backward message compatibility. I expected the server to gracefully accept old and new messages, but instead got something completely unexpected:

PATH=~/code/avro-example/avro-1.6.1/src ./send_message.py AA BB MSG
Traceback (most recent call last):
File "./send_message.py", line 56, in
print("Result: " + requestor.request("myecho", {"mymessage": message}))
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 145, in request
return self.issue_request(call_request, message_name, request_datum)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 260, in issue_request
call_response_exists = self.read_handshake_response(buffer_decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 204, in read_handshake_response
handshake_response.get('serverProtocol'))
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 120, in set_remote_protocol
REMOTE_PROTOCOLS[self.transceiver.remote_name] = self.remote_protocol
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 475, in
remote_name = property(lambda self: self.sock.getsockname())
AttributeError: 'NoneType' object has no attribute 'getsockname'


A well designed IDL should at least show a warning message indicating that the field is unknown (or new, etc). Nope! Avro returns with a weird socket-related error. Upon looking at the Avro library (avro-1.6.1/src/avro/ipc.py), line ~474 yields:

# read-only properties
sock = property(lambda self: self.conn.sock)
remote_name = property(lambda self: self.sock.getsockname())


So, I'm no Python expert but it's clear that self.sock does not exist, so I manually set remote_name in the constructor __init__ (meaning it's not a readonly variable anymore, but who cares) and viola, it works! Who the heck checked in this code anyways? My next attempt was the reverse: the server takes in a newer message and the client sends an older message and here's my very useful Avro message:

/code/avro-example/avro-1.6.1/src ./send_message.py AA BB MSG
Traceback (most recent call last):
File "./send_message.py", line 56, in
print("Result: " + requestor.request("myecho", {"mymessage": message}))
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 145, in request
return self.issue_request(call_request, message_name, request_datum)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 264, in issue_request
return self.request(message_name, request_datum)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 145, in request
return self.issue_request(call_request, message_name, request_datum)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 262, in issue_request
return self.read_call_response(message_name, buffer_decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/ipc.py", line 222, in read_call_response
response_metadata = META_READER.read(decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/io.py", line 445, in read
return self.read_data(self.writers_schema, self.readers_schema, decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/io.py", line 486, in read_data
return self.read_map(writers_schema, readers_schema, decoder)
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/io.py", line 615, in read_map
block_count = decoder.read_long()
File "/home/vm42/code/avro-example/avro-1.6.1/src/avro/io.py", line 184, in read_long
b = ord(self.read(1))
TypeError: ord() expected a character, but string of length 0 found


Alright, I've had enough. What I just attempted, was a very very common test case and if there's any decent amount of unit tests, this problem would never have existed. My guess now is that there is no unit test whatsoever, and there isn't much user base because I can't find this complaint anywhere via Googling (and I can't find much Avro documentation in the first place)! I sent an email to the Avro developer team this weekend and I've yet to receive a response. I am most utterly not impressed so far.

Conclusion:
1) Save yourself some time by using something else that is battle tested. Bleeding edge (in this case) is a waste of time.
2) The only thing that matters is your hands-on experience. Marketing and bias makes Avro look amazing (dynamic features, flexibility, maintenance free, language support, ...), but it doesn't matter if it does not work TODAY.

Friday, February 25, 2011

Recommending a fellow Googler to a startup

Once in a while I get this question: "Hey Kevin, thank you for recommending a fellow Googler. What do you think about X's technical skills?"

My long response is this:
The Google technical interview process is one of the most challenging interviews one can get. There's the resume screening (only one out of 1000 resumes pass through), then email screening, then phone screening, possible secondary phone screening, on-site screening, and finally the hiring committee (from Mountain View) reviews long and very detailed written feedback from 7-10 interviewers. If someone makes it in as an engineer, you are sure that person is way above average over the millions and millions of people who send in their resumes to Google each year. Think about this: if you're an engineer at less-than-stellar company that don't value core engineering (Fox, Yahoo, Citysearch, AT&T Interactive, MySpace) and you think you can do better, you have already at some point in your career applied to Google. It's only human to want to do better. Look. Chances are, an engineer you already know (who never went to Google) already applied and chances are he/she failed. I realize what I'm saying is really harsh, but this is harsh reality. For this reason, even a really bad Googler is still above tech industry average (e.g. especially from what I see in Los Angeles). Secondly, if a person survives the Google culture for a few years, you're sure that person is at least average amongst Googlers because the below average Googlers get kicked out very very fast; 2 to 3 quarters and you're out. I personally know a few that don't survive a year-- usually they're super smart but unmotivated and/or had other issues.

Having that said, technically, almost everyone I know at Google can kick the industry average programmer's ass. Googlers tend to come from top-tier schools or top-tier companies. They made it into the system. They are hardcore, trained under the stringent Google Code Readability process. People strive to get badges on their Moma page by being Googly-- being technically good. I am not exaggerating or bragging, I'm just saying this after observing different people from different backgrounds, and relative to Google, the average tech standard is a pathetically low bar.

Anyways, if I vouch for someone from Google, then that person is almost more than technically adequate. But then again, so is 90% of the other Googlers. There are of course distinctions amongst the group of the Special Force. Some people are slow but precise (they like to work on mission critical code). Some people are fast but sloppy (they like to work on social networking sites). Some like Java. Some like Python. Some like Javascript. Some like C++. Some people are smart, and some people are simply mind blowing brilliant. The Google gene-pool isn't all homogenous.

In the end, you should not have to worry about a Googler's technical skills. You may however, have to worry about many other things, like being able to give them challenging enough of a task, making them feel like they're making a big impact to the world, and providing enough incentives and rewards for keeping them; believe me, everyone is getting poached here and there these days with ridiculous packages. Keep in mind, there's a reason why Google managers tend to come and go very fast-- an x-manager once commented to me that it's really really hard to motivate and manage someone who is clearly much smarter than you are. I wasn't a manager at Google but I can understand why. Some of the smartest people I've met in the world are people I met in Google, and a few are a total pain in the ass to work with.