2009-07-02 Thu
It's hard to be a relational database lately. After years of faithful service everywhere you look the world is turning against you:
Certainly some say stick with the past. It's your fault, you aren't doing it right, give us another chance and all will be as it ever was. Some smirk saying this is nothing but a return to a more ancient time when IBM was King.
But it's in the air. It's in the code. A revolution is coming. To what? That is what is not yet clear.
Update 3: Presentation from the NoSQL Conference: slides, video.
Update 2: Jim Wilson helps with the Understanding HBase and BigTable by explaining them from a "conceptual standpoint."
Update: InfoQ interview: HBase Leads Discuss Hadoop, BigTable and Distributed Databases. "MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing."
Hbase is the open source answer to BigTable, Google's highly scalable distributed database. It is built on top of Hadoop (product), which implements functionality similar to Google's GFS and Map/Reduce systems.
Update 3: Presentation from the NoSQL conference: slides, video 1, video 2.
Update 2: The folks at Hypertable would like you to know that Hypertable is now officially sponsored by Baidu, China’s Leading Search Engine. As a sponsor of Hypertable, Baidu has committed an industrious team of engineers, numerous servers, and support
resources to improve the quality and development of the open source technology.
Update: InfoQ interview on Hypertable Lead Discusses Hadoop and Distributed Databases. Hypertable differs from HBase in that it is a higher performance implementation of Bigtable.
Skrentablog gives the heads up on Hypertable, Zvents' open-source BigTable clone. It's written in C++ and can run on top of either HDFS or KFS. Performance looks encouraging at 28M rows of data inserted at a per-node write rate of 7mb/sec.
Update 2: Presentation from the NoSQL conference: slides, video.
Update: Why you won't be building your killer app on a distributed hash table by Jonathan Ellis. Why I think Cassandra is the most promising of the open-source distributed databases --you get a relatively rich data model and a distribution model that supports efficient range queries. These are not things that can be grafted on top of a simpler DHT foundation, so Cassandra will be useful for a wider variety of applications.
James Hamilton has published a thorough summary of Facebook's Cassandra, another scalable key-value store for your perusal. It's open source and is described as a "BigTable data model running on a Dynamo-like infrastructure." Cassandra is used in Facebook as an email search system containing 25TB and over 100m mailboxes.
Update: Presentation from the NoSQL conference: slides, video 1, video 2.
Project Voldemort is an open source implementation of the basic parts of Dynamo (Amazon’s Highly Available Key-value Store) distributed key-value storage system. LinkedIn is using it in their production environment for "certain high-scalability storage problems where simple functional partitioning is not sufficient."
Update 2: They are now called NoSQL databases. So keep up! Eric Lai wrote a good article in Computerworld No to SQL? Anti-database movement gains steam about the phenomena. There was even a NoSQL conference. It was unfortunately full by the time I wanted to sign up, but there are presentations by all the major players. Nice Hacker News thread too.
Update: Some Notes on Distributed Key Stores by Leonard Lin. What's the best way to handle a fast growing system with 100M items that requires low latency and lots of inserts? Leanord takes a trip through several competing systems. The winner was: Tokyo Cabinet.
Richard Jones has put together a very nice list of various key-value stores around the internets. The list includes: Project Voldemort, Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Cassandra, HBase, and Hypertable. Richard also includes some commentary and their basic components (language, fault tolerance, persistence, client protocol, data model, docs, community).
There's an excellent discussion in the comments of Paxos vs Vector Clock techniques for synchronizing writes in the face of network failures.
2009-07-01 Wed
In this podcast, we interview Jonathan Ellis about how Facebook's open sourced Cassandra Project took lessons learned from Amazon's Dynamo and Google's BigTable to tackle the difficult problem of building a highly scalable, always available, distributed data store.
For the last couple of months, we've been quietly developing a MySQL protocol parser for Maatkit. It isn't an implementation of the protocol: it's an observer of the protocol. This lets us gather queries from servers that don't have a slow query log enabled, at very high time resolution.
With this new functionality, it becomes possible for mk-query-digest to stand on the sidelines and watch queries fly by over TCP. It is only an observer on the sidelines: it is NOT a man in the middle like mysql-proxy, so it has basically zero impact on the running server (tcpdump is very efficient) and zero impact on the query latency. There are some unique challenges to watching an entire server's traffic, but we've found ways to solve those. Some of them are harder than others, such as making sense of a conversation when you start listening in the middle. In general, it's working very well. We can gather just about every bit of information about queries that mysql-proxy can, making this a viable way to monitor servers without the disadvantages of a proxy. In theory, we can gather ALL the same information, but in practice we are going for the 95% case.
As always with Maatkit, this has minimal dependencies. It doesn't require any Net::Pcap or other modules from CPAN. It's written in pure Perl, and it parses the output of tcpdump, rather than watching the network traffic directly. This might sound useless, but it's not. It means you can go tcpdump some traffic on a machine without Perl installed, and copy it to another machine for analysis, or send it via email to your friendly consultant, or do any of a number of other things. Decoupling things is very helpful sometimes.
Let's see how to gather queries and do something useful with them. I'll just watch the queries on a sandbox server on my laptop, and print out the profile synopsis so you can see how it works.
-
sudo tcpdump -i lo port 3306 -s 65535 -x -n -q -tttt> tcpdump.out
I run a few queries, quit, and cancel tcpdump. Now I've got a file and I'm ready to analyze it. Let's see:
-
mk-query-digest --type=tcpdump --report-format=profile tcpdump.out
-
# Rank Query ID Response time Calls R/Call Item
-
# ==== ================== ================ ======= ========== ====
-
# 1 0x088084BF139954D8 0.1009 90.2% 1 0.100881 SELECT dual
-
# 2 0x261703E684370D2C 0.0110 9.8% 1 0.011017
I'm kind of showing off the summary profile here to illustrate that you can get really compact results to see what's going on inside your server. What do you suppose that one query was that took a tenth of a second? We can find out.
-
mk-query-digest --type=tcpdump --limit 1 tcpdump.out
-
# 460ms user time, 30ms system time, 8.88M rss, 11.79M vsz
-
# Overall: 8 total, 6 unique, 0.15 QPS, 0.00x concurrency ________________
-
# total min max avg 95% stddev median
-
# Exec time 114ms 0 101ms 14ms 100ms 33ms 737us
-
# Hosts 8
-
# Time range 2009-07-01 23:55:41.334082 to 2009-07-01 23:56:33.929945
-
# bytes 215 14 49 26.88 46.83 9.76 26.08
-
# Errors 8
-
# Rows affe 0 0 0 0 0 0 0
-
# Warning c 0 0 0 0 0 0 0
-
# 0% No_good_index_used
-
# 12% No_index_used
-
-
# Query 1: 0 QPS, 0x concurrency, ID 0x088084BF139954D8 at byte 7517 _____
-
# This item is included in the report because it matches --limit.
-
# pct total min max avg 95% stddev median
-
# Count 12 1
-
# Exec time 88 101ms 101ms 101ms 101ms 101ms 0 101ms
-
# Users 1 msandbox
-
# Hosts 1 127.0.0.1
-
# Databases 1
-
# Time range 2009-07-01 23:56:31.022602 to 2009-07-01 23:56:31.022602
-
# bytes 22 49 49 49 49 49 0 49
-
# Errors 1 none
-
# Rows affe 0 0 0 0 0 0 0 0
-
# Warning c 0 0 0 0 0 0 0 0
-
# 0% No_good_index_used
-
# 0% No_index_used
-
# Query_time distribution
-
# 1us
-
# 10us
-
# 100us
-
# 1ms
-
# 10ms
-
# 100ms ################################################################
-
# 1s
-
# 10s+
-
# Tables
-
# SHOW TABLE STATUS LIKE 'dual'\G
-
# SHOW CREATE TABLE `dual`\G
-
# EXPLAIN
-
select 1 from ( select sleep(.1) from dual ) as x\G
Indeed, it's no surprise the query took a tenth of a second to execute, and now you see where "SELECT dual" comes from.
Notice that it is inspecting the protocol enough to see the flags set in the protocol, indicating the warning count, error count, rows affected, and whether no index or no good index was available. Look at the top of the report -- what is up with the 12% of queries that say No_index_used? If we increase --limit a bit, we can see
-
# 0% No_good_index_used
-
# 100% No_index_used
-
# Query_time distribution
-
... snip ...
-
show databases\G
I did not know that SHOW DATABASES sets the "no index used" flag, did you? Now we both do!
This is just a brief introduction to what the protocol parser can do. Of course, in real life it's much more useful than just seeing a query or two -- it has all the power of mk-query-digest for filtering, aggregating, printing and so forth.
Entry posted by Baron Schwartz | No comment
I was particularly mean spirited in my recent post about the TPC-H(tm) benchmark when I suggested that an unnamed analyst should perhaps leave the industry behind. I didn't really mean that.
I choose to give the TPC-H(tm) numbers significance, especially as an indicator of price/performance ratios. Kickfire kicks ass with a PP ratio of $0.81 per QPPH. This is very impressive. You can choose to disagree if you like.
Update 5: Shopzilla's Site Redo - You Get What You Measure. At the Velocity conference Phil Dixon, from Shopzilla, presented data showing a 5 second speed up resulted in a 25% increase in page views, a 10% increase in revenue, a 50% reduction in hardware, and a 120% increase traffic from Google. Built a new service oriented Java based stack. Keep it simple. Quality is a design decision. Obsessively easure everything. Used agile and built the site one page at a time to get feedback. Use proxies to incrementally expose users to new pages for A/B testing. Oracle Coherence Grid for caching. 1.5 second page load SLA. 650ms server side SLA. Make 30 parallel calls on server. 100 million requests a day. SLAs measure 95th percentile, averages not useful. Little things make a big difference.
Update 4: Slow Pages Lose Users. At the Velocity Conference Jake Brutlag (Google Search) and Eric Schurman (Microsoft Bing) presented study data showing delays under half a second impact business metrics and delay costs increase over time and persist. Page weight not key. Progressive rendering helps a lot.
Update 3: Nati Shalom's Take on this article. Lots of good stuff on designing architectures for latency minimization.
Update 2: Why Latency Lags Bandwidth, and What it Means to Computing by David Patterson. Reasons: Moore's Law helps BW more than latency; Distance limits latency; Bandwidth easier to sell; Latency help BW, but not vice versa; Bandwidth hurts latency; OS overhead hurts latency more than BW. Three ways to cope: Caching, Replication, Prediction. We haven't talked about prediction. Games use prediction, i.e, project where a character will go, but it's not a strategy much used in websites.
Update: Efficient data transfer through zero copy. Copying data kills. This excellent article explains the path data takes through the OS and how to reduce the number of copies to the big zero.
Latency matters. Amazon found every 100ms of latency cost them 1% in sales. Google found an extra .5 seconds in search page generation time dropped traffic by 20%. A broker could lose $4 million in revenues per millisecond if their electronic trading platform is 5 milliseconds behind the competition.




