Friday, June 8, 2012

Your Friday afternoon reading list

Here you go, just what you were waiting for :)

  1. A nice explanation of why it was rather challenging to build index scans into the Postgres MVCC engine: http://michael.otacoo.com/postgresql-2/postgresql-9-2-highlight-index-only-scans/
    The commit message talks about “visibility map”, which is a feature implemented since PostgreSQL 8.4, which allows to keep tracking of which pages contains only tuples that are visible to all the transactions (no data modified since latest vacuum cleanup for example). What this commit simply does is to check if the page that needs to be consulted is older than the transaction running.
  2. A simple introduction to Postgres's query timing features: http://momjian.us/main/blogs/pgblog/2012.html#June_8_2012
    Each data manipulation language (dml) command (select, insert, update, delete) goes through three stages:
    1. parser
    2. planner
    3. executor
    You can actually time how long each stage takes.
  3. A super-awesome discussion of the intricacies of Postgres's 9.2 Group Commit algorithm, and a change that is in the hopper for Postgres 9.3: http://pgeoghegan.blogspot.com/2012/06/towards-14000-write-transactions-on-my.html
    Oftentimes, they will find that this has happened, and will be able to simply fastpath out of the function that ensures that WAL is flushed (a call to that function is required to honour transactional semantics). In fact, it is expected that only a small minority of backends (one at a time, dubbed “the leader”) will actually ever go through with flushing WAL.
  4. A great essay on why your storage system really needs to be sensitive to whether it uses SSD hardware underneath: http://blog.empathybox.com/post/24415262152/ssds-and-distributed-data-systems
    Traditional B+Trees or hashes are no longer the most appropriate persistent data structure. This is not due to the drop in latency but due the the write endurance problem. Moving a database with a traditional storage engine to commodity SSDs will likely be quite fast but the SSDs may stop working after a few months!
  5. One of the best sharding presentations I've seen yet, from the Tumblr team: https://github.com/tumblr/jetpants/blob/master/doc/VelocityEurope2011Presentation.pdf?raw=true
    Sharding is the implementation of horizontal partitioning outside of MySQL (at the application level or service level). Each partition is a separate table. They may be located in different database schemas and/or different instances of MySQL.
  6. Also from the Tumblr gang, a nifty short note about using parallel gzip and netcat to get the fastest possible transfer of immense data sets between nodes: http://engineering.tumblr.com/post/7658008285/efficiently-copying-files-to-multiple-destinations
    By adding tee and a FIFO to the mix, you can create a fast copy chain: each node in the chain saves the data locally while simultaneously sending it to the next server in the chain.
  7. In the most recent issue of IEEE Computer, Prof. Eric Brewer revisits his famous CAP theorem, 12 years later: http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
    The challenging case for designers is to mitigate a par­tition’s effects on consistency and availability. The key idea is to manage partitions very explicitly, including not only detection, but also a specific recovery process and a plan for all of the invariants that might be violated during a partition.
  8. Twitter have open-sourced their distributed trace infrastructure, designed for real-time tracing of service-oriented architectures under actual load: http://engineering.twitter.com/2012/06/distributed-systems-tracing-with-zipkin.html
    Zipkin started out as a project during our first Hack Week. During that week we implemented a basic version of the Google Dapper paper for Thrift. Today it has grown to include support for tracing Http, Thrift, Memcache, SQL and Redis requests.
    Here's Google Dapper, by the way, if you haven't seen it before: http://research.google.com/pubs/pub36356.html
  9. A great checklist if you find yourself considering the problem of, say, loading 500 million rows into your database, all at once: http://derwiki.tumblr.com/post/24490758395/loading-half-a-billion-rows-into-mysql
    Use LOAD DATA INFILE

    This is the most optimized path toward bulk loading structured data into MySQL. 8.2.2.1. Speed of INSERT Statements predicts a ~20x speedup over a bulk INSERT (i.e. an INSERT with thousands of rows in a single statement).

    LOAD DATA INFILE is the MySQL equivalent of Postgres's COPY FROM, I believe.
  10. A somewhat-marketing-slanted post about why you shouldn't expect any shared-disk database system to ever perform very well: http://database-scalability.blogspot.com/
    With a shared disk - there is a single shared copy of my big data on the shared disk, the database engine still have to maintain "buffer management, locking, thread locks/semaphores, and recovery tasks". But now - with a twist! Now all of the above need to be done "globally" between all participating servers, thru network adapters and cables, introducing latency. Every database server in the cluster needs to update all other nodes for every "buffer management, locking, thread locks/semaphores, and recovery tasks" it is doing on a block of data.
  11. A well-written plea to never poll, and never spin, and always use a real lock manager or real synchronization system instead: http://randomascii.wordpress.com/2012/06/05/in-praise-of-idleness/
    The comment block says “Note: these locks are for use when you aren’t likely to contend on the critical section”. Famous last words. Caveat lockor. Sorry Facebook, too risky, not recommended.
  12. A fascinating and detailed article about trying to use the Linux "perf" toolkit to profile and measure complicated system software (in this case, the Postgres engine) http://rhaas.blogspot.com/2012/06/perf-good-bad-ugly.html
    perf seems to be the best of the Linux profiling tools that are currently available by a considerable margin, and I think that they've made good decisions about the functionality and user interface. The lack of proper documentation is probably the tool's biggest weakness right now, but hopefully that is something that will be addressed over time.
  13. An intriguing introduction to some of the searching issues that come up in genome sequencing, which, at its core, bears a lot of resemblance to the problem of computing diffs between two versions of an object. http://williamedwardscoder.tumblr.com/post/24071805525/searching-for-substrings-in-a-massive-string
    How might you search a really big string - a few billion symbols - for the best alignment of a large number of short substrings - perhaps just a few hundred symbols long or less - with some allowed edit-distance fuzzy-matching?
    Includes some interesting pointers to the Bowtie project, which I had not previously encountered: http://bowtie-bio.sourceforge.net/index.shtml
  14. And, lastly, since no list should end with 13 items, from the University of Ulm, an entire PhD thesis entitled "Concurrent Programming for Scalable Web Architectures": http://berb.github.com/diploma-thesis/original/index.html

    Chapter 6 is, naturally, of particular interest to me, being a storage systems kinda guy:

    In this chapter, we consider the impact of concurrency and scalability to storage backends. We illustrate the challenge of guaranteeing consistency in distributed database systems and point to different consistency models. We then outline some internal concepts of distributed database systems and describe how they handle replication and partitioning.

Enjoy!

No comments:

Post a Comment