Velocity 08: Storage at Scale


Google's reliability strategy is to buy cheap hardware with no reliability features and create reliable clusters from them because no problem Google wants to solve fits on a single machine anyway.

The Google File System (GFS) is a cluster file system with a familiar interface, but not POSIX compliant. Bigtable is a distributed database system. This has a custom interface, not SQL. There are 100's of instances of each of these cells scaling in to 1000's of servers and petabytes of data.

in the GFS, a master manages metadata. Data is broken into chunks (64Mb) and multiple copies (typically three) of a chunk are stored on various machines. The master also handles machine failures. Failures are frequent when you use lots of commodity hardware. Checksumming detects errors, replication allows for recovery. This all happens automatically. Higher replication is used for hotspots.

Most data is in two formats:

  • RecordIO - sequence of variable sized records
  • SSTable - sorted sequence of key/value pairs

BigTable is built on top of GFS. Lots of semi-structured data ordered by URL, user-id, geographic locations. The size of the data sets varies widely (e.g. page data vs sat-image data).

Tables are broken into tablets. These are treated as chunks for replicating in GFS. When a tablet gets too big, it's broken in two. Tablets go into SSTables in GFS.

Scale is important. Envisioning how to create exabyte systems. The systems need to be more and more automated. The number of systems is growing faster than Google can hire.