I’m at Velocity 2009, sitting in on the “Hadoop Operations” talk.

Jeff Hammerbacher, Chief Scientist, Cloudera (email is first six of his last name at his company dot com). He has an ambitious agenda for this session and talks very fast, so sketchy notes and abbrevs for me. Pardon the crappy formatting.

slides are here.

Built data team at FB. ~30 ppl when he left. Built Hive and Cassandra.

Good resources:

  • “Hadoop: The Definitive Guide” by Tom White (must have)
  • “Hadoop Cluster Management” slides by Marco Nicosia’s 2009 USENIX talk

Hadoop: OSS for WSCs (warehouse-scale computers)

Typical cluster: 1U 2×4 core, 8GB RAM, 4×1TB SATA, 2×1 gE NIC; one switch per rack with 8 Gb intfc to backbone. Think 40-node-rack as unit.

HDFS: breaks files to 128MB, replicates blocks across nodes. W1RM design. checksumming, replication, compression included (tell you three times). Hooks in via Java, C, command line tools, FUSE, WebDAV, Thrift. Not usually mounted directly.

[how does it handle many small files? see HAR files below, see Common problems below, no statements about performance]

HDFS looks to diversly write blocks (across racks) using topology info.

MapReduce uses HDFS api to assign work to where the data is.

Avro: cross-language serialization for on-wire/RPC and persistence, includes versioning and security

HBase: Google’s BigTable lookalike on top of HDFS

Hive: SQL-like interface to structured data stored in HDFS. Replace DWH.

Pig: lang for dataflow programming.

Zookeeper: manage a distributed system

Good ways to dip your toes with Hadoop:


  • Log or msg warehouse
  • DB archival store
  • ETL for DWH
  • Search team projects (autocomplete, did you mean, indexing)
  • Targeted web crawls (market research, etc)


  • use retired DB servers
  • use unused desktops
  • use EC2

[skipped a lot about how the project runs, apache voting, etc.]

Don’t run Hadoop across two data centers; one per and communicate at the app layer. [this sounds a lot like the rules for MPI et al ca. 1999-2000]

Make sure to use ECC RAM. High volume mem churn requires it.

Linux/CentOS “mildly preferred”

Mount local FS “noatime” for performance.

Recommend ext3 over xfs. Local FS performance improvements (e.g. xfs) don’t necessarily translate to global perf improvements (network bottlenecks consume it). Mentioned an xfs long-write problem.

JBOD over RAID0; slightly better performance and losing a disk doesn’t suck as much.

Java 6 update 14 or later (update 14 makes 64-bit pointers as cheap as 32-bit).

Installation: http://www.cloudera.com/hadoop

“In our distribution we put [things] where they ought to be.” Register with init.d, etc.

Configuration: http://my.cloudera.com/

You spec topology and whether JT/NN live on same machine, it spits out the rest. Hangs on to it for you, too.

Config modes

Standalone mode:

  • Everything in one JVM
  • Only one reducer, so you might not be able to find the bug

Pseudo-dist mode:

  • All daemons on one box using socket IPC

Dist mode:

  • For production

Config files

  • xml based
  • org.apache.hadoop.conf has Configuration class
  • Later resources overwrite earlier; “final” keyword prevents overwrite
  • common-site.xml, hdfs-site.xml, mapred-site.xml
  • Look in .template for examples

Cloudera admins their soft-layer cluster with Puppet “with varying level of success”. He’s seen Chef, cfengine, bcfg2, and others.

Problems in config:

  • “The problem is almost always DNS”—Todd Lipcon
  • Open the necessary ports (many) in firewall
  • Disting ssh keys (Cloudera uses expect)
  • directory permissions (writing logs)
  • Use all your disks!
  • Don’t try to use NFS for large clusters
  • JAVA_HOME set right (esp. on Macs)

Nehalems ~2x performance improvement

HDFS NameNode (“the master”)

VERSION file specs layoutVersion (negative number, decrements for each new). You hope this doesn’t change much; upgrade is painful

NN manages fs image (inode map, in mem) and edit log (journal, to disk).

Secondary NN (on different node) aka checkpoint node (v0.21): replays journal and tells primary to forget some history to prevent the edit log from becoming ridiculously large.

Backup node: write same data to NFS to recover if local node blows up

DataNode: round-robins blocks across all nodes.

  • Heartbeats to the nodes
  • dfs.hosts[.exlcude] to allow/deny clients


  • Use Java libs or command line
  • libhdfs c library lacks features and has memory leaks (and FUSE interface uses it)
  • Client only contacts NN for metadata
  • Client keeps distance-ranked list of block locations for data reads
  • Client maintains write queues: data queue and ack queue (writes three times, can’t forget request until all three are ack’d).
  • First datanode in write takes responsibility for pass-down-the-line write requests rather than having client spray data at all 3/n data nodes expected to write.

Can’t seek and write, nor append. So you create new each time.

HDFS Operator Utilities

Safe mode

  • Loads image file, applies edit log, creates new (empty) edit log
  • Datanodes send blocklists to NN
  • NN uses this during startup, will only service metadata reads while in safe mode
  • Exits safe mode after 99.9% of blocks have reported in (configurable); only one replica of block must be known (can rereplicate)

FS Check (hadoop fsck)

  • Just talks to NN to look at metadata
  • Looks for minimally rep’d, over/under rep’d blocks
  • Identify missing replicas and rereplicate, blocks with 0 replicas (corrupt files)
  • hadoop fsck /path/to/file -files -blocks to determine blocks for file
  • Run ~1 hr in production, store output


  • admin quotas
  • add/remove datanodes
  • ckpoint fs image
  • monitor/manage fs upgrade


  • cksum local blocks (with bandwidth throttling)
  • Runs ~3 weeks (configurable)


  • goes thru cluster, makes disk utilization scores per datanode
  • rebalances if nodes are more than +/- 10% (with throttling)

Archive Tool

  • HAR file: like tar file, many entries in one HDFS namespace
  • Makes two index files and many part files (hopefully less than # of files you’re har’g)
  • Index files are used for lookup into part files
  • Doesn’t support compression and are W1RM.


  • Move large amounts of data in parallel
  • Implemented as MapReduce with no reducers
  • Can move data between data centers with this; can also saturate the network pipe


  • apply to directories, not users or groups
  • namespace quotas constrain your use of the NN resources
  • diskspace quotas constrain your use of the datanodes’ resources
  • No defaults (can’t make new directories pick them up)

Users, Groups, Permissions

  • Relatively new
  • Very UNIXy
  • Executable bit means nothing on file
  • Need write on dir to add/remove files
  • need exec on dir to access child dirs
  • identity of NN process superuser

Audit logs

  • Not on by default, but useful for security


  • Uses to compute distance measures for replication
  • Node, Rack, Core Switch
  • Some work to infer from IP

Web UIs

  • There are many
  • NN @ port 50070: /metrics /logLevel /stacks
  • 2NN @ port 50090
  • Datanode @ port 50075

HFDS Proxy: http server access for non-HDFS clients

ThriftFS: thrift server for non-HDFS clients


  • Helps recover from bad rm’s (indavertent rm -rf happened on FB cluster)

Common Problems

  • Disk capacity: crank up reserved space, keep close eye on space, watch hadoop logfiles
  • Slow disks which aren’t yet dead: can’t see as fail, but you have to watch
  • NIC goes out of gig-E mode
  • ckpoint and backup data: keep an eye on 2NN node, watch NN edit log size
  • check NFS mount for shared NN data structure
  • Long writes (> 1 hr) can see things get freaky; break them down
  • HDFS layoutVersion upgrades are scary
  • Many small files can consume namespace: keep an eye on consumption

Turn on fairshare schedulers (Cloudera rus it out of the box)

Use distributed cache to send common libs to all nodes

JobControl: good way to express job depedencies

Run canary jobs (sort, dfs write) to test functional status

Upgrades are scary. This will be less true as it reaches 1.0

One admin can easily carry a medium (100-node) cluster. Most activity is around commission/decommission.

Try not to lose more than N nodes, where N is your replication factor. You could hit the jackpot on those being the only three replicas of some needed block.

blog comments powered by Disqus


22 June 2009