LiveBlog: Hadoop Operations

I’m at Velocity 2009, sitting in on the “Hadoop Operations” talk.

Jeff Hammerbacher, Chief Scientist, Cloudera (email is first six of his last name at his company dot com). He has an ambitious agenda for this session and talks very fast, so sketchy notes and abbrevs for me. Pardon the crappy formatting.

slides are here.

Built data team at FB. ~30 ppl when he left. Built Hive and Cassandra.

Good resources:

“Hadoop: The Definitive Guide” by Tom White (must have)
“Hadoop Cluster Management” slides by Marco Nicosia’s 2009 USENIX talk

Hadoop: OSS for WSCs (warehouse-scale computers)

Typical cluster: 1U 2×4 core, 8GB RAM, 4×1TB SATA, 2×1 gE NIC; one switch per rack with 8 Gb intfc to backbone. Think 40-node-rack as unit.

HDFS: breaks files to 128MB, replicates blocks across nodes. W1RM design. checksumming, replication, compression included (tell you three times). Hooks in via Java, C, command line tools, FUSE, WebDAV, Thrift. Not usually mounted directly.

[how does it handle many small files? see HAR files below, see Common problems below, no statements about performance]

HDFS looks to diversly write blocks (across racks) using topology info.

MapReduce uses HDFS api to assign work to where the data is.

Avro: cross-language serialization for on-wire/RPC and persistence, includes versioning and security

HBase: Google’s BigTable lookalike on top of HDFS

Hive: SQL-like interface to structured data stored in HDFS. Replace DWH.

Pig: lang for dataflow programming.

Zookeeper: manage a distributed system

Good ways to dip your toes with Hadoop:

Projects:

Log or msg warehouse
DB archival store
ETL for DWH
Search team projects (autocomplete, did you mean, indexing)
Targeted web crawls (market research, etc)

Clusters:

use retired DB servers
use unused desktops
use EC2

[skipped a lot about how the project runs, apache voting, etc.]

Don’t run Hadoop across two data centers; one per and communicate at the app layer. [this sounds a lot like the rules for MPI et al ca. 1999-2000]

Make sure to use ECC RAM. High volume mem churn requires it.

Linux/CentOS “mildly preferred”

Mount local FS “noatime” for performance.

Recommend ext3 over xfs. Local FS performance improvements (e.g. xfs) don’t necessarily translate to global perf improvements (network bottlenecks consume it). Mentioned an xfs long-write problem.

JBOD over RAID0; slightly better performance and losing a disk doesn’t suck as much.

Java 6 update 14 or later (update 14 makes 64-bit pointers as cheap as 32-bit).

Installation: http://www.cloudera.com/hadoop

“In our distribution we put [things] where they ought to be.” Register with init.d, etc.

Configuration: http://my.cloudera.com/

You spec topology and whether JT/NN live on same machine, it spits out the rest. Hangs on to it for you, too.

Config modes

Standalone mode:

Everything in one JVM
Only one reducer, so you might not be able to find the bug

Pseudo-dist mode:

All daemons on one box using socket IPC

Dist mode:

For production

Config files

xml based
org.apache.hadoop.conf has Configuration class
Later resources overwrite earlier; “final” keyword prevents overwrite
common-site.xml, hdfs-site.xml, mapred-site.xml
Look in .template for examples

Cloudera admins their soft-layer cluster with Puppet “with varying level of success”. He’s seen Chef, cfengine, bcfg2, and others.

Problems in config:

“The problem is almost always DNS”—Todd Lipcon
Open the necessary ports (many) in firewall
Disting ssh keys (Cloudera uses expect)
directory permissions (writing logs)
Use all your disks!
Don’t try to use NFS for large clusters
JAVA_HOME set right (esp. on Macs)

Nehalems ~2x performance improvement

HDFS NameNode (“the master”)

VERSION file specs layoutVersion (negative number, decrements for each new). You hope this doesn’t change much; upgrade is painful

NN manages fs image (inode map, in mem) and edit log (journal, to disk).

Secondary NN (on different node) aka checkpoint node (v0.21): replays journal and tells primary to forget some history to prevent the edit log from becoming ridiculously large.

Backup node: write same data to NFS to recover if local node blows up

DataNode: round-robins blocks across all nodes.

Heartbeats to the nodes
dfs.hosts[.exlcude] to allow/deny clients

Client:

Use Java libs or command line
libhdfs c library lacks features and has memory leaks (and FUSE interface uses it)
Client only contacts NN for metadata
Client keeps distance-ranked list of block locations for data reads
Client maintains write queues: data queue and ack queue (writes three times, can’t forget request until all three are ack’d).
First datanode in write takes responsibility for pass-down-the-line write requests rather than having client spray data at all 3/n data nodes expected to write.

Can’t seek and write, nor append. So you create new each time.

HDFS Operator Utilities

Safe mode

Loads image file, applies edit log, creates new (empty) edit log
Datanodes send blocklists to NN
NN uses this during startup, will only service metadata reads while in safe mode
Exits safe mode after 99.9% of blocks have reported in (configurable); only one replica of block must be known (can rereplicate)

FS Check (hadoop fsck)

Just talks to NN to look at metadata
Looks for minimally rep’d, over/under rep’d blocks
Identify missing replicas and rereplicate, blocks with 0 replicas (corrupt files)
hadoop fsck /path/to/file -files -blocks to determine blocks for file
Run ~1 hr in production, store output

dfsadmin

admin quotas
add/remove datanodes
ckpoint fs image
monitor/manage fs upgrade

DataBlockScanner

cksum local blocks (with bandwidth throttling)
Runs ~3 weeks (configurable)

Balancer

goes thru cluster, makes disk utilization scores per datanode
rebalances if nodes are more than +/- 10% (with throttling)

Archive Tool

HAR file: like tar file, many entries in one HDFS namespace
Makes two index files and many part files (hopefully less than # of files you’re har’g)
Index files are used for lookup into part files
Doesn’t support compression and are W1RM.

distcp

Move large amounts of data in parallel
Implemented as MapReduce with no reducers
Can move data between data centers with this; can also saturate the network pipe

Quotas

apply to directories, not users or groups
namespace quotas constrain your use of the NN resources
diskspace quotas constrain your use of the datanodes’ resources
No defaults (can’t make new directories pick them up)

Users, Groups, Permissions

Relatively new
Very UNIXy
Executable bit means nothing on file
Need write on dir to add/remove files
need exec on dir to access child dirs
identity of NN process superuser

Audit logs

Not on by default, but useful for security

Topology

Uses to compute distance measures for replication
Node, Rack, Core Switch
Some work to infer from IP

Web UIs

There are many
NN @ port 50070: /metrics /logLevel /stacks
2NN @ port 50090
Datanode @ port 50075

HFDS Proxy: http server access for non-HDFS clients

ThriftFS: thrift server for non-HDFS clients

Trash:

Helps recover from bad rm’s (indavertent rm -rf happened on FB cluster)

Common Problems

Disk capacity: crank up reserved space, keep close eye on space, watch hadoop logfiles
Slow disks which aren’t yet dead: can’t see as fail, but you have to watch
NIC goes out of gig-E mode
ckpoint and backup data: keep an eye on 2NN node, watch NN edit log size
check NFS mount for shared NN data structure
Long writes (> 1 hr) can see things get freaky; break them down
HDFS layoutVersion upgrades are scary
Many small files can consume namespace: keep an eye on consumption

Turn on fairshare schedulers (Cloudera rus it out of the box)

Use distributed cache to send common libs to all nodes

JobControl: good way to express job depedencies

Run canary jobs (sort, dfs write) to test functional status

Upgrades are scary. This will be less true as it reaches 1.0

One admin can easily carry a medium (100-node) cluster. Most activity is around commission/decommission.

Try not to lose more than N nodes, where N is your replication factor. You could hit the jackpot on those being the only three replicas of some needed block.

← Previous Archive Next →

blog comments powered by Disqus

Published

22 June 2009