Yahoo! Developer Network Blog

« Previous | Main | Next »


May 4, 2009

Using ZooKeeper to tame system test for large-scale services

In the last release I took on the task of setting up a true system test environment for Apache ZooKeeper. Our previous environment ran the system test in a single JVM instance, which meant that there were some test scenarios that we just couldn't reproduce. In this new environment we wanted to be able to run tests across multiple hosts and deal with different numbers of machines and cluster environments.


My first attempt used ssh and scripts to fire off servers and clients on a cluster of machines. I soon realized that there was a significant amount of hardcoded configuration and envirnment
assumptions that would make the setup inflexible and impractical. So, I started over from scratch.


For the system test I wanted to use some set of machines, M, to host some number of ZooKeeper servers, S, and some number of clients C. Ideally the system test could just discover M at runtime, startup S servers on some subset of M and then startup C clients on the rest. I realized this is a perfect application for ZooKeeper. I also realized that such a use case can apply to more than just system tests.


What I needed to do was very similar to the rendezvous problem: you have two processes that startup with no knowledge of where each other is running and they need to find each other. Common solutions to this involve service discovery by broadcasting on well known ports, but for our network topologies broadcast based solutions will not work. But, ZooKeeper makes rendezvous easy. Each process creates an ephemeral znode with contact information as a child of a previously agreed upon znode (the rendezvous point). For example, let's use "/systest/available" as the rendezvous point. If four hosts startup, /systest/available would have the following children:


/systest/available/host1
/systest/available/host2
/systest/available/host3
/systest/available/host4


I can now see what machines are available for use in our system test.


Now that I know the machines that I can use, I want to start assigning work to them. The Java process that created /systest/available/host1 also created /systest/assignments/host1 and is watching that znode for children to appear. So when I need to assign task1 to a machine and assuming host1 is the least loaded machine, the system test will create the znode /systest/assignments/host1/task1, which will contain information about what class to instantiation and start as well as configuration parameters. The creation of /systest/assignments/host1/task1 will trigger a watch on the agent process running on that machine. The agent running on that machine will read the information about the
new task and start it. I also use the task znode to stop tasks, change their configuration, and get task status.


The end result is a very clean system test environment. The test classes themselves just interact through ZooKeeper, so there aren't any hardcoded assumptions about the environment needed. This also serves as a nice illustration of how Apache ZooKeeper can be used for more than just leader election and locking. Of course the next step would be to combine this scheme with something like OSGi to provide automatic management of the class dependences of the instances.


See the Apache ZooKeeper website to give it a try. The system test describe here is in the latest release: 3.1.1

Ben Reed
Yahoo!

Posted at May 4, 2009 3:21 PM

Bookmark this on Delicious

Comments

I am attempting to use the zookeeper systest provided in 3.1.1 and discussed in this article. However, the test container is failing to contact the local server. I have summarized the instructions in /zookeeper-3.1.1/src/java/systest/README.txt to the following shell script:

# -------------------------------------------------------
# copy resources to ~/tmp
# -------------------------------------------------------
single-machine.cfg
zookeeper-3.1.1-fatjar.jar
zookeeper-dev.jar

# -------------------------------------------------------
# run a zookeeper standalone instance (cluster is ok too)
# -------------------------------------------------------

java -jar zookeeper-3.1.1-fatjar.jar server single-machine.cfg > zookeeper.log & 2>&1

# -------------------------------------------------------
# on each host start the system test container
# -------------------------------------------------------

java -jar zookeeper-3.1.1-fatjar.jar ic TODDG01LT 2181 /sysTest > container.log & 2>&1

# -------------------------------------------------------
# Error, the container cannot connect to the server
# -------------------------------------------------------

org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /sysTest/assignments
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:518)

I'm attempting to sort this out, and it seems that FatJarMain is invoked by the jar, and that it is looking for classes to start...

#FatJarMain.java
* This is a generic Main class that is completely driven by the
* /mainClasses resource on the class path. This resource has the
* format:
*


* cmd:mainClass:Description
*

I'll hook up a debugger shortly to see why the container is failing...

Posted by: Todd Greenwood at May 18, 2009 4:27 PM | Permalink

Found the error, this line in my script should be host:port:

java -jar zookeeper-3.1.1-fatjar.jar ic todd_container localhost:2181 /sysTest > container.log & 2>&1

So, now that the zkserver and container are running, the test is run with:

# -------------------------------------------------------
# initiate the system test using the fatjar
# -------------------------------------------------------
java -jar zookeeper-dev-fatjar.jar systest org.apache.zookeeper.test.system.SimpleSysTest | tee ./log/simplesystest.log

This works, although the systest output is a bit hard to use, and it's not immediately clear to me how to configure this system test to run against an ensemble of zookeeper servers.

Posted by: Todd Greenwood at May 19, 2009 1:56 PM | Permalink

Post a comment

Comment Policy: We encourage comments and look forward to hearing from you. Please note that Yahoo! may, in our sole discretion, remove comments if they are off topic, inappropriate, or otherwise violate our Terms of Service.

Remember Me?

Hadoop is a trademark of the Apache Software Foundation.

Copyright © 2010 Yahoo! Inc. All rights reserved. Copyright | Privacy Policy

Help us continue to improve the Yahoo! Developer Network: Send Your Suggestions