Saturday, July 31, 2010

There is nothing new under the sun

One of team (AJ) was sent on an Test Driven Development course this week and came back and gave a show-and-tell at the weekly retrospective. Some interesting points came out of the subsequent discussion

AJ said that he had been taught the definition of "legacy code" was code that had no tests (I assume, after a quick Google, this came from Michael Feather's book Working Effectively with Legacy Code).

But, I pointed out, if we accept this definition then all test code was therefore legacy code. Who has tests for tests - Quis custodiet ipsos custodes?

So, what was the best way to avoid the pain often associated with "legacy" code? Afterall, some of our tests are now over 3 years old. So, we decided:
  • Making methods so short that unit testing becomes almost trivial.
  • If the method is truly trivial, there is little point in testing it. (We don't like unit testing everything since that sounds like we're just ticking check boxes for the Time and Motion man).
  • Do as little mocking as possible since writing expectations can lead to hard to maintain code. Instead, override methods in the (production) class that would call other classes/make network calls etc with methods in a (test) subclass is preferred.
  • Continue using EasyMock on the projects that already use it but use Mockito for new projects since test code is much more succinct with Mockito. One team member was alarmed that we would then have two frameworks for testing but was won over by the rest of the team who had used it and said it was trivial to learn.
  • If we're just quickly trying to get the test to pass rather than thinking about why it's failing, the test serves no purpose.
AJ also mentioned Acceptance Testing. This caused some confusion as to what exactly this was. For those of us who were brought up on Pressman, the definition is:

Conducted by the end user rather than the system developer, an acceptance test can range from an informal "test drive" to a planned and systematically executed series of tests.
(Software Engineering: A Practitioners Approach, European Adaptation, Fourth Edition, p522 - Roger S Pressman)

More than once, I've heard people use the term "acceptance tests" and assumed that their audience knew they meant automated, regression acceptance tests as encouraged by frameworks such as Fitnesse. Pressman's definition is more general. So, I started Googling for formal definitions of terms the team was discussing.


Test Driven Development

The notable characteristics of TDD are:
  • automated testing [1]
  • regression testing
  • black- and white-box testing [2]
  • unit testing [1]
  • test-first programming [1] ("write a little test that doesn't work, and perhaps doesn't even compile to begin with").
[1] Test Driven Development: By Example - Kent Beck
[2] http://www.threeriversinstitute.org/Testing%20Dichotomies%20and%20TDD.htm

The ubiquitous JUnit is an example of a TDD framework.

Team observations: write-first testing is nice but not to be dogmatically enforced; just because you're using JUnit doesn't mean you're unit testing.


Behaviour Driven Development
, to give it its British spelling, was a term coined by fellow Brit, Dan North. It is a methodology that brings together:
  • automated testing [3]
  • regression testing [3]
  • black-box testing
  • acceptance testing [3]
[3] http://www.slideshare.net/skillsmatter/bdd-introduction

It distinguishes itself from other methodologies in that "acceptance criteria should be executable". To this end, it uses a domain specific language that looks like natural language. This makes the tests easy to read by the business.

JBehave is an example of a framework that facilitates this.

Team observation: these tests take a long time to run - which can hamper productivity; testers and business people are supposed to write the tests but that very, very rarely happens and instead the developer does this.


This is a very incomplete list of terms. Ideally, I want to launch Gimp and put together some matrix of which methodologies include what elements. Although they often have some very nice frameworks, new methodologies often claim to be paradigm shifts (Dan North calls BDD "a second generation ... methodology" [3]). Really, they are emphasising ideas that have been long established.

Wednesday, July 28, 2010

ActiveMQ Patch Accepted

Yay! The Apache ActiveMQ team have accepted my patch!

This fixes a bug that prevented the consumption of relevant messages from a Topic for which we have a durable subscription if it were full of messages we were not interested in.

The bug manifested itself when your consuming thread simply hung despite perfectly good messages being added to the JMS Topic.

Fortunately, ActiveMQ comes with MBeans that are easily activated. This allowed us to watch what the little blighter was doing with the much under-utilized JConsole. Service stopped when we saw the number of irrelevant messages in the Topic hit 100 - the default size of the buffer. This was the clue that prompted me to go off and look in the code at where my consuming thread was blocking (JConsole also shows where all the JVM's threads are).

Incidentally, I'm surprised how few really good Java developers don't use JConsole - despite the fact you get it for free with your JDK. For monitoring an application in production, you can't beat it. Other commonly used libraries often come with MBeans ready to deploy. For instance, did you know that you can change the Log4J logging threshold at runtime? If you want to make your own MBeans and you're using Spring, it couldn't be easier. You just need to have a org.springframework.jmx.export.MBeanExporter in your context and annotate your class with @ManagedResource.

Saturday, July 10, 2010

Brief comparison of locking between Oracle, Postgres and Sybase

I'm always surprised at how little most server-side developers know about databases even when all but the most trivial application them. And I am not talking about understanding SQL (worthy cause as it is) but understanding what is happening in the DB regarding locks. You generally don't have any problems at this level during testing. But under a non-trivial load they are a major concern.

So, partly to remind myself, here are some notes on the databases of which I have experience.

Sybase uses (exclusive) write and (shared) read locks extensively. If you update any data, you have an exclusive lock on that data until the end of the transaction. No other thread can even read it. "That data" can be anything. It doesn't need to be a row. It could be a table or an index. So, perhaps surprisingly, these two threads block each other:

TX 1
begin tran
TX2
begin tran
insert into X values ('y')
insert into X values ('y')
[Blocked by TX1]
.
.
.
commit tran
[Released by TX1]

.
.
.
commit tran

They are dealing with different rows but TX1 holds an exclusive lock on the table.

Oracle and Postgres deal with things differently. They use Multiversion Concurrency Control. At the start of a transaction, you get something similar to a snapshot of the database that you can play with until your "merge" your changes into the "trunk" (note: this is only a loose analogy and there are areas where it breaks down - see later).

So, these two (Oracle) processes don't block each other:

TX 1
set transaction isolation level read committed ;
TX2
set transaction isolation level read committed ;
insert into X values ('y')
insert into X values ('y')
commit;
commit;

They only really block each other when they really are contending for the same row:

TX 1
set transaction isolation level read committed ;
TX2
set transaction isolation level read committed ;
update X set Y ='y' where Z = z;
update X set Y ='y' where Z = z;
[Blocked by TX1]

.
.
.
commit;
[Released by TX1]
commit;

But (unlike Sybase) TX2 is not blocked in Oracle or Postgres in this scenario:

TX 1
set transaction isolation level read committed ;
TX2
set transaction isolation level read committed ;
update X set Y ='y2' where Z = z;
select Y from X where Z = z;
[sees the old value for Y]
commit;
commit;

Sybase would block TX2 upon the select command until TX1 committed or rolled back. This leads to an interesting unintuitive corner case in Sybase that I have seen. Since a shared read lock is still a lock, you can have deadlocks between two transactions even if one of those transactions is read-only. For instance, if I issue a select * where ... on a table, I start locking rows in the table. If another transaction updates a row that my select process has not yet reached then tries to update a row my select process already has a lock on - deadlock. Sybase chooses one transaction to be the loser. By default, this is the transaction that has done the least work.

So far, I've discussed the READ COMMITTED isolation level - the default on all three databases. The fun starts when you use SERIALIZABLE.

In Sybase:

TX1

set transaction isolation level 3 -- SERIALIZABLE
begin tran
select count(z) from X

TX2
set transaction isolation level 1 -- READ_COMMITTED

begin tran
insert into X values ('y')
[Blocked by TX1]

.
.
.
commit;
[Released by TX1]
commit;

Note: if TX1 had been at isolation level 1 (READ_COMMITTED) there would have been no blocking in TX2.

Unsurprisingly, Oracle and Postgres don't block:

TX1
set transaction isolation level serializable;
TX2
set transaction isolation level serializable;
insert into X values ('y');
select count(z) from X
[sees value ' w', say]
insert into X values ('y')
select count(z) from X
[sees value 'w' too]
commit;
select count(z) from X
[still sees value 'w' !]
.
.
.
commit;

So, although Sybase blocks (which can be somewhat inefficient with anything but the shortest transactions), it is consistent. In Oracle and Postgres, you will continue to see the same value for as long as the transaction continues even though the data is stale and a perfectly good change has been successfully committed.

Oracle has to keep hold of this snapshot to maintain the SERIALIZABLE contract for potentially forever (or at least until you get "ORA-01550 snapshot too old"...).

And whereas Sybase guarantees that your view of the data always SERIALIZABLE, Oracle and Postgres do not.


TX1
set transaction isolation level serializable;
TX2
set transaction isolation level serializable;
delete from X where Z=z;
update X set Y='y2' where Z=z
[ERROR!]
commit;

Oracle gives you:

Error: ORA-08177: can't serialize access for this transaction

and Postgres gives you:

ERROR: could not serialize access due to concurrent update
SQL state: 40001

Interestingly, the criteria of SERIALIZABLE is defined as being met if the final database state is equal to all the queries being run one after the other with no overlap. Note, the DB implementation is not obligated to actually do this, just that the final state must be equivalent to this behaviour. The Sybase programmers chose to guarantee this every time. The Oracle and Postgres developers chose to make a best-effort and to throw an exception if the contract could not be honoured.

There will always be subtle differences between implementations. So, Java developers, get to know a little about your database of choice. (Oh, and keeping transactions as short as possible is a good idea in all three databases.)

Aside: it's a pity that the Exceptions in the JDBC spec are so non-specific. I get a SQLException if the database is down (from which I can't really recover). And I get a SQLException if my transaction was deadlocked (from which I can recover, namely by trying again.) A missed opportunity.

Sunday, July 4, 2010

Concurrent modification of data structures

Sharing data is something that seems easy but has many subtle problems. When I architected a credit rating system in the past, I chose to make all the domain objects immutable. Although most developers were fine with this approach, it made a number of others antsy as manipulating the data became a little more unwieldy. By definition, they couldn't change the data (make a similar object with new fields) and felt the constructors took too many arguments since everything had to be set at object creation time (decompose the model).

Anyway, I'm glad we went with that architecture all those years ago since now I am on a completely different project, I have to maintain code that did not adopt this pattern. As a result, things are messy.

As an aside, hire good developers. Good architects and good managers are for naught if the guys at the code face are not naturals.

First off, in the current mutable architecture, threads were taking the objects from the cache and manipulating them for their own purpose. This changed the underlying data. Oops. So, this was remedied quite early by making a copy using simple Java serialization.

But frequently copying large objects using serialization is surprisingly expensive. With 70 threads reading a moderately sized object (~100s of KB) and 20 updating it, each every 45s, my four-core Linux box is maxing out.

Whatsmore, I'm seeing plenty of ConcurrentModificationExceptions in the logs as the threads performing updates to the data interfere with the threads performing the serialization. This forces a retry which consumes yet more CPU cycles.

And to top it all, occasionally one read thread is stuck in an infinite loop and consumes all the CPU as another thread changed a java.util.HashSet (read more about this pathological phenomena here).

Sure, you could say well the JavaDoc clearly says don't change non-thread-safe collections while another thread is reading it. But most developers don't reckon on somebody serializing an object of their class while somebody else updates a collection within it.

And sure, you could say use ConcurrentHashMap and its ilk ("They do not throw ConcurrentModificationException" and "all operations are thread-safe") but we never chose the implementation of our maps and sets - Hibernate did as it rehydrated the object.

These are innocent mistakes (albeit with potentially devastating impacts). Less forgivable perhaps are developers updating java.util.Sets with code like:

aSet.remove(anElement);
aSet.add(anElement);

The equality of our anElement object is based on its Primary Key, so that's sensible. But of course, in a shared data world, this introduces a horrible race condition. For a moment, aSet does not contain anElement. Obvious when you think about it but although this code was well unit tested, the developer never thought about concurrency. It only manifests itself as a bug when the system is under a heavy load.

My patch involved using RentrantReadWriteLock that allowed me to have non-blocking ReadLocks and exclusive WriteLocks. Firing up JMeter and running my stress tests showed a moderate degradation in performance (5-10%) but under heavy load, 0% of errors (compared to 5% of errors without this locking due to race conditions).

Ideally, I'd like to re-write this crufty code entirely and wrap changes to the underlying data structure in Command objects that can only be executed in one class. This would give me a centralized place to implement and manage the locking code. Unfortunately, access to the underlying data is currently scattered across this component.

Alas, it's unlikely that I'll get to do this as for the client my current solution is "good enough".

As for improving the performance, I managed to squeeze a 30% saving by just copying what was needed by the calling code not copying a larger object graph then returning a subset of that data. But that's less interesting.