Replication vs. Partitioned Caches
From the (slightly old) Oracle Coherence 3.5:
Partitioned caches use divide and conquer. Each node does not have the entirety of the data set on one node, unlike replicated caches. This makes it better suited to systems that are write-heavy since "the Replicated Cache service needs to distribute the operation to all the nodes in the cluster and receive confirmation from them that the operation was completed successfully".
The downside of partitioned casts is that access is more expensive. "It is important to note that if the requested piece of data is not managed locally, it will always take only one additional network call to get it. Another thing to consider is that the objects in a partitioned cache are always stored in a serialized binary form. This means that every read request will have to deserialize the object, introducing additional latency."
"You can also increase the number of backup copies to more than one, but this is not recommended by Oracle and is rarely done in practice. The reason for this is that it guards against a very unlikely scenario - that two or more physical machines will fail at exactly the same time.
"The chances are that you will either lose a single physical machine, in the case of hardware failure, or a whole cluster within a data center, in the case of catastrophic failure. A single backup copy, on a different physical box, is all you need in the former case, while in the latter no number of backup copies will be enough - you will need to implement much broader disaster recovery solution and guard against it by implementing cluster-to-cluster replication across multiple data centers." Oracle's Push Replication pattern is what we're currently using at the moment.
What you see is not necessarily what you get.
"If your data set is 1 GB and you have one backup copy for each object, you need at least 2GB of RAM.
"The cache indexes you create will need some memory.
The dangers of OOMEs are acute:
"You need to leave enough free space... or a frequent garbage collection will likely bring everything to a standstill, or worse yet, you will run out of memory and most likely bring the whole cluster down - when one node fails in a low-memory situation, it will likely have a domino effect on other nodes as they try to accommodate more data than they can handle".
I can testify that this is indeed the case as I have seen it before.
The book claims that too much memory can also be a problem. I can't vouch for this yet as we're only just moving to larger memory usage and it remains to be seen where the sweet spot is.
"Once the heap grows over a certain size (2GB at the moment for most JVMs), the garbage collection pause can become too long to be tolerated by users of an application, and possibly long enough that Coherence will assume the node is unavailable and kick it out of the cluster...
"Because of this, the recommended heap size for Coherence nodes is typically in the range of 1 to 2 GB, with 1GB usually being the optimal size that ensures that garbage collection pauses are short enough to be unnoticeable."
The book was written in 2009 and GC technology may have improved so YMMV.
Nodes leaving the cluster
The advantages of using a Coherence extend client is that their membership to the cluster is not as important.
"When a node fails, the Partitioned Cache service will notify all the other nodes to promote backup copies of the data that the failed node had primary responsibility for, and to create new backup copies on different nodes.
"When the failed node recovers, or a new node joins the cluster, the Partitioned Cache service will fail back some of the data to it by repartitioning the cluster and asking all of the existing members to move some of their data to the new node".