Some more notes on the different types and aspects of neural net architecture - what to use and when.
"Common columnar data has a static structure to it, and is best modeled in DL4J by a classic multilayer perceptron neural network. These problems may benefit from minor feature engineering, but we can often let the network find the best weights on the dataset on its own. Hyperparameter tuning is one of the main challenges when modeling with MLPs." 
"While we recommend the practitioner start with CNNs for image modeling problems, application of these networks have begun to appear for text modeling problems as well."  For example machine translation, sentence classification and sentiment analysis.
"Normalization is a broad category of methods that seek to make different samples seen by a machine-learning model more similar to each other, which helps the model learn and generalize well to new data. The most common form of data normalization is ... centering the data on 0 by subtracting the mean from the data, and giving the data a unit standard deviation by dividing the data by its standard deviation. In effect, this makes the assumption that the data follows a normal (or Gaussian) distribution and makes sure this distribution is centered and scaled to unit variance
"data normalization should be a concern after every transformation operated by the network... The main effect of batch normalization is that it helps with gradient propagation—much like residual connections—and thus allows for deeper networks." 
"A bottleneck layer is a layer that contains few nodes compared to the previous layers. It can be used to obtain a representation of the input with reduced dimensionality. An example of this is the use of autoencoders with bottleneck layers for nonlinear dimensionality reduction" (StackOverflow).
Recurrent Neural Networks
"Recurrent Neural Networks are in the family of feed-forward neural-networks. They differ from other feed-forward networks in their ability to send information over time-steps... Modeling the time dimension is a hallmark of Recurrent Neural Networks." 
"That is, a feedforward network has no notion of order in time, and the only input it considers is the current example it has been exposed to. Feedforward networks are amnesiacs regarding their recent past; they remember nostalgically only the formative moments of training.
Recurrent networks, on the other hand, take as their input not just the current input example they see, but also what they have perceived previously in time" (DeepLearning4J).
RNNs do this by adding their output in time step t-1 to the input in time step t.
"A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor... This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data... Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version." (Chris Olah's blog).
"When we see data that is the combination of both time and image (eg video) we use a special hybrid network of Long Short-Term Memory (LSTM) and convolutional layers.
"Generally, you want to use multiple epochs and one iteration (.iterations(1) option) when training; multiple iterations are generally only used when doing full-batch training on very small data sets" (DeepLearning4J).
"If the input is a fixed-length sequence (ie all time-series are the same length), you might as well use an MLP or CNN." 
"As long as the activation function is something nonlinear, a neural network with at least one hidden layer can approximate arbitrary functions... The nonlinear neural network model with a hidden layer ... is flexible enough to approximately represent any function! What a time to be alive!"  This is called the Universal Approximation Theorem.
See Chris Olah's blog: "Each layer stretches and squishes space, but it never cuts, breaks, or folds it... Transformations like this, which don’t affect topology, are called homeomorphisms. Formally, they are bijections that are continuous functions both ways... Layers with NN inputs and NN outputs are homeomorphisms, if the weight matrix, WW, is non-singular."
"If you do not put an activation function between two layers, then two layers together will serve no better than one, because their effect will still be just a linear transformation... The reason why people use ReLU between layers is because it is non-saturating (and is also faster to compute). Think about the graph of a sigmoid function. If the absolute value of x is large, then the derivative of the sigmoid function is small, which means that as we propagate the error backwards, the gradient of the error will vanish very quickly as we go back through the layers. With ReLU the derivative is 1 for all positive inputs, so the gradient for those neurons that fired will not be changed by the activation unit at all and will not slow down the gradient descent.
"For the last layer of the network the activation unit also depends on the task. For regression you will want to use the sigmoid or tanh activation, because you want the result to be between 0 and 1. For classification you will want only one of your outputs to be one and all others zeros, but there's no differentiable way to achieve precisely that, so you will want to use a softmax to approximate it" (from StackOverflow).
"You don’t have to worry too much about which activation function is better under what circumstances. That’s still an active research topic... Usually, the best one is chosen by using cross-validation to determine which one gives the best model, given the dataset you’re working with." 
 Deep Learning: A Practitioner's Approach
 Deep Learning with Python
 Machine Learning with TensorFlow