Tuesday, July 17, 2018

Spark CLI nohup


You may package a JAR and use spark-submit to submit the code to Spark. But sometimes you want to hack and sometimes this hack might be long running query. How do you keep spark-shell running after you have gone home?

It took me some fiddling but this works (with a bit of help from StackExchange).

In Unix shell #1, do:

mkfifo my_pipe
nohup spark-shell YOUR_CONFIG < my_pipe > YOUR_OUTPUT_FILE

Now in Unix shell #2, do:

nohup cat YOUR_SPARK_SCALA > my_pipe 2>/dev/null &

You should now see the Spark shell jump into life.

Now, back in shell #1, press CTRL+z, type:

jobs

identify your application's job ID and type

bg JOB_ID

Alternatively, following the advice in this StackOverflow answer, you can, CTRL+z, find the JOB_ID from jobs, then bg as above before calling:

disown -h %JOB_ID

You may now logoff and go home.

Thursday, July 12, 2018

Are you better?


There is a certain amount of randomness in neural nets. Run it twice and you're unlikely to get the same answer. So, when making changes, how do we know things have improved?

I ran exactly the same Spark neural net code on slightly different data. The data represented the same use case but in one run I used L1 normalization and in the other L2. For each normalization technique, I ran the neural net 9 times. The accuracy looked like this:
Accuracy of a neural net. Red is when using data with L2 normalization, blue with L1.
At first blush, it looks like L1 is better but it's not certain. After all some of the high values for L2 accuracy are higher than the low values for L1. And since each normalization only has 9 data points each, could this apparent difference just be due to luck?

Accuracy (%)Std. Dev. (%)
L194.90.36
L294.30.42


The Classical Approach

The standard deviation in a sample can be used to estimate the standard deviation of the whole population. The standard deviation of the mean is σ/√N (aka the standard error) where σ is the (true) standard deviation of the population and N is the number of measurements per trial [Boaz p772].

To demonstrate, let's take 5 observations 1 million times. We'll compare the calculated value with the results of a simulation in Python:

    N = 5
    Nsamp = 10 ** 6
    sigma_x = 2

    x = np.random.normal(0, sigma_x, size=(Nsamp, N))

    mu_samp = x.mean(1)  # 10 ** 6 values that are the mean of each row
    sig_samp = sigma_x * N ** -0.5

    print("{0:.3f} should equal {1:.3f}".format(np.std(mu_samp, ddof=1), sig_samp))

This outputs:

0.894 should equal 0.894

(Code stolen from here).

Using this relationship between the standard deviation of the samples and the true mean, we can then conduct a t-test.

"Using a t-test allows you to test  
  • whether the mean of a sample differs significantly from an expected value, or 
  • whether the means of two groups differ significantly from an expected value, or 
  • whether the means of two groups differ significantly from each other.
"[William 'Student'] Gosset's key observation was the dependence on sample size for determining the probability that the mean of the population lies within a given distance of the mean of the sample, if a normal distribution is assumed... Gosset noted that when samples are collected from a normal distribution, and if the number of samples is small, and these are used to estimate the variance, then the distribution (for the variable x):

t = x̄ - μ / (σ/√N)

is both flatter, and has more observations appearing in the tails, than a normal distribution, when the samples sizes are less than 30... This distribution is known as the t distribution and approximates a normal distribution if n (and by implication [degrees of freedom]) are large (greater than 30 in practical terms)." [1]

All things being equal, "most of the time, you’d expect to get t-values close to 0. That makes sense, right? Because if you randomly select representative samples from a population, the mean of most of those random samples from the population should be close to the overall population mean, making their differences (and thus the calculated t-values) close to 0... The t-value measures the size of the difference relative to the variation in your sample data... T is simply the calculated difference represented in units of standard error" [from a blog here].

Calculating the t-value in Python is easy:

t_value, p_value) = stats.ttest_ind(data.l1, data.l2, equal_var=False)
print("t = {0:.3f}, p = {1:.3f}".format(t_value, p_value))

t = 2.755, p = 0.014

The values for p (probability of getting a given t-value) and the t-value itself are related. Given a distribution that represents the PDF for a value of t, then p is the area under that curve that represents the probability of having that t value or less.

"Confidence limits are expressed in terms of a confidence coefficient. Although the choice of confidence coefficient is somewhat arbitrary, in practice 90%, 95%, and 99% intervals are often used, with 95% being the most commonly used." [NIST]

Given our data, it is unlikely they had been drawn from the same universe of data because p=0.014. This result is in our 95% (if not 99%) confidence interval.

[Aside: "As a technical note, a 95% confidence interval does not mean that there is a 95% probability that the interval contains the true mean. The interval computed from a given sample either contains the true mean or it does not. Instead, the level of confidence is associated with the method of calculating the interval. The confidence coefficient is simply the proportion of samples of a given size that may be expected to contain the true mean. That is, for a 95% confidence interval, if many samples are collected and the confidence interval computed, in the long run about 95 % of these intervals would contain the true mean." [NIST]]

The Bayesian Approach

"If you have taken a statistics course, you have probably been taught this technique (although not necessarily learned this technique... you may have felt uncomfortable with the derivation" [2].

So, let's now refer to Bayesian Methods for Hackers (at github). I assumed that, given the data, the true standard deviation uniformly distributed between 0% and 2%, and that the means were uniformly anywhere between 90% and 99%. Finally, I assumed the values for the accuracy are distributed by a Gaussian.

These assumptions may or may not be correct but they're not outrageous and PyMC should find the optimum answer anyway.

The probable difference between the mean accuracys look like this (code available on my own GitHub repository):
Histogram of a the difference in accuracy between L1 and L2 normalization using the same ANN.
Which shows that, given what we know, there is most likely an improvement in using L1 over L2 of about 0.5%, a quantity the classical approach did not tell us. There is a possibility that despite the data, L2 is actually better than L1 but as we can see from the graph this is very unlikely.

[1] Statistics in a Nutshell, O'Reilly
[2] Bayesian Methods for Hackers, Davidson-Pilon

Python Crib Sheet #2


Context managers - aka 'with'

"Context managers wrap a block and manage requirements on entry and departure from the block and are marked by the with keyword. File objects are context managers... we know that the file will be closed immediately after the last read, whether the operation was successful or not... closure of the file is also assured, because it’s part of the file object’s context management, so we don’t need to write the code. In other words, by using with combined with a context management (in this case a file object), we don’t need to worry about the routine cleanup." [1]

In pseudo-code (from effbot):

    class controlled_execution:
        def __enter__(self):
            set things up
            return thing
        def __exit__(self, type, value, traceback):
            tear things down

    with controlled_execution() as thing:
         some code


* Operator

"A special parameter can be defined that will collect all extra positional arguments in a function call into a tuple" [1]. See the example here where "zip is its own inverse". That is, it can both turn two lists into one list of tuples (as you'd expect) and also turn a list of tuples into two lists.

For example:

>>> def g(*x):
...     for a in x:
...         print(a)
... 
>>> g(*(1, 2))
1
2
>>> g((1, 2))
(1, 2)
>>> g(*[(1, 2), (3, 4)])
(1, 2)
(3, 4)
>>> g([(1, 2), (3, 4)])

[(1, 2), (3, 4)]

That is, adding * means the tuple or list becomes a sequence of their constituents.

To summarize:

def g(*x)def g(x)
g((1, 2))(1,2)1
2
g(*(1, 2))1
2
TypeError

and the same applied for lists where '(' and ')' become '[' and ']'.

The future works (certain T&Cs may apply)

"It’s possible to import several features of Python 3, like division, into Python 2.x by using the __future__ module; but even if you imported all the features of the __future__ library, the differences in library structure, the distinction between strings and Unicode, and so on all would make for code that would be hard to maintain and debug." [1]

"With __future__ module's inclusion, you can slowly be accustomed to incompatible changes or to such ones introducing new keywords." (StackOverflow).

An example is the differences between integer division in Python 2 and 3. In Python 2, the result is another integer, in Python 3 it's a float. Using __future__ can make Python 2 work like Python 3:

$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 1/3
0
>>> from __future__ import division
>>> 1/3
0.3333333333333333

[1] The Quick Python Book