Thursday, November 29, 2018

Hive


I'm building a Hadoop ecosystem test harness that will allow a whole slew of applications to be tested in a single JVM, in a single integration test (source).

A test for this harness fired up Zookeeper, Kafka, HDFS, Spark and finally Hive. It feeds Kafka with messages and Spark Structured Streaming processes them writes them as Parquet to HDFS. Finally, Hive reads this file and checks that what it reads is what Spark processes.

This all works fine until I decided to partition the Spark DataStreamWriter. Then, Hive didn't see any data and the the test failed upon an assertion that what Spark sees, Hive sees. Annoyingly, the test did not fail because of what is essentially misconfiguration. The reason being that Hive keeps a store of meta data about all the things it can see and although the data is there to be read, it's not been told to do so until the metastore is updated.

So, putting a long Thread.sleep in the test code I then fired up Beeline with something like:

beeline> !connect  jdbc:hive2://localhost:40327/default

whereupon we're asked to enter a user name and password. First, we recognize the data:

0: jdbc:hive2://localhost:40327/default> create external table parquet_table_name (key String, value String) PARTITIONED BY (partitionKey Date)  STORED AS PARQUET LOCATION 'hdfs://127.0.0.1:33799/tmp_parquet/';
No rows affected (0.077 seconds)

but we can't see any rows:

0: jdbc:hive2://localhost:40327/default> select count(*) from parquet_table_name;
No rows selected (0.204 seconds)
+------+
| _c0  |
+------+
| 0    |
+------+

This seems to be because there are no partitions:

0: jdbc:hive2://localhost:40327/default> show partitions parquet_table_name;
+------------+
| partition  |
+------------+
+------------+
No rows selected (0.219 seconds)

The solution was to add the partitions manually:

1: jdbc:hive2://localhost:40327/default> ALTER TABLE parquet_table_name ADD PARTITION (partitionKey='2018-11-28') location 'hdfs://127.0.0.1:33799/tmp_parquet/partitionKey=2018-11-28/';
No rows affected (0.155 seconds)
1: jdbc:hive2://localhost:40327/default> show partitions parquet_table_name ;
No rows affected (0.327 seconds)
+--------------------------+
|        partition         |
+--------------------------+
| partitionkey=2018-11-28  |
+--------------------------+

Contrary to the advice I read elsewhere, MSCK REPAIR TABLE parquet_table_name  SYNC PARTITIONS;  did not seem (because of camel-case names) to help me (the command ALTER TABLE table_name RECOVER PARTITIONS; seems to be just for Amazon's version of Hive).

Now, counting the rows gives me (some of) my data:

1: jdbc:hive2://localhost:40327/default> select count(*) from parquet_table_name;
+-------+
|  _c0  |
+-------+
| 2879  |
+-------+
1 row selected (1.939 seconds)

I'm not a Hive expert so there may be a solution that just adds anything in the directory (despite all the advice on forums, I could not get that to work the column name on which it's partitioned needs to be lowercase and mine was camel case). However, this hack works for now.

There appears to be a JIRA to make this automatic but there is nothing at the moment.

Thursday, November 8, 2018

Science


Summer seems an age ago but these are some notes I took from my holiday book list. The theme is "what is science?" a deceptively easy question. See Confidence, Credibility, and why Frequentism and Science do not Mix which argues "frequentism is generally answering the wrong question. (Briefly, it says: "frequentists consider model parameters to be fixed and data to be random, while Bayesians consider model parameters to be random and data to be fixed... the Bayesian solution is a statement of probability about the parameter value given fixed bounds. The frequentist solution is a probability about the bounds given a fixed parameter value.")

The Scientific Methodology

Inductive reasoning is where "the premises are viewed as supplying strong evidence for the truth of the conclusion. While the conclusion of a deductive argument is certain, the truth of the conclusion of an inductive argument may be probable, based upon the evidence given." (Wikipedia)

"The scientific procedure for the study of a physical system can be (rather arbitrarily) divided into the following three steps.

i) Parameterization of the system: discovery of a minimal set of model parameters whose values completely characterize the system (from a given point of view).

ii) Forward modeling: discovery of the physical laws allowing us, for given values of the model parameters, to make predictions on the results of measurements on some observable parameters.

iii) Inverse modeling: use of the actual results of some measurements of the observable parameters to infer the actual values of the model parameters.

Strong feedback exists between these steps, and a dramatic advance in one of them is usually followed by advances in the other two. While the first two steps are mainly inductive, the third step is deductive. This means that the rules of thinking that we follow in the first two steps are difficult to make explicit. On the contrary, the mathematical theory of logic (completed with probability theory) seems to apply quite well to the third step, to which this book is devoted."

- Inverse Problem Theory, Albert Taratola

The Logic of Science

First, terminology. "A syllogism ... is a kind of logical argument that applies deductive reasoning to arrive at a conclusion based on two or more propositions that are asserted or assumed to be true" (Wikipedia). For example, all men are mortal. Socrates is a man. Therefore, Socrates is mortal.

In The Logic of Science, E T Jaynes argues: "our theme is simply: Probability Theory as Extended Logic.

"Deductive reasoning can be analyzed ultimately into the repeated application of two strong syllogisms:

If A is true, then B is true
A is true
Therefore, B is true

and its inverse

If A is true, then B is true
B is false
Therefore, A is false

[However] ... in almost all the situations confronting us we do not have the right kind of information to allow this kind of reasoning. We fall back on weaker syllogisms

If A is true, then B is true
B is true
Therefore, A becomes more plausible

"The rain at 10 AM is not the physical cause of the clouds at 9:45 AM. Nevertheless, the proper logical connection is not in the uncertain causal direction (clouds ⇒ rain), but rather (rain ⇒ clouds) which is certain, although noncausal."

"Suppose some dark night a policeman walks down a street, apparently deserted; but suddenly he hears a burglar alarm, looks across the street, and sees a jewelry store with a broken window. Then a gentleman wearing a mask comes crawling out through the broken window, carrying a bag which turns out to be full of expensive jewelry. The policeman doesn’t hesitate at all in deciding that this
gentleman is dishonest [even if] there may have been a perfectly innocent explanation for everything.

"The reasoning of our policeman was not even of the above types. It is best described by a still weaker syllogism:

If A is true, then B becomes more plausible
B is true
Therefore, A becomes more plausible

"But in spite of the apparent weakness of this argument, when stated abstractly in terms of A and B, we recognize that the policeman’s conclusion has a very strong convincing power. There is
something which makes us believe that in this particular case, his argument had almost the power
of deductive reasoning."

Think Complexity

The rest of this post are notes I made from reading Think Complexity by Allen B Downey that he has generously made free to download (see here)

Science

Determinism. Statements from weak (1) to strong (4):

D1: Deterministic models can make accurate predictions for some physical systems.

D2: Many physical systems can be modelled by deterministic processes, but some are intrinsically random.

D3: All events are caused by prior events, but many physical systems are nevertheless
fundamentally unpredictable.

D4: All events are caused by prior events, and can (at least in principle) be predicted.

Wolfram’s demonstration of complex behaviour in simple cellular automata is … disturbing, at least to a deterministic world view.

Models

In general, we expect a model that is more realistic to make better predictions and to provide more believable explanations. Of course, this is only true up to a point. Models that are more detailed are harder to work with, and usually less amenable to analysis. At some point, a model becomes so complex that it is easier to experiment with the system. At the other extreme, simple models can be compelling exactly because they are simple.

Simple models offer a different kind of explanation than detailed models. With a detailed model, the argument goes something like this: “We are interested in physical system S, so
we construct a detailed model, M, and show by analysis and simulation that M exhibits a behaviour, B, that is similar (qualitatively or quantitatively) to an observation of the real system, O. So why does O happen? Because S is similar to M, and B is similar to O, and we can prove that M leads to B.

“With simple models we can’t claim that S is similar to M, because it isn’t. Instead, the argument goes like this: “There is a set of models that share a common set of features. Any model that has these features exhibits behaviour B. If we make an observation, O, that resembles B, one way to explain it is to show that the system, S, has the set of features sufficient to produce B.”

For this kind of argument, adding more features doesn’t help. Making the model more realistic doesn’t make the model more reliable; it only obscures the difference between the essential features that cause O and the incidental features that are particular to S. The features and are sufficient to produce the behaviour. Adding more detail, like features and z, might make the model more realistic, but that realism adds no explanatory power.

Scientific Realism

Think Heraclitus and the river.

Note: the gene was postulated 50 years before DNA was discovered. During that time, it was only a postulated entity rather than a “real” one.

SR1: Scientific theories are true or false to the degree that they approximate reality, but no
theory is exactly true. Some postulated entities may be real, but there is no principled
way to say which ones.

SR2: As science advances, our theories become better approximations of reality. At least
some postulated entities are known to be real.

SR3: Some theories are exactly true; others are approximately true. Entities postulated by
true theories, and some entities in approximate theories, are real.

SR4: A theory is true if it describes reality correctly, and false otherwise. The entities postulated by true theories are real; others are not.

Instrumentalism

But SR1 is so weak that it verges on instrumentalism, which is the view that we can’t say whether a theory is true or false because we can’t know whether a theory corresponds to
reality.

Criticality

“A system is “critical” if it is in transition between two phases; for example, water at its freezing point is a critical system.”

Think sand pile, cellular automata and periodic avalanches. Common behaviour of criticality is:

“Long-tailed distributions of some physical quantities: for example, in freezing water
the distribution of crystal sizes is characterized by a power law.

“Fractal geometries: canonical example is a snowflake. Fractals are characterized by self-similarity; that is, parts of the pattern resemble scaled copies of the whole.”

Pink noise: “Specifically, the power at frequency is proportional to 1/ .”

Reductionism

“A reductionist model describes a system by describing its parts and their interactions... it depends on an analogy between the components of the model and the components of the system”

An example is the Ideal Gas laws which approximates to reality by ignoring inter-molecular interactions

Holistic

“Holistic models are more focused on similarities between systems and less interested in
analogous parts.

• Identify a kind of behavior that appears in a variety of systems.

• Find the simplest model that demonstrates that behavior.”

For example, propagation of memes like genes not analogous but share the same evolutionary behaviour.

Prediction of SoC (self-organized criticality)

“If Perrow’s “normal accident theory” is correct, there may be no special cause of large failures.”

Agent-based Models

"The characteristics of agent-based models include:

• Agents that model intelligent behavior, usually with a simple set of rules.
• The agents are usually situated in space (or in a network), and interact with each
other locally.
• They usually have imperfect, local information.
• Often there is variability between agents.
• Often there are random elements, either among the agents or in the world.

"The Highway is a one-lane road that forms a circle, but it is displayed as a series of rows that spiral down the canvas. Each driver starts with a random position and speed. At  each time step, each Driver accelerates or brakes based on the distance between it and the Driver in front. ... If the following distance is too short, the Driver brakes; otherwise it accelerates. [There are] two other constraints: there is a speed limit for each driver, and if the current speed would cause a collision, the Driver comes to a complete stop.. You will probably see a traffic jam, and the natural question is, “Why?” There is nothing about the Highway or Driver behavior that obviously causes traffic jams."