Thursday, August 27, 2020

Azure

The Azure web GUI is quite immature. For instance, if you install the Azure Storage Explorer (Windows only) it doesn't show timestamps of files. Fortunately, a lot (everything?) can be done from the command line. This, for instance, mounts a SMB drive in the cloud on my local Linux box where it can be treated as any other directory:

sudo mount -t cifs //XXX.file.core.windows.net/DIRECTORY /mnt/DIRECTORY -o vers=3.0,username=USERNAME,password=GET_THIS_FROM_THE_WEB_GUI,dir_mode=0777,file_mode=0777,serverino 

Also, if you want to put a multi-line value into Microsoft's Key Vault, you'll find you can't do it in the web GUI. You need to put the text with line returns into YOUR_FILE and use:

az keyvault secret set --name YOUR_KEY --vault-name VAULT_NAME --value "`cat YOUR_FILE`"


Docker and K8s in the Azure cloud

First, tag your image with something like:

docker tag 8d2be7e5d4eb XXX.azurecr.io/YYY:1.0

where XXX is your image repository subdomain in Azure and YYY is the name of the artifact. Login with:

az acr login -n XXX

and now you can push your artifact into the Azure infrastructure:

docker push XXX.azurecr.io/YYY

(You might need to run az acr login -n XXX first) 

Let's check it works:

kubectl run -i --tty --attach ARBITRARY_NAME --image=XXX.azurecr.io/YYY:1.0  --command -- /bin/bash

and behold, we are on the CLI of a remote container in the Azure cloud.

But don't forget to clean up after ourselves with:

kubectl delete deployment ARBITRARY_NAME


Network Speeds

By having your image pushed to K8s, you can run your code in Azure as easily as your laptop. The big benefit is network speeds. In my case, I was decrypting an RSA encoded file taken from BLOB storage at about 1mb/s on my (well specced) laptop but exactly the same code was easily managing 10mb/s in the Azure cloud. (Yes, I know that using asymmetric ciphers for large files is not efficient [SO] but this was imposed on us by our client). By using jstack, I could see that the threads on my laptop were spending most of their time in IO not Bouncy Castle.


Thursday, August 20, 2020

Self-documenting tests

Even though it's 2020, self-documenting tests are still niche. There is Clairvoyance, a Scala flavour of YatSpec (that I mention exensively in this article for IBM). Its creator, Rhys Keepence is an old colleague and told me recently that it "is mostly up to date (although I haven’t yet published for Scala 2.13). The docs on my github page are not super up to date, but the latest version is 1.0.129".

However, introducing yet another new library to the codebase was too much an ask so I started using GivenWhenThen in ScalaTest. It was somewhat painful to get it to print just the Given, When, Then outputs to a separate file that can be version controlled for the edification of other data scientists modulo all the gubbins that are also spat out in your typical build.

I eventually did it using these resources from the ScalaTest docs. The top-and-bottom of it is that the GWT outputs are captured in a Reporter that may be bespoke (IntelliJ uses its own to seperate the GWTs from the logging). This Reporter can then spew out the events at the end of the test. But if you want them in a file produced by your build, you'll need something like this (in Maven):

    <build>
        <plugins>
            <plugin>
                <groupId>org.scalatest</groupId>
                <artifactId>scalatest-maven-plugin</artifactId>
                <version>2.0.0</version>
                <configuration>
                    <reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory>
                    <junitxml>.</junitxml>
                    <stderr/>
                    <filereports>W ../docs/src/main/acceptance_tests/scenarios.txt</filereports>
                </configuration>
...

and then copy scenarios.txt where it can be versioned controlled with something like:

            <plugin>
                <artifactId>maven-resources-plugin</artifactId>
                <version>3.1.0</version>
                <executions>
                    <execution>
                        <id>copy-resources</id>
                        <phase>package</phase>
                        <goals>
                            <goal>copy-resources</goal>
                        </goals>
                        <configuration>
                            <outputDirectory>${basedir}/../docs/src/main/acceptance_tests/</outputDirectory>
                            <resources>
                                <resource>
                                    <directory>target/docs/src/main/acceptance_tests/</directory>
                                    <filtering>true</filtering>
...

The W in filereports means without colour since although this looks good on a Unix CLI, it just adds odd escape characters to a text file, which is what the data scientists want.

I'm unaware of a similar BDD framework for ZIO which means I need to mix my ZIO tests with ScalaTest. Unfortunately, I noticed that with Maven, some ZIO tests were failing but this did not stop the build. I documented this on ZIO's github here.


Friday, August 14, 2020

Encryption


A few random notes I've been making about security libraries I've been using this past year or so.

How Random is Random?

SecureRandom is the gold standard. However, "depending on the implementation, the generateSeed and nextBytes methods may block as entropy is being gathered, for example, if they need to read from /dev/random on various Unix-like operating systems." [JavaDocs] This hasn't been a problem for me so far as I create one million 64-bit random numbers in my unit tests and the whole process takes about a second or two.

On Linux, you can see the temperature of the CPU, fan speeds etc by installing the tools mentioned here (AskUbuntu). This is one way to generate randomness.

There's an interesting addition to the Java API called ThreadLocalRandom that is more efficient than java.util.Random but still not appropriate for secure random number generators.


PGP or GPG?

"OpenPGP is the IETF-approved standard that defines encryption technology that uses processes that are interoperable with PGP. pgp is Symantec's proprietary encryption solution. pgp adheres to the OpenPGP standard and provides an interface that allows users to easily encrypt their files." [NetworkWorld]

"gpg is the OpenPGP part of the GNU Privacy Guard (GnuPG). It is a tool to provide digital encryption and signing services using the OpenPGP standard. gpg features complete key management and all the bells and whistles you would expect from a full OpenPGP implementation." [gpg man pages].

You can have the public key embedded in the file which can identify the recipient.  Why this is useful? "As far as I know, the recipient's public key IDs, key Validity dates, name, and email address are embedded in the GPG ASCII Armor file (GnuPG Manual ). So using pub key file / Key ID / Name / Email to identify which public key to use should all be equivalent." [StackExchange]

Importing a private key

If you haven't got the key, you can't decrypt a file. But if you have, you don't need to specify it. For instance, if I try to decrypt a file for which I don't have a key, I see:

$ gpg --output file.zip -d file.zip.pgp 
gpg: encrypted with RSA key, ID EAC258F9825D4C9C
gpg: decryption failed: No secret key

However, I can import it:

$ gpg --import ~/Temp/key.txt
gpg: key EAC258F9825D4C9C: public key "XXXX-TEST " imported
gpg: key EAC258F9825D4C9C: secret key imported
gpg: Total number processed: 1
gpg:               imported: 1
gpg:       secret keys read: 1
gpg:   secret keys imported: 1

and now decrypt it:

$ gpg --output file.zip -d file.zip.pgp 
gpg: encrypted with 4096-bit RSA key, ID EAC258F9825D4C9C, created 2020-02-27
      "XXXX-TEST "

Bouncy Castle

Bouncy Castle is the defacto library to allow the JVM to access OpenPGP files. 

One gotcha I found when using Bouncy Castle in an über JAR that was called in a Docker container was:

Caused by: java.util.jar.JarException: file:/home/henryp/main-1.0-SNAPSHOT-jar-with-dependencies.jar has unsigned entries ...

There doesn't seem to be a huge amount you can do about this if you insist on using über JARs as "You can't bundle a cryptographic library. They have to be signed for the JVM to load them, and the signature is destroyed when merged into the shadow jar." [GitHub]

This seems to be something specific to Oracle's JDK because if my Docker config file starts with:

FROM openjdk:11-jdk-slim

I don't have this problem. 


Encrypted ZIPs

I was hoping to stream a zip file that was encrypted, decrypting and unzipping as I went but was worried about the ZIP format. Note that a "directory is placed at the end of a ZIP file. This identifies what files are in the ZIP and identifies where in the ZIP that file is located. This allows ZIP readers to load the list of files without reading the entire ZIP archive. ZIP archives can also include extra data that is not related to the ZIP archive." [Wikipedia

So, could I really decrypt and unzip a stream?

Changing to GZIP wouldn't help either because "Both zip and gzip use the same compressing format internally, the main difference is in the metadata: zip has it at the end of the file, gzip at the beginning (and gzip only supports one enclosed file easily)." [StackOverflow]

But decrypting the stream and forking a process that unzips it using PipedInputStream and PipedOutputStream seems to work even on files of a about 1gb.

Encrypted Parquet

Parquet Modular Encryption allows certain columns to be encrypted.

OAuth

“Designed specifically to work with … (HTTP), OAuth essentially allows access tokens to be issued to third-party clients by an authorization server, with the approval of the resource owner” [Wikipedia] "A trust store is used to authenticate peers. A key store is used to authenticate yourself." [StackOverflow]

You can get tokens in Google DataFlow with this:

    val credentials = ComputeEngineCredentials.create()
    val accessToken = credentials.refreshAccessToken()
    logger.info(s"accessToken = $accessToken")             // OAuth token
    logger.info(s"getAccount = ${credentials.getAccount}") // service account name

Although almost ubiquitous, OAuth has its drawbacks:

Ross A. Baker @rossabaker Jun 08 04:27
I think the specification is far too complex for what it accomplishes.
I had a lengthy argument implementing it when I worked at a security company as a replacement for a request signing algorithm.
In request signing, you can't just steal a token the way you can in OAuth2.
And the argument is, "Well, it's sent over TLS, what does it matter?"
And as we were having that argument, those tokens were appearing in clear text in our logs.
Was it a shitty implementation? Absolutely. But all it takes is one mistake like that.
It's neither as convenient as basic auth, nor as secure as something like an HMAC-signed request. I feel like it operates in a middle ground that suits no purpose very well.

Gavin Bisesi @Daenyth Jun 08 14:48
another pain point is that IIUC the oauth spec is very full of "MAY" options and relatively few "MUST" options, so every actual implementation does stuff differently and nothing is compatible with anything else eg you often need specific logic to support X vs Y backends

An alternative to OAuth is Request Signing. Basically, the server has all private keys, clients only have their own and messges are encrypted and signed - see Andrew Hoang's blog.


Gotchas

When storing pass phrases etc in files, be careful that your editor does not add a newline. For instance, open a file in vi such:

$ vi /tmp/5Chars.txt

Type the string 12345, save and close it.

$ ls -l /tmp/5Chars.txt
-rw-r--r-- 1 henryp kismet 6 Nov 24 10:23 /tmp/5Chars.txt

What? it's 6 bytes, not 5! One solution is this:

$ echo -n 12345 > /tmp/5Chars.txt 
$ ls -l /tmp/5Chars.txt
-rw-r--r-- 1 henryp kismet 5 Nov 24 10:25 /tmp/5Chars.txt

That's better.

Thursday, August 13, 2020

Dependency Injection with ZIO

In a previous post, I showed how to set up a test in ZIO-land. This is more a how it works post.

If you recall, we created a ZIO that needs layers to unit test it. To expand, we have a ZIO that looks like this:

ZIO[Init with Flow with Errors, Throwable, ProblemList]

where InitFlow and Errors are my bespoke layers and ProblemList is just a type alise for a List of Eithers.

We add these layers later with provideLayer with implementations that are for testing or production. The interesting code has already been called, that is the construction of the ZIO in the first place. For me, it looks like:

  def flow(paths:       Filenames,
           configFile:  String,
           session:     SparkSession,
           fs:          FileSystem): ZIO[Init with Flow with Errors, Throwable, ProblemList] = for {
      s         <- Init.init(configFile, session, fs)
      results   <- Flow.resultsFor(s, paths, session, fs)

...

Now, taking the first element in the for comprehension, we see it looks like:

    def init(configFile:  String,
             session:     SparkSession,
             fs:          FileSystem): ZIO[Init, Throwable, Settings]
      = ZIO.accessM(_.get.initializeWith(configFile, session, fs))

This is where it becomes interesting. 

Although ZIO.accessM looks like we're calling a function, this is syntactic sugar. We're actually receiving a ZIO.AccessMPartiallyApplied[R] that "Effectfully accesses the environment of the effect" [docs].

The _.get returns the service we created (test or production) and initializeWith(...) is just calling my code. But how does get do its magic? Well, Init is just a type alias to Has[A] and "The trait Has[A] is used with ZIO environment to express an effect's dependency on a service of type A" [docs]. How it magics this into existence is down to the Izumi reflect library, whose mysteries I'm only just understanding. 

You can apply any number of layers and leave others dangling. For instance, if we have a:

RIO[KeyVaultLayer with Blocking, A]

(where RIO[R, A] is just an alias for ZIO[R, Throwable, A]) and we add layers so:

val kvService:  ULayer[KeyVaultLayer]               = ...
val zio:        RIO[KeyVaultLayer with Blocking, A] = ...
val partialZio: RIO[Blocking, A]                    = zio.provideSomeLayer[Blocking](kvService)

then you'll notice that partialZio still has one of the original requirements.

Although the ZLayer.fromFunction method seems to allow dependencies between layers, I was scratching my head on how to create a dependency on something that isn't a layer (in my case, a configuration that has been read within the ZIO I want the layer added to). I worked around it by having my layer provide a factory rather than the service itself.  

The take away point of all this is that you can defer adding a dependency until after the for-comprehension code (but before you execute it). This is very convenient for testing.