Wednesday, May 25, 2022

The CLI for busy Data Scientists and Engineers

I've been asked to give a talk on the command line interface for a mixed audience of Data Scientists and Engineers. Since the world is becoming ever more container based, I'm focussing on Linux.

Containers

Sometimes you need to diagnose things within the container. Jump into the container with:

docker exec -it CONTAINER_ID /bin/bash

To get the basic diagnostic tools mentioned below, you'll generally need to execute:

apt-get update               # you need to run this first
apt-get install net-tools    # gives you netstat
apt-get install iputils-ping # gives you ping
apt-get install procps       # gives you ps
apt-get install lsof        

Note these installations will all be gone next time you fire up the image as the underlying image does not change.

You can find out which module your favourite command belongs to by running something like this:

$ dpkg -S /bin/netstat
net-tools: /bin/netstat

Formatting

You can use regex expressions in grep with the -P switch. For example, let's search for lines that are strictly composed of two 5-letter words seperated by a space:

$ echo hello world  | grep -P "\w{5}\s\w{5}$"
hello world
$ echo hello wordle | grep -P "\w{5}\s\w{5}$"
$

You can extract elements from a string with an arbitrary delimiter with awk. For example, this takes the first and sixth elements from a line of CSV:

$ echo this,is,a,line,of,csv | awk -F',' '{print $1 " " $6}'
this csv
$

To prettify output, use column like this:

$ echo "hello, world. I love you!
goodbye cruelest world! So sad" | column -t
hello,   world.    I       love  you!
goodbye  cruelest  world!  So    sad

$

To print to standard out as well as to a file, use tee. For example:

$ echo And now for something completely different | tee /tmp/monty_python.txt
And now for something completely different
$ cat /tmp/monty_python.txt 
And now for something completely different

To capture everything typed and output to your terminal (very useful in training), use script:

$ script -f /tmp/my_keystrokes.log
Script started, file is /tmp/my_keystrokes.log
$ echo hello world
hello world
$ cat /tmp/my_keystrokes.log 
Script started on 2022-05-13 16:10:37+0100
$ echo hello world
hello world
$ cat /tmp/my_keystrokes.log 
$

Beware its recursive nature! Anyway, you stop it with an exit.

You can poll an output with watch. For example, this will keep an eye on the Netty threads in a Java application (IntelliJ as it happens):

watch "jstack `jps | grep Main | awk '{print \$1}'` | grep -A10 ^\\\"Netty\ Builtin\ Server"

Note that the $1 has been escaped and the quote mark within the quote has been triple escapted. The switch -A10 is just to show the 10 lines After what we pattern matched. Backticks execute a command within a command. Of course, we can avoid this escaping with:

$ watch "jstack $(jps | grep Main | awk '{print $1}') | grep -A10 ^\\\"Netty\ Builtin\ Server"

Note that $(...).

Resources

The command htop gives lots of information on a running system. Pressing P or M orders the resources by processor or memory usage respectively. VIRT and RES are your virtual memory (how much your application has asked for) and resident memory (how much it's actually using) the latter is normally the most important. The load average tells you how much work is backing up. Anything over the number of processors you have is suboptimal. How many processors do you have?

$ grep -c ^processor /proc/cpuinfo 
16
$

The top command also lists zombie tasks. I'm told that these are threads that are irretrievable stuck, probably due to some hardware driver issue.

File handles can be seen using lsof. This can be useful to see, for example, where something is logging. For instance, guessing that IntelliJ logs to a file that has log in its name, we can run:

$ lsof -p 12610 2>/dev/null | grep log$
java    12610 henryp   12w      REG              259,3    7393039 41035613 /home/henryp/.cache/JetBrains/IntelliJIdea2021.2/log/idea.log
$

The 2>/dev/null pipes errors (the 2) to a dark pit that is ignore.

To see what your filewall is dropping (useful when you've misconfigured a node), run:

$ sudo iptables -L -n -v -x

To see current network connections, run:

$ netstat -nap 

You might want to pipe that to grep LISTEN to see what processes are listening and on which port. Very useful if something already has control of port 8080.

For threads, you can see what's going on by accessing the /proc directory. While threads are easy to see in Java (jstack), Python is a little more opaque, not least because the Global Interpretter Lock (GIL) only really allows one physical thread of execution (even if Python can allow logical threads). To utilise more processors, you must start a heavyweight thread (see "multiprocessing" here). Anyway, find the process ID you're interested in and run something like:

$ sudo cat /proc/YOUR_PROCESS_ID/stack
[<0>] do_wait+0x1cb/0x230
[<0>] kernel_wait4+0x89/0x130
[<0>] __do_sys_wait4+0x95/0xa0
[<0>] __x64_sys_wait4+0x1e/0x20
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

as everything is a file in Unix, right? This happens to be the stack in Python code that is time.sleeping. You'll see a similar looking stack for a Java thread that happens to be waiting.

If you want to pin work to certain cores, use something like taskset. For example, if I wanted to run COMMAND on all but one of my 16 cores, I run:

taskset 0xFFFE COMMAND

This is very useful if some data munging is so intense it's bringing my system down. Using this, at least one thread is left for the OS.

Finally, vmstat gives you lots of information about the health of the box such as blocks being read/written from/to disk (bo/bi), the number of processes runnable (not necessarily running) and the number blcoked (r/b) and the number of context switches per second (cs)