Terminals and Shells 2: Processes and their environments

This is part of a series covering 'glue' knowledge. This is stuff that may be difficult to find in any normal training material. If you're a new developer or programmer you will hopefully find it useful. I try to explain more of the implementation side if it helps understanding.

This particular part of the series is about terminal and shell usage. I focus mostly on Linux based shells.

Processes and their environment

Environment variables

Like most programming languages, your shell has the ability to set and get the value of variables. The variables can have names like THING, other_thing_123, ie containing alphanumerics and underscores. The values are always just bytes, but these bytes are very often human readable strings.

One of these variables you will have is PATH, where the value is some colon separated paths. We will talk more about what this does soon. We can print it out with echo:

echo $PATH

When we want to get the value of a variable we prefix it with a dollar sign ($). If we do echo PATH we'd just print the word PATH to the screen, which is not very useful.

If we want to set a variable we use an equals sign:

COUNT=123
echo $COUNT

This setting of variables is not a command, it's part of the shell program itself.

It's important to know that shells are typically 'stringly' typed. The value of variables are a string of characters/bytes. Count was set to 123 here, which is the string of characters, not the number. Some things may treat it as a number, but it's stored as a string.

You cannot have spaces either side of the equals sign here. If the value has spaces you'll need to quote it, similar to when passing arguments:

GREETING="Hello, world!"

Another important point to realise, is that these variables are substituted with their value by the shell. This is not something that executables handle (unlike how executables handle relative paths). When we run echo $PATH, the echo command never sees $PATH, the shell will have already substituted $PATH with it's value before it gets passed to echo as an argument.

In Bash, this substitution is dumb. If your variable contains spaces, this will cause commands to interpret the substitution as multiple arguments. It's effectively as if you typed the value yourself. This is the case even though you used quotes when setting the variable.

For example (note that the $ here just indicates the shell prompt, not something you type):

$ TRASH="Some giant file.pdf"
$ rm $TRASH
rm: Some: No such file or directory
rm: giant: No such file or directory
rm: file.pdf: No such file or directory

We forgot to quote $TRASH, which got expanded and passed like rm Some giant file.pdf. The rm sees this as three separate arguments, rather than a single file to remove. This behaviour varies between shells, but Bash is common enough that it's a good habit to quote things, especially when writing scripts.

This is where the difference between single quotes ('hello') and double quotes ("world") comes in. Variables in double quotes will be substituted by the shell, but will not be in single quotes. So echo '$HOME' will print the literal string $HOME to the terminal.

The variable name can be wrapped in braces, which is useful when it might otherwise get mixed up with the rest of the string:

$ PREFIX="develop-"
$ echo "$PREFIXapi"

This prints nothing, because PREFIXapi is a valid variable name, and that variable is not set, so is substituted with the empty string. To do this properly we can wrap the variable name with braces:

$ echo "${PREFIX}api"
develop-api

Making your own command

Commands are not magic. They are just executable files somewhere on the filesystem. We're going to make our own executable, and make it a full blown command you can run like any other.

Making `echo`

I'm going to make a simple version of the echo command in a few languages I'm familiar with. This is to try and demonstrate that this is functionality present in any popular language. You don't have to fully understand what they're all doing.

JavaScript on Node (ie not the browser):

#!/usr/bin/env node

const process = require("process");
const args = process.argv.slice(2);
console.log(args.join(" "));

Python:

#!/usr/bin/env python3

import sys
args = sys.argv[1:]
print(' '.join(args))

#include <stdio.h>

int main(int argc, char const** argv) {
    char const* sep = "";
    for (int i = 1; i < argc; i++) {
        printf("%s%s", sep, argv[i]);
        sep = " ";
    }
    puts("");
    return 0;
}

Go:

package main
import (
    "fmt"
    "os"
)

func main() {
    sep := ""
    for i := 1; i < len(os.Args); i++ {
        fmt.Print(sep, os.Args[i])
        sep = " "
    }
    fmt.Println()
}

Okay, so what are these doing? They're all accessing some structure called 'args' or 'argv' in one way or another. This is short for the command line 'arguments vector', or at least that's a common thought, the 'vector' part is hard to verify.

This argv corresponds to the command line arguments we pass in, for example we run our Node version like so:

node myecho.js one two three

This entire line is pretty much exactly what the argv contains, including node and myecho.js. So the program skips the first two elements, then prints the remaining ones, just like echo would.

Making it feel like a command

You may have noticed that the way we run our Node myecho is more complicated than the normal echo command. We can fix that. First we have to talk about PATH.

PATH is a special environment variable. You can see yours by running

echo $PATH

On my machine this spits out something like this:

/Users/me/.docker/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/Users/me/.docker/bin:/Users/me/.cargo/bin

This is actually a list of paths to directories separated by colons (:). These paths are how the shell finds commands. When you type a command into the command line, the shell looks at the contents of the directories in each listed path, in order, and tries to execute the first file it finds with that name.

You can figure out where a given executable is stored using the which command:

$ which docker
/usr/local/bin/docker

You can see this prints out the path to the docker command. You can see that the directory of this is listed in the PATH I printed earlier. You'll see bin a lot here, it means 'binary', a common name for executables.

Let's temporarily add our current directory to the PATH and see what happens. PWD is another of these environment variables that is set to your current working directory. You don't have to understand what this is doing just yet:

$ PATH="$PATH:$PWD"
$ myecho.js
zsh: permission denied: myecho.js

This half worked. The shell seems to have found the file we wrote, but given a permission denied error. This is due to the file not being considered executable. On Linux (and MacOS) a file must specifically be set to be executable in order for us to execute it. We can do this with the chmod ("change mode") command. We make it executable by 'adding' the x:

chmod +x ./myecho.js

Now we can run our echo command like an actual command:

$ myecho.js one two three
one two three

Shebang/hashbang

Theres a missing puzzle piece to the above (well, several). One is how the shell knew that it should use node to execute myecho.js? Linux generally doesn't care about file name extensions (.js), so it didn't use that. It certainly doesn't inspect the entire file contents and attempt to figure out that it is JavaScript code.

The magic is actually the first line of our script:

#!/usr/bin/env node

This line is called a 'shebang' or a 'hashbang' after the hash # and bang ! that the line starts with. When an executable file is called, the kernel checks for these characters. If they are present, it executes the command after the shebang, adding the current file path as an argument. In this case we use yet another command called env, which finds the given command and executes it. This means when we execute myecho.js, we're actually executing something like /usr/bin/env node myecho.js.

You could instead give it the exact path to node (which you can find with which), like this:

#!/usr/local/bin/node

This would work just as well, but node would have to be exactly in this place, which may not be true for everyones installation of Node. The env command is smarter and searches PATH for the executable instead, so is more likely to work on other people's machines (a useful property in scripts!).

Back to environment variables

Your shell has environment variables, but this isn't really a shell feature. The shell just exposes the environment variables to the user. Every process has its own environment. These environments are not shared between processes. Every process can have completely different environment variables.

We can see this with a slightly different JavaScript/Node program:

const process = require("process");
console.log(process.env.PATH);

This will print out the PATH environment variable. These variables can also be modified by the program like we can do in our shell. This can be used to change the behaviour of programs. You can set environment variables in your shell, then invoke your program.

Lets say we have an application with a database. In production we use a big database cluster somewhere, but for local development we use a locally running database. Here's a pretend program:

const process = require("process");

if (process.env.NODE_ENV === "development") {
  console.log("Dev mode, use local database.");
} else {
  console.log("Not dev mode, use production database.");
}

So if we set our NODE_ENV variable we can control which mode the application is in:

$ NODE_ENV=development
$ node app.js
Not dev mode, lets make sure we use production stuff.

Well, that didn't work...

Export and inheritance

It turns out shells don't automatically 'pass on' all of it's environment when it runs a command. In order to make sure that this environment variable is passed on to our command we need to export the variable:

$ export NODE_ENV
$ NODE_ENV=development
$ node app.js
Dev mode, lets enable some useful debugging features.

This export NODE_ENV tells the shell that this variable should be inherited by any child process. A child process is any process that our shell produces. More generally, for any given process (the parent), any process it produces is a child process of it.

We don't have to export the variable after every change. We don't even need to have the variable defined before we export it. This exporting is a feature of the shell, not processes in general. Child processes typically inherit the entire environment of the parent, unless specified not to by the parent.

You can also define environment variables directly before a command in order to have those variables set for just that command:

$ NODE_ENV=development node app.js
Dev mode, lets enable some useful debugging features.

This syntax is why you have to use quotes when setting variables, otherwise it will be interpreted like the above.

You can have more by just separating them with spaces like A=1 B=2 node app.js. It's also worth noting that these variables don't have to be capitalised, lowercase is fine. But node_env is a different variable to NODE_ENV. They are case sensitive.

You can print your entire shell environment with the printenv command.

Summary

We've learned a bit more about the shell with how it handles environment variables, and learned a bit about processes more generally in Linux. This knowledge should be useful throughout the rest of your career, whether it's building AWS Lambda Functions, writing React, or hacking at the kernel.