Terminals and Shells 2: Processes and their environments
This is part of a series covering 'glue' knowledge. This is stuff that may be difficult to find in any normal training material. If you're a new developer or programmer you will hopefully find it useful. I try to explain more of the implementation side if it helps understanding.
This particular part of the series is about terminal and shell usage. I focus mostly on Linux based shells.
Processes and their environment
Environment variables
Like most programming languages, your shell has the ability to set and get the
value of variables. The variables can have names like THING
,
other_thing_123
, ie containing alphanumerics and underscores. The values are
always just bytes, but these bytes are very often human readable strings.
One of these variables you will have is PATH
, where the value is some colon
separated paths. We will talk more about what this does soon. We can print it
out with echo
:
echo $PATH
When we want to get the value of a variable we prefix it with a dollar sign
($
). If we do echo PATH
we'd just print the word PATH
to the screen, which
is not very useful.
If we want to set a variable we use an equals sign:
COUNT=123 echo $COUNT
This setting of variables is not a command, it's part of the shell program itself.
It's important to know that shells are typically 'stringly' typed. The value of
variables are a string of characters/bytes. Count was set to 123
here, which
is the string of characters, not the number. Some things may treat it as a
number, but it's stored as a string.
You cannot have spaces either side of the equals sign here. If the value has spaces you'll need to quote it, similar to when passing arguments:
GREETING="Hello, world!"
Another important point to realise, is that these variables are substituted
with their value by the shell. This is not something that executables handle
(unlike how executables handle relative paths). When we run echo $PATH
, the
echo command never sees $PATH
, the shell will have already substituted $PATH
with it's value before it gets passed to echo
as an argument.
In Bash, this substitution is dumb. If your variable contains spaces, this will cause commands to interpret the substitution as multiple arguments. It's effectively as if you typed the value yourself. This is the case even though you used quotes when setting the variable.
For example (note that the $
here just indicates the shell prompt, not
something you type):
$ TRASH="Some giant file.pdf" $ rm $TRASH rm: Some: No such file or directory rm: giant: No such file or directory rm: file.pdf: No such file or directory
We forgot to quote $TRASH
, which got expanded and passed like rm Some giant file.pdf
. The rm
sees this as three separate arguments, rather than a single
file to remove. This behaviour varies between shells, but Bash is common enough
that it's a good habit to quote things, especially when writing scripts.
This is where the difference between single quotes ('hello'
) and double quotes
("world"
) comes in. Variables in double quotes will be substituted by the
shell, but will not be in single quotes. So echo '$HOME'
will print the
literal string $HOME
to the terminal.
The variable name can be wrapped in braces, which is useful when it might otherwise get mixed up with the rest of the string:
$ PREFIX="develop-" $ echo "$PREFIXapi"
This prints nothing, because PREFIXapi
is a valid variable name, and that
variable is not set, so is substituted with the empty string. To do this
properly we can wrap the variable name with braces:
$ echo "${PREFIX}api" develop-api
Making your own command
Commands are not magic. They are just executable files somewhere on the filesystem. We're going to make our own executable, and make it a full blown command you can run like any other.
Making echo
I'm going to make a simple version of the echo
command in a few languages I'm
familiar with. This is to try and demonstrate that this is functionality present
in any popular language. You don't have to fully understand what they're all
doing.
JavaScript on Node (ie not the browser):
#!/usr/bin/env node const process = require("process"); const args = process.argv.slice(2); console.log(args.join(" "));
Python:
#!/usr/bin/env python3 import sys args = sys.argv[1:] print(' '.join(args))
C:
#include <stdio.h> int main(int argc, char const** argv) { char const* sep = ""; for (int i = 1; i < argc; i++) { printf("%s%s", sep, argv[i]); sep = " "; } puts(""); return 0; }
Go:
package main import ( "fmt" "os" ) func main() { sep := "" for i := 1; i < len(os.Args); i++ { fmt.Print(sep, os.Args[i]) sep = " " } fmt.Println() }
Okay, so what are these doing? They're all accessing some structure called 'args' or 'argv' in one way or another. This is short for the command line 'arguments vector', or at least that's a common thought, the 'vector' part is hard to verify.
This argv
corresponds to the command line arguments we pass in, for example we
run our Node version like so:
node myecho.js one two three
This entire line is pretty much exactly what the argv
contains, including
node
and myecho.js
. So the program skips the first two elements, then prints
the remaining ones, just like echo
would.
Making it feel like a command
You may have noticed that the way we run our Node myecho
is more complicated
than the normal echo
command. We can fix that. First we have to talk about
PATH
.
PATH
is a special environment variable. You can see yours by running
echo $PATH
On my machine this spits out something like this:
/Users/me/.docker/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/Users/me/.docker/bin:/Users/me/.cargo/bin
This is actually a list of paths to directories separated by colons (:
). These
paths are how the shell finds commands. When you type a command into the command
line, the shell looks at the contents of the directories in each listed path, in
order, and tries to execute the first file it finds with that name.
You can figure out where a given executable is stored using the which
command:
$ which docker /usr/local/bin/docker
You can see this prints out the path to the docker
command. You can see that
the directory of this is listed in the PATH
I printed earlier. You'll see
bin
a lot here, it means 'binary', a common name for executables.
Let's temporarily add our current directory to the PATH
and see what happens.
PWD
is another of these environment variables that is set to your current
working directory. You don't have to understand what this is doing just yet:
$ PATH="$PATH:$PWD" $ myecho.js zsh: permission denied: myecho.js
This half worked. The shell seems to have found the file we wrote, but given a
permission denied error. This is due to the file not being considered
executable. On Linux (and MacOS) a file must specifically be set to be
executable in order for us to execute it. We can do this with the chmod
("change mode") command. We make it executable by 'adding' the x
:
chmod +x ./myecho.js
Now we can run our echo command like an actual command:
$ myecho.js one two three one two three
Shebang/hashbang
Theres a missing puzzle piece to the above (well, several). One is how the shell
knew that it should use node
to execute myecho.js
? Linux generally doesn't
care about file name extensions (.js
), so it didn't use that. It certainly
doesn't inspect the entire file contents and attempt to figure out that it is
JavaScript code.
The magic is actually the first line of our script:
#!/usr/bin/env node
This line is called a 'shebang' or a 'hashbang' after the hash #
and bang !
that the line starts with. When an executable file is called, the kernel checks
for these characters. If they are present, it executes the command after the
shebang, adding the current file path as an argument. In this case we use yet
another command called env
, which finds the given command and executes it.
This means when we execute myecho.js
, we're actually executing something like
/usr/bin/env node myecho.js
.
You could instead give it the exact path to node
(which you can find with
which
), like this:
#!/usr/local/bin/node
This would work just as well, but node
would have to be exactly in this place,
which may not be true for everyones installation of Node. The env
command is
smarter and searches PATH
for the executable instead, so is more likely to
work on other people's machines (a useful property in scripts!).
Back to environment variables
Your shell has environment variables, but this isn't really a shell feature. The shell just exposes the environment variables to the user. Every process has its own environment. These environments are not shared between processes. Every process can have completely different environment variables.
We can see this with a slightly different JavaScript/Node program:
const process = require("process"); console.log(process.env.PATH);
This will print out the PATH
environment variable. These variables can also be
modified by the program like we can do in our shell. This can be used to change
the behaviour of programs. You can set environment variables in your shell, then
invoke your program.
Lets say we have an application with a database. In production we use a big database cluster somewhere, but for local development we use a locally running database. Here's a pretend program:
const process = require("process"); if (process.env.NODE_ENV === "development") { console.log("Dev mode, use local database."); } else { console.log("Not dev mode, use production database."); }
So if we set our NODE_ENV
variable we can control which mode the application
is in:
$ NODE_ENV=development $ node app.js Not dev mode, lets make sure we use production stuff.
Well, that didn't work...
Export and inheritance
It turns out shells don't automatically 'pass on' all of it's environment when it runs a command. In order to make sure that this environment variable is passed on to our command we need to export the variable:
$ export NODE_ENV $ NODE_ENV=development $ node app.js Dev mode, lets enable some useful debugging features.
This export NODE_ENV
tells the shell that this variable should be
inherited by any child process. A child process is any process that
our shell produces. More generally, for any given process (the parent), any
process it produces is a child process of it.
We don't have to export the variable after every change. We don't even need to have the variable defined before we export it. This exporting is a feature of the shell, not processes in general. Child processes typically inherit the entire environment of the parent, unless specified not to by the parent.
You can also define environment variables directly before a command in order to have those variables set for just that command:
$ NODE_ENV=development node app.js Dev mode, lets enable some useful debugging features.
This syntax is why you have to use quotes when setting variables, otherwise it will be interpreted like the above.
You can have more by just separating them with spaces like A=1 B=2 node app.js
. It's also worth noting that these variables don't have to be
capitalised, lowercase is fine. But node_env
is a different variable to
NODE_ENV
. They are case sensitive.
You can print your entire shell environment with the printenv
command.
Summary
We've learned a bit more about the shell with how it handles environment variables, and learned a bit about processes more generally in Linux. This knowledge should be useful throughout the rest of your career, whether it's building AWS Lambda Functions, writing React, or hacking at the kernel.