Owen Gage

Chasing a newline

What's the ASCII representation of a newline \n character? We can write a simple program to print out some text with a newline, and then look at the binary output:

print("hello\nworld")

If we print the hexadecimal for the output:

% python3 hello.py > output.txt
% xxd output.txt 
00000000: 6865 6c6c 6f0a 776f 726c 640a            hello.world.

Python's print also adds a newline character at the end, so we can easily see that the hexadecimal for newline is 0a, ie 10 in decimal.

We also could have just looked it up. But the question I want to answer is, where is the 10? It's no where in our source code, but obviously the final program 'knows' that when you write \n it should produce that binary value.

If it's not in our Python code, where else could it be? We run the Python code using the Python executable, which is obviously just another program, so maybe the newline-to-10 knowledge exists there? We can look at the source code for CPython which is generally the version of Python you'll find everywhere. It is written in C.

CPython

Quickly trying to follow where Python interprets string literals we can follow a chain of functions calls:

Here's a cut down version of that last function to show just the newline relevant parts:

PyObject *
_PyUnicode_DecodeUnicodeEscapeInternal2(
    const char *s,
    Py_ssize_t size,
    const char *errors,
    Py_ssize_t *consumed,
    int *first_invalid_escape_char,
    const char **first_invalid_escape_ptr)
{
    // snip
#define WRITE_ASCII_CHAR(ch)                                              \
        do {                                                              \
            assert(ch <= 127);                                            \
            assert(writer.pos < writer.size);                             \
            PyUnicode_WRITE(writer.kind, writer.data, writer.pos++, ch);  \
        } while(0)

    switch (c) {
    // snip
    case 'n': WRITE_ASCII_CHAR('\n'); continue;
    // snip
    }

    // snip
    return NULL;
}

When Python sees the \n, as in two separate characters in the string it is parsing, it simply writes the byte \n. We find ourselves in exactly the same situation as with our Python source code. There is no '10' here. The C language also has \n as an escape, and the CPython source code simply uses that.

This means the information for a newline being 10 doesn't exist in our source code, and it doesn't exist in the source code for the Python interpretter itself. But it must exist in the python executable in order for our program to work. The only place this information could have gotten into the executable is when the interpretter itself was compiled. The interpretter is written in C, so lets look at a common C compiler, GCC.

GCC

If you poke around a bit, you'll find that GCC is written in a kind of amalgamation of C and C++! This seems very strange if you haven't come across it before. How do you compile GCC if the very thing you are trying to get is a C compiler? The answer is another C compiler! This is a process known as bootstrapping. That C compiler is very likely to be GCC as well, but an earlier version.

There's no reason a C compiler cannot be written in C. The only issue being having a C compiler in the first place. Generally bootstrapping processes will use some earlier compiler to compile the current one, and then compile it again with the just-produced compiler. The means you eventually compile your source code with the very compiler it encodes, and this all works!

The pattern will typically be something like

  1. Write some assembly code for an assembler.
  2. Hand translate your assembly code to machine code.
  3. Use your new assembler to assemble itself.
  4. Write a simple C compiler in this assembly. You don't bother with optimizations or speed. Assemble it.
  5. Write your complex C compiler, whatever optimizations and other smarts you want, compile it!
  6. Compile your fancy compiler code with your fancy compiler. You now have an optimized compiler for other C programs.

The last step tends to occur over and over again, meaning you have a very long chain of compilers building its next generation. Other language compilers generally follow a similar pattern.

Let's chase our newline character again, but this time in GCC's source code. If we start in the lexer and follow the trail we end up at the convert_escape function. Here's a trimmed version of that:

static const uchar *
convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
        struct _cpp_strbuf *tbuf, struct cset_converter cvt,
        cpp_string_location_reader *loc_reader,
        cpp_substring_ranges *ranges, bool uneval)
{
  /* Values of \a \b \e \f \n \r \t \v respectively.  */
#if HOST_CHARSET == HOST_CHARSET_ASCII
  static const uchar charconsts[] = {  7,  8, 27, 12, 10, 13,  9, 11 };
#elif HOST_CHARSET == HOST_CHARSET_EBCDIC
  static const uchar charconsts[] = { 47, 22, 39, 12, 21, 13,  5, 11 };
#else
#error "unknown host character set"
#endif

  uchar c;

  /* Record the location of the backslash.  */
  source_range char_range;
  if (loc_reader)
    char_range = loc_reader->get_next ();

  c = *from;
  switch (c)
    {
    // snip
    case 'n': c = charconsts[4];  break;
    // snip
    }

    // snip

  return from + 1;
}

We have 10! When \n is seen in a string literal by this compiler, it looks up the numeric value in the HOST_CHARSET, which here for ASCII is the value 10. If it wasn't for GCC being written to have different character sets you would probably have just seen \n here again.

So for this chain of programs, the knowledge of \n being 10 isn't in our source code, it wasn't even in the interpretter source code, but it was finally in the interpretter's compiler source code. This trail ends quite suddenly, but it could have gone much further. In theory this knowledge could have been encoded in long-lost assembly or even handmade machine code, carried forward purely through the produced executables.

Trusting trust

This is more general than newlines, every other ASCII character has a similar 'hidden knowledge', and arbitrary knowledge or behaviour could be hidden. This idea was part a speech by Ken Thompson, titled 'Reflections on Trusting Trust'. It's not clear what the best source to link to is, so feel free to search for the title.

There are efforts to make this bootstrapping process more transparent and therefore trustable, such as bootstrappable which also has some great further reading.