Applying "Programming Without Pointers" to an mbox indexer using Zig

Programming without pointers

I must have watched Programming without pointers (or PWP) roughly 10 times by now. In garbage collected languages where memory allocations are mostly invisible, it can be so easy to forget that memory allocations are often poison for high performance software.

In the video, Zig’s creator Andrew Kelley shows us how to structure our programs so that they only need to do minimal memory allocations, reducing the number of total pointers that we need to manage. Memory allocations are expensive, so reducing them can improve the overall complexity of our programs. I think that the pattern of PWP can make the code a little bit more complex if you’re not used to writing code that way. But once you’re used to it, I think that complexity can go away as you’ll recognise the pattern when you see it.

Here’s how I applied it to a real problem that I had.

Migrating email providers

Recently I migrated from Hey to Gmail. Long story short: Hey doesn’t integrate with anything, and there is no API. Gmail integrates with everything, and there is a complete API.

Mbox format

Hey exports email archives in mbox format. It seems like an obsolete format, but interestingly Google Takeout also uses it.

An mbox file looks like this:

From alice@example.com Mon Jan  6 09:00:00 2025
Message-ID: <001@example.com>
From: Alice <alice@example.com>
To: Bob <bob@example.com>
Subject: Weekly sync notes
Date: Mon, 6 Jan 2025 09:00:00 +0000

Hi Bob,

Here are the notes from our weekly sync.

- Review Q1 roadmap
- Assign sprint tasks

Cheers,
Alice

From bob@example.com Mon Jan  6 14:30:00 2025
Message-ID: <002@example.com>
From: Bob <bob@example.com>
To: Alice <alice@example.com>
Subject: Re: Weekly sync notes
Date: Mon, 6 Jan 2025 14:30:00 +0000

Thanks Alice, looks good. I'll update the board.

Bob

As you can see, it is plain text. Each email message in an mbox file is denoted by a line that starts with the text From . The end of an email message is either the start of a new email message or EOF.

Anatomy of a valid email message

From <email> <date>
<headers>
<blank line>
<body>

It’s very simple. The From line, followed by headers, then the body of the email. The body of the email can contain attachments too, but for my purposes this is not important—it’s just a string.

Processing my mbox export

The export from Hey was a bit over 5GB. I used GYB to import my initial export into Gmail. This took a very long time (around 8 hours). While I had updated my DNS records so that new mail would come into Gmail, there were still some emails coming into Hey. Presumably this is due to DNS propagation taking some time. When I was ready to do the final import, I was concerned that importing a new export would result in a bunch of duplicate messages in Gmail. There were a few issues on the GYB GitHub about this. Not to mention it would mean another 8 hours of waiting.

To resolve this, I figured that since mbox is just plain text, it would be fairly simple to write a tool using Zig that, given two mbox files, outputs a third mbox file that is the difference of the inputs. This is how mbox-diff was born.

Initially I used a fairly brute force approach to index each file before calculating the diff:

const MboxIndex = struct {
    // ...

    const Location = struct {
        start: u64,
        end: u64,
    };

    messages: std.StringHashMapUnmanaged(Location) = .empty,

    fn deinit(self: *MboxIndex, allocator: Allocator) void {
        var iter = self.messages.keyIterator();
        while (iter.next()) |key| {
            allocator.free(key.*);
        }
        self.messages.deinit(allocator);
    }

    // ...
}

As you can see from the definition of MboxIndex.messages, when we call deinit we need to iterate all the keys and free them individually. For my 5GB mbox file with roughly 30K messages, this is 30K allocations.

Applying PWP

After watching the PWP video again, I settled on a new design. All of the hashmap keys are stored as null-terminated strings in message_ids. So then the keys in locations are slices of the message_ids ArrayList.

message_ids: std.ArrayListUnmanaged(u8) = .empty,
locations: std.StringHashMapUnmanaged(Location) = .empty,

const Index = @This();

pub fn deinit(self: *Index, allocator: Allocator) void {
    self.locations.deinit(allocator);
    self.message_ids.deinit(allocator);
}

This is only two allocations that need to be freed, not thirty thousand! I should note that while building the index we need to keep a StringHashMap of IDs that we have already seen — however once the index has been created this is no longer needed. Right now there is only the two tools: mbox-index and mbox-diff, but I am going to build out a few more tools for working with mbox files that can use the index to quickly scan or filter an mbox file.

As implied by the existence of mbox-index, we can write the index to disk so that diffs can be calculated almost instantly if the index already exists. This might be useful if you wanted to keep incremental backups of your email, quickly search for a particular message, or even browse emails in a TUI.

I still need to support loading the mbox-index files when running mbox-diff, but that wasn’t the learning outcome of this for me. I was more interested in learning and applying PWP.

Reading and writing the index file

Index binary format

To write the index file to disk, we need a binary representation. I used a fairly ubiquitous design, with a magic header, a version number, and multiple chunks with a type constant, a length and then the data.

// Binary index format (v1):
//   [8]    magic
//   [u64]  version
//   [u64]  chunk type 0x01 (message IDs)
//   [u64]  blob length
//   [...]  null-terminated message IDs, concatenated
//   [u64]  chunk type 0x02 (locations)
//   [u64]  entry count
//   [...]  {start: u64, end: u64} per message, ordered to match IDs
//
// All integers are little-endian.
const file_header = [_]u8{ 0x00, 0x08, 0x10, 'm', 'b', 'i', 'd', 'x' };
const file_version: u64 = 0x01;

Reading and writing the index

I love Zig’s reader and writer interfaces! I find them fun to write and very easy to comprehend. Due to the simple data structures of the index, writing takes zero allocations. And because of the file design and PWP, reading only requires two! One for the message_ids array list, and one for the locations hash map. Since the file tells us how many ids and locations there are, we can preallocate too.

pub fn read(allocator: Allocator, reader: *std.io.Reader) !Index {
    const header = try reader.takeArray(8);
    if (!std.mem.eql(u8, header, &file_header)) return error.InvalidFormat;

    const version = try reader.takeInt(u64, .little);
    if (version != file_version) return error.UnsupportedVersion;

    const ids_chunk_type = try reader.takeInt(u64, .little);
    if (ids_chunk_type != chunk_type_message_ids) return error.InvalidFormat;
    const ids_blob_len = try reader.takeInt(u64, .little);

    var message_ids: std.ArrayListUnmanaged(u8) = try .initCapacity(allocator, @intCast(ids_blob_len));
    errdefer message_ids.deinit(allocator);
    message_ids.items.len = @intCast(ids_blob_len);
    try reader.readSliceAll(message_ids.items);

    const locs_chunk_type = try reader.takeInt(u64, .little);
    if (locs_chunk_type != chunk_type_locations) return error.InvalidFormat;
    const entry_count = try reader.takeInt(u64, .little);

    var locations: std.StringHashMapUnmanaged(Location) = .empty;
    errdefer locations.deinit(allocator);
    try locations.ensureTotalCapacity(allocator, @intCast(entry_count));

    var iter: MessageIdIterator = .init(&message_ids);
    var i: u64 = 0;
    while (i < entry_count) : (i += 1) {
        const start = try reader.takeInt(u64, .little);
        const end = try reader.takeInt(u64, .little);
        const id = iter.next() orelse return error.InvalidFormat;
        locations.putAssumeCapacityNoClobber(id, .{ .start = start, .end = end });
    }

    return .{
        .message_ids = message_ids,
        .locations = locations,
    };
}

pub fn write(self: Index, writer: *std.io.Writer) !void {
    try writer.writeAll(&file_header);
    try writer.writeInt(u64, file_version, .little);

    try writer.writeInt(u64, chunk_type_message_ids, .little);
    try writer.writeInt(u64, self.message_ids.items.len, .little);
    try writer.writeAll(self.message_ids.items);

    try writer.writeInt(u64, chunk_type_locations, .little);
    try writer.writeInt(u64, self.locations.count(), .little);

    var iter: MessageIdIterator = .init(&self.message_ids);
    while (iter.next()) |id| {
        const loc = self.locations.get(id) orelse continue;
        try writer.writeInt(u64, loc.start, .little);
        try writer.writeInt(u64, loc.end, .little);
    }

    try writer.flush();
}

Conclusion

I know that the mbox format is fairly esoteric at this point. But for almost 20 years my passion has always been learning new things, and it’s fun!

I want to apply PWP even more in my Zig code. Perhaps it might find its way into gdzig.

I had a lot of fun working on this. Somehow it feels nostalgic and new at the same time. Probably the mixture of old formats and new code.

If you got this far, thanks for reading, and consider checking out neutils.