In this post, we will explore the CIDR trie data structure and how it can help you manage IP addresses and subnets in your Rust project.

As usual in this series, we will take a tiny piece of Rust out of a real project and explain it in a very practical way.

Problem

Imagine that your Rust service should only be available from specific IP addresses. This may be desirable for limiting access to certain geographical regions (similar to geo-fencing) or other services you control.

This service should not be slowed down by the need to check every incoming request's IP address against a list of allowed IP addresses.

Let's have a look at the size of the geo-whois-asn-country-ipv4.csv file that only contains ipv4 addresses: it contains 243K entries and is about 7.13 MB in size. Simply checking every incoming request against this list with Vec::contains is obviously too inefficient.

Furthermore, the problem is that we don't have the list of all single IP addresses that we can search, but we have a list of IP ranges, for example:

  • from 1.10.10.0 to 1.10.10.255 belongs to IN/India
  • from 1.5.0.0 to 1.5.255.255 belongs to JP/Japan

and so on.

Now when a client connects to our service we have to check a single IP to those given ranges and decide if the client is allowed to connect or not.

Solution

Let's first introduce a slightly different format for those IP ranges, the CIDR notation. CIDR stands for Classless Inter-Domain Routing, and it is a compact way to represent an IP address range. For example:

  • 1.10.10.0 to 1.10.10.255 will be represented as 1.10.10.0/24
  • 1.5.0.0 to 1.5.255.255 will be represented as 1.5.0.0/16

So the suffix /24 or /16 tells us how many leading bits of the IP address are fixed. Keep in mind that a full ipv4 address has 32 bits. That means if 24 bits are fixed, the remaining 8 bits are variable, and as we will see later, not important to us.

For the post we will assume we already have an efficient way of converting the range format above to the CIDR notation. We now want to find a data structure that is well-suited to searching for the country of a given IP address.

We ultimately want to use the data structure in our service like this:

// usage example
let mut country_ip_list = CidrCountryMap::new();
country_ip_list.insert("1.10.10.0/24", "IN");
country_ip_list.insert("1.5.0.0/16", "JP");

// search for the country of the given IP address
assert_eq!(country_ip_list.search("1.10.10.1"), Some("IN"));
assert_eq!(country_ip_list.search("1.10.10.22"), Some("IN"));
assert_eq!(country_ip_list.search("1.5.0.1"), Some("JP"));

// not in the list
assert_eq!(country_ip_list.search("10.1.1.1"), None);

Note: this code is pseudo code. In production we would likely not use strings for the IP addresses, but the std::net::Ipv4Addr type of the std lib.

The CidrCountryMap

One data structure we could base the CidrCountryMap on is a specialized trie. In it's most generic form, it can be used to search based on the prefix of an element (e.g. the first few letters of a string).

But let's first look at what a trie is and it's use case. According to wikipedia, a trie is also called digital tree or prefix tree. To quote:

Unlike a binary search tree, nodes in the trie do not store their associated key. Instead, a node's position in the trie defines the key with which it is associated. This distributes the value of each key across the data structure, and means that not every node necessarily has an associated value.

So the gist is, we would not store full values on nodes, but the data (that we want to store) will distribute on several nodes in the tree on a path. A path starting at the root (usally visualized on the top) leading down to a leaf (last not in the tree) or a terminal node.

This data distribution on paths comes with some very neat properties: the time it takes to search for an item, depends only on the key length, not on the amount of stored items. The same is true for inserting items.

The CIDR trie specifics

In the CIDR trie we represent a bit of the IP address as a node on the path. The complete path from the root to a (terminal) leaf node therefore represents the fixed bits of the IP address (without the trailing masked bits). The leaf node furthermore stores the country code associated with the IP address range.

For example let's look at the IP range 1.10.10.0/24 and their single bytes:

1  -> 00000001
10 -> 00001010
10 -> 00001010
0  -> 00000000

and the /24 suffix tells us that the first 24 bits are the ones we are taking into account. That means we don't store the last 8 bits of the IP range.

So we can build a trie just as we would read those binary numbers, that looks like this (showing only the first 2 bytes):

In the illustration we can see path that the first 16 bit of the IP address would span. The nodes of the 3rd byte are not shown, to keep the illustration simple. But just imagine a continued path down to the leaf node. The leaf node would contain the country code and also would be marked as a terminal node.

That means when we want to search for the country of the IP address, we must just process the bits of the given IP address and follow the path in the trie. If we reach a terminal node, we have found the country code.

That is the basic idea of the CIDR trie. Let's implement this trie in Rust now.

Implementation

Let's introduce the CidrCountryMap struct that holds the trie and the methods to insert and search for a given IP address.

#[derive(Default)]
pub struct CidrCountryMap {
    root: Node,
}

The Node struct represents a single node in the trie. It holds the country code, the child nodes and it keeps the information if it's a terminal node in the whole trie or not.

#[derive(Default)]
struct Node {
    is_terminal: bool,
    children: [Option<Box<Node>>; 2],
    value: Option<String>,
}

Note: we're using a fixed size array of for 2 Nodes (children: [Option<Box<Node>>;2]) to represent the child nodes. This is a common pattern in Rust to represent a tree structure. This avoids further allocations when inserting nodes. Just in case, could also use a Vec instead of an array. Also this is specific to a binary trie, e.g. for storing arbitrary ASCII characters we would have to (pre-)allocate 256 children (one per character). In that case it would be better to use a HashMap insteaf of a fixed size array.

We have to use Box because Rust does not allow recursive types that are not on the heap.

Let's implement the CidrCountryMap insert method.

impl CidrCountryMap {
    pub fn insert(&mut self, cidr: &str, data: impl Into<String>) -> Result<()> {
        let (cidr, take_bits) = cidr
            .split_once('/')
            .ok_or(CidrCountryMapError::SplitError)?;
        let mut take_bits = take_bits
            .parse::<usize>()
            .map_err(|err| CidrCountryMapError::ParseIntError(err))?;

        let mut current_node = &mut self.root;
        for byte in cidr.split('.').flat_map(|octet| octet.parse::<u8>()) {
            for bi in (0..8).rev() {
                if take_bits == 0 {
                    break;
                }
                let index = (byte >> bi) & 1;
                current_node = current_node.children[index as usize].get_or_insert(Box::default());
                take_bits -= 1;
            }
        }

        current_node.is_terminal = true;
        current_node.value = Some(data.into());

        Ok(())
    }}
  • it splits the CIDR notation into the IP address and the number of bits
  • it iterates over the bytes of the IP address and the bits of the CIDR notation
  • it processes the bits from left to right by using bit shifting and masking (byte >> bi) & 1 (bi is the amount of bits we shift to the right)
    • first we shift by 7 bits, then by 6 bits, then by 5 bits and so on
    • after shifting the bits to the right we select the right-most bit with the mask 1
    • by doing so we iterate through the bits of the byte from left to right
    • this gives either true or false (1 or 0) and we use this as an index to access the child array
  • we take or insert a new node in the trie for each bit
  • after all current_node is the leaf node and we set the country code and mark it as terminal

Note: You might wonder about the the use of Result<()> and CidrCountryMapError. I kept it out on purpose to keep the code simple and not use .unwrap(). You can find the full code in playground link on the References section.

The search method is quite similar to the insert method. We just have to iterate over the bits of the IP address and follow the path in the trie.

impl CidrCountryMap {
    pub fn search(&self, cidr: &str) -> Option<&str> {
        let mut current_node = &self.root;
        for byte in cidr.split('.').flat_map(|octet| octet.parse::<u8>()) {
            for bi in (0..8).rev() {
                let index = byte >> bi & 0b1;

                if let Some(node) = &current_node.children[index as usize] {
                    current_node = node;
                } else if current_node.is_terminal {
                    break;
                } else {
                    return None;
                }
            }
        }
        current_node.value.as_deref()
    }
}
  • we break the loop if we reach a terminal node
  • we return None if we reach a node that has no children
  • we return the country code of the end node, the leaf node, if we reach the end of the IP address

Conclusion

Searching an IP Address in the CIDR trie is done in O(k) time complexity. Where k is the number of bits in the IP Adress to look up, so O(32). It may also take less than O(k) time when the IP is not present in the trie.

The CIDR trie is a very efficient data structure to search for IP addresses in a given range.

Note: For other more complex use cases it might be better to roll the trie_rs crate, which provides a more generic trie implementation. However, I want to encourage you to mind a clean and small dependency list in your projects. Be aware that every external dependency comes with the cost of maintenance and less control over security and stability.

Improvement ideas:

  • we could store the whole trie Nodes in a single Vec to avoid multiple allocations and to improve cache locality
  • we could compress the trie by combining nodes that have only one child, this approach is then called to be a Radix Trie
  • we could store the country code in a separate vector and use the index as a reference in the trie nodes, to save memory and to avoid storing the country code multiple times

Special thanks to Jonas and David for providing feedback and improvement suggestions to this post 🙏 🦀 🚀

References