Using Roaring Bitmaps in Elixir

2025-05-08

A known limitation of Elixir as a higher level language built on the BEAM is performance and memory management when working with large data sets. At Tern we had large sets we needed in memory to do business analysis and decision making fast. To do this we used Roaring bitmaps the gold standard for fast memory efficient sets.

We initially built a custom implementation in pure Elixir but hit some performance problems, working with high performance binary data structures is not Erlang's strength. Fortunately we found roaring-rs a Rust crate implementation we could use with Rustler. Tern open sourced Roaring, so others can work with large sets fast and efficiently without leaving Elixir. While unpolished the wrapper was used in production with performance sensitive code powered by it.

In this post I want to show off how to use RoaringBitmaps in Elixir.

Basic Usage

Creating a new set is easy, just decide on the 32 or 64 bit implementations based on your max integer size.

{:ok, set_1} = RoaringBitmap64.new()
:ok = RoaringBitmap64.insert(set_1, 1) 

{:ok, set_2} = RoaringBitmap64.new()
:ok = RoaringBitmap64.insert(set_2, 1) 
:ok = RoaringBitmap64.insert(set_2, 2) 

Example set operations:

{:ok, union_set} = RoaringBitmap64.union(set_1, set_2)
{:ok, 2} = RoaringBitmap64.size(union_set) 
{:ok, [1, 2]} = RoaringBitmap64.to_lsit(union_set) 

Real World Example

To show off how this can be used I am using a local data set for Blue Bikes trips in Cambridge, MA. At ~14M rows and a CSV download of 3.5GB it's big enough to benefit from RoaringBitmaps.

Each row is a trip which we'll save in an Ecto schema like so:

defmodule Cbb.BlueBikes.Trip do
  use Ecto.Schema

  import Ecto.Changeset

  schema "blue_bikes_trips" do
    # external string id in the form of start-end timestamps
    field :ride_external_id, :string
    field :started_at, :utc_datetime
    field :ended_at, :utc_datetime

    field :start_station_external_id, :string
    field :start_station_name, :string
    field :start_station_lat, :string
    field :start_station_long, :string
    field :start_station_location, :string

    field :end_station_external_id, :string
    field :end_station_name, :string
    field :end_station_lat, :string
    field :end_station_long, :string
    field :end_station_location, :string

    field :bike_type, Ecto.Enum, values: [classic: 1, electric: 2]
    field :member_type, Ecto.Enum, values: [member: 1, casual: 2]
  end
end

Importing the CSV

See the full implementation for parsing logic, etc but the interesting part is

File.stream!(file, read_ahead: 100_000)
|> NimbleCSV.RFC4180.parse_stream()
|> Stream.chunk_every(1_000)
|> Stream.map(fn rows ->
  attrs_list =
    Enum.map(rows, fn csv_row ->
      trip_attrs(csv_row)
    end)

  Cbb.Repo.insert_all(Trip, attrs_list,
    on_conflict: :nothing,
    conflict_target: [:ride_external_id]
  )
end)
|> Stream.run()

This saves each Blue Bike trip as a row in Postgres with an integer ID we can use as our RoaringBitmap reference.

Building the RoaringBitmaps

Let's define a couple sets based off SQL queries, this hides the where clauses to build the queries for simplicity but they can be found here.

alias Cbb.BlueBikes.Queries

set_list = [
  %{
    key: :members,
    name: "Member Trips",
    query: Trip |> Queries.for_members()
  },
  %{
    key: :non_members,
    name: "Non-Member Trips",
    query: Trip |> Queries.for_members(false)
  },
  %{
    key: :round_trips,
    name: "Round Trips",
    query: Trip |> Queries.where_round_trip()
  },
  %{
    key: :bike_classic,
    name: "Classic Bike Trips",
    query: Trip |> Queries.for_bike_type(:classic)
  },
  %{
    key: :bike_electric,
    name: "Electric Bike Trips",
    query: Trip |> Queries.for_bike_type(:electric)
  }
]
sets = Enum.map(set_list, fn set_attrs ->
  {:ok, set} = RoaringBitmap32.new()

  {:ok, set} =
    Cbb.Repo.transaction(
      fn ->
        Cbb.Repo.stream(
          set_attrs.query
          |> select([t], t.id)
        )
        |> Enum.reduce(set, fn trip_id, set ->
          :ok = RoaringBitmap32.insert(set, trip_id)

          set
        end)
      end,
      timeout: 90_000
    )

  {set_attrs.key, set}
end)
|> Map.new()

We now have map of references we can use:

%{
  members: #Reference<0.3376069143.556138498.164279>,
  non_members: #Reference<0.3376069143.556138497.221516>,
  round_trips: #Reference<0.3376069143.556138497.242082>,
  bike_classic: #Reference<0.3376069143.556138497.245076>,
  bike_electric: #Reference<0.3376069143.556400641.71502>
}

We can do some simple analysis with the sets, let's answer the question, which of our sets return to the same station they left from?

{:ok, ebike_round_trips} = RoaringBitmap32.intersection(sets.bike_electric, sets.round_trips)
{:ok, classic_round_trips} = RoaringBitmap32.intersection(sets.bike_classic, sets.round_trips)
{:ok, member_round_trips} = RoaringBitmap32.intersection(sets.members, sets.round_trips)
{:ok, non_member_round_trips} = RoaringBitmap32.intersection(sets.non_members, sets.round_trips)

RoaringBitmaps.rate(ebike_round_trips, sets.bike_electric)
# 0.001861091867258799
RoaringBitmaps.rate(classic_round_trips, sets.bike_classic)
# 0.016158657608797578
RoaringBitmaps.rate(member_round_trips, sets.members)
# 0.011310790342330712
RoaringBitmaps.rate(non_member_round_trips, sets.non_members)
# 0.0305662697828675

RoaringBitmap32.size(non_member_round_trips)
# {:ok, 83906}

Interesting, 3% of Blue Bike trips by non-members return to the same dock, while in general only 1.5% of trips do this. A relatively trivial example but with these building blocks we can start to do some interesting things right in ELixir without overloading the database.

These don't take up much memory at all and can be used to build scalable analysis or business rule engines off of large collections like users or products without leaving Elixir.

Saving the RoaringBitmaps

We built the sets and showed how to use them but we don't want to rebuild them every time, we need to persist them. RoaringBitmaps have a standard format they can be encoded in for persistence. Roaring implements EctoRoaring.RoaringBitmap32 and a Postgres extension for peristence.

# Migration
create table(:trip_sets) do
  add :name, :string
  add :key, :string
  add :membership_32, :bytea
end

create unique_index(:trip_sets, [:key])

# Schema
schema "trip_sets" do
  field :name, :string
  field :key, :string
  field :membership_32, EctoRoaring.RoaringBitmap32
end

One benefit of the standardized roaring bitmaps in Postgres: we can write sql queries using these same sets, I'll show that in a follow up post.

Take aways

These sets take up very little memory, based on before and after :erlang.memory calls the RoaringBitmap increased the process memory by 3MB while building MapSets containing the id's as integers used 1.9 GB of memory.

Need to work with large sets in Elixir, offload load from your DB or enable high scale features efficiently? Roaring might be able to help.

You can find all the code for this blog post on Github or Source Hut.