2025-05-08
A known limitation of Elixir as a higher level language built on the BEAM is performance and memory management when working with large data sets. At Tern we had large sets we needed in memory to do business analysis and decision making fast. To do this we used Roaring bitmaps the gold standard for fast memory efficient sets.
We initially built a custom implementation in pure Elixir but hit some performance problems, working with high performance binary data structures is not Erlang's strength. Fortunately we found roaring-rs a Rust crate implementation we could use with Rustler. Tern open sourced Roaring, so others can work with large sets fast and efficiently without leaving Elixir. While unpolished the wrapper was used in production with performance sensitive code powered by it.
In this post I want to show off how to use RoaringBitmaps in Elixir.
Creating a new set is easy, just decide on the 32 or 64 bit implementations based on your max integer size.
{:ok, set_1} = RoaringBitmap64.new()
:ok = RoaringBitmap64.insert(set_1, 1)
{:ok, set_2} = RoaringBitmap64.new()
:ok = RoaringBitmap64.insert(set_2, 1)
:ok = RoaringBitmap64.insert(set_2, 2)
Example set operations:
{:ok, union_set} = RoaringBitmap64.union(set_1, set_2)
{:ok, 2} = RoaringBitmap64.size(union_set)
{:ok, [1, 2]} = RoaringBitmap64.to_lsit(union_set) To show off how this can be used I am using a local data set for Blue Bikes trips in Cambridge, MA. At ~14M rows and a CSV download of 3.5GB it's big enough to benefit from RoaringBitmaps.
Each row is a trip which we'll save in an Ecto schema like so:
defmodule Cbb.BlueBikes.Trip do
use Ecto.Schema
import Ecto.Changeset
schema "blue_bikes_trips" do
# external string id in the form of start-end timestamps
field :ride_external_id, :string
field :started_at, :utc_datetime
field :ended_at, :utc_datetime
field :start_station_external_id, :string
field :start_station_name, :string
field :start_station_lat, :string
field :start_station_long, :string
field :start_station_location, :string
field :end_station_external_id, :string
field :end_station_name, :string
field :end_station_lat, :string
field :end_station_long, :string
field :end_station_location, :string
field :bike_type, Ecto.Enum, values: [classic: 1, electric: 2]
field :member_type, Ecto.Enum, values: [member: 1, casual: 2]
end
endSee the full implementation for parsing logic, etc but the interesting part is
File.stream!(file, read_ahead: 100_000)
|> NimbleCSV.RFC4180.parse_stream()
|> Stream.chunk_every(1_000)
|> Stream.map(fn rows ->
attrs_list =
Enum.map(rows, fn csv_row ->
trip_attrs(csv_row)
end)
Cbb.Repo.insert_all(Trip, attrs_list,
on_conflict: :nothing,
conflict_target: [:ride_external_id]
)
end)
|> Stream.run()
This saves each Blue Bike trip as a row in Postgres with an integer ID we can use as our RoaringBitmap reference.
Let's define a couple sets based off SQL queries, this hides the where clauses to build the queries for simplicity but they can be found here.
alias Cbb.BlueBikes.Queries
set_list = [
%{
key: :members,
name: "Member Trips",
query: Trip |> Queries.for_members()
},
%{
key: :non_members,
name: "Non-Member Trips",
query: Trip |> Queries.for_members(false)
},
%{
key: :round_trips,
name: "Round Trips",
query: Trip |> Queries.where_round_trip()
},
%{
key: :bike_classic,
name: "Classic Bike Trips",
query: Trip |> Queries.for_bike_type(:classic)
},
%{
key: :bike_electric,
name: "Electric Bike Trips",
query: Trip |> Queries.for_bike_type(:electric)
}
]sets = Enum.map(set_list, fn set_attrs ->
{:ok, set} = RoaringBitmap32.new()
{:ok, set} =
Cbb.Repo.transaction(
fn ->
Cbb.Repo.stream(
set_attrs.query
|> select([t], t.id)
)
|> Enum.reduce(set, fn trip_id, set ->
:ok = RoaringBitmap32.insert(set, trip_id)
set
end)
end,
timeout: 90_000
)
{set_attrs.key, set}
end)
|> Map.new()
We now have map of references we can use:
%{
members: #Reference<0.3376069143.556138498.164279>,
non_members: #Reference<0.3376069143.556138497.221516>,
round_trips: #Reference<0.3376069143.556138497.242082>,
bike_classic: #Reference<0.3376069143.556138497.245076>,
bike_electric: #Reference<0.3376069143.556400641.71502>
}
We can do some simple analysis with the sets, let's answer the question, which of our sets return to the same station they left from?
{:ok, ebike_round_trips} = RoaringBitmap32.intersection(sets.bike_electric, sets.round_trips)
{:ok, classic_round_trips} = RoaringBitmap32.intersection(sets.bike_classic, sets.round_trips)
{:ok, member_round_trips} = RoaringBitmap32.intersection(sets.members, sets.round_trips)
{:ok, non_member_round_trips} = RoaringBitmap32.intersection(sets.non_members, sets.round_trips)
RoaringBitmaps.rate(ebike_round_trips, sets.bike_electric)
# 0.001861091867258799
RoaringBitmaps.rate(classic_round_trips, sets.bike_classic)
# 0.016158657608797578
RoaringBitmaps.rate(member_round_trips, sets.members)
# 0.011310790342330712
RoaringBitmaps.rate(non_member_round_trips, sets.non_members)
# 0.0305662697828675
RoaringBitmap32.size(non_member_round_trips)
# {:ok, 83906}
Interesting, 3% of Blue Bike trips by non-members return to the same dock, while in general only 1.5% of trips do this. A relatively trivial example but with these building blocks we can start to do some interesting things right in ELixir without overloading the database.
These don't take up much memory at all and can be used to build scalable analysis or business rule engines off of large collections like users or products without leaving Elixir.
We built the sets and showed how to use them but we don't want to rebuild them
every time, we need to persist them. RoaringBitmaps have a standard format they
can be encoded in for persistence. Roaring implements EctoRoaring.RoaringBitmap32
and a Postgres extension for peristence.
# Migration
create table(:trip_sets) do
add :name, :string
add :key, :string
add :membership_32, :bytea
end
create unique_index(:trip_sets, [:key])
# Schema
schema "trip_sets" do
field :name, :string
field :key, :string
field :membership_32, EctoRoaring.RoaringBitmap32
end
One benefit of the standardized roaring bitmaps in Postgres: we can write sql queries using these same sets, I'll show that in a follow up post.
These sets take up very little memory, based on before and after :erlang.memory
calls the RoaringBitmap increased the process memory by 3MB while building
MapSets containing the id's as integers used 1.9 GB of memory.
Need to work with large sets in Elixir, offload load from your DB or enable high scale features efficiently? Roaring might be able to help.
You can find all the code for this blog post on Github or Source Hut.