RoaringBitmap 的底层数据结构

我曾开发过基于 RoaringBitmap 的分布式去重服务，用于在线用户数去重统计及离线计算中的去重统计，本文基于 0.9.8 简要分析 RoaringBitmap 的底层数据结构。

Java 中基本类型 int 为有符号实现，可表示的范围为 [-2147483648, 2147483647]，即整个范围段长达 42 亿之多，即使是仅使用正数部分也有 21 亿，对于大部分公司的用户维度的统计已经足够，所以我们选择基于 32 位 int 的 RoaringBitmap 进行分析。如果对基于 64 位 long 的 Roaring64Bitmap 实现感兴趣，可以自行查看源码：Roaring64Bitmap.java。

RoaringBitmap 的整体设计思想是把 32 位的 int 整数拆分为两个 16 位的部分进行存储，高 16 位的部分用于定位低 16 位的部分存储在哪一个 container，然后在对应的 container 中存储低 16 位值。可能上面的说法有一点抽象，我们使用最常用的 add 方法进行分析，add 方法用于将一个 int 类型的整数添加至 RoaringBitmap，其源码实现如下：

/**
 * Add the value to the container (set the value to "true"), whether it already appears or not.
 *
 * Java lacks native unsigned integers but the x argument is considered to be unsigned.
 * Within bitmaps, numbers are ordered according to {@link Integer#compareUnsigned}.
 * We order the numbers like 0, 1, ..., 2147483647, -2147483648, -2147483647,..., -1.
 *
 * @param x integer value
 */
@Override
public void add(final int x) {
  final char hb = Util.highbits(x);
  final int i = highLowContainer.getIndex(hb);
  if (i >= 0) {
    highLowContainer.setContainerAtIndex(i,
        highLowContainer.getContainerAtIndex(i).add(Util.lowbits(x)));
  } else {
    final ArrayContainer newac = new ArrayContainer();
    highLowContainer.insertNewKeyValueAt(-i - 1, hb, newac.add(Util.lowbits(x)));
  }
}

其中 line 12 将输入的 int 类型的值 x 取出高 16 位的比特并作为 char 类型进行接收，源码如下：

1
2
3

protected static char highbits(int x) {
  return (char) (x >>> 16);
}

可以看出，对 x 进行无符号右移 16 位再将 int 强转为 char 及得到高 16 位比特的值，我们知道在 Java 中 char 与 short 的内存占用均为两个字节，它们的差异在于 char 是无符号的，而 short 是有符号的，具体的在 JLS§4.2.1. Integral Types and Values 中有以下描述：

For short, from -32768 to 32767, inclusive
For char, from ‘\u0000’ to ‘\uffff’ inclusive, that is, from 0 to 65535

其实在早期的 RoaringBitmap 版本中，是采用 short 接收的，但是必须处理关于符号位的问题，直到一次 PR: Replace emulated unsigned shorts with chars #364 将该问题进行了处理，调整为了使用 char 来接收高 16 位的比特值，关于 Java 中基础数字类型不支持无符号的问题，我在大学时期编写基于 RXTX 的串口通讯程序时，也频繁处理过关于符号位的相关问题，所以此处将使用有符号的 short 接收调整为使用无符号的 char 接收后，实现将更加简单。

line 13 是根据 x 高 16 位的比特位值获取到对应的 Container，此处核心的变量为 highLowContainer，它本质上是一个存储 Container 的数组，其声明为：

RoaringArray highLowContainer = null;

/**
 * Create an empty bitmap
 */
public RoaringBitmap() {
  highLowContainer = new RoaringArray();
}

其中 RoaringArray 的部分源码如下：

static final int INITIAL_CAPACITY = 4;

char[] keys = null;

Container[] values = null;

int size = 0;

protected RoaringArray() {
  this(INITIAL_CAPACITY);
}

RoaringArray(int initialCapacity) {
  this(new char[initialCapacity], new Container[initialCapacity], 0);
}

RoaringArray(char[] keys, Container[] values, int size) {
  this.keys = keys;
  this.values = values;
  this.size = size;
}

可以看出，由用于存储高 16 位值的数组 char[] keys 及存储低 16 位值的 Container[] values 组成，其中 Container 抽象类的源码见：Container.java。

回到 line 13，final int i = highLowContainer.getIndex(hb); 是根据高 16 位比特值对应的 char 值获取到对应 container 在 highLowContainer 中的索引。

设想我们第一次调用 org.roaringbitmap.RoaringBitmap#add(int) 方法，那么 highLowContainer.getIndex(hb) 将会返回 -1，此时将进入 line 18，创建一个类型为 ArrayContainer 的 container，line 19 将 x 的低 16 位值添加至 ArrayContainer 后并将该 container 添加至 highLowContainer 中，至此完成一次简单的 add 操作。

基于 32 位 int 类型的 RoaringBitmap 可表达的范围为 [0, 2³²)，高 16 位可表达的范围为 [0, 2¹⁶)，即可表达的范围值个数为 65536，低 16 位亦是如此。那么实际上整个范围段 [0, 2³²) 即被划分为 65536 个块进行表示，每个块可表达的值的个数为 65536 个，即可以把整个范围段划分为如下的子范围段：([0, 2¹⁶), [2¹⁶, 2 × 2¹⁶), …)，每一个子范围段对应着我们刚刚提到的核心概念：container，之前提到的 highLowContainer 用于根据高 16 位的值获取对应的 container，然后再在定位到的 container 上进行相关操作，其中 Container 含有以下类型：

以上的几种实现中，最常用到的为 ArrayContainer 与 BitmapContainer，RoaringBitmap 会根据子范围段的数据密集程度选择使用不同的 Container 实现以保证最低的存储空间占用。比如上面示例中根据高 16 位的值没有找到对应的 container 时，此时新建了一个 container，选用的 ArrayContainer 作为容器的实现，顾名思义，ArrayContainer 将低 16 位的值作为数组存储，部分源码如下：

private static final int DEFAULT_INIT_SIZE = 4;

char[] content;

/**
 * Create an array container with default capacity
 */
public ArrayContainer() {
  this(DEFAULT_INIT_SIZE);
}

/**
 * Create an array container with specified capacity
 *
 * @param capacity The capacity of the container
 */
public ArrayContainer(final int capacity) {
  content = new char[capacity];
}

protected static int serializedSizeInBytes(int cardinality) {
  return cardinality * 2 + 2;
}

@Override
public void serialize(DataOutput out) throws IOException {
  out.writeShort(Character.reverseBytes((char) this.cardinality));
  // little endian
  for (int k = 0; k < this.cardinality; ++k) {
    out.writeShort(Character.reverseBytes(this.content[k]));
  }
}

@Override
public int serializedSizeInBytes() {
  return serializedSizeInBytes(cardinality);
}

从以上源码可以看出内部采用 char[] content; 该 container 中低 16 位的值，每个 16 位采用一个 char 进行存储，即占用两个字节，虽然预先申请了 4 个 char 的数组，但是整体的设计中，每一个低 16 位的值存储占用两个字节，序列化的时候也只是在 content 数组之外附加了数组的长度，额外再占用了两个字节，如果 content 数组的长度为 c，那么序列化的存储占用为 2×c + 2 个字节。在 ArrayContainer 的 add 方法中，有如下代码：

static final int DEFAULT_MAX_SIZE = 4096;// containers with DEFAULT_MAX_SZE or less integers should be ArrayContainers

/**
 * running time is in O(n) time if insert is not in order.
 */
@Override
public Container add(final char x) {
  if (cardinality == 0 || (cardinality > 0
          && (x) > (content[cardinality - 1]))) {
    if (cardinality >= DEFAULT_MAX_SIZE) {
      return toBitmapContainer().add(x);
    }
    if (cardinality >= this.content.length) {
      increaseCapacity();
    }
    content[cardinality++] = x;
  }

  // ...

  return this;
}

可以看出，当 ArrayContainer 中的元素个数大于等于 4096 时，会将该 container 实现转换为 BitmapContainer，转换的原因为空间占用已经等于 BitmapContainer 所需的空间占用，因为一个范围段需要保存的元素数量为 65536，使用原生的比特位保存所需的存储空间为 65536/8 = 8192 字节，之前的 ArrayContainer 一个元素占用两个字节，当存储的元素数量大于等于 4096 时，就没有使用 BitmapContainer 划算了，因为 BitmapContainer 的存储占用是固定的 8192 字节，而 ArrayContainer 在元素数量不多时空间占用低于 8192 字节，元素数量超过 4096 时就不止了，所以此处进行了转换，以保证低的空间占用，其中 BitmapContainer 的部分源码为：

public static final int MAX_CAPACITY = 1 << 16;

final long[] bitmap;

/**
 * Create a bitmap container with all bits set to false
 */
public BitmapContainer() {
  this.cardinality = 0;
  this.bitmap = new long[MAX_CAPACITY / 64];
}

// the parameter is for overloading and symmetry with ArrayContainer
protected static int serializedSizeInBytes(int unusedCardinality) {
  return MAX_CAPACITY / 8;
}

@Override
public void serialize(DataOutput out) throws IOException {
  // little endian
  for (long w : bitmap) {
    out.writeLong(Long.reverseBytes(w));
  }
}

@Override
public int serializedSizeInBytes() {
  return serializedSizeInBytes(0);
}

可以看出，单个 BitmapContainer 可表示的范围段大小为 1 << 16 = 2¹⁶ = 65536，在 Java 中，long 占用 8 个字节，一个字节占用 8 个比特，所以单个 BitmapContainer 用 long[65536/64] 作为底层的数据结构，即 1024 个 long 数字，占用 8192 字节。

以上，即为最常用到的两种 Container，如果用官网的描述，那就是尽管 RoaringBitmap 被称为 Bitmap，但它是一种混合的数据结构，将未压缩的 Bitmap 与排序数组相结合。自我们创建一个空的 RoaringBitmap 开始，随着第一个值的加入，一个 ArrayContainer 被创建，继续往这同一个 ArrayContainer 中添加值，当元素数量超过 4096 时，这个 ArrayContainer 将被转换为 BitmapContainer，另一方面，如果一个值从 BitmapContainer 中移除，当 BitmapContainer 中的元素数量小于等于 4096 时，BitmapContainer 将被转换回 ArrayContainer，当 container 中不含有任何元素时，这个 container 将从 highLowContainer 中移除。可以看出，在常用的 add 及 remove 方法的调用过程中，Container 的类型在 ArrayContainer 与 BitmapContainer 中变换，并没有用到 RunContainer，根据需求，如果我们调用 org.roaringbitmap.RoaringBitmap#runOptimize 将会遍历当前的所有 container，将各个 container 转换为最节省空间占用的实现，源码如下：

/**
 * Use a run-length encoding where it is more space efficient
 *
 * @return whether a change was applied
 */
public boolean runOptimize() {
  boolean answer = false;
  for (int i = 0; i < this.highLowContainer.size(); i++) {
    Container c = this.highLowContainer.getContainerAtIndex(i).runOptimize();
    if (c instanceof RunContainer) {
      answer = true;
    }
    this.highLowContainer.setContainerAtIndex(i, c);
  }
  return answer;
}

下面简单介绍下 RunContainer，其主要利用的为行程长度编码，在具有大量连续值时能够节省空间占用，如果值的分布较为稀疏，那么压缩效果并不好，具体可以参考：Run-length encoding - Wikipedia，下面为 RunContainer 的存储部分源码：

private char[] valueslength;// we interleave values and lengths, so
// that if you have the values 11,12,13,14,15, you store that as 11,4 where 4 means that beyond 11
// itself, there are
// 4 contiguous values that follows.
// Other example: e.g., 1, 10, 20,0, 31,2 would be a concise representation of 1, 2, ..., 11, 20,
// 31, 32, 33

protected static int serializedSizeInBytes(int numberOfRuns) {
  return 2 + 2 * 2 * numberOfRuns; // each run requires 2 2-byte entries.
}

根据源码中的注释可以知道，如果我们有以下的值 11, 12, 13, 14, 15，我们使用 11, 4 进行存储，即表示数字从 11 开始，后面有 4 个连续的值，且因为每个 Container 只需要存储低 16 位的值，所以值部分用两个字节的 char 即可表示，且一个 Container 的值的个数最大为 65536，所以长度部分用两个字节的 char 即可表示，最后，用 char[] valueslength 即可表示这个 RunContainer 中的所有具有连续值的片段，而具体之前的 ArrayContainer 与 BitmapContainer 转换为 RunContainer 后空间占用是否会降低，还需要进行计算才能确认，所以有了以下的代码，如 ArrayContainer 中的 runOptimize 方法：

@Override
public Container runOptimize() {
  // TODO: consider borrowing the BitmapContainer idea of early
  // abandonment
  // with ArrayContainers, when the number of runs in the arrayContainer
  // passes some threshold based on the cardinality.
  int numRuns = numberOfRuns();
  int sizeAsRunContainer = RunContainer.serializedSizeInBytes(numRuns);
  if (getArraySizeInBytes() > sizeAsRunContainer) {
    return new RunContainer(this, numRuns); // this could be maybe
                                            // faster if initial
                                            // container is a bitmap
  } else {
    return this;
  }
}

可以看出，只有当 RunContainer 的空间占用低于 ArrayContainer 时，才会转换为 RunContainer。同理，BitmapContainer 中的 runOptimize 方法实现如下：

@Override
public Container runOptimize() {
  int numRuns = numberOfRunsLowerBound(MAXRUNS); // decent choice

  int sizeAsRunContainerLowerBound = RunContainer.serializedSizeInBytes(numRuns);

  if (sizeAsRunContainerLowerBound >= getArraySizeInBytes()) {
    return this;
  }
  // else numRuns is a relatively tight bound that needs to be exact
  // in some cases (or if we need to make the runContainer the right
  // size)
  numRuns += numberOfRunsAdjustment();
  int sizeAsRunContainer = RunContainer.serializedSizeInBytes(numRuns);

  if (getArraySizeInBytes() > sizeAsRunContainer) {
    return new RunContainer(this, numRuns);
  } else {
    return this;
  }
}

逻辑相似，也是需要 RunContainer 的空间占用更低时才会转换为 RunContainer。最后，再看看 RunContainer 的 runOptimize 方法实现：

/**
 * Convert to Array or Bitmap container if the serialized form would be shorter. Exactly the same
 * functionality as toEfficientContainer.
 */
@Override
public Container runOptimize() {
  return toEfficientContainer();
}

// convert to bitmap or array *if needed*
private Container toEfficientContainer() {
  int sizeAsRunContainer = RunContainer.serializedSizeInBytes(this.nbrruns);
  int sizeAsBitmapContainer = BitmapContainer.serializedSizeInBytes(0);
  int card = this.getCardinality();
  int sizeAsArrayContainer = ArrayContainer.serializedSizeInBytes(card);
  if (sizeAsRunContainer <= Math.min(sizeAsBitmapContainer, sizeAsArrayContainer)) {
    return this;
  }
  return toBitmapOrArrayContainer(card);
}

根据注释，可以知道，根据序列化后的空间占用决定是否转换，如果转换的话需要转换为具体哪一种 Container 实现。

以上，即为 RoaringBitmap 底层数据结构的简要分析，其涉及到的技术还有 JVM 指令内联，如：JVM Intrinsics，以及单个 container 空间占用不会超过 8KB 以支持 L1 cache 同时存储多个 container，其余涉及的技术可以参考下方相关链接，此处不再一一分析。

Reference

A Quick Look at RoaringBitmap | Richard Startin’s Blog
Engineering Fast Indexes for Big Data Applications: Spark Summit East talk by Daniel Lemire
Roaring Bitmaps
Consistently faster and smaller compressed bitmaps with Roaring