Poison

HashMap

以下基于 JDK 8 中的 HashMap 进行分析,先看看这几个构造函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
/**
* Constructs an empty <tt>HashMap</tt> with the specified initial
* capacity and load factor.
*
* @param initialCapacity the initial capacity
* @param loadFactor the load factor
* @throws IllegalArgumentException if the initial capacity is negative
* or the load factor is nonpositive
*/
public HashMap(int initialCapacity, float loadFactor) {
if (initialCapacity < 0)
throw new IllegalArgumentException("Illegal initial capacity: " +
initialCapacity);
if (initialCapacity > MAXIMUM_CAPACITY)
initialCapacity = MAXIMUM_CAPACITY;
if (loadFactor <= 0 || Float.isNaN(loadFactor))
throw new IllegalArgumentException("Illegal load factor: " +
loadFactor);
this.loadFactor = loadFactor;
this.threshold = tableSizeFor(initialCapacity);
}

/**
* Constructs an empty <tt>HashMap</tt> with the specified initial
* capacity and the default load factor (0.75).
*
* @param initialCapacity the initial capacity.
* @throws IllegalArgumentException if the initial capacity is negative.
*/
public HashMap(int initialCapacity) {
this(initialCapacity, DEFAULT_LOAD_FACTOR);
}

/**
* Constructs an empty <tt>HashMap</tt> with the default initial capacity
* (16) and the default load factor (0.75).
*/
public HashMap() {
this.loadFactor = DEFAULT_LOAD_FACTOR; // all other fields defaulted
}

可见,对于最常见的 new HashMap() 方法,仅仅将 loadFactor 设置为了默认的负载因子:0.75,此时未对底层的数组 Node<K,V>[] table 进行初始化。

默认的负载因子为何要选择 0.75 呢?其中 HashMap 的 JavaDoc 中专门提到:

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table. Note that using many keys with the same hashCode() is a sure way to slow down performance of any hash table. To ameliorate impact, when keys are Comparable, this class may use comparison order among keys to help break ties.

我们看看 HashMap(int initialCapacity, float loadFactor) 方法中最后一行:
this.threshold = tableSizeFor(initialCapacity),其中 tableSizeFor 方法实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
/**
* Returns a power of two size for the given target capacity.
*/
static final int tableSizeFor(int cap) {
int n = cap - 1;
n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8;
n |= n >>> 16;
return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}

可能第一次看到这块代码没有头绪,我们先忽略以上代码,只看其中一行:

1
n |= n >>> 1;

这句代码非常简单,将 nn >>> 1 进行运算并将结果赋值给 n,举个例子:

1
2
3
4
5
6
n = 8:
00000000 00000000 00000000 00001000
n >>> 1:
00000000 00000000 00000000 00000100
n = n | n >>> 1:
00000000 00000000 00000000 00001100

可以看出,n 最高 1 bit 位后面紧跟的一个 bit 位被置为 1,如果我们继续对 n 进行 n |= n >>> 1; 操作,可以得到:

1
2
3
4
n = n | n >>> 1:
00000000 00000000 00000000 00001110
n = n | n >>> 1:
00000000 00000000 00000000 00001111

可以看出,不停进行 n |= n >>> 1; 后,n 最高 1 bit 位后续的 bit 全被置为 1。

想象一个极端场景:

1
2
n = 1073741824:
01000000 00000000 00000000 00000000

可以看出,在极端场景下,我们需要对 n 进行 30 次 n |= n >>> 1; 才能将 n 最高 1 bit 位后续的 bit 全部置为 1。

1
2
3
4
5
6
7
01000000 00000000 00000000 00000000
01100000 00000000 00000000 00000000
01110000 00000000 00000000 00000000
01111000 00000000 00000000 00000000
01111100 00000000 00000000 00000000
...
01111111 11111111 11111111 11111111

通过观察上述的二进制变化,可以看出,在经过第一次 n |= n >>> 1; 计算后,一定会存在连续两位 bit 为 1,所以,我们第二次移动时,就不用仅仅移动一位了,此时我们可以一次性移动两位,二进制变化过程如下:

1
2
3
4
5
6
7
01000000 00000000 00000000 00000000
n = n | n >>> 1:
01100000 00000000 00000000 00000000
n = n | n >>> 2:
01111000 00000000 00000000 00000000
...
01111111 11111111 11111111 11111111

通过观察上述的二进制变化,可以看出,在经过 n |= n >>> 2; 运算后,存在连续四位 bit 为 1,所以,我们第三次移动时,就可以一次性移动四位了,二进制变化过程如下:

1
2
3
4
5
6
7
8
9
01000000 00000000 00000000 00000000
n = n | n >>> 1:
01100000 00000000 00000000 00000000
n = n | n >>> 2:
01111000 00000000 00000000 00000000
n = n | n >>> 4:
01111111 10000000 00000000 00000000
...
01111111 11111111 11111111 11111111

再次观察,很容易得到后续移动的位数分别为 8、16。所以,如果我们想对一个大于 0 的数 n 的二进制表示中最高 1 bit 位后续的低位 bit 全部置为 1,我们只需要进行如下运算即可:

1
2
3
4
5
n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8;
n |= n >>> 16;

可以知道,经过上述运算后,n 的二进制表示为自最高 1 bit 位起及后续低位 bit 全置为 1,所以,此时 n + 1 即为 2 的 n 次幂。代码如下:

1
2
3
4
5
6
n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8;
n |= n >>> 16;
return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;

其中 MAXIMUM_CAPACITY 的定义为:

1
2
3
4
5
6
/**
* The maximum capacity, used if a higher value is implicitly specified
* by either of the constructors with arguments.
* MUST be a power of two <= 1<<30.
*/
static final int MAXIMUM_CAPACITY = 1 << 30;

这里为什么是左移 30 位呢?因为 Java 的基本类型为有符号类型,即在正数部分的最大 2 的 n 次幂的取值即为 1 << 30,其二进制表示如下:

1
2
1 << 30:
01000000 00000000 00000000 00000000

如果我们对 1 左移了 31 位,那么此时 1 << 31 = -2147483648,就不符合 MAXIMUM_CAPACITY 最大容量的定义了:

1
2
1 << 31 = -2147483648:
10000000 00000000 00000000 00000000

现在回到 tableSizeFor 方法:

1
2
3
4
5
6
7
8
9
static final int tableSizeFor(int cap) {
int n = cap - 1;
n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8;
n |= n >>> 16;
return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}

其中第一行 int n = cap - 1; 的作用是什么呢?我们先假设没有减 1 这一动作,假设 tableSizeFor 方法实现如下:

1
2
3
4
5
6
7
8
9
static final int tableSizeFor(int cap) {
int n = cap;
n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8;
n |= n >>> 16;
return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}

cap = 8 时,此时返回值为 16,如果我们使用作者的 int n = cap - 1; 版本进行计算,此时返回值为 8, 可以知道,作者编写的 tableSizeFor 方法的返回值为大于等于 cap 的 2 的 n 次幂。这就是 JDK 8 中 tableSizeFor 方法的实现原理。

我们再看看 HashMap 构造函数在低版本 JDK 上的代码,其中 JDK 1.3 的代码为:

1
2
3
4
5
6
7
8
9
10
11
12
13
public HashMap(int initialCapacity, float loadFactor) {
if (initialCapacity < 0)
throw new IllegalArgumentException("Illegal Initial Capacity: "+
initialCapacity);
if (loadFactor <= 0 || Float.isNaN(loadFactor))
throw new IllegalArgumentException("Illegal Load factor: "+
loadFactor);
if (initialCapacity==0)
initialCapacity = 1;
this.loadFactor = loadFactor;
table = new Entry[initialCapacity];
threshold = (int)(initialCapacity * loadFactor);
}

可以看出在 JDK 1.3 中,并没有寻找大于等于用户指定的 initialCapacity 的 2 的 n 次幂的动作,我们再看看 JDK 1.4 的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public HashMap(int initialCapacity, float loadFactor) {
if (initialCapacity < 0)
throw new IllegalArgumentException("Illegal initial capacity: " +
initialCapacity);
if (initialCapacity > MAXIMUM_CAPACITY)
initialCapacity = MAXIMUM_CAPACITY;
if (loadFactor <= 0 || Float.isNaN(loadFactor))
throw new IllegalArgumentException("Illegal load factor: " +
loadFactor);

// Find a power of 2 >= initialCapacity
int capacity = 1;
while (capacity < initialCapacity)
capacity <<= 1;

this.loadFactor = loadFactor;
threshold = (int)(capacity * loadFactor);
table = new Entry[capacity];
init();
}

可以看出,在 JDK 1.4 中,已经引入了寻找大于等于用户指定的 initialCapacity 的 2 的 n 次幂的代码逻辑,只是实现的代码效率没有 JDK 8 中的 tableSizeFor 高。

到了 JDK 7 时,寻找 2 的 n 次幂已经调整为 Integer.highestOneBit 实现,其逻辑与 JDK 8 中 tableSizeFor 几乎一致。其实这部分优化来自 《Hacker’s Delight》Integer 类里面的许多位操作方法都来自于这本书。

那么,我们为什么一定要寻找 2 的 n 次幂呢?如果直接使用用户指定的 cap 去作为底层数组的大小可以吗?说到这里,我们不得不提 JDK 1.3 中 HashMap 的几个构造函数,其中代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/**
* Constructs a new, empty map with the specified initial
* capacity and the specified load factor.
*
* @param initialCapacity the initial capacity of the HashMap.
* @param loadFactor the load factor of the HashMap
* @throws IllegalArgumentException if the initial capacity is less
* than zero, or if the load factor is nonpositive.
*/
public HashMap(int initialCapacity, float loadFactor) {
if (initialCapacity < 0)
throw new IllegalArgumentException("Illegal Initial Capacity: "+
initialCapacity);
if (loadFactor <= 0 || Float.isNaN(loadFactor))
throw new IllegalArgumentException("Illegal Load factor: "+
loadFactor);
if (initialCapacity==0)
initialCapacity = 1;
this.loadFactor = loadFactor;
table = new Entry[initialCapacity];
threshold = (int)(initialCapacity * loadFactor);
}

/**
* Constructs a new, empty map with the specified initial capacity
* and default load factor, which is <tt>0.75</tt>.
*
* @param initialCapacity the initial capacity of the HashMap.
* @throws IllegalArgumentException if the initial capacity is less
* than zero.
*/
public HashMap(int initialCapacity) {
this(initialCapacity, 0.75f);
}

/**
* Constructs a new, empty map with a default capacity and load
* factor, which is <tt>0.75</tt>.
*/
public HashMap() {
this(11, 0.75f);
}

类似的逻辑可以在 JDK 8 的 Hashtable 的源代码中找到:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/**
* Constructs a new, empty hashtable with the specified initial
* capacity and the specified load factor.
*
* @param initialCapacity the initial capacity of the hashtable.
* @param loadFactor the load factor of the hashtable.
* @exception IllegalArgumentException if the initial capacity is less
* than zero, or if the load factor is nonpositive.
*/
public Hashtable(int initialCapacity, float loadFactor) {
if (initialCapacity < 0)
throw new IllegalArgumentException("Illegal Capacity: "+
initialCapacity);
if (loadFactor <= 0 || Float.isNaN(loadFactor))
throw new IllegalArgumentException("Illegal Load: "+loadFactor);

if (initialCapacity==0)
initialCapacity = 1;
this.loadFactor = loadFactor;
table = new Entry<?,?>[initialCapacity];
threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
}

/**
* Constructs a new, empty hashtable with the specified initial capacity
* and default load factor (0.75).
*
* @param initialCapacity the initial capacity of the hashtable.
* @exception IllegalArgumentException if the initial capacity is less
* than zero.
*/
public Hashtable(int initialCapacity) {
this(initialCapacity, 0.75f);
}

/**
* Constructs a new, empty hashtable with a default initial capacity (11)
* and load factor (0.75).
*/
public Hashtable() {
this(11, 0.75f);
}

可以看出,JDK 1.3 的 HashMap 和 JDK 8 的 Hashtable 默认使用 11 作为初始容量,且如果用户指定了 initialCapacity,则直接使用用户指定的 initialCapacity 作为底层数组的大小,但是 JDK 8 的 HashMap 为何使用基于 2 的 n 次幂作为底层数组的大小呢?

首先,我们需要明确的是,在 hashCode 值分布不均匀的时候,HashMap 的 size 为质数时产生碰撞的几率更低,举个例子:

考虑一组 hashCode 值的集合 K = {0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99},当哈希表的底层数组的大小为 12 时,我们使用 int index = hashCode % size; 定位 key 所在 Node 的 index 值,K 中的键全部会落在 index = {0, 3, 6, 9} 的 Node 中,当哈希表的底层数组大小为 11 时,K 中的键会落在 index = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} 的 Node 中,可以看出,当 K 中的数字与哈希表底层数组大小存在公因数时,所在 Node 的 index 将会是这个公因数的倍数,这样就增大了碰撞的概率。所以,为了减少碰撞,只要减少 hashCode 值与哈希表底层数组大小的公约数数量即可,当 hashCode 值不能被改变的时候,能想到的就是减少 hashCode 与哈希表底层数组大小的公约数数量,当哈希表底层数组大小为质数的时候,公约数数量最小,此时,碰撞的概率最低。

但是为何自 JDK 1.4 开始,HashMap 使用基于 2 的 n 次幂作为底层数组的大小呢?我们先看看在 JDK 1.3 中定位 key 所在 index 的代码,注意 JDK 1.3 还未引入基于 2 的 n 次幂作为底层数组的逻辑,代码如下:

1
2
JDK 1.3:
index = (hash & 0x7FFFFFFF) % tab.length;

上述代码的逻辑为使用 hash 值与 0x7FFFFFFF 进行与计算得到一个正数,然后再对 tab.length 取余,这也是我们最容易想到的逻辑。

再看看 JDK 1.4 中定位 key 所在 index 的代码,注意 JDK 1.4 已引入基于 2 的 n 次幂作为底层数组的逻辑,代码如下:

1
2
3
4
5
6
7
JDK 1.4:
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}

可以看出定位 index 的过程已经没有了取余的逻辑,优化为执行一次与计算得到,此逻辑必须在 length = 2n 时才满足,即 h & (length -1) = h % length

HashMap Requires a Better hashCode() - JDK 1.4 Part II 中,可以看出与运算的运行耗时不到余运算的四分之一。

使用 2 的 n 次幂的另一个好处与 JDK 8 中的 resize 方法有关,在扩容时可以使用 hash & oldCap 方便的计算出扩容后所在 Node 的 index 值,文章后面会详细描述扩容逻辑。

通过以上的描述,我们可以总结出,将哈希表底层数组的 size 调整为 2 的 n 次幂后:

优点如下:

  • 定位 key 所在的 Node 时,无需使用性能不佳的取余计算,可以直接使用与计算得到
  • 便于 resize 方法高效计算在扩容后所在 Node 的 index 值

缺点如下:

  • 对于分布不均的 hashCode 值,发生碰撞的几率比 size 为质数时高

再看看 HashMap 作者之一 Josh Bloch 关于 size 为 2 的 n 次幂的回复:

The downside of using a power-of-two is that the resulting hash table is very sensitive to the quality of the hash function (hashCode). It is imperative that any change in the input must affect the low order bits of the hash value. (Ideally, it should affect all bits of the hash value with equal likelihood.) Because we have no assurance that this is true, we put in a secondary (or “defensive”) hash function when we switched to the power-of-two hash table. This hash function is applied to the results of hashCode before masking off the low order bits. Its job is to scatter the information over all the bits, and in particular, into the low order bits. Of course it has to run very fast, or you lose the benefit of switching to the power-of-two-sized table. The original secondary hash function in 1.4 turned out to be insufficient. We knew that this was a theoretical possibility, but we thought that it didn’t affect any practical data sets. We were wrong. The replacement secondary hash function (which I developed with the aid of a computer) has strong statistical properties that pretty much guarantee good bucket distribution.

引用自:HashMap Requires a Better hashCode() - JDK 1.4 Part II

大意是说,使用 2 的 n 次幂的缺点是对 hash 函数的质量非常敏感,输入值影响 hash 值的低 bit 位非常重要,所以当把 size 切换为 2 的 n 次方时,我们引入了第二个 hash 函数,这个 hash 函数的作用的将信息分散至所有 bit 位上,尤其是在低 bit 位上。

以上描述为 JDK 8 中的 hash 函数埋下了伏笔,我们看看 JDK 8 中定位 key 所在 Node 相关的几个方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/**
* Computes key.hashCode() and spreads (XORs) higher bits of hash
* to lower. Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.) So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

public V put(K key, V value) {
return putVal(hash(key), key, value, false, true);
}

final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
boolean evict) {
Node<K,V>[] tab; Node<K,V> p; int n, i;
if ((tab = table) == null || (n = tab.length) == 0)
n = (tab = resize()).length;
if ((p = tab[i = (n - 1) & hash]) == null)
tab[i] = newNode(hash, key, value, null);
else {
// ...
}
++modCount;
if (++size > threshold)
resize();
afterNodeInsertion(evict);
return null;
}

关于定位当前 key 位于哪一个 Node 上,可以看出逻辑与 JDK 1.4 一致:

1
tabIndex = (capacity - 1) & hash(key);

关于 hash 方法的实现,为何要将 hashCode()hashCode() >>> 16 进行异或处理呢?其中文档提到:

Because the table uses power-of-two masking, sets of hashes that vary only in bits above the current mask will always collide.

因为哈希表使用 2 的幂次掩码,所以仅在当前掩码上方的位中发生变化的哈希值集合将始终发生冲突。

我们使用一个 capacity = 8 的 HashMap 进行示例,并随机使用了四个 key 进行验证:

1
2
3
4
5
6
7
8
9
10
11
12
13
capacity = 8:
00000000 00000000 00000000 00001000
capacity - 1:
00000000 00000000 00000000 00000111

keyA.hashCode() = 17:
00000000 00000000 00000000 00010001
keyB.hashCode() = 33:
00000000 00000000 00000000 00100001
keyC.hashCode() = 524289:
00000000 00001000 00000000 00000001
keyD.hashCode() = 65537:
00000000 00000001 00000000 00000001

如果不将 hashCode()hashCode() >>> 16 进行异或处理,即我们假设 hash 方法的实现为直接取 key.hashCode(),那么:

1
2
3
4
tabIndexOfKeyA = (capacity - 1) & keyA.hashCode() = 1
tabIndexOfKeyB = (capacity - 1) & keyB.hashCode() = 1
tabIndexOfKeyC = (capacity - 1) & keyC.hashCode() = 1
tabIndexOfKeyD = (capacity - 1) & keyD.hashCode() = 1

正如文档所说,仅在当前掩码上方的位中发生变化的哈希值集合将始终发生冲突。 以上 4 个 key 均落在 index = 1 的 Node 内。

接下来按照 hash 方法的实现将 hashCode()hashCode() >>> 16 进行异或处理:

1
2
3
4
5
6
7
8
hash(keyA) = keyA.hashCode() ^ (keyA.hashCode() >>> 16):
00000000 00000000 00000000 00010001 // keyA.hashCode()
00000000 00000000 00000000 00000000 // keyA.hashCode() >>> 16

00000000 00000000 00000000 00010001 // hash(keyA)
00000000 00000000 00000000 00000111 // capacity - 1

tabIndexOfKeyA = (capacity - 1) & hash(keyA) = 1
1
2
3
4
5
6
7
8
hash(keyB) = keyB.hashCode() ^ (keyB.hashCode() >>> 16):
00000000 00000000 00000000 00100001 // keyB.hashCode()
00000000 00000000 00000000 00000000 // keyB.hashCode() >>> 16

00000000 00000000 00000000 00100001 // hash(keyB)
00000000 00000000 00000000 00000111 // capacity - 1

tabIndexOfKeyB = (capacity - 1) & hash(keyB) = 1
1
2
3
4
5
6
7
8
hash(keyC) = keyC.hashCode() ^ (keyC.hashCode() >>> 16):
00000000 00001000 00000000 00000001 // keyC.hashCode()
00000000 00000000 00000000 00001000 // keyC.hashCode() >>> 16

00000000 00001000 00000000 00001001 // hash(keyC)
00000000 00000000 00000000 00000111 // capacity - 1

tabIndexOfKeyC = (capacity - 1) & hash(keyC) = 1
1
2
3
4
5
6
7
8
hash(keyD) = keyD.hashCode() ^ (keyD.hashCode() >>> 16):
00000000 00000001 00000000 00000001 // keyD.hashCode()
00000000 00000000 00000000 00000001 // keyD.hashCode() >>> 16

00000000 00000001 00000000 00000000 // hash(keyD)
00000000 00000000 00000000 00000111 // capacity - 1

tabIndexOfKeyD = (capacity - 1) & hash(keyD) = 0

可以看出,经过高低位异或处理后,keyD 落在了 index = 0 的 Node 内,降低了哈希冲突的概率。

当然,也可以根据注释中的 Among known examples are sets of Float keys holding consecutive whole numbers in small tables. 来理解,只不过平时在开发过程中对 Float 使用不多,所以我并没有举这个例子,感兴趣的可以使用连续的 Float 整数来验证其在 HashMap 中的分布,关于浮点数的二进制表示可以参考:13 | 魔数 0x5f3759df-极客时间Float 对象的 hashCode() 可以参考 JDK 源码,此处不再展开。

我们再看看 putVal 方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
/**
* Implements Map.put and related methods.
*
* @param hash hash for key
* @param key the key
* @param value the value to put
* @param onlyIfAbsent if true, don't change existing value
* @param evict if false, the table is in creation mode.
* @return previous value, or null if none
*/
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
boolean evict) {
Node<K,V>[] tab; Node<K,V> p; int n, i;
if ((tab = table) == null || (n = tab.length) == 0)
n = (tab = resize()).length;
if ((p = tab[i = (n - 1) & hash]) == null)
tab[i] = newNode(hash, key, value, null);
else {
Node<K,V> e; K k;
if (p.hash == hash &&
((k = p.key) == key || (key != null && key.equals(k))))
e = p;
else if (p instanceof TreeNode)
e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
else {
for (int binCount = 0; ; ++binCount) {
if ((e = p.next) == null) {
p.next = newNode(hash, key, value, null);
if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
treeifyBin(tab, hash);
break;
}
if (e.hash == hash &&
((k = e.key) == key || (key != null && key.equals(k))))
break;
p = e;
}
}
if (e != null) { // existing mapping for key
V oldValue = e.value;
if (!onlyIfAbsent || oldValue == null)
e.value = value;
afterNodeAccess(e);
return oldValue;
}
}
++modCount;
if (++size > threshold)
resize();
afterNodeInsertion(evict);
return null;
}

大致逻辑为,如果对应 index 的 Node 还未初始化,则使用需要 put 的值初始化 Node,否则判断该 Node 上是否与需要 put 的 key 完全相等,如果相等就直接替换该 Node 上的 value,如果不等则创建 Node 并使用 next 连接为链表,当链表长度达到 8 时转换为红黑树,以使查询时时间复杂度由 O(n) 降低为 O(log(n))。关于引入红黑树的优化可以参考:JEP 180: Handle Frequent HashMap Collisions with Balanced Trees,其中解释了 Hashtable 未引入该项优化的原因,因为 Hashtable 自 Java 1.0 就存在,有许多遗留代码依赖 Hashtable,而这项改进会影响 Hashtable 的迭代顺序,这项优化之前碰撞 bucket 中的元素的顺序为插入顺序,优化后为比较顺序,所以为了保证向后兼容性,Hashtable 未引入该项优化。而 WeakHashMap 引入该项特性后会导致不可接受的微基准性能下降,所以也未引入该项优化。IdentityHashMap 因为使用的 System.identityHashCode() 去生成 hashCode,产生碰撞的几率极低,故没有必要引入该项优化。

其中,在底层数组 tab 为空或者添加元素后当前 HashMap 元素数量大于阈值 threshold 时均会触发 resize 方法进行扩容,其中 resize 方法代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
/**
* Initializes or doubles table size. If null, allocates in
* accord with initial capacity target held in field threshold.
* Otherwise, because we are using power-of-two expansion, the
* elements from each bin must either stay at same index, or move
* with a power of two offset in the new table.
*
* @return the table
*/
final Node<K,V>[] resize() {
Node<K,V>[] oldTab = table;
int oldCap = (oldTab == null) ? 0 : oldTab.length;
int oldThr = threshold;
int newCap, newThr = 0;
if (oldCap > 0) {
if (oldCap >= MAXIMUM_CAPACITY) {
threshold = Integer.MAX_VALUE;
return oldTab;
}
else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
oldCap >= DEFAULT_INITIAL_CAPACITY)
newThr = oldThr << 1; // double threshold
}
else if (oldThr > 0) // initial capacity was placed in threshold
newCap = oldThr;
else { // zero initial threshold signifies using defaults
newCap = DEFAULT_INITIAL_CAPACITY;
newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
}
if (newThr == 0) {
float ft = (float)newCap * loadFactor;
newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
(int)ft : Integer.MAX_VALUE);
}
threshold = newThr;
@SuppressWarnings({"rawtypes","unchecked"})
Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
table = newTab;
if (oldTab != null) {
for (int j = 0; j < oldCap; ++j) {
Node<K,V> e;
if ((e = oldTab[j]) != null) {
oldTab[j] = null;
if (e.next == null)
newTab[e.hash & (newCap - 1)] = e;
else if (e instanceof TreeNode)
((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
else { // preserve order
Node<K,V> loHead = null, loTail = null;
Node<K,V> hiHead = null, hiTail = null;
Node<K,V> next;
do {
next = e.next;
if ((e.hash & oldCap) == 0) {
if (loTail == null)
loHead = e;
else
loTail.next = e;
loTail = e;
}
else {
if (hiTail == null)
hiHead = e;
else
hiTail.next = e;
hiTail = e;
}
} while ((e = next) != null);
if (loTail != null) {
loTail.next = null;
newTab[j] = loHead;
}
if (hiTail != null) {
hiTail.next = null;
newTab[j + oldCap] = hiHead;
}
}
}
}
}
return newTab;
}

首先我们看一下该方法的注释:初始化或者将底层 table 容量翻倍,如果底层 table 为空,则使用 threshold 字段存储的初始容量值进行申请,否则,因为我们使用的 2 的 n 次幂扩容,所以每个节点的元素要么待在原索引上,要么移动至了原索引偏移原 2 的 n 次幂的索引处。

可能说的不是很明白,我们来详细解释一下,又回到 HashMap(int initialCapacity, float loadFactor) 方法中最后一行: this.threshold = tableSizeFor(initialCapacity),这行代码将计算出的 table size 赋值给了 this.threshold, 起初令我迷惑的就是这一行,为什么要将 table size 赋值给 this.threshold 作为阈值呢?我们再看看 this.threshold 的注释:

1
2
3
4
5
6
7
8
9
10
/**
* The next size value at which to resize (capacity * load factor).
*
* @serial
*/
// (The javadoc description is true upon serialization.
// Additionally, if the table array has not been allocated, this
// field holds the initial array capacity, or zero signifying
// DEFAULT_INITIAL_CAPACITY.)
int threshold;

可以看出,在 table 数组还未进行初始化申请的时候,该字段用于临时存储初始的数组容量,这也解释了 resize 方法其中的注释:如果底层 table 为空,则使用 threshold 字段存储的初始容量值进行申请。

resize 方法中,当 oldCap 大于 0 时,其中计算 newCap 的逻辑为:newCap = oldCap << 1,即将之前的容量进行翻倍,理解了本文开始说明的容量为 2 的 n 次幂的原理之后,高低链表扩容原理就很好理解了,举个例子,当 oldCap = 16 时,此时 newCap = 32,假设有如下的 hash 值集合:{5, 8, 20, 53, 60},其二进制表示如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
oldCap = 16:
00000000 00000000 00000000 00010000
oldCap - 1 = 15:
00000000 00000000 00000000 00001111

hash = 5:
00000000 00000000 00000000 00000101 // oldIndex = 5
hash = 8:
00000000 00000000 00000000 00001000 // oldIndex = 8
hash = 20:
00000000 00000000 00000000 00010100 // oldIndex = 4
hash = 53:
00000000 00000000 00000000 00110101 // oldIndex = 5
hash = 60:
00000000 00000000 00000000 00111100 // oldIndex = 12

newCap = 32:
00000000 00000000 00000000 00100000
newCap - 1 = 31:
00000000 00000000 00000000 00011111
hash = 5:
00000000 00000000 00000000 00000101 // newIndex = 5
hash = 8:
00000000 00000000 00000000 00001000 // newIndex = 8
hash = 20:
00000000 00000000 00000000 00010100 // newIndex = 20 = 4 + oldCap
hash = 53:
00000000 00000000 00000000 00110101 // newIndex = 21 = 5 + oldCap
hash = 60:
00000000 00000000 00000000 00111100 // newIndex = 28 = 12 + oldCap

resize 方法中,最为核心的判断逻辑为 if ((e.hash & oldCap) == 0),该返回值用于判断该 Node 是落在低位链表还是高位链表,这个逻辑是如何得出的呢?首先我们知道根据 hash 值定位所在 index 的逻辑为:tabIndex = (capacity - 1) & hash(key);, 所以我们比较一下 oldCap - 1newCap - 1:

1
2
3
4
oldCap - 1 = 15:
00000000 00000000 00000000 00001111
newCap - 1 = 31:
00000000 00000000 00000000 00011111

可以看出,其二进制的 bit 位差异即为 oldCap 的值:

1
2
oldCap = 16:
00000000 00000000 00000000 00010000

所以,不难理解 if ((e.hash & oldCap) == 0) 的含义,是为了判断扩容后使用 tabIndex = (capacity - 1) & hash(key); 计算出的 tabIndex 是否会变化,如果与计算后的结果为 0,说明扩容后,计算出的 tabIndex 与扩容前相同,反之 tabIndex 相比扩容前偏移了 oldCap,这也与上面的例子相符合。

我们再看看 HashMap 中部分字段的声明:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/**
* The table, initialized on first use, and resized as
* necessary. When allocated, length is always a power of two.
* (We also tolerate length zero in some operations to allow
* bootstrapping mechanics that are currently not needed.)
*/
transient Node<K,V>[] table;

/**
* Holds cached entrySet(). Note that AbstractMap fields are used
* for keySet() and values().
*/
transient Set<Map.Entry<K,V>> entrySet;

/**
* The number of key-value mappings contained in this map.
*/
transient int size;

/**
* The number of times this HashMap has been structurally modified
* Structural modifications are those that change the number of mappings in
* the HashMap or otherwise modify its internal structure (e.g.,
* rehash). This field is used to make iterators on Collection-views of
* the HashMap fail-fast. (See ConcurrentModificationException).
*/
transient int modCount;

令人费解的是,为何这些字段都被 transient 修饰了呢?具体原因可以参考 Serializable。最关键的一点为需要保证 HashMap 中元素 keyhashCode() 返回值不变,否则会出现无法获取元素的问题,我之前处理过的一次线上事故就是由于 key 对象放入 HashMap 之后,key 对象中的成员变量值发生了改变,从而使 key 对象的 hashCode() 返回值发生了变化,导致 remove 方法调用未能移除元素,从而内存泄漏,触发了 OOM。

最后,不得不提的是在 HashMapJavaDoc 中用黑体字进行了如下说明:

Note that this implementation is not synchronized.

该实现没有被同步,并不是线程安全的,关于线程安全问题可以参考如下两篇文章:

Reference

Why the size of Hashmap is power of 2 in Java - Runzhuo Li
HashMap.tableSizeFor(…). How does this code round up to the next power of 2? - Stack Overflow
Why is the initialCapacity of Hashtable 11 while the DEFAULT_INITIAL_CAPACITY in HashMap is 16 and requires a power of 2? - Stack Overflow
Prime number - Wikipedia
data structures - Why is it best to use a prime number as a mod in a hashing function? - Computer Science Stack Exchange
Why does HashMap.put both compare hashes and test equality? - Stack Overflow
Change to HashMap hash function in Java 8 - Stack Overflow
Prefer to insert node into the head or tail of the linked list when we implement a separate chaining HashMap? - Stack Overflow