Serializable

发表于 2021-04-25

关于 Java 中的序列化，最核心的两个接口为 Serializable 与 Externalizable。

Serializable 的源码如下：

1 2	public interface Serializable { }

Externalizable 的源码如下：

public interface Externalizable extends java.io.Serializable {
    /**
     * The object implements the writeExternal method to save its contents
     * by calling the methods of DataOutput for its primitive values or
     * calling the writeObject method of ObjectOutput for objects, strings,
     * and arrays.
     *
     * @serialData Overriding methods should use this tag to describe
     *             the data layout of this Externalizable object.
     *             List the sequence of element types and, if possible,
     *             relate the element to a public/protected field and/or
     *             method of this Externalizable class.
     *
     * @param out the stream to write the object to
     * @exception IOException Includes any I/O exceptions that may occur
     */
    void writeExternal(ObjectOutput out) throws IOException;

    /**
     * The object implements the readExternal method to restore its
     * contents by calling the methods of DataInput for primitive
     * types and readObject for objects, strings and arrays.  The
     * readExternal method must read the values in the same sequence
     * and with the same types as were written by writeExternal.
     *
     * @param in the stream to read data from in order to restore the object
     * @exception IOException if I/O errors occur
     * @exception ClassNotFoundException If the class for an object being
     *              restored cannot be found.
     */
    void readExternal(ObjectInput in) throws IOException, ClassNotFoundException;
}

可以看出 Externalizable 继承了 Serializable 接口且额外定义了两个方法。关于它们之前的区别可以参见 What is the difference between Serializable and Externalizable in Java?。具体实现可以跟随 JDK 源码 ObjectOutputStream.writeObject 中的调用链至 ObjectOutputStream.writeOrdinaryObject 可以看出对两个接口的处理差异，此处不再一一分析。

在序列化中，不得不提的就是 transient 关键字，对于该关键字，在 JLS 中只有以下简短的描述：

Variables may be marked transient to indicate that they are not part of the persistent state of an object.

在常见的集合框架类的相关源码中，经常看到 transient 的身影，作者为何要加上此关键字呢？

阅读全文 »

GC Causes

发表于 2021-04-24

在我的经历中，最常见的 GC 触发原因为：

System.gc(): 代码中显式触发
Allocation Failure：最为常见的 GC 触发原因，主要发生在 Young 区，此时触发的为 Young GC
Ergonomics：由收集器为了达到吞吐量目标/最小暂时时间目标动态伸缩堆引起，建议将最小堆大小与最大堆大小设置为相同的值，参见 Tuning Tips for Heap Sizes
Metadata GC Threshold：由元数据区 commited 空间达到高水位线时触发，参见 Class Metadata
GCLocker Initiated GC：因为 JNI 在关键区域调用时可能持有堆中的指针，所以在进行 JNI 关键区域调用时会暂时禁止 GC，直到所有线程退出 JNI 临界区时触发，参见 GCLocker_Initiated_GC 及 GetPrimitiveArrayCritical, ReleasePrimitiveArrayCritical
G1 Evacuation Pause：参见 G1_Evacuation_Pause
G1 Humongous Allocation：参见 G1_Humongous_Allocation

Reference

gcCause.hpp
gcCause.cpp
GC Causes

G1 GC

发表于 2021-04-23

在 Orcale 的官方网站中，对于 G1 GC 的介绍可以参见 Garbage-First Garbage Collector 及 Garbage-First Garbage Collector Tuning，其中关键点如下：

G1 收集器是一个服务端的垃圾收集器，适用于具有大内存的多处理器机器。它极有可能满足垃圾回收（GC）暂停时间目标，同时实现高吞吐量。整堆操作（例如全局标记）与应用程序线程同时执行。这样可以防止与堆或活动数据大小成比例的中断。

阅读全文 »

HashMap

发表于 2021-04-18

以下基于 JDK 8 中的 HashMap 进行分析，先看看这几个构造函数：

/**
 * Constructs an empty <tt>HashMap</tt> with the specified initial
 * capacity and load factor.
 *
 * @param  initialCapacity the initial capacity
 * @param  loadFactor      the load factor
 * @throws IllegalArgumentException if the initial capacity is negative
 *         or the load factor is nonpositive
 */
public HashMap(int initialCapacity, float loadFactor) {
    if (initialCapacity < 0)
        throw new IllegalArgumentException("Illegal initial capacity: " +
                                           initialCapacity);
    if (initialCapacity > MAXIMUM_CAPACITY)
        initialCapacity = MAXIMUM_CAPACITY;
    if (loadFactor <= 0 || Float.isNaN(loadFactor))
        throw new IllegalArgumentException("Illegal load factor: " +
                                           loadFactor);
    this.loadFactor = loadFactor;
    this.threshold = tableSizeFor(initialCapacity);
}

/**
 * Constructs an empty <tt>HashMap</tt> with the specified initial
 * capacity and the default load factor (0.75).
 *
 * @param  initialCapacity the initial capacity.
 * @throws IllegalArgumentException if the initial capacity is negative.
 */
public HashMap(int initialCapacity) {
    this(initialCapacity, DEFAULT_LOAD_FACTOR);
}

/**
 * Constructs an empty <tt>HashMap</tt> with the default initial capacity
 * (16) and the default load factor (0.75).
 */
public HashMap() {
    this.loadFactor = DEFAULT_LOAD_FACTOR; // all other fields defaulted
}

可见，对于最常见的 new HashMap() 方法，仅仅将 loadFactor 设置为了默认的负载因子：0.75，此时未对底层的数组 Node<K,V>[] table 进行初始化。

默认的负载因子为何要选择 0.75 呢？其中 HashMap 的 JavaDoc 中专门提到：

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table. Note that using many keys with the same hashCode() is a sure way to slow down performance of any hash table. To ameliorate impact, when keys are Comparable, this class may use comparison order among keys to help break ties.

我们看看 HashMap(int initialCapacity, float loadFactor) 方法中最后一行：
this.threshold = tableSizeFor(initialCapacity)，其中 tableSizeFor 方法实现如下：

/**
 * Returns a power of two size for the given target capacity.
 */
static final int tableSizeFor(int cap) {
    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}

阅读全文 »

关于 Spark 的分区数问题

发表于 2021-04-07

Coalesce Hints for SQL Queries，该特性用于控制输出的文件数，之前数仓同步时耗时较长，经过定位后发现大部分时间消耗在与 OSS 的数据交互上，主要是小文件引起，每张表的同步任务经过 shuffle 后默认会生成 200 个文件，后面优化为根据每张表的表记录数计算出一个合适的分区数使用上述 Hint 嵌入在 SQL 中，整个数仓同步耗时降低近 50%。

同时发现的问题还有 EMR-OSS 连接器中对 System.gc() 的显式调用，该问题会导致花费大量时间在不必要的 FullGC 上，后面移除了该调用以提升数仓同步速度。

Reference

Spark Partitioning & Partition Understanding
Spark SQL Shuffle Partitions