Poison

Java Agent 的类加载隔离实现

关于 Java Agent 为何需要做类加载隔离,我在实际开发 Java Agent 之前是不清楚的,直到业务需要将 Java Agent 用于应用监控,在开发过程中,对整个类加载器层次及类隔离有了更深入的理解,本文简要记录。

在早期我们用于监控的 Java Agent 的实现中,是没有做类加载隔离的,因为起初的 Java Agent 实现非常简单,仅仅是监控是否有堆转储文件产生,然后触发告警,此时 Java Agent 没有任何依赖。随着业务发展,越来越多的依赖加入至 Java Agent 后,我们发现集成至 JVM 应用后,会触发各种关于类加载的异常,如:X cannot be cast to X exceptions

首先简要介绍一下 Java Agent。根据 Oracle 的文档 java.lang.instrument (Java Platform SE 8 ),我们知道 Java Agent 是作为 Jar 文件部署的,有两种启动方式,一种是通过命令行随 Java 应用一起启动,另一种是在 JVM 已经启动后,attach 至已经启动的应用,并将 Agent 加载到正在运行的 Java 应用中。

文档还提到 Agent 的入口类由系统类加载器加载,同时系统类加载器也是加载应用 main 方法的入口类的类加载器。

在我们的场景中,Java Agent 是在 Docker 镜像层统一接入的,即通过命令行随 Java 应用一起启动,我们在 Java Agent 的入口类中提供了 premain 方法。关于 Agent 中的 premain 方法,其内部可以做什么没有具体的限制,任何应用程序 main 方法可以做的在 premain 方法中都可以做,包括创建线程等操作都是合法的。

那么我们为什么需要对 Java Agent 做类加载隔离呢,我们用一个例子来说明不进行类加载隔离会出现什么问题。现在,在我们的 Java Agent 实现中,因为需要打印日志,所以我们引入了如下依赖:

1
2
3
4
5
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.14.1</version>
</dependency>

即在 Java Agent 中依赖了 log4j-core 2.14.1,在我们的 web 应用中,含有如下依赖:

1
2
3
4
5
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.11.1</version>
</dependency>

即 web 应用中依赖了 log4j-core 2.11.1。根据我们之前对 Tomcat 的类加载器层次 的分析,我们知道 web 应用中的依赖存在于 /WEB-INF/lib 中,且由 Webapp ClassLoader 加载。在 log4j 中,其对自身插件部分的加载采取了类似 Java SPI 机制的方式进行加载,原理为扫描 META-INF/org/apache/logging/log4j/core/config/plugins/Log4j2Plugins.dat 文件,读取该文件中的插件的实现类名,然后对这些插件的实现类进行加载,那么在 web 应用看来,将使用 log4j-core 2.11.1 的代码去查询接口的实现类并加载,其加载部分代码位于:org.apache.logging.log4j.core.config.plugins.util.PluginRegistry#decodeCacheFiles,源码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
private Map<String, List<PluginType<?>>> decodeCacheFiles(final ClassLoader loader) {
final long startTime = System.nanoTime();
final PluginCache cache = new PluginCache();
try {
final Enumeration<URL> resources = loader.getResources(PluginProcessor.PLUGIN_CACHE_FILE);
if (resources == null) {
LOGGER.info("Plugin preloads not available from class loader {}", loader);
} else {
cache.loadCacheFiles(resources);
}
} catch (final IOException ioe) {
LOGGER.warn("Unable to preload plugins", ioe);
}
final Map<String, List<PluginType<?>>> newPluginsByCategory = new HashMap<>();
int pluginCount = 0;
for (final Map.Entry<String, Map<String, PluginEntry>> outer : cache.getAllCategories().entrySet()) {
final String categoryLowerCase = outer.getKey();
final List<PluginType<?>> types = new ArrayList<>(outer.getValue().size());
newPluginsByCategory.put(categoryLowerCase, types);
for (final Map.Entry<String, PluginEntry> inner : outer.getValue().entrySet()) {
final PluginEntry entry = inner.getValue();
final String className = entry.getClassName();
try {
final Class<?> clazz = loader.loadClass(className);
final PluginType<?> type = new PluginType<>(entry, clazz, entry.getName());
types.add(type);
++pluginCount;
} catch (final ClassNotFoundException e) {
LOGGER.info("Plugin [{}] could not be loaded due to missing classes.", className, e);
} catch (final VerifyError e) {
LOGGER.info("Plugin [{}] could not be loaded due to verification error.", className, e);
}
}
}

final long endTime = System.nanoTime();
final DecimalFormat numFormat = new DecimalFormat("#0.000000");
final double seconds = (endTime - startTime) * 1e-9;
LOGGER.debug("Took {} seconds to load {} plugins from {}",
numFormat.format(seconds), pluginCount, loader);
return newPluginsByCategory;
}

其中关键之处在 line 5 及 line 24,在 web 应用的 log4j-core 2.11.1 执行 line 5 的代码后,从 Webapp ClassLoader 发起调用,那么会返回整个类加载器层次上的 Log4j2Plugins.dat 资源,此时会返回两个,一个是 web 应用依赖的 log4j-core 2.11.1 中含有的 Log4j2Plugins.dat,一个是 Java Agent 依赖的 log4j-core 2.14.1 中含有的 Log4j2Plugins.dat,即两个不同版本的 Log4j2Plugins.dat 资源,且根据 Webapp ClassLoader 加载资源的顺序可以得出,log4j-core 2.11.1 中的 Log4j2Plugins.datlog4j-core 2.14.1 中的 Log4j2Plugins.dat 之前返回。随即调用 line 9 将两个 Log4j2Plugins.dat 的数据进行 merge 操作,代码位于 org.apache.logging.log4j.core.config.plugins.processor.PluginCache#loadCacheFiles,源码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
private final Map<String, Map<String, PluginEntry>> categories =
new LinkedHashMap<>();

/**
* Loads and merges all the Log4j plugin cache files specified. Usually, this is obtained via a ClassLoader.
*
* @param resources URLs to all the desired plugin cache files to load.
* @throws IOException if an I/O exception occurs.
*/
public void loadCacheFiles(final Enumeration<URL> resources) throws IOException {
categories.clear();
while (resources.hasMoreElements()) {
final URL url = resources.nextElement();
try (final DataInputStream in = new DataInputStream(new BufferedInputStream(url.openStream()))) {
final int count = in.readInt();
for (int i = 0; i < count; i++) {
final String category = in.readUTF();
final Map<String, PluginEntry> m = getCategory(category);
final int entries = in.readInt();
for (int j = 0; j < entries; j++) {
final PluginEntry entry = new PluginEntry();
entry.setKey(in.readUTF());
entry.setClassName(in.readUTF());
entry.setName(in.readUTF());
entry.setPrintable(in.readBoolean());
entry.setDefer(in.readBoolean());
entry.setCategory(category);
if (!m.containsKey(entry.getKey())) {
m.put(entry.getKey(), entry);
}
}
}
}
}
}

由以上代码可知,将扫描出的两个 Log4j2Plugins.dat 资源的内容进行了 merge 操作,随后,在之前代码的 line 24:final Class<?> clazz = loader.loadClass(className); 中会使用 Webapp ClassLoader 去尝试加载类,而因为 web 应用依赖的 log4j-core 2.11.1 不含有 Java Agent 依赖的 log4j-core 2.14.1 中插件配置文件 Log4j2Plugins.dat 含有的类,那么在 line 24 行的调用中,对这部分在 web 应用依赖中不存在的类,最终会委托给 System ClassLoader 加载并成功加载。然后在随后的处理逻辑中,会将加载到的类进行子类具体化,代码位于 org.apache.logging.log4j.core.lookup.Interpolator#Interpolator(org.apache.logging.log4j.core.lookup.StrLookup, java.util.List<java.lang.String>),源码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/**
* Constructs an Interpolator using a given StrLookup and a list of packages to find Lookup plugins in.
*
* @param defaultLookup the default StrLookup to use as a fallback
* @param pluginPackages a list of packages to scan for Lookup plugins
* @since 2.1
*/
public Interpolator(final StrLookup defaultLookup, final List<String> pluginPackages) {
this.defaultLookup = defaultLookup == null ? new MapLookup(new HashMap<String, String>()) : defaultLookup;
final PluginManager manager = new PluginManager(CATEGORY);
manager.collectPlugins(pluginPackages);
final Map<String, PluginType<?>> plugins = manager.getPlugins();

for (final Map.Entry<String, PluginType<?>> entry : plugins.entrySet()) {
try {
final Class<? extends StrLookup> clazz = entry.getValue().getPluginClass().asSubclass(StrLookup.class);
strLookupMap.put(entry.getKey(), ReflectionUtil.instantiate(clazz));
} catch (final Throwable t) {
handleError(entry.getKey(), t);
}
}
}

其中关键的代码为 line 16,该行代码会调用 java.lang.Class#asSubclass 方法,源码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/**
* Casts this {@code Class} object to represent a subclass of the class
* represented by the specified class object. Checks that the cast
* is valid, and throws a {@code ClassCastException} if it is not. If
* this method succeeds, it always returns a reference to this class object.
*
* <p>This method is useful when a client needs to "narrow" the type of
* a {@code Class} object to pass it to an API that restricts the
* {@code Class} objects that it is willing to accept. A cast would
* generate a compile-time warning, as the correctness of the cast
* could not be checked at runtime (because generic types are implemented
* by erasure).
*
* @param <U> the type to cast this class object to
* @param clazz the class of the type to cast this class object to
* @return this {@code Class} object, cast to represent a subclass of
* the specified class object.
* @throws ClassCastException if this {@code Class} object does not
* represent a subclass of the specified class (here "subclass" includes
* the class itself).
* @since 1.5
*/
@SuppressWarnings("unchecked")
public <U> Class<? extends U> asSubclass(Class<U> clazz) {
if (clazz.isAssignableFrom(this))
return (Class<? extends U>) this;
else
throw new ClassCastException(this.toString());
}

那么,在调用 java.lang.Class#asSubclass 方法时,this 指向的类可能为通过系统类加载器加载的 log4j-core 2.14.1 中的类,而参数 StrLookup.class 此时是又 Webapp ClassLoader 所加载,根据之前对 Class.isAssignableFrom 的分析我们知道,即使两个类满足继承关系,但是当这两个类不是由同一个类加载器加载时,该方法会返回 false,从而执行 else 逻辑,触发 ClassCastException,异常栈帧如下:

1
2
3
4
5
6
7
8
9
10
2021-10-30 20:34:23,956 RMI TCP Connection(2)-127.0.0.1 ERROR Unable to create Lookup for event java.lang.ClassCastException: class org.apache.logging.log4j.core.lookup.EventLookup
at java.lang.Class.asSubclass(Class.java:3404)
at org.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:73)
at org.apache.logging.log4j.core.config.AbstractConfiguration.doConfigure(AbstractConfiguration.java:502)
at org.apache.logging.log4j.core.config.AbstractConfiguration.initialize(AbstractConfiguration.java:238)
at org.apache.logging.log4j.core.config.AbstractConfiguration.start(AbstractConfiguration.java:250)
at org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:547)
at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:619)
at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:636)
at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:231)

以上就是 Java Agent 未进行类加载隔离导致的问题之一,因为应用依赖的复杂性,在真实的业务场景中,报的错远远不止这一种,那么如何解决这个问题呢,我参考了几个开源的 Java Agent 实现,比如 Uber 的 jvm-profiler,其在构建 Java Agent 的 Jar 时使用用 Maven Shade Plugin 将类进行重定位,将资源进行排除,其 pom.xml 截取部分代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>org.apache</pattern>
<shadedPattern>ujagent_shaded.org.apache</shadedPattern>
</relocation>
<relocation>
<pattern>com.fasterxml</pattern>
<shadedPattern>ujagent_shaded.com.fasterxml</shadedPattern>
</relocation>
<relocation>
<pattern>org.codehaus</pattern>
<shadedPattern>ujagent_shaded.org.codehaus</shadedPattern>
</relocation>
<relocation>
<pattern>com.thoughtworks</pattern>
<shadedPattern>ujagent_shaded.com.thoughtworks</shadedPattern>
</relocation>
<relocation>
<pattern>org.xerial</pattern>
<shadedPattern>ujagent_shaded.org.xerial</shadedPattern>
</relocation>
<relocation>
<pattern>org.tukaani</pattern>
<shadedPattern>ujagent_shaded.org.tukaani</shadedPattern>
</relocation>
<relocation>
<pattern>org.slf4j</pattern>
<shadedPattern>ujagent_shaded.org.slf4j</shadedPattern>
</relocation>
<relocation>
<pattern>com.alibaba</pattern>
<shadedPattern>ujagent_shaded.com.alibaba</shadedPattern>
</relocation>
<relocation>
<pattern>net.logstash</pattern>
<shadedPattern>ujagent_shaded.net.logstash</shadedPattern>
</relocation>
<relocation>
<pattern>com.timgroup</pattern>
<shadedPattern>ujagent_shaded.com.timgroup</shadedPattern>
</relocation>
<relocation>
<pattern>com.amazonaws</pattern>
<shadedPattern>ujagent_shaded.com.amazonaws</shadedPattern>
</relocation>
<relocation>
<pattern>org.javassist</pattern>
<shadedPattern>ujagent_shaded.org.javassist</shadedPattern>
</relocation>
<relocation>
<pattern>org.joda</pattern>
<shadedPattern>ujagent_shaded.org.joda</shadedPattern>
</relocation>
<relocation>
<pattern>avro</pattern>
<shadedPattern>ujagent_shaded.avro</shadedPattern>
</relocation>
<relocation>
<pattern>edu</pattern>
<shadedPattern>ujagent_shaded.edu</shadedPattern>
</relocation>
<relocation>
<pattern>javassist</pattern>
<shadedPattern>ujagent_shaded.javassist</shadedPattern>
</relocation>
<relocation>
<pattern>net</pattern>
<shadedPattern>ujagent_shaded.net</shadedPattern>
</relocation>
<relocation>
<pattern>com.uber.data</pattern>
<shadedPattern>ujagent_shaded.com.uber.data</shadedPattern>
</relocation>
<relocation>
<pattern>com.uber.m3</pattern>
<shadedPattern>ujagent_shaded.com.uber.m3</shadedPattern>
</relocation>
<relocation>
<pattern>com.uber.stream</pattern>
<shadedPattern>ujagent_shaded.com.uber.stream</shadedPattern>
</relocation>
<relocation>
<pattern>com.uber.elk</pattern>
<shadedPattern>ujagent_shaded.com.uber.elk</shadedPattern>
</relocation>
<relocation>
<pattern>com.shade</pattern>
<shadedPattern>ujagent_shaded.com.shade</shadedPattern>
</relocation>
<relocation>
<pattern>org.yaml</pattern>
<shadedPattern>ujagent_org.yaml</shadedPattern>
</relocation>
<relocation>
<pattern>redis.clients</pattern>
<shadedPattern>ujagent_shaded.redis.clients</shadedPattern>
</relocation>
<relocation>
<pattern>org.influxdb</pattern>
<shadedPattern>ujagent_shaded.org.influxdb</shadedPattern>
</relocation>
<relocation>
<pattern>ch</pattern>
<shadedPattern>ujagent_shaded.ch</shadedPattern>
</relocation>
<relocation>
<pattern>okio</pattern>
<shadedPattern>ujagent_shaded.okio</shadedPattern>
</relocation>
</relocations>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/maven/**</exclude>
<exclude>META-INF/services/com.fasterxml.**</exclude>
<exclude>META-INF/services/java.sql.Driver</exclude>
<exclude>awssdk_config_default.json</exclude>
<exclude>heatpipe.config.properties</exclude>
<exclude>log4j.properties</exclude>
<exclude>kafka/kafka-version.properties</exclude>
<exclude>darwin/x86_64/liblz4-java.dylib</exclude>
<exclude>linux/amd64/liblz4-java.so</exclude>
<exclude>linux/i386/liblz4-java.so</exclude>
<exclude>win32/amd64/liblz4-java.so</exclude>
<exclude>org/gjt/mm/mysql/**</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>

该实现方式的好处在于实现相对简单,只需在打包时进行配置即可,缺点在于必须将所有依赖涉及的类进行重定位及对资源进行排除,且由于 Maven 依赖的传递性,那么需要将依赖引入的依赖树涉及的类都进行重定位,如果 Java Agent 中的依赖重定位声明漏掉了一些类,那么就和我们开始提到的情况一样,在应用依赖的版本不一致的情况下,容易出现类转换等异常。

另一种实现方式可以参考 elastic 的 Java Agent,源码可以参考 AgentMain.javaShadedClassLoader.java,其实现思路在 PR:Isolated agent classloader by felixbarny · Pull Request #2109 · elastic/apm-agent-java · GitHub 中有详细解释,大家可以参考。

个人认为这是更通用且安全的方式,即使用独立的类加载器去加载 Java Agent 依赖的类,该独立的类加载器的 parent 指向 Bootstrap ClassLoader,且将 Java Agent 依赖的类的默认后缀 .class 进行调整,以避免系统类加载器加载到这些类,以实现类的隔离,目前我们内部的 Java Agent 实现即采用的类似的方式进行了实现,解决了 Java Agent 集成至应用后的相关类加载问题。

类似的实现在 opentelemetry 中也能看到,源码可以参考 OpenTelemetryAgent.javaAgentClassLoader.java,实现思路可以参考 opentelemetry-java-instrumentation/javaagent-jar-components.md at main · open-telemetry/opentelemetry-java-instrumentation · GitHub,文档中有详尽的解释,且有图片示例,此处不再赘述。

References

Load pre-loaded lookups via SPI, rather than hard-code in Interpolator. by tbwork · Pull Request #396 · apache/logging-log4j2 · GitHub
技术分享:How To Write a JavaAgent (袁伟)
The definitive guide to Java agents by Rafael Winterhalter
How to Create a Java Agent and Why Would You Need One?