I have a batch process, written in java, that analyzes extremely long sequences of tokens (maybe billions or even trillions of them!) and observes bi-gram patterns (aka, word-pairs).
In this code, bi-grams are represented as Pairs of Strings, using the ImmutablePair class from Apache commons. I won't know in advance the cardinality of the tokens. They might be very repetitive, or each token might be totally unique.
The more data I can fit into memory, the better the analysis will be!
But I definitely can't process the whole job at once. So I need to load as much data as possible into a buffer, perform a partial analysis, flush my partial results to a file (or to an API, or whatever), then clear my caches and start over.
One way I'm optimizing memory usage is by using Guava interners to de-duplicate my String instances.
Right now, my code looks essentially like this:
int BUFFER_SIZE = 100_000_000;
Map<Pair<String, String>, LongAdder> bigramCounts = new HashMap<>(BUFFER_SIZE);
Interner<String> interner = Interners.newStrongInterner();
String prevToken = null;
Iterator<String> tokens = getTokensFromSomewhere();
while (tokens.hasNest()) {
String token = interner.intern(tokens.next());
if (prevToken != null) {
Pair<String, String> bigram = new ImmutablePair(prevToken, token);
LongAdder bigramCount = bigramCounts.computeIfAbsent(
bigram,
(c) -> new LongAdder()
);
bigramCount.increment();
// If our buffer is full, we need to flush!
boolean tooMuchMemoryPressure = bigramCounts.size() > BUFFER_SIZE;
if (tooMuchMemoryPressure) {
// Analyze the data, and write the partial results somewhere
doSomeFancyAnalysis(bigramCounts);
// Clear the buffer and start over
bigramCounts.clear();
}
}
prevToken = token;
}
The trouble with this code is that this is a very crude way of determining whether there is tooMuchMemoryPressure
.
I want to run this job on many different kinds of hardware, with varying amounts of memory. No matter the instance, I want this code to automatically adjust to maximize the memory consumption.
Rather than using some hard-coded constant like BUFFER_SIZE
(derived through experimentation, heuristics, guesswork), I actually just want ask the JVM whether the memory is almost full. But that's a very complicated question, considering the complexities of mark/sweep algorithms, and all the different generational collectors.
What would be a good general-purpose approach for accomplishing something like this, assuming this batch-job might run on a variety of different machines, with different amounts of available memory? I don't need this to be extremely precise... I'm just looking for a rough signal to know that I need to flush the buffer soon, based on the state of the actual heap.
The simplest way to get a first glimpse of what is going on with the process' heap space is Runtime.freeMemory()
together with .maxMemory
and .totalMemory
. Yet the first does not factor in garbage and so is an under-estimation at best and may be completely misleading just before the GC kicks in.
Assuming that for your application "memory pressure" basically means "(soon) not enough", the interesting value is free memory right after a GC.
This is available by using GarbageCollectorMXBean
which provides GcInfo
with memory usage after the GC.
The bean can be watched exactly after GC since it is a NotificationEmitter
, despite this is not being advertised in the Javadoc. Some minimal code, patterned after a longer example is
void registerCallback() {
List<GarbageCollectorMXBean> gcbeans =
java.lang.management.ManagementFactory.getGarbageCollectorMXBeans();
for (GarbageCollectorMXBean gcbean : gcbeans) {
System.out.println(gcbean.getName());
NotificationEmitter emitter = (NotificationEmitter) gcbean;
emitter.addNotificationListener(this::handle, null, null);
}
}
private void handle(Notification notification, Object handback) {
if (!notification.getType()
.equals(GarbageCollectionNotificationInfo.GARBAGE_COLLECTION_NOTIFICATION)) {
return;
}
GarbageCollectionNotificationInfo info = GarbageCollectionNotificationInfo
.from((CompositeData) notification.getUserData());
GcInfo gcInfo = info.getGcInfo();
gcInfo.getMemoryUsageAfterGc().forEach((name, memUsage) -> {
System.err.println(name+ "->" + memUsage);
});
}
There will be several memUsage
entries and this will also differ depending on the GC. But from the values provided, used
, committed
and max
we can derive upper limits on free memory which again should give the "rough signal" the OP is asking for.
The doSomeFancyAnalysis
will certainly also need its share of fresh memory, so with a very rough estimate how much that will be per bigramm to analyze, this could be the limit to watch for.