Share:
What's MessagePack? It's like JSON, but fast and small. It can improve your application performance and save traffic. You can encode additional data types like binary data.
Also, you can encode other data types into it. However, you can only do it in JSON with expensive base64 encoding.
MessagePack is the foundation of AppSpector communication protocols. We pack everything we send from our mobile SDK for iOS, Android, and Flutter in MessagePack. All logs, network requests, performance metrics, SQL queries, etc.Encoding/decoding performance is critical.
At the SDK level, it's not so critical because events are captured over time. Individual circumstances are never big enough to cause significant issues. But on the other side of the wire, we have a Web dashboard that has to process all events simultaneously.
This situation can be challenging because user sessions can contain thousands of events. Imagine we have to download, decompress, and decode from MessagePack, then insert into Redux and update the UI for 250k objects.
I needed to make every step work as fast as possible. So I started with MessagePack decoding performance.
Initially, I was using the msgpack-lite library for parsing. It was pretty old, but still, it was the best option a few years ago when we first implemented it.
I made several minor optimizations, which took a lot of work due to low code quality. I began to look for other options, and that's when I discovered the official msgpack-javascript library. It was written in TypeScript and had decent code quality (shoutout toFUJI Goro for creating it).
We managed to migrate to the new library in just a few days. The next step was to make it work FAST.
Msgpack-javascript needed to be faster. It was able to parse 68000 docs/sec. It's an impressive number by any standard, but when you need to parse 50 megabytes of data on the front end, you need to ensure you have the performance you can theoretically get.
What does that 68000 docs/sec number mean? MessagePack library has a benchmark that parses a small document 500,000 times and measures how many copies it parses per second. I will use this benchmark to test the optimizations described in this article.
Initially, I started with a high-level code review, trying to find any noticeable performance issues. In just 5 minutes, I found one. During decoding, we allocated arrays with zero sizes. It then pushed each decoded element to an array.
this.stack.push({
type: State.ARRAY,
size,
array: [],
});
...
state.array.push(object);
The obvious fix was to preallocate the array with size decoded from msgpack. Many JavaScript developers forget what's happening under the hood. Each call for the push method will reallocate the whole array if its current capacity is not big enough to store a new element.
We can fix it, however. First, allocate the array with the needed size. Then we’ll use the position variable to insert new elements.
this.stack.push({
type: State.ARRAY,
size,
array: new Array<unknown>(size),
position: 0,
});
By introducing this simple fix, we achieved 72000-74000 docs/sec decoding speed for the default benchmark. Just a few-percent improvement for documents with small arrays, but for edge case scenarios with a large array, it gives us more than a 2x improvement. Pull request.
It was just a 5% improvement from the initial speed: not a big deal, but every fraction of a % matters in the end.
For a typical payload large percentage of values are strings. Messagepack-javascript uses a combination of manual string decoding in pure JS and an optional WebAssembly version.
Let's take a look at the JS version. It looks complex, and for every string, it allocates an array of Unicode symbols and performs mathematical operations.
export function utf8Decode(bytes: Uint8Array, inputOffset: number, byteLength: number): string {
let offset = inputOffset;
const end = offset + byteLength;
const out: Array<number> = [];
while (offset < end) {
const byte1 = bytes[offset++];
if ((byte1 & 0x80) === 0) {
// 1 byte
out.push(byte1);
} else if ((byte1 & 0xe0) === 0xc0) {
// 2 bytes
const byte2 = bytes[offset++] & 0x3f;
out.push(((byte1 & 0x1f) << 6) | byte2);
} else if ((byte1 & 0xf0) === 0xe0) {
// 3 bytes
const byte2 = bytes[offset++] & 0x3f;
const byte3 = bytes[offset++] & 0x3f;
out.push(((byte1 & 0x1f) << 12) | (byte2 << 6) | byte3);
} else if ((byte1 & 0xf8) === 0xf0) {
// 4 bytes
const byte2 = bytes[offset++] & 0x3f;
const byte3 = bytes[offset++] & 0x3f;
const byte4 = bytes[offset++] & 0x3f;
let unit = ((byte1 & 0x07) << 0x12) | (byte2 << 0x0c) | (byte3 << 0x06) | byte4;
if (unit > 0xffff) {
unit -= 0x10000;
out.push(((unit >>> 10) & 0x3ff) | 0xd800);
unit = 0xdc00 | (unit & 0x3ff);
}
out.push(unit);
} else {
out.push(byte1);
}
}
return String.fromCharCode.apply(String, out as any);
}
Let's make it simpler and faster.
const textDecoder = new TextDecoder("utf-8");
const stringValue = textDecoder.decode(bytes);
The image above is a text decoder API provided by most browsers. It was specifically designed to decode/encode strings and was done in the native part, not in JavaScript.
Let's run the benchmark and see .... 40000 docs/sec.
How is it possible that the native API is significantly slower than the JS version?
Because this API requires calls across JS <-> Native bridge. This process adds overhead for every string decoding request. Every byte transfers from the JS virtual machine to the native part, and the same is true for the decoded string result.
We should not abandon TextDecoder. It should be different depending on the string length.
## string length=10 byteLength=10
utf8Decode x 8,147,700 ops/sec ±3.23% (84 runs sampled)
utf8DecodeWasm x 1,073,699 ops/sec ±2.33% (88 runs sampled)
TextDecoder x 693,559 ops/sec ±3.68% (74 runs sampled)
## string length=100 byteLength=100
utf8Decode x 860,952 ops/sec ±3.01% (83 runs sampled)
utf8DecodeWasm x 323,801 ops/sec ±8.54% (67 runs sampled)
TextDecoder x 484,350 ops/sec ±6.20% (69 runs sampled)
## string length=200 byteLength=200
utf8Decode x 458,241 ops/sec ±3.88% (88 runs sampled)
utf8DecodeWasm x 326,323 ops/sec ±5.80% (79 runs sampled)
TextDecoder x 454,980 ops/sec ±3.84% (74 runs sampled)
## string length=300 byteLength=300
utf8Decode x 298,996 ops/sec ±2.66% (83 runs sampled)
utf8DecodeWasm x 215,869 ops/sec ±9.42% (74 runs sampled)
TextDecoder x 403,562 ops/sec ±4.16% (75 runs sampled)
TextDecoder is incredibly slow for small strings but becomes much faster for strings with sizes > 200 bytes.
Let's add logic to a parsing flow that only uses TextDecoder for strings with lengths > 200 bytes.
const MIN_TEXT_DECODER_STRING_LENGTH = 200;
const defaultEncoding = "utf-8";
const sharedTextDecoder = typeof TextDecoder !== "undefined" ? new TextDecoder(defaultEncoding) : null;
export function utf8Decode(bytes: Uint8Array, inputOffset: number, byteLength: number): string {
let offset = inputOffset;
const end = offset + byteLength;
if (sharedTextDecoder !== null && byteLength > MIN_TEXT_DECODER_STRING_LENGTH) {
const stringBytes = bytes.subarray(offset, end);
return sharedTextDecoder.decode(stringBytes);
}
...rest of pure JS decoding logic
Let's run the benchmark test and see what happens .... 112000 docs/secs, a 1.64ximprovement from the initial speed.
And just so you realize what's happening: at this very moment, we are faster than any other msgpack implementation for JavaScript, and we are even faster than native JSON.parse()
Benchmark on NodeJS/v12.3.1
operation | op | ms | op/s
----------------------------------------------------------------- | ------: | ----: | ------:
buf = Buffer.from(JSON.stringify(obj)); | 557200 | 5000 | 111440
buf = JSON.stringify(obj); | 1078100 | 5000 | 215620
obj = JSON.parse(buf); | 394300 | 5001 | 78844
buf = require("msgpack-lite").encode(obj); | 416400 | 5000 | 83280
obj = require("msgpack-lite").decode(buf); | 313600 | 5000 | 62720
buf = require("@msgpack/msgpack").encode(obj); | 646100 | 5000 | 129220
obj = require("@msgpack/msgpack").decode(buf); | 561800 | 5000 | 112360
✨ Done in 36.69s.
Can we push it even more?
For a moment, I thought there was nothing more I could do to improve the performance. But as in life - there is always one more thing.
As I already mentioned, strings are a big part of the typical payload. They are used for keys and values everywhere.
We have already optimized string decoding, but it still takes most of the time if we look at the profiler. Is there anything we can do to speed up decoding except to skip it? Can we just not decode strings at all?
I analyzed one of the AppSpector sessions to see how many strings it contained. It had 250k strings, and 130k were strings for keys in maps. Most of these keys were the same.
I counted only 104 unique values in 130k string instances. We had around 20k instances of string "payload."
It didn't look right, and I needed to find a way to skip that work somehow.
First, I used a map with bytes as keys and strings as values. Instead of decoding a string each time, we would just look at this cache and get a decoded string from it. But Uint8Array can't be used as a map key, and converting it to a key string would make the whole optimization useless.
Let's define decoder logic. The decode method should receive msgpack bytes array, offset for string position inside this array, and string bytes length decoded from msgpack string header. It should return a decoded string.
class CachedKeyDecoder {
public decode(bytes: Uint8Array, inputOffset: number, byteLength: number): string {
// Find cached value
let value = this.get(bytes, inputOffset, byteLength);
// if it's not found then decode it using JS decoder and store in cache
if (!value) {
value = utf8DecodeJs(bytes, inputOffset, byteLength);
// Make copy of string bytes from main msgpack bytes array
const stringsBytes = bytes.slice(inputOffset, inputOffset + byteLength);
this.cache(stringsBytes, value);
}
return value;
}
}
Let's define what we are going to store in the cache. We need a decoded key string and bytes that represent it.
interface KeyCacheRecord {
readonly bytes: Uint8Array;
readonly key: string;
}
Let's implement find in cache logic. It's pretty trivial. It scans every byte of every cached record, and if all bytes match, it returns the key string.
class CachedKeyDecoder {
private cachedValues = Array<KeyCacheRecord>()
private get(bytes: Uint8Array, inputOffset: number, byteLength: number): string | null {
for(let i=0; i < this.cachedValues.length; i++) {
let found = true;
const cacheRecord = this.cachedValues[i];
// Skip if request bytes lenght is not equal to current cache record bytes lenght
if (byteLength !== cacheRecord.bytes.length) {
continue;
}
// Compare every bytes of cache record with every bytes from input array
for(let i=0; i < byteLength; i++) {
if (cacheRecord[i] !== bytes[inputOffset + i]) {
found = false;
break;
}
}
if (found) {
return cacheRecord.key;
}
}
return null;
}
This version is working, but it needs to be optimized. First, it's trying to iterate over all cache records, even if they have different sizes. To fix it, we use arrays. It's preallocated to the maximum size of max cached key length + 1.
Now we can get all cacheRecord with bytes size of 10 by accessing cachedValues[10]
class CachedKeyDecoder {
private cachedValues = Array<Array<KeyCacheRecord>>();
constructor(private maxKeyLength: number = 32) {
this.cachedValues = new Array<Array<KeyCacheRecord>>(this.maxKeyLength + 1);
}
public get(bytes: Uint8Array, inputOffset: number, byteLength: number): string | null {
const chunks = this.cachedValues[byteLength];
if (chunks) {
return this.findCachedKey(bytes, inputOffset, byteLength, chunks);
} else {
return null;
}
}
}
Now we need to optimize the findCachedKey function. First, we will replace the found flag with a loop label. Code is more straightforward and faster
private findCachedKey(
bytes: Uint8Array,
inputOffset: number,
byteLength: number,
chunks: Array<KeyCacheRecord>,
): string | null {
const chunksLength = chunks.length;
FIND_CHUNK: for (let i = 0; i < chunksLength; i++) {
const chunk = chunks[i];
for (let j = 0; j < byteLength; j++) {
if (chunk.bytes[j] !== bytes[inputOffset + j]) {
continue FIND_CHUNK;
}
}
return chunk.key;
}
return null;
}
Next, instead of iterating byte by byte from the beginning, we'll simultaneously repeat from start to end. It allows us to reject a cache record faster. For example, we have two records with keys "payload" and "payment." If we are iterating from the beginning, we will have to check bytes from 1 to 4 to understand that "payload" bytes are not equal to "payment" bytes.
private findCachedKey(
bytes: Uint8Array,
inputOffset: number,
byteLength: number,
chunks: Array<KeyCacheRecord>,
): string | null {
const chunksLength = chunks.length;
const halfLength = byteLength / 2;
const endPosition = inputOffset + byteLength;
FIND_CHUNK: for (let i = 0; i < chunksLength; i++) {
const chunk = chunks[i];
for (let j = 0; j < halfLength; j++) {
if (chunk.bytes[j] !== bytes[inputOffset + j]) {
continue FIND_CHUNK;
}
if (chunk.bytes[byteLength - j - 1] !== bytes[endPosition - j - 1]) {
continue FIND_CHUNK;
}
}
return chunk.key;
}
return null;
}
Now it's time to apply some statistics. Usually, some map keys are used more than others. For example, we have 20k "payload" strings, just a few "payment" strings; however, if "payment" is cached before "payload," it will always be checked first.
Let's optimize it. First, we need to add a hits counter to KeyCacheRecord
interface KeyCacheRecord {
readonly bytes: Uint8Array;
readonly key: string;
hits: number;
}
We will increment this value each time a key is found inside the cache.
private findCachedKey(
bytes: Uint8Array,
inputOffset: number,
byteLength: number,
chunks: Array<KeyCacheRecord>,
): string | null {
const chunksLength = chunks.length;
const halfLength = byteLength / 2;
const endPosition = inputOffset + byteLength;
FIND_CHUNK: for (let i = 0; i < chunksLength; i++) {
const chunk = chunks[i];
for (let j = 0; j < halfLength; j++) {
if (chunk.bytes[j] !== bytes[inputOffset + j]) {
continue FIND_CHUNK;
}
if (chunk.bytes[byteLength - j - 1] !== bytes[endPosition - j - 1]) {
continue FIND_CHUNK;
}
}
chunk.hits++;
return chunk.key;
}
return null;
}
Now we have the statistics about key usage. Let's apply it and order keys by the number of hits so that the most used key will always be the first.
private findCachedKey(
bytes: Uint8Array,
inputOffset: number,
byteLength: number,
chunks: Array<KeyCacheRecord>,
): string | null {
let prevHits = 0;
const chunksLength = chunks.length;
const halfLength = byteLength / 2;
const endPosition = inputOffset + byteLength;
FIND_CHUNK: for (let i = 0; i < chunksLength; i++) {
const chunk = chunks[i];
if (i > 0 && prevHits < chunk.hits) {
// Sort chunks by number of hits
// in order to improve search speed for most used keys
const prevChunk = chunks[i - 1];
chunks[i] = prevChunk;
chunks[i - 1] = chunk;
prevHits = prevChunk.hits;
} else {
prevHits = chunk.hits;
}
for (let j = 0; j < halfLength; j++) {
if (chunk.bytes[j] !== bytes[inputOffset + j]) {
continue FIND_CHUNK;
}
if (chunk.bytes[byteLength - j - 1] !== bytes[endPosition - j - 1]) {
continue FIND_CHUNK;
}
}
chunk.hits++;
return chunk.key;
}
return null;
}
You can find the final version in this pull request
We have built a pretty complex logic. Let's run a benchmark test: 180000 docs/sec, a 2.64x improvement from the initial speed!
Full PR still needs to be merged, and we are finalizing public API for users that will allow controlling cache size. It should be out shortly.
Share: