javascript / 11 min read

How to improve MessagePack JavaScript decoder speed by 2.6 times.

How to improve MessagePack JavaScript decoder speed by 2.6 times.

What is MessagePack, and why should anyone care about its speed at all? It's like JSON, but fast and small. With this format, you can improve your application performance and save traffic. You can also encode additional data types to it: like binary data. Also, you can encode additional data types to it. For example binary data. However, you can't do it in JSON without involving expensive and ineffective base64 encoding.

MessagePack is the foundation of AppSpector communication protocols. Everything we send from our mobile SDK for iOS, Android and Flutter is packed using MessagePack. All logs, network requests, performance metrics, SQL queries - everything is encoded using this format and then sent to our server and our web dashboard for you to see. Encoding/decoding performance is critical for every component of our system.

At SDK level it's not so critical because events are captured over time and individual events never big enough to cause significant performance issues. But on the other side of the wire we have a Web dashboard that has to process all events at once.

This situation can become a challenging task because sometimes an individual user session can contain hundreds of thousands of events. Just imagine, we have to download, decompress, decode from MessagePack, then insert into Redux and update the UI for 250k objects.

This is exactly why I needed to make every step work as fast as possible. So I started with MessagePack decoding performance.

Before

Initially, I was using msgpack-lite library for parsing. It was pretty old, but still, it was the best option a few years ago when we first implemented it.

I made several small optimizations for it, but, due to low code quality, it was hard to maintain and extend.I began to look for other options and that's when I discovered the official msgpack-javascript library. It was written in TypeScript and had decent code quality. Huge thanks and shoutout to FUJI Goro for creating it!

In just a few days we managed to migrate to the new library. The next step was to make it work FAST.

Don't get me wrong, msgpack-javascript was far from being slow. It was actually able to parse 68000 docs/sec. It's a very good number by any standard! But when you need to parse 50 megabytes of data on the frontend - you need to make sure you have the performance that you can theoretically get.

That 68000 docs/sec number mean? Well, MessagePack library has a benchmark that parses a small document 500,000 times and measures how many copies it parsed per second. I'm going to use this benchmark to test the optimizations described in this article.


Optimization 1 – Simple one

Initially, I started with a high-level code review, trying to find any noticeable performance issues. In just 5 minutes, I found one.

During decoding each array decoded from msgpack was allocated with zero size and each decoded element was pushed to an array

this.stack.push({
	type: State.ARRAY,
	size,
	array: [],
});

...

state.array.push(object);

The obvious fix was to preallocate array with size decoded from msgpack. Many JavaScript developers forget what's happening under the hood 😕. Each call for push method will reallocate the whole array if its current capacity is not big enough to store a new element. We can fix it by allocation array with needed size using position variable to insert new elements at appropriate positions.

this.stack.push({
  type: State.ARRAY,
  size,
  array: new Array<unknown>(size),
  position: 0,
});

...

state.array[state.position] = object;
state.position++;

By introducing this simple fix, we were able to achieve a decoding speed of 72000-74000 docs/sec for the default benchmark. Just a few-percent improvement for documents with small arrays, but for edge case scenario with a large array, it gives us more than a 2x improvement.

Pull request: https://github.com/msgpack/msgpack-javascript/pull/32

This is just a 5% improvement from the initial speed: not a big deal, but every fraction of a % matters at the end.


Optimization 2 – UTF-8 decoding is expensive

For typical payload large percentage of values are strings. Messagepack-javascript is using a combination of manual string decoding in pure JS and optional WebAssembly version.

Let's take a look at JS version. It looks pretty complex, and for every string it allocates an array for Unicode symbols and performs a bunch of mathematical operations.

export function utf8Decode(bytes: Uint8Array, inputOffset: number, byteLength: number): string {
  let offset = inputOffset;
  const end = offset + byteLength;

  const out: Array<number> = [];
  while (offset < end) {
    const byte1 = bytes[offset++];
    if ((byte1 & 0x80) === 0) {
      // 1 byte
      out.push(byte1);
    } else if ((byte1 & 0xe0) === 0xc0) {
      // 2 bytes
      const byte2 = bytes[offset++] & 0x3f;
      out.push(((byte1 & 0x1f) << 6) | byte2);
    } else if ((byte1 & 0xf0) === 0xe0) {
      // 3 bytes
      const byte2 = bytes[offset++] & 0x3f;
      const byte3 = bytes[offset++] & 0x3f;
      out.push(((byte1 & 0x1f) << 12) | (byte2 << 6) | byte3);
    } else if ((byte1 & 0xf8) === 0xf0) {
      // 4 bytes
      const byte2 = bytes[offset++] & 0x3f;
      const byte3 = bytes[offset++] & 0x3f;
      const byte4 = bytes[offset++] & 0x3f;
      let unit = ((byte1 & 0x07) << 0x12) | (byte2 << 0x0c) | (byte3 << 0x06) | byte4;
      if (unit > 0xffff) {
        unit -= 0x10000;
        out.push(((unit >>> 10) & 0x3ff) | 0xd800);
        unit = 0xdc00 | (unit & 0x3ff);
      }
      out.push(unit);
    } else {
      out.push(byte1);
    }
  }

  return String.fromCharCode.apply(String, out as any);
}

Can we make it simpler and possibly faster? Absolutely!

const textDecoder = new TextDecoder("utf-8");
const stringValue = textDecoder.decode(bytes);

This is a text decoder API provided by most browsers. It was specifically designed to decode/encode strings, and it was done in  native part, not in JavaScript.

Let's run the benchmark and see .... 40000 docs/sec

WTF?! How is it possible that native API is significantly slower than the JS version?

The answer is because this API requires calls across JS <-> Native bridge. This process adds additional overhead for every string decoding request. Every byte has to be transferred from JS virtual machine to native part and the same is true for the decoded string result.

Should we abandon the idea of using TextDecoder? Probably not yet. The ratio between transfer overhead and decoding time should be different depending on string length. Let's check how it will behave with a different string length.

## string length=10 byteLength=10

utf8Decode x 8,147,700 ops/sec ±3.23% (84 runs sampled)
utf8DecodeWasm x 1,073,699 ops/sec ±2.33% (88 runs sampled)
TextDecoder x 693,559 ops/sec ±3.68% (74 runs sampled)

## string length=100 byteLength=100

utf8Decode x 860,952 ops/sec ±3.01% (83 runs sampled)
utf8DecodeWasm x 323,801 ops/sec ±8.54% (67 runs sampled)
TextDecoder x 484,350 ops/sec ±6.20% (69 runs sampled)

## string length=200 byteLength=200

utf8Decode x 458,241 ops/sec ±3.88% (88 runs sampled)
utf8DecodeWasm x 326,323 ops/sec ±5.80% (79 runs sampled)
TextDecoder x 454,980 ops/sec ±3.84% (74 runs sampled)

## string length=300 byteLength=300

utf8Decode x 298,996 ops/sec ±2.66% (83 runs sampled)
utf8DecodeWasm x 215,869 ops/sec ±9.42% (74 runs sampled)
TextDecoder x 403,562 ops/sec ±4.16% (75 runs sampled)

As we see TextDecoder is incredibly slow for small strings, but it becomes much faster for strings with sizes > 200 bytes

Let's add logic to parsing flow that will only use TextDecoder for strings with lengths > 200 bytes.

const MIN_TEXT_DECODER_STRING_LENGTH = 200;
const defaultEncoding = "utf-8";
const sharedTextDecoder = typeof TextDecoder !== "undefined" ? new TextDecoder(defaultEncoding) : null;

export function utf8Decode(bytes: Uint8Array, inputOffset: number, byteLength: number): string {
  let offset = inputOffset;
  const end = offset + byteLength;

  if (sharedTextDecoder !== null && byteLength > MIN_TEXT_DECODER_STRING_LENGTH) {
    const stringBytes = bytes.subarray(offset, end);
    return sharedTextDecoder.decode(stringBytes);
  }
  ...rest of pure JS decoding logic

Let's run the benchmark test and see what happens .... 112000 docs/sec

This is 1.64x improvements from initial speed. Not bad.

And just so you realize what's happening: at this very moment we are faster than any other msgpack implementation for JavaScript, and we are even faster than native JSON.parse() 🤯

Benchmark on NodeJS/v12.3.1

operation                                                         |   op   |   ms  |  op/s 
----------------------------------------------------------------- | ------: | ----: | ------:
buf = Buffer.from(JSON.stringify(obj));                           |  557200 |  5000 |  111440
buf = JSON.stringify(obj);                                        | 1078100 |  5000 |  215620
obj = JSON.parse(buf);                                            |  394300 |  5001 |   78844
buf = require("msgpack-lite").encode(obj);                        |  416400 |  5000 |   83280
obj = require("msgpack-lite").decode(buf);                        |  313600 |  5000 |   62720
buf = require("@msgpack/msgpack").encode(obj);                    |  646100 |  5000 |  129220
obj = require("@msgpack/msgpack").decode(buf);                    |  561800 |  5000 |  112360
✨  Done in 36.69s.

Can we push it even more?


Optimization 3 – Skip!

For a moment, I thought that I was done and there was nothing more I could do to improve the performance. But as in life - there is always one more thing.

Like I already mentioned before, strings are a big part of the typical payload. They are used for key and values everywhere. We have already optimized string decoding, but it still takes most of the time if we look at the profiler. There is nothing that we can do to speed up decoding, except maybe trying to skip it? Can we just not decode strings at all?

I analyzed one of the AppSpector sessions to see how many strings it contained. In total it had 250k strings, and 130k of them were strings for keys in maps. Most of these keys were the same. I counted only 104 unique values in 130k strings instances. We had around 20k instances of string "payload".

It didn't look right. I needed to find a way to skip that work somehow.

First, I thought about using a map with bytes as key and string as value. Instead of decoding string each time we would just look at this cache and get a decoded string from it. But Uint8Array can't be used as a map key and the process of converting it to a key string would make the whole optimization useless.

Step 1:

Let's define decoder logic. The decode method should receive msgpack bytes array, offset for string position inside this array and string bytes length decoded from msgpack string header. It should return a decoded string.

class CachedKeyDecoder {
  public decode(bytes: Uint8Array, inputOffset: number, byteLength: number): string {
        // Find cached value
        let value = this.get(bytes, inputOffset, byteLength);
	
        // if it's not found then decode it using JS decoder and store in cache
	    if (!value) {
	      value = utf8DecodeJs(bytes, inputOffset, byteLength);
          // Make copy of string bytes from main msgpack bytes array
	      const stringsBytes = bytes.slice(inputOffset, inputOffset + byteLength);
	      this.cache(stringsBytes, value);
	    }
	
	    return value;
    }
}

Step 2:

Let's define what we are going to store in the cache. We need a decoded key string and bytes that represent it.

interface KeyCacheRecord {
  readonly bytes: Uint8Array;
  readonly key: string;
}

Step 3:

Let's implement find in cache logic. It's pretty trivial. It just scans every byte of every cached record and if all bytes match then it returns the key string.

class CachedKeyDecoder {
    private cachedValues = Array<KeyCacheRecord>()

    private get(bytes: Uint8Array, inputOffset: number, byteLength: number): string | null {
      for(let i=0; i < this.cachedValues.length; i++) {
         let found = true;
         const cacheRecord = this.cachedValues[i];
         // Skip if request bytes lenght is not equal to current cache record bytes lenght
         if (byteLength !== cacheRecord.bytes.length) {
           continue;
         }
         // Compare every bytes of cache record with every bytes from input array
         for(let i=0; i < byteLength; i++) {
             if (cacheRecord[i] !== bytes[inputOffset + i]) {
               found = false;
               break;
             }
         }
         
         if (found) {
           return cacheRecord.key;
         }
      }
      
      return null;
    }

Step 4:

This version is working, but it's far from being optimal. First of all, it's trying to iterate over all cache records, even if they have different sizes. To fix it we use an array of arrays. It's preallocated to the maximum size of max cached key length + 1.  

Now we can get all cacheRecord with bytes size of 10 by accessing cachedValues[10]

class CachedKeyDecoder {
    private cachedValues = Array<Array<KeyCacheRecord>>();
    
    constructor(private maxKeyLength: number = 32) {
	    this.cachedValues = new Array<Array<KeyCacheRecord>>(this.maxKeyLength + 1);
    }
    
    public get(bytes: Uint8Array, inputOffset: number, byteLength: number): string | null {
    const chunks = this.cachedValues[byteLength];

    if (chunks) {
      return this.findCachedKey(bytes, inputOffset, byteLength, chunks);
    } else {
      return null;
    }
  }
}

Step 5:

Now we need to optimize findCachedKey function. First, we will replace the found flag with a loop label. Code is simpler and faster

private findCachedKey(
  bytes: Uint8Array,
  inputOffset: number,
  byteLength: number,
  chunks: Array<KeyCacheRecord>,
): string | null {
    const chunksLength = chunks.length;
    FIND_CHUNK: for (let i = 0; i < chunksLength; i++) {
      const chunk = chunks[i];

      for (let j = 0; j < byteLength; j++) {
        if (chunk.bytes[j] !== bytes[inputOffset + j]) {
          continue FIND_CHUNK;
        }
      }

      return chunk.key;
    }

    return null;
}

Next, instead of iterating byte by byte from the beginning, we'll iterate from the start and the end at the same time. It allows us to reject a cache record faster. For example, we have 2 records with keys "payload" and "payment". If we are iterating from the beginning, we will have to check bytes from 1 to 4 to understand that "payload" bytes are not equal to "payment" bytes.

private findCachedKey(
  bytes: Uint8Array,
  inputOffset: number,
  byteLength: number,
  chunks: Array<KeyCacheRecord>,
): string | null {
    const chunksLength = chunks.length;
    const halfLength = byteLength / 2;
    const endPosition = inputOffset + byteLength;
    FIND_CHUNK: for (let i = 0; i < chunksLength; i++) {
      const chunk = chunks[i];

      for (let j = 0; j < halfLength; j++) {
        if (chunk.bytes[j] !== bytes[inputOffset + j]) {
          continue FIND_CHUNK;
        }

        if (chunk.bytes[byteLength - j - 1] !== bytes[endPosition - j - 1]) {
          continue FIND_CHUNK;
        }
      }

      return chunk.key;
    }

    return null;
}

Step 6:

Now it's time to apply some statistics. Usually, some map keys are more used than others. For example, we have 20k "payload" strings, just a few "payment" strings. However, if "payment" is cached before "payload" it will always be checked first.

Let's optimize it.First, we need to add hits counter to KeyCacheRecord

interface KeyCacheRecord {
  readonly bytes: Uint8Array;
  readonly key: string;
  hits: number;
}

We will increment this value each time when a key is found inside the cache.

private findCachedKey(
  bytes: Uint8Array,
  inputOffset: number,
  byteLength: number,
  chunks: Array<KeyCacheRecord>,
): string | null {
    const chunksLength = chunks.length;
    const halfLength = byteLength / 2;
    const endPosition = inputOffset + byteLength;
    FIND_CHUNK: for (let i = 0; i < chunksLength; i++) {
      const chunk = chunks[i];

      for (let j = 0; j < halfLength; j++) {
        if (chunk.bytes[j] !== bytes[inputOffset + j]) {
          continue FIND_CHUNK;
        }

        if (chunk.bytes[byteLength - j - 1] !== bytes[endPosition - j - 1]) {
          continue FIND_CHUNK;
        }
      }

      chunk.hits++;

      return chunk.key;
    }

    return null;
}

Now we have the statistics about keys usage. Let's apply it and order keys by a number of hits, so that the most used key will always be the first.

private findCachedKey(
      bytes: Uint8Array,
      inputOffset: number,
      byteLength: number,
      chunks: Array<KeyCacheRecord>,
  ): string | null {
    let prevHits = 0;
    const chunksLength = chunks.length;
    const halfLength = byteLength / 2;
    const endPosition = inputOffset + byteLength;
    FIND_CHUNK: for (let i = 0; i < chunksLength; i++) {
      const chunk = chunks[i];

      if (i > 0 && prevHits < chunk.hits) {
        // Sort chunks by number of hits
        // in order to improve search speed for most used keys
        const prevChunk = chunks[i - 1];
        chunks[i] = prevChunk;
        chunks[i - 1] = chunk;
        prevHits = prevChunk.hits;
      } else {
        prevHits = chunk.hits;
      }

      for (let j = 0; j < halfLength; j++) {
        if (chunk.bytes[j] !== bytes[inputOffset + j]) {
          continue FIND_CHUNK;
        }

        if (chunk.bytes[byteLength - j - 1] !== bytes[endPosition - j - 1]) {
          continue FIND_CHUNK;
        }
      }

      chunk.hits++;

      return chunk.key;
    }

    return null;
}

You can find the final version in this pull request

We have spent some time building a pretty complex logic. Was it worth it?

Let's run a benchmark test: 180000 docs/sec. This is a 2.64x improvement from the initial speed! Hell yeah, it was worth it!

Full PR is not merged yet, we are finalizing public API for users that will allow controlling cache size. It should be out shortly.

Summary

JavaScript has a reputation for being a slow language. It may have been true 10 years ago, but modern JS engines (especially the V8) can deliver impressive performance. But even the V8 can't fix your architecture and algorithmic complexity. Sometimes the best way to improve performance is to rethink the way your code works. This is the first article in our performance optimization series, but we plan to release many more. In the upcoming articles I will cover such cool topics as streaming parsing, web workers and redux performance optimizations.

Appreciate your attention friends, stay tuned!


About Us

AppSpector is remote debugging and introspection tool for iOS and Android applications. With AppSpector you can debug your app running in the same room or on another continent. You can measure app performance, view CoreData and SQLite content, logs, network requests and many more in realtime. Just like you we have been struggling for years trying to find stupid mistakes and dreaming of a better native tools, finally we decided to build them. This is the instrument that you’ve been looking for.

Share: