In most cases, you can assume one character in a string is 1 byte, but that is only in most cases. How many bytes do you think ü is? It turns out it is 2 bytes. But if you run 'ü'.length it will return the string’s length as 1. Unicode characters can appear as a single character but be made up of multiple bytes of data. Usually, this isn’t a big deal if you just need the length of a string, but if you actually need the size in bytes of a string it is a big deal.

I bring this all up because I recently came across an issue with a library that was uploading data to AWS S3 and not letting the AWS SDK automatically compute the size of the data. They were doing something like below:

return s3Client.putObject({
    Body: contents || '',
    Bucket: bucket,
    Key: fullKey,
    ContentLength: contents ? contents.length : 0,
    ContentType: CONTENT_TYPE_PLAIN_TEXT
});

Once I started trying to upload data that had Unicode characters I was getting BadDigest: The Content-MD5 you specified did not match what we received. errors from AWS. I knew the AWS SDK would automatically compute the size of the data being uploaded so things started working once I removed that property. It wasn’t until then that I realized why it actually mattered in this case.

If you are using node and need the true size, in bytes, of a string you can do so with a Buffer.

Buffer.from('ü').length
// Returns 2