eyJaafCsubstantially: cramming English words into JSON web tokens (JWTs)

A while back, my cofounder Ulysse and I spent an entire Friday evening solving a fantastically dumb technical problem. It really was a waste of time. Our founding engineer Blake later wrote in Slack, "guys, I’m pretty sure this is the least important thing we could be working on. "

Ulysse had said something like this:

So you know how JWTs [JSON web tokens] kind of always start with 'eyJ'? Well, that’s just the base64 encoding for squiggly brace and quotation mark. I’m pretty sure I can figure out how to manipulate the JSON header to target a specific string. Like, there’s some string we can put in the header that can make a JWT say 'TESSERAL_SECRET_JWT'

I was, of course, very down to spend time on that. Why not?

We spent most of our Friday night figuring out exactly what we could write into valid JWTs. As it turns out, human-legible strings in JWTs are surprisingly hard to target! We weren’t, to our mild disappointment, able to write any variant of our company’s name into our JWTs. (More on this later.)

I decided to try something else. If we couldn't make just any text work, I'd figure out the longest word of consistent capitalization that I could stick into a JWT.

Anyway, here’s a blog post about a squandered Friday night (and Saturday morning).

An aside: yes, the code snippets that I’ve included here are very sloppy. You’re right to think so. I was just using Cursor to idly bolt stuff on while watching Gladiator 2 with my dog. Once something kinda sorta worked, I stopped. I’ve never pretended to be a good programmer anyway.

What are JWTs?

If you visit some major website and check your browser cookies, there’s a good chance you’ll see some cookie value that looks like this:

>    eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6Ik5lZCBPJ0xlYXJ5IiwibXNnX3RvX2Jsb2dfcmVhZGVyIjoiaGVsbG8iLCJpYXQiOjE1MTYyMzkwMjJ9.ChPEZQ5cI207MIvHl2S3-LmQaOKSTs8ppV_pjRhqsOk

That’s a JSON web token. It’s just some data broken into three period-delimited chunks:

A base64URL-encoded header: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9
A base64URL-encoded body: eyJzdWIiOiIxMjM0NTY3ODkwIiwi...
A digital signature: ChPEZQ5cI207MIvHl2S3-LmQaOKSTs8ppV_pjRhqsOk

A JSON web token (or JWT, pronounced jot) will let a server stash some data on your browser. For example, it might store a few strings like your name or your job title. Now, the data’s not generally encrypted. Anyone who can see the JWT can read its contents. The data is signed, however, so you can always guarantee the authenticity of the JWT’s contents.

JWTs are very useful for authorization claims. You might give an authenticated user a short-lived access token that includes information like their name or their entitlements (e.g., indicating that someone is an 'admin' user). You’d have the client pass the access token to your server to validate its authenticity. Who is this person? What are they allowed to do? JWTs are just one way to pass around this kind of identity-related information.

The key point here is just that JWTs are Base64URL encoded. They translate normal-looking blobs of JSON into text that’s illegible to humans.

Mapping JSON to Base64URL

When we construct a JWT, we need two blobs of JSON as inputs: a header and a body. We pretty much treat the header and the body exactly the same way, so let’s just focus on the header for now.

The header looks something like this:

{
  "alg": "HS256",
  "typ": "JWT"
}

We need to convert that blob of JSON into some Base64URL encoded text that we can stash in the user’s browser. To do that, we do two different mappings:

We first utf-8 encode the JSON string into a byte stream. Each character becomes one byte – or eight bits.
We then mash together the bits and pull them off into base64 chunks (i.e., six bits) that map to specific characters.

To be more concrete, let’s take the first character. The curly brace { is 01111011 in ASCII. Then the quotation mark “ is 00100010. The lowercase a is 01100001. We go on from there until we eventually get the other curly brace }, which is 01111101. You get the idea.

We concatenate all of the bits together into one long stream moving left to right: 011110110010001001100001…01111101. Note that we’re always dealing with some number of bits here that’s a multiple of eight. Each character from our input maps to eight bits.

We now have some long mess of zeroes and ones. Moving left to right, we take one six-bit chunk at a time and convert it into a Base64 character. We start with the first six bits, 011110, which map to lowercase e in Base64. Then the next six bits, 110010, map to lowercase y. (You can start to see why these things tend to start with eyJ!).

Moving from chunks of 8 bits to chunks of 6 bits means that we don’t have a 1:1 mapping from JSON characters to Base64URL encoded characters. This is really, really annoying, at least for the purposes of creating vanity JWTs.

Padding aside, we have to think in chunks of characters. Each three-character chunk of JSON maps to four characters of base64URL characters. (3 * 8 bits == 24 bits == 4 * 6 bits).

Mapping base64URL to JSON

Writing a Base64URL string that translates into valid JSON isn’t that easy. If you’re just attempting to conjure some Base64URL string, you’ll typically create an invalid byte at some point.

For example, suppose you have the string “ERAL” in base64. That becomes 000100010001000000001011 in binary – or 00010001 00010000 00001011. Those are all control characters in ASCII. Control characters aren’t good JSON.

You need to target a specific subset of ASCII characters. The set of four-character Base64URL blocks that map back to a valid three-character block of ASCII isn’t entirely obvious. Consequently, the set of human-legible strings in base64 that convert back into valid JSON isn’t obvious.

We were able to find some general heuristics. For example, you usually don’t want to use characters that are densely packed with zeroes. A Base64URL character like A – which translates into 000000 – is going to increase your likelihood of landing on an ASCII control character, since ASCII control characters start with a bunch of zero bits.

But we didn’t come up with an especially systematic set of rules. It should be possible. But I don’t really feel like doing things the smart way when it’s so easy to do the dumb thing.

Finding English words that I can stick in JWTs, the dumb way

We can start by cheating a little bit. I know I want some valid JSON. So we can start the base64URL encoded string with eyJa, which always results in {“Z. And then we can always end the base64URL string with any characters that target “:1} when decoded, e.g. IjoxfQ==.

Here’s the general process I followed to brute-force this thing: iterate over a list of 10,000 common English words; for each word, try every possible combination of padding until one works, e.g. for the string "hello",

eyJaaaahelloIjoxfQ== doesn’t work
eyJaaabhelloIjoxfQ== doesn’t work
...
eyJaZzZhelloIjoxfQ== does work

Often, we can find more than one padding string that works. But we only need to find one. Using ZzZ, we can get eyJaZzZhelloIjoxfQ==, which maps to {"Zg6azYh":1}

Here’s some bad code (remember – watching Gladiator 2 with my dog, not a good programmer) that executes that process for a given word:

from itertools import product
import json
import base64

def generate_combinations(chars: str, n: int) -> product:
    return product(chars, repeat=n)

def url_safe_base64_decode(encoded_str: str) -> bytes:
    return base64.urlsafe_b64decode(encoded_str + "=" * (-len(encoded_str) % 4))

def base64_creates_good_json(test_string: str) -> bool:
    try:
        decoded = url_safe_base64_decode("eyJa" + test_string)
        decoded = decoded.decode("utf-8")
        output_json = decoded + '\":1}'
        json.loads(output_json)  # This will raise JSONDecodeError if invalid
        print(test_string)
        print(output_json)
        return True
    except:
        return False

def get_valid_padded_base64(word: str) -> str:
    chars_list = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    padding_strings = [
        [],
        generate_combinations(chars_list, 3),
        generate_combinations(chars_list, 2),
        generate_combinations(chars_list, 1)
    ]
    word_length = len(word)
    remainder = ( word_length ) % 4

    for padding_string in padding_strings[remainder]:
        padding_string = ''.join(str(i) for i in padding_string)
        front_padded_word = padding_string + word
        end_padded_word = word + padding_string
        try:
            if base64_creates_good_json(front_padded_word):
                return front_padded_word
            elif base64_creates_good_json(end_padded_word):
                return end_padded_word
            else:
                pass
        except:
            pass

    try:
        if base64_creates_good_json(word):
            return word
    except:
        pass

    raise ValueError

def script():

    valid_words = []

    with open('words.txt', 'r') as file:
        for line in file:
            word = line.strip()
            if len(word) <= 8:  # Skip short words early
                continue

            try:
                padded = get_valid_padded_base64(word)
                valid_words.append((word, padded))
                continue
            except ValueError:
                pass

            try:
                padded = get_valid_padded_base64(word.upper())
                valid_words.append((word.upper(), padded))
            except ValueError:
                pass

    valid_words.sort(key=lambda x: len(x[0]), reverse=True)
    for word, padded in valid_words:
        print(f"{word} | {padded}")

script()

Whatever. It works, I think.

English words that I can stick in JWTs

Without further ado, the longest English word I was able to stick into a JWT was substantially. It turns out that if you start your JWT’s header with {"Zi𬹻-j{bjYr, you get a JWT that starts with "eyJaafCsubstantallyi."

The longest word that didn't need any padding seemed to be encyclopedia, which decodes to zw2rZ)yؚ.

Note that any valid four-character chunk here can be appended onto another valid chunk, so you can mash some of these together. For example, you could do something like eyJaacaAMENDMENTencyclopediaacaSIGNATUREaiphenomenonIjoxfQ==, which decodes to {"Ziƀ0CC0CSzw2rZ)yؚiƒ c@MDDj*azz&zz'":1}.

I’m not sure why anyone would ever want to do that, but it’s possible!

Here’s the full list of long-ish words I was able to cram into JWTs, complete with one corresponding set of valid padding characters:

substantially | afCsubstantially
encyclopedia | encyclopedia
potentially | wpotentially
SURROUNDING | SURROUNDINGa
TERMINOLOGY | TERMINOLOGYg
activation | aeactivation
affiliated | acaffiliated
affiliates | acaffiliates
amendments | acamendments
assumption | aeassumption
cincinnati | cincinnaticg
documented | aidocumented
eventually | aceventually
expiration | aeexpiration
GOVERNMENT | adGOVERNMENT
immigrants | aeimmigrants
journalism | aijournalism
journalist | aijournalist
motivation | aemotivation
multimedia | aemultimedia
nationally | ainationally
phenomenon | aiphenomenon
statements | aistatements
translated | aitranslated
accidents | acaaccidents
activated | acaactivated
AMENDMENT | acaAMENDMENT
annotated | aeaannotated
anonymous | afCanonymous
appraisal | afCappraisal
basically | acabasically
beautiful | beautifulacg
collected | aeacollected
completed | aeacompleted
conducted | aeaconducted
CONSUMERS | acaCONSUMERS
contacted | aeacontacted
converted | aeaconverted
converter | aeaconverter
criticism | aeacriticism
databases | acadatabases
dedicated | acadedicated
DIFFERENT | acaDIFFERENT
diversity | diversityacg
documents | aeadocuments
estimated | aeaestimated
estimates | aeaestimates
evaluated | afCevaluated
expressed | acaexpressed
gibraltar | afCgibraltar
gradually | afCgradually
graduated | afCgraduated
graduates | afCgraduates
greetings | aeagreetings
guarantee | afCguarantee
handhelds | acahandhelds
impressed | afCimpressed
incidents | aeaincidents
indicated | aeaindicated
indicates | aeaindicates
initially | aeainitially
initiated | aeainitiated
literally | aealiterally
motivated | aeamotivated
naturally | acanaturally
opponents | afCopponents
portraits | afCportraits
protected | afCprotected
publicity | afCpublicity
purchased | afCpurchased
purchases | afCpurchases
residents | acaresidents
salvation | acasalvation
shipments | aeashipments
SIGNATURE | acaSIGNATURE
standings | afCstandings
templates | acatemplates
tennessee | acatennessee
translate | afCtranslate
universal | aeauniversal
virtually | afCvirtually

Better ideas?

Would love to know if you have better ideas for finding valid strings. Is there a word longer than encyclopedia that’s valid? (There must be a few...) What’s the longest string we can build – without padding characters – of consecutive English words? What happens if we relax the requirement that a word be all-uppercase or all-lowercase?

Is there an elegant solution for finding words that can work?

About us

We at Tesseral find things like JWTs intrinsically interesting. We build auth software after all. If you're working on some business software, we'd love to handle auth for you.