Super rad. Intuitively I would not have expected a meaningful difference between token efficiency and entropy.
I wonder if other tokenizers would be more or less accurate for calculating token efficiency. Youd probably have to adjust cutoff to 'calibrate' different tokenizers but itd be interesting if accuracy could be pushed even higher.
Hey! Thanks for the comment and yea, interesting idea! I'd be happy to do some follow-up research if you're interested.
Somewhat related, but importing that tiktoken library into Betterleaks caused a 5MB-ish bloat to the binary. I wonder how a smaller tokenizer like `p50k_base` would perform? It'd be great to reduce the binary bloat
"A quick note on passwords. Token Efficiency does not do well with classifying bad passwords like “password123” or “chibearsfan123”. These passwords are basically natural language which means a high token efficiency value. Pass phrases also don’t do well because those are usually just straight up words."
What do you think is the best way to find these than? or its something to drop in a secret scanner because "who uses such a weak password should be pwned anyway"?
I am curious, though, about the internationalization side of this. For example, how would this fare against Polish or other Cyrillic-alphabet languages for tokenization? For example, the Polish word for "luck" (1 token) is "szczęście", which produces 4 tokens according to the tool you shared, which would meet the definition of "high tokenization" while still being a common "dictionary word".
I am wondering if this means for English words this makes sense due to training bias, but for other alphabets, maybe entropy might still be needed?
Hey Dwayne, thanks for the compliment and question! You are right, Token Efficiency using cl100k_base is biased toward English. There may be another "all language" model that produces similar token efficiency values for common and rare words across languages. Entropy is still a great filter and is needed in order to squeak out that .89 F1 score (generic rule uses a entropy of 2.73 + Token Efficiency). But you're definitely right, the way token efficiency is shipped today you would still need an entropy filter for Polish or other Cyrillic-alphabet languages.
Feel free to open an issue for this on the Betterleaks repo if you're feeling motivated. Otherwise I'll eventually get to this.
Thanks for the acknowledgment, and glad I was able to contribute.
Great article!
I wonder if you looked manually into the TNs in the [2.0, 2.5] token effeciency range - what are they; Are they largely weak hand-created passwords? Or something short? In many use-cases, the stakeholder may decide this is not what they even need to protect against. If a someone is using `Passw0rd-gmail` as a credential (14/6 = 2.33), they have bigger problems besides it being hardcoded somewhere, and it may not be worth protection at a scanner level. Potentially the threshold can still be moved to 2.00 or 2.05-2.10.
P.S> I also continue developing this idea separately, but haven't had a chance to publish a formalized research like you did - congarts! I still like the term 'token density'. Although in that case the formula needs to be reversed; len(tok)/len(str) with density thresholds lingering in the 0.5-ish range.
> I wonder if you looked manually into the TNs in the [2.0, 2.5] token effeciency range - what are they; Are they largely weak hand-created passwords? Or something short?
TNs in that range is a group I did not look at manually. I can whip something up and take a look. If I were to guess it might be hex-encoded ids?
> In many use-cases, the stakeholder may decide this is not what they even need to protect against.
Agreed! That's why I'm trying to expose as much as possible via config/cli flags.
Super rad. Intuitively I would not have expected a meaningful difference between token efficiency and entropy.
I wonder if other tokenizers would be more or less accurate for calculating token efficiency. Youd probably have to adjust cutoff to 'calibrate' different tokenizers but itd be interesting if accuracy could be pushed even higher.
Hey! Thanks for the comment and yea, interesting idea! I'd be happy to do some follow-up research if you're interested.
Somewhat related, but importing that tiktoken library into Betterleaks caused a 5MB-ish bloat to the binary. I wonder how a smaller tokenizer like `p50k_base` would perform? It'd be great to reduce the binary bloat
hey great writeup!
just because its also written in the post:
"A quick note on passwords. Token Efficiency does not do well with classifying bad passwords like “password123” or “chibearsfan123”. These passwords are basically natural language which means a high token efficiency value. Pass phrases also don’t do well because those are usually just straight up words."
What do you think is the best way to find these than? or its something to drop in a secret scanner because "who uses such a weak password should be pwned anyway"?
This is a great write-up. Thank you.
I am curious, though, about the internationalization side of this. For example, how would this fare against Polish or other Cyrillic-alphabet languages for tokenization? For example, the Polish word for "luck" (1 token) is "szczęście", which produces 4 tokens according to the tool you shared, which would meet the definition of "high tokenization" while still being a common "dictionary word".
I am wondering if this means for English words this makes sense due to training bias, but for other alphabets, maybe entropy might still be needed?
Hey Dwayne, thanks for the compliment and question! You are right, Token Efficiency using cl100k_base is biased toward English. There may be another "all language" model that produces similar token efficiency values for common and rare words across languages. Entropy is still a great filter and is needed in order to squeak out that .89 F1 score (generic rule uses a entropy of 2.73 + Token Efficiency). But you're definitely right, the way token efficiency is shipped today you would still need an entropy filter for Polish or other Cyrillic-alphabet languages.
Feel free to open an issue for this on the Betterleaks repo if you're feeling motivated. Otherwise I'll eventually get to this.
Thanks for the acknowledgment, and glad I was able to contribute.
Great article!
I wonder if you looked manually into the TNs in the [2.0, 2.5] token effeciency range - what are they; Are they largely weak hand-created passwords? Or something short? In many use-cases, the stakeholder may decide this is not what they even need to protect against. If a someone is using `Passw0rd-gmail` as a credential (14/6 = 2.33), they have bigger problems besides it being hardcoded somewhere, and it may not be worth protection at a scanner level. Potentially the threshold can still be moved to 2.00 or 2.05-2.10.
P.S> I also continue developing this idea separately, but haven't had a chance to publish a formalized research like you did - congarts! I still like the term 'token density'. Although in that case the formula needs to be reversed; len(tok)/len(str) with density thresholds lingering in the 0.5-ish range.
Thanks again for the great idea Dmitriy!
> I wonder if you looked manually into the TNs in the [2.0, 2.5] token effeciency range - what are they; Are they largely weak hand-created passwords? Or something short?
TNs in that range is a group I did not look at manually. I can whip something up and take a look. If I were to guess it might be hex-encoded ids?
> In many use-cases, the stakeholder may decide this is not what they even need to protect against.
Agreed! That's why I'm trying to expose as much as possible via config/cli flags.