Discussion about this post

User's avatar
Dwayne McDaniel's avatar

This is a great write-up. Thank you.

I am curious, though, about the internationalization side of this. For example, how would this fare against Polish or other Cyrillic-alphabet languages for tokenization? For example, the Polish word for "luck" (1 token) is "szczęście", which produces 4 tokens according to the tool you shared, which would meet the definition of "high tokenization" while still being a common "dictionary word".

I am wondering if this means for English words this makes sense due to training bias, but for other alphabets, maybe entropy might still be needed?

Dmitriy Alergant's avatar

Thanks for the acknowledgment, and glad I was able to contribute.

Great article!

I wonder if you looked manually into the TNs in the [2.0, 2.5] token effeciency range - what are they; Are they largely weak hand-created passwords? Or something short? In many use-cases, the stakeholder may decide this is not what they even need to protect against. If a someone is using `Passw0rd-gmail` as a credential (14/6 = 2.33), they have bigger problems besides it being hardcoded somewhere, and it may not be worth protection at a scanner level. Potentially the threshold can still be moved to 2.00 or 2.05-2.10.

P.S> I also continue developing this idea separately, but haven't had a chance to publish a formalized research like you did - congarts! I still like the term 'token density'. Although in that case the formula needs to be reversed; len(tok)/len(str) with density thresholds lingering in the 0.5-ish range.

1 more comment...

No posts

Ready for more?