Merkle Tree Authentication with Vector Embeddings
A Neural Network take on Public Key Distribution With Tree Authentication
Could vector embeddings be a replacement for SHA-256 as the one-way function Merkle tree authentication protocols?
Vector embeddings is a hash?
Vector embedding models are a form of compression applied to input data that returns an output vector that contains all the information provided in the input data.
Secondly, the output vector is always of the same size, or more precisely - the same number of dimensions, regardless of the size of input.
Thirdly, each value in the vector embedding is a floating point number and assuming a level of precision (or number of decimals), a collision of 2 different inputs producing exactly the same vector is very highly unlikely.
Therefore, can we consider vector embedding models as a one-way function, sort of? An equivalent to SHA-256, but in the neural network space?
We can compare vector embeddings to painting with a wide brush: the main strokes will be relatively reversible, but the fine details (like which number is at the 4th decimal point at vector position #55) will be scrambled and generally not recoverable.
In other words, even though some amount of semantic and topical information from can be extracted from a vector embedding, getting the exact and complete input data from the output vector is nearly impossible as the highest ratio of recovered text within some published studies was at around 92% with advanced methods1.
This becomes even more prevalent when the data includes a lot of numbers and complicated tabular structure (i.e. transactions data).
Let’s then plant a Vector embeddings tree! 🌳
For the purpose of this exercise let’s revisit Ralphe C. Merkle’s original paper2 on Merkle trees, but instead of SHA-256 we’ll just use vector embeddings to hash the public file.
Bleeding edge research.
OK, so we have entries in a document for which we apply a one-way hash function, in this case E (for Embeddings), to the entire document. Even though E is applied to the whole document, its output is only 400 Bytes long (based on the calculation of 100 dimensions * 4 bytes per value). The small output of E is called the root, R, of the document.
Why trees? To save Earth space
Just to quickly reiterate why tree authentication is used: anyone can check if the entire document is correct by confirming R = E(document)
. If any part of the document changes then we have R ≠ E(altered document)
, which is easily checked by comparing the generated vector embedding from the altered document with the provided root.
The issue with this is having to know and store the whole document which is highly impractical, and where tree authentication comes into play. We divide the document Y into a sequence of entries:
Y = Y₁, Y₂, ..., Yn
We then define E(Y) as:
E(Y) = E(E(first half of Y), E(second half of Y))
Let’s say these entries are transactions where n = 100
, and a user needs to only confirm the first 50 transactions, no more. Provided we know the root R, to confirm these are correct we would then only need to know:
the first half of the document (the first 50 transactions in question)
vector embeddings of the second half of the document (400 Bytes)
This would be enough to generate the final vector embeddings E(Y) and check if it does indeed equal R.
And of course, if we apply this to each separate transaction Yn, the user needs to generate log₂(n) embedding vectors, along with the transaction they’re checking.
For an example of having to check a transaction in a much larger document of let’s say 1500 transactions or separate parts, that would still equate to having to generate only 3 vector embeddings, or 3 * 400B = 1 kilobyte of data.
Which is around 70x less data than the document containing those same transactions in plaintext format.
Sweet!
Authentication process
Everything else is the same as in the original protocol when using a standard cryptographic hash function like SHA-256.
The tree, once computed, is fixed and unchanging. So our document (which is just the leaves of the tree) is also fixed and unchanging, publicly verifiable and impossible to update.
Equally, the tree authentication path would include the vector embeddings generation function at each step. An example authentication path for Y5 is shown in figure 1 below:
For the generalized case, the protocol for public key distribution between user A and user B would be unchanged:
A obtains B's entry in the document and B's path in the vector embeddings tree (where in the tree is the leaf user B is on - or which specific transaction in our previous example) and checks if it’s correct
A generates a random key k₁ and transmits EB(k₁) to B
B obtains A's entry in the document and A's authentication path and confirms it’s correct.
B generates a random key k₂ and transmits EA(k₂) to A
A computes the final key
k = <k₁, k₂>
, wherek₂ = DA(EA(k₂))
B computes the final key
k = <k₁, k₂>
, wherek₁ = DB(EB(k₁))
The protocol can only be compromised if A's or B's private key (DA or DB) is compromised, if the vector embeddings R are not correctly known by A or B, or if there is a false and misleading entry in the document. The correct value of R is exchanged between pairs of users as they establish keys and is published to the network.
Conclusion
Using vector embedding models as a one-way hash function in Merkle trees looks feasible!
As for applications, I already have some ideas for using vector embedding trees which I will write about in my next posts, but I’m also sure many smart people will have interesting takes.
If you are one of those smart people, please share your opinion in the comments!
P.S. if no one really thought and wrote about this before me, then please respect my wish for this type of cryptographic tree to be called a Visualcrypto tree. Thank you.