Hello Alex, I'm sorry for the (very) late draft proposal. I would appreciate some comments on it before I submit my final proposal. Thanks in advance. On Mar 28 2021, at 10:58 pm, Vassilis Xanthopoulos <xanthopoulos [ dot ] vassilis [ at ] gmail [ dot ] com> wrote: > Hi Alex, > > I'm sorry, I didn't phrase my second point correctly. Of course even a minor > change in the input of most hash functions, especially the ones intended to be > used in cryptographic applications, will completely change the output. > > I am talking about the case that we use a fuzzy hash function, like spamsum used > by ssdeep, and store the resulting hashes in the database. Then we could query > the database trying to find stored hashes 'near' the input hash. > > In the meantime though I found out that this feature is present in most > relational databases by just defining a custom distance function and specifying > an acceptable max distance from our input. > > On Mar 28 2021, at 9:37 pm, Alexios Zavras <zvr+eellak [ at ] zvr [ dot ] gr> wrote: > > Hi Vassilis > > > > Sure, as you say, the whole idea is to store N hashes. > > This does not affect the main project -- which should be able > > to be configured about the hashes that we want to keep. > > > > To your second point, I'm afraid this is not how hashes > > and fuzzy matching work. If we were to have, for example, > > a typical hash like SHA-512, a single bit change on the input > > (let's say a file), would produce a completely different hash. > > See, for example a single space changing SHA-1 hashes: > > $ echo "hello" | sha1sum > > f572d396fae9206628714fb2ce00f72e94f2258f - > > $ echo "hello " | sha1sum > > 2beec0d1b3eeee6c25b1eacc1063338047d588e3 - > > > > What king of "fuzzy matching" at the database level > > would say that these two hashes are "near" each other? > > > > On Sat, Mar 27, 2021, at 17:25, Vassilis Xanthopoulos wrote: > > > Hello again Alex, > > > > > > I understand there is a trade-off between facilitating the querying process as > > > you said and keeping the space requirements for this service reasonable. > > > If we think about the other extreme case, we end up storing all possible hashes > > > (at least the ones in the multihash table) for a file. This will make our > > > service very space demanding. > > > > > > I believe there is a sweet spot where we store the `N` most 'appropriate' hashes > > > (whatever appropriate means in our context), where `N` is enough to give some > > > flexibility to the user and not make our service cumbersome. Is this pre-decided > > > or open for discussion? > > > > > > Also, regarding ssdeep and fuzzy hashing. We can always perform the fuzzy part > > > of this process in the application level, but is there any merit in choosing a > > > Database System that implements 'fuzzy querying'? It would be nice to be able to > > > search the database in an 'edit-distance like way' where we select all entries > > > with distance at most X from our input. I did some research on this idea and I > > > found out there are some implementations of it, and maybe even ways to > > > implement it in in a normal SQL environment. > > > > > > Vassilis. > > > > > > > > > On Mar 27 2021, at 12:32 pm, Alexios Zavras <zvr+eellak [ at ] zvr [ dot ] gr> wrote: > > > > Hi Vassilis, thanks for your interest. > > > > > > > > The rationale for having multiple hash values in the database > > > > is purely to facilitate querying. > > > > > > > > It's the difference between telling a user "ask for a hash value; > > > > get the result" and "you should have/install the file; you should > > > > have/install software to produce MY favorite hash; you should > > > > compute this hash; you can query with this hash". > > > > Keep in mind that the first 3 conditions may be actual steps > > > > to be performed. Why burden the user? > > > > > > > > To take your idea to an extreme, I can tel the user: > > > > "oh, you have the file; compute its SWHID and > > > > check at archive.softwareheritage.org" > > > > No need for us to do anything; the functionality is already existing. ;-) > > > > > > > > On Sat, Mar 27, 2021, at 00:06, Vassilis Xanthopoulos wrote: > > > > > Greetings everyone, > > > > > > > > > > My name is Vassilis Xanthopoulos and I am an undergraduate student at > > > > > the National Technical University of Athens currently pursuing a degree > > > > > in > > > > > Electrical and Computer Engineering. Reading through the project ideas > > > > > for GSoC 2021, the hashesDB project caught my eye and I have a question > > > > > about a certain aspect of it. > > > > > > > > > > Is there any benefit in storing multiple hashes in the database for a > > > > > single file instead of just one? I have some possible answers in mind > > > > > including > > > > > > > > > > * Using an optimal hash function for each file type > > > > > > > > > > This doesn't require storing all hashes for all files though, > > > > > just the right hash for each file > > > > > > > > > > * Collision detection > > > > > > > > > > I believe it's very improbable to find collisions in our data, > > > > > provided we use appropriate hash functions (maybe it's even a feature, > > > > > since we would like > > > > > some locality properties in our hash functions). All around it > > > > > feels like a long shot. > > > > > > > > > > * Providing the flexibility of using various hash function to > > > > > digest a file and query the database. > > > > > > > > > > I consider this option to be a nice-to-have feature rather than a > > > > > reason on it's own to add so much redundant information in the > > > > > database. > > > > > > > > > > Thanks in advance, > > > > > Vassilis. > > > > > > > > > > > > > > > ---- > > > > > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και > > > > > συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του > > > > > Google Summer of Code - A discussion list for student developers and > > > > > mentors of Google Summer of Code projects., > > > > > https://lists.ellak.gr/gsoc-developers/listinfo.html > > > > > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. > > > > > ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr > > > > > <mailto:gsoc-developers%2Bunsubscribe%40ellak.gr>>. > > > > > > > > > > > > > -- > > > > -- zvr - > > > > -- > > -- zvr - > > >
---- Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects., https://lists.ellak.gr/gsoc-developers/listinfo.html Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.