Hello again Alex, I understand there is a trade-off between facilitating the querying process as you said and keeping the space requirements for this service reasonable. If we think about the other extreme case, we end up storing all possible hashes (at least the ones in the multihash table) for a file. This will make our service very space demanding. I believe there is a sweet spot where we store the `N` most 'appropriate' hashes (whatever appropriate means in our context), where `N` is enough to give some flexibility to the user and not make our service cumbersome. Is this pre-decided or open for discussion? Also, regarding ssdeep and fuzzy hashing. We can always perform the fuzzy part of this process in the application level, but is there any merit in choosing a Database System that implements 'fuzzy querying'? It would be nice to be able to search the database in an 'edit-distance like way' where we select all entries with distance at most X from our input. I did some research on this idea and I found out there are some implementations of it, and maybe even ways to implement it in in a normal SQL environment. Vassilis. On Mar 27 2021, at 12:32 pm, Alexios Zavras <zvr+eellak [ at ] zvr [ dot ] gr> wrote: > Hi Vassilis, thanks for your interest. > > The rationale for having multiple hash values in the database > is purely to facilitate querying. > > It's the difference between telling a user "ask for a hash value; > get the result" and "you should have/install the file; you should > have/install software to produce MY favorite hash; you should > compute this hash; you can query with this hash". > Keep in mind that the first 3 conditions may be actual steps > to be performed. Why burden the user? > > To take your idea to an extreme, I can tel the user: > "oh, you have the file; compute its SWHID and > check at archive.softwareheritage.org" > No need for us to do anything; the functionality is already existing. ;-) > > On Sat, Mar 27, 2021, at 00:06, Vassilis Xanthopoulos wrote: > > Greetings everyone, > > > > My name is Vassilis Xanthopoulos and I am an undergraduate student at > > the National Technical University of Athens currently pursuing a degree > > in > > Electrical and Computer Engineering. Reading through the project ideas > > for GSoC 2021, the hashesDB project caught my eye and I have a question > > about a certain aspect of it. > > > > Is there any benefit in storing multiple hashes in the database for a > > single file instead of just one? I have some possible answers in mind > > including > > > > * Using an optimal hash function for each file type > > > > This doesn't require storing all hashes for all files though, > > just the right hash for each file > > > > * Collision detection > > > > I believe it's very improbable to find collisions in our data, > > provided we use appropriate hash functions (maybe it's even a feature, > > since we would like > > some locality properties in our hash functions). All around it > > feels like a long shot. > > > > * Providing the flexibility of using various hash function to > > digest a file and query the database. > > > > I consider this option to be a nice-to-have feature rather than a > > reason on it's own to add so much redundant information in the > > database. > > > > Thanks in advance, > > Vassilis. > > > > > > ---- > > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και > > συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του > > Google Summer of Code - A discussion list for student developers and > > mentors of Google Summer of Code projects., > > https://lists.ellak.gr/gsoc-developers/listinfo.html > > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. > > ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr > > <mailto:gsoc-developers%2Bunsubscribe%40ellak.gr>>. > > > > -- > -- zvr - >
---- Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects., https://lists.ellak.gr/gsoc-developers/listinfo.html Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.