ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

Re: GSoC 2021: HashesDB

Subject: Re: GSoC 2021: HashesDB
From: Vassilis Xanthopoulos <xanthopoulos [ dot ] vassilis [ at ] gmail [ dot ] com>
Date: Sun, 28 Mar 2021 22:58:13 +0300

Hi Alex,

I'm sorry, I didn't phrase my second point correctly. Of course even a minor
change in the input of most hash functions, especially the ones intended to be
used in cryptographic applications, will completely change the output.

I am talking about the case that we use a fuzzy hash function, like spamsum used
by ssdeep, and store the resulting hashes in the database. Then we could query
the database trying to find stored hashes 'near' the input hash.

In the meantime though I found out that this feature is present in most
relational databases by just defining a custom distance function and specifying
an acceptable max distance from our input.

On Mar 28 2021, at 9:37 pm, Alexios Zavras <zvr+eellak [ at ] zvr [ dot ] gr> wrote:
> Hi Vassilis
>
> Sure, as you say, the whole idea is to store N hashes.
> This does not affect the main project -- which should be able
> to be configured about the hashes that we want to keep.
>
> To your second point, I'm afraid this is not how hashes
> and fuzzy matching work. If we were to have, for example,
> a typical hash like SHA-512, a single bit change on the input
> (let's say a file), would produce a completely different hash.
> See, for example a single space changing SHA-1 hashes:
> $ echo "hello" | sha1sum
> f572d396fae9206628714fb2ce00f72e94f2258f -
> $ echo "hello " | sha1sum
> 2beec0d1b3eeee6c25b1eacc1063338047d588e3 -
>
> What king of "fuzzy matching" at the database level
> would say that these two hashes are "near" each other?
>
> On Sat, Mar 27, 2021, at 17:25, Vassilis Xanthopoulos wrote:
> > Hello again Alex,
> >
> > I understand there is a trade-off between facilitating the querying process as
> > you said and keeping the space requirements for this service reasonable.
> > If we think about the other extreme case, we end up storing all possible hashes
> > (at least the ones in the multihash table) for a file. This will make our
> > service very space demanding.
> >
> > I believe there is a sweet spot where we store the `N` most 'appropriate' hashes
> > (whatever appropriate means in our context), where `N` is enough to give some
> > flexibility to the user and not make our service cumbersome. Is this pre-decided
> > or open for discussion?
> >
> > Also, regarding ssdeep and fuzzy hashing. We can always perform the fuzzy part
> > of this process in the application level, but is there any merit in choosing a
> > Database System that implements 'fuzzy querying'? It would be nice to be able to
> > search the database in an 'edit-distance like way' where we select all entries
> > with distance at most X from our input. I did some research on this idea and I
> > found out there are some implementations of it, and maybe even ways to
> > implement it in in a normal SQL environment.
> >
> > Vassilis.
> >
> >
> > On Mar 27 2021, at 12:32 pm, Alexios Zavras <zvr+eellak [ at ] zvr [ dot ] gr> wrote:
> > > Hi Vassilis, thanks for your interest.
> > >
> > > The rationale for having multiple hash values in the database
> > > is purely to facilitate querying.
> > >
> > > It's the difference between telling a user "ask for a hash value;
> > > get the result" and "you should have/install the file; you should
> > > have/install software to produce MY favorite hash; you should
> > > compute this hash; you can query with this hash".
> > > Keep in mind that the first 3 conditions may be actual steps
> > > to be performed. Why burden the user?
> > >
> > > To take your idea to an extreme, I can tel the user:
> > > "oh, you have the file; compute its SWHID and
> > > check at archive.softwareheritage.org"
> > > No need for us to do anything; the functionality is already existing. ;-)
> > >
> > > On Sat, Mar 27, 2021, at 00:06, Vassilis Xanthopoulos wrote:
> > > > Greetings everyone,
> > > >
> > > > My name is Vassilis Xanthopoulos and I am an undergraduate student at
> > > > the National Technical University of Athens currently pursuing a degree
> > > > in
> > > > Electrical and Computer Engineering. Reading through the project ideas
> > > > for GSoC 2021, the hashesDB project caught my eye and I have a question
> > > > about a certain aspect of it.
> > > >
> > > > Is there any benefit in storing multiple hashes in the database for a
> > > > single file instead of just one? I have some possible answers in mind
> > > > including
> > > >
> > > > * Using an optimal hash function for each file type
> > > >
> > > > This doesn't require storing all hashes for all files though,
> > > > just the right hash for each file
> > > >
> > > > * Collision detection
> > > >
> > > > I believe it's very improbable to find collisions in our data,
> > > > provided we use appropriate hash functions (maybe it's even a feature,
> > > > since we would like
> > > > some locality properties in our hash functions). All around it
> > > > feels like a long shot.
> > > >
> > > > * Providing the flexibility of using various hash function to
> > > > digest a file and query the database.
> > > >
> > > > I consider this option to be a nice-to-have feature rather than a
> > > > reason on it's own to add so much redundant information in the
> > > > database.
> > > >
> > > > Thanks in advance,
> > > > Vassilis.
> > > >
> > > >
> > > > ----
> > > > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και
> > > > συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του
> > > > Google Summer of Code - A discussion list for student developers and
> > > > mentors of Google Summer of Code projects.,
> > > > https://lists.ellak.gr/gsoc-developers/listinfo.html
> > > > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ.
> > > > ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr
> > > > <mailto:gsoc-developers%2Bunsubscribe%40ellak.gr>>.
> > > >
> > >
> > > --
> > > -- zvr -
>
> --
> -- zvr -
>

----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.

αναφορές

Re: GSoC 2021: HashesDB, Alexios Zavras

πλοήγηση μηνυμάτων

προηγούμενο ημερολογιακά: Re: [GSoC21] PackageInfo WebApp Project
επόμενο ημερολογιακά: some more info for PackageInfo WebApp
προηγούμενο βάσει θέματος: Re: GSoC 2021: HashesDB
επόμενο βάσει θέματος: Ερώτηση για Projects