ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

Re: GSoC 2021: HashesDB

Subject: Re: GSoC 2021: HashesDB
From: Vassilis Xanthopoulos <xanthopoulos [ dot ] vassilis [ at ] gmail [ dot ] com>
Date: Tue, 13 Apr 2021 17:15:52 +0300

Hello Alex,

I'm sorry for the (very) late draft proposal. I would appreciate some comments on it before I submit my final proposal.
Thanks in advance.
On Mar 28 2021, at 10:58 pm, Vassilis Xanthopoulos <xanthopoulos [ dot ] vassilis [ at ] gmail [ dot ] com> wrote:
> Hi Alex,
>
> I'm sorry, I didn't phrase my second point correctly. Of course even a minor
> change in the input of most hash functions, especially the ones intended to be
> used in cryptographic applications, will completely change the output.
>
> I am talking about the case that we use a fuzzy hash function, like spamsum used
> by ssdeep, and store the resulting hashes in the database. Then we could query
> the database trying to find stored hashes 'near' the input hash.
>
> In the meantime though I found out that this feature is present in most
> relational databases by just defining a custom distance function and specifying
> an acceptable max distance from our input.
>
> On Mar 28 2021, at 9:37 pm, Alexios Zavras <zvr+eellak [ at ] zvr [ dot ] gr> wrote:
> > Hi Vassilis
> >
> > Sure, as you say, the whole idea is to store N hashes.
> > This does not affect the main project -- which should be able
> > to be configured about the hashes that we want to keep.
> >
> > To your second point, I'm afraid this is not how hashes
> > and fuzzy matching work. If we were to have, for example,
> > a typical hash like SHA-512, a single bit change on the input
> > (let's say a file), would produce a completely different hash.
> > See, for example a single space changing SHA-1 hashes:
> > $ echo "hello" | sha1sum
> > f572d396fae9206628714fb2ce00f72e94f2258f -
> > $ echo "hello " | sha1sum
> > 2beec0d1b3eeee6c25b1eacc1063338047d588e3 -
> >
> > What king of "fuzzy matching" at the database level
> > would say that these two hashes are "near" each other?
> >
> > On Sat, Mar 27, 2021, at 17:25, Vassilis Xanthopoulos wrote:
> > > Hello again Alex,
> > >
> > > I understand there is a trade-off between facilitating the querying process as
> > > you said and keeping the space requirements for this service reasonable.
> > > If we think about the other extreme case, we end up storing all possible hashes
> > > (at least the ones in the multihash table) for a file. This will make our
> > > service very space demanding.
> > >
> > > I believe there is a sweet spot where we store the `N` most 'appropriate' hashes
> > > (whatever appropriate means in our context), where `N` is enough to give some
> > > flexibility to the user and not make our service cumbersome. Is this pre-decided
> > > or open for discussion?
> > >
> > > Also, regarding ssdeep and fuzzy hashing. We can always perform the fuzzy part
> > > of this process in the application level, but is there any merit in choosing a
> > > Database System that implements 'fuzzy querying'? It would be nice to be able to
> > > search the database in an 'edit-distance like way' where we select all entries
> > > with distance at most X from our input. I did some research on this idea and I
> > > found out there are some implementations of it, and maybe even ways to
> > > implement it in in a normal SQL environment.
> > >
> > > Vassilis.
> > >
> > >
> > > On Mar 27 2021, at 12:32 pm, Alexios Zavras <zvr+eellak [ at ] zvr [ dot ] gr> wrote:
> > > > Hi Vassilis, thanks for your interest.
> > > >
> > > > The rationale for having multiple hash values in the database
> > > > is purely to facilitate querying.
> > > >
> > > > It's the difference between telling a user "ask for a hash value;
> > > > get the result" and "you should have/install the file; you should
> > > > have/install software to produce MY favorite hash; you should
> > > > compute this hash; you can query with this hash".
> > > > Keep in mind that the first 3 conditions may be actual steps
> > > > to be performed. Why burden the user?
> > > >
> > > > To take your idea to an extreme, I can tel the user:
> > > > "oh, you have the file; compute its SWHID and
> > > > check at archive.softwareheritage.org"
> > > > No need for us to do anything; the functionality is already existing. ;-)
> > > >
> > > > On Sat, Mar 27, 2021, at 00:06, Vassilis Xanthopoulos wrote:
> > > > > Greetings everyone,
> > > > >
> > > > > My name is Vassilis Xanthopoulos and I am an undergraduate student at
> > > > > the National Technical University of Athens currently pursuing a degree
> > > > > in
> > > > > Electrical and Computer Engineering. Reading through the project ideas
> > > > > for GSoC 2021, the hashesDB project caught my eye and I have a question
> > > > > about a certain aspect of it.
> > > > >
> > > > > Is there any benefit in storing multiple hashes in the database for a
> > > > > single file instead of just one? I have some possible answers in mind
> > > > > including
> > > > >
> > > > > * Using an optimal hash function for each file type
> > > > >
> > > > > This doesn't require storing all hashes for all files though,
> > > > > just the right hash for each file
> > > > >
> > > > > * Collision detection
> > > > >
> > > > > I believe it's very improbable to find collisions in our data,
> > > > > provided we use appropriate hash functions (maybe it's even a feature,
> > > > > since we would like
> > > > > some locality properties in our hash functions). All around it
> > > > > feels like a long shot.
> > > > >
> > > > > * Providing the flexibility of using various hash function to
> > > > > digest a file and query the database.
> > > > >
> > > > > I consider this option to be a nice-to-have feature rather than a
> > > > > reason on it's own to add so much redundant information in the
> > > > > database.
> > > > >
> > > > > Thanks in advance,
> > > > > Vassilis.
> > > > >
> > > > >
> > > > > ----
> > > > > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και
> > > > > συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του
> > > > > Google Summer of Code - A discussion list for student developers and
> > > > > mentors of Google Summer of Code projects.,
> > > > > https://lists.ellak.gr/gsoc-developers/listinfo.html
> > > > > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ.
> > > > > ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr
> > > > > <mailto:gsoc-developers%2Bunsubscribe%40ellak.gr>>.
> > > > >
> > > >
> > > > --
> > > > -- zvr -
> >
> > --
> > -- zvr -
> >
>

----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.

πλοήγηση μηνυμάτων

προηγούμενο ημερολογιακά: Re: Draft proposal for hashesdb
επόμενο ημερολογιακά: Re: [GSoC21] Cms Project
προηγούμενο βάσει θέματος: Re: Draft proposal for hashesdb
επόμενο βάσει θέματος: Software Developers for e-governance applications @ GRNET