ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

Re: GSoC 2021: HashesDB

Subject: Re: GSoC 2021: HashesDB
From: "Alexios Zavras" <zvr+eellak [ at ] zvr [ dot ] gr>
Date: Sun, 28 Mar 2021 20:37:22 +0200

Hi Vassilis

Sure, as you say, the whole idea is to store N hashes.
This does not affect the main project -- which should be able
to be configured about the hashes that we want to keep.

To your second point, I'm afraid this is not how hashes
and fuzzy matching work. If we were to have, for example,
a typical hash like SHA-512, a single bit change on the input
(let's say a file), would produce a completely different hash.
See, for example a single space changing SHA-1 hashes:
$ echo "hello" | sha1sum 
f572d396fae9206628714fb2ce00f72e94f2258f  -
$ echo "hello " | sha1sum 
2beec0d1b3eeee6c25b1eacc1063338047d588e3  -

What king of "fuzzy matching" at the database level
would say that these two hashes are "near" each other?

On Sat, Mar 27, 2021, at 17:25, Vassilis Xanthopoulos wrote:
> Hello again Alex,
> 
> I understand there is a trade-off between facilitating the querying process as
> you said and keeping the space requirements for this service reasonable. 
> If we think about the other extreme case, we end up storing all possible hashes 
> (at least the ones in the multihash table) for a file. This will make our 
> service very space demanding.
> 
> I believe there is a sweet spot where we store the `N` most 'appropriate' hashes
> (whatever appropriate means in our context), where `N` is enough to give some
> flexibility to the user and not make our service cumbersome. Is this pre-decided
> or open for discussion?
> 
> Also, regarding ssdeep and fuzzy hashing. We can always perform the fuzzy part
> of this process in the application level, but is there any merit in choosing a
> Database System that implements 'fuzzy querying'? It would be nice to be able to
> search the database in an 'edit-distance like way' where we select all entries
> with distance at most X from our input. I did some research on this idea and I
> found out there are some  implementations of it, and maybe even ways to
> implement it in in a normal SQL environment. 
> 
> Vassilis.
> 
> 
> On Mar 27 2021, at 12:32 pm, Alexios Zavras <zvr+eellak [ at ] zvr [ dot ] gr> wrote:
> > Hi Vassilis, thanks for your interest.
> > 
> > The rationale for having multiple hash values in the database
> > is purely to facilitate querying.
> > 
> > It's the difference between telling a user "ask for a hash value;
> > get the result" and "you should have/install the file; you should
> > have/install software to produce MY favorite hash; you should
> > compute this hash; you can query with this hash".
> > Keep in mind that the first 3 conditions may be actual steps
> > to be performed. Why burden the user?
> > 
> > To take your idea to an extreme, I can tel the user:
> > "oh, you have the file; compute its SWHID and
> > check at archive.softwareheritage.org"
> > No need for us to do anything; the functionality is already existing. ;-)
> > 
> > On Sat, Mar 27, 2021, at 00:06, Vassilis Xanthopoulos wrote:
> > > Greetings everyone,
> > >
> > > My name is Vassilis Xanthopoulos and I am an undergraduate student at
> > > the National Technical University of Athens currently pursuing a degree
> > > in
> > > Electrical and Computer Engineering. Reading through the project ideas
> > > for GSoC 2021, the hashesDB project caught my eye and I have a question
> > > about a certain aspect of it.
> > >
> > > Is there any benefit in storing multiple hashes in the database for a
> > > single file instead of just one? I have some possible answers in mind
> > > including
> > >
> > > * Using an optimal hash function for each file type
> > >
> > > This doesn't require storing all hashes for all files though,
> > > just the right hash for each file
> > >
> > > * Collision detection
> > >
> > > I believe it's very improbable to find collisions in our data,
> > > provided we use appropriate hash functions (maybe it's even a feature,
> > > since we would like
> > > some locality properties in our hash functions). All around it
> > > feels like a long shot.
> > >
> > > * Providing the flexibility of using various hash function to
> > > digest a file and query the database.
> > >
> > > I consider this option to be a nice-to-have feature rather than a
> > > reason on it's own to add so much redundant information in the
> > > database.
> > >
> > > Thanks in advance,
> > > Vassilis.
> > >
> > >
> > > ----
> > > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και
> > > συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του
> > > Google Summer of Code - A discussion list for student developers and
> > > mentors of Google Summer of Code projects.,
> > > https://lists.ellak.gr/gsoc-developers/listinfo.html
> > > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ.
> > > ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr
> > > <mailto:gsoc-developers%2Bunsubscribe%40ellak.gr>>.
> > >
> > 
> > --
> > -- zvr -

-- 
-- zvr -

----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.

απαντήσεις

Re: GSoC 2021: HashesDB, Vassilis Xanthopoulos

αναφορές

Re: GSoC 2021: HashesDB, Alexios Zavras
Re: GSoC 2021: HashesDB, Vassilis Xanthopoulos

πλοήγηση μηνυμάτων

προηγούμενο ημερολογιακά: Re: [GSoC21] Cms Project
επόμενο ημερολογιακά: Re: Query reg. PackageInfo WebApp
προηγούμενο βάσει θέματος: Re: GSoC 2021: HashesDB
επόμενο βάσει θέματος: Re: GSoC 2021: HashesDB