ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

Re: [GSoC 2021]Questions about HashesDB

Subject: Re: [GSoC 2021]Questions about HashesDB
From: "Alexios Zavras" <zvr+eellak [ at ] zvr [ dot ] gr>
Date: Tue, 16 Mar 2021 17:23:24 +0100

1. The project is completely open to what libaries/frameworks
will be used. If you feel that SQLAlchemy is a wrong decision,
feel free to ignore it. If you've reached this decision after careful
consideration, please include your arguments in your proposal;
they will help the evaluation.

2. That's a nice example of thinking about the problem
and realizing that something not described would be helpful.
Congratulations!
Now, think about it: what would be more helpful: storing this info
in the database itself, or writing external log files? How will this info
be used? How *can* it be used, depending on your decision
about storage?

3. The exploration and evaluation of different hashing functions
is very important. Once again, it's perfectly acceptable to say
in your proposals "we will NOT be doing X for such-and-such reasons". 

4. Please read the official rules for GSoC; the answer to your question
is there:
https://developers.google.com/open-source/gsoc/faq#can_i_submit_more_than_one_proposal


[extra] You do mention "more columns [...] for new hash function values".
You should explore whether the database design could accommodate
new hash functions without more columns. Again, describing your thoughts
in your proposal will show you have considered the issue (whatever
your final decision may be).


On Tue, Mar 16, 2021, at 00:29, Giorgos wrote:
> Hello,
> 
> I have some questions that occurred to me while studying the suggested material.
> 
> 1) While going through past successful GFOSS projects, I noticed that 
> SQLAlchemy was used in one of those. This made me think about using 
> SQLAlchemy in hashesdb. However, after reading more about this library, 
> I reached the conclusion that it would be better if we avoided the use 
> of SQLAlchemy (or of any object-relational mapper altogether), because:
> 
>  * I believe it would be helpful if we allowed our users to add more 
> columns (either for metadata or for new hash function values) after the 
> creation of the database. As far as I am concerned, SQLAlchemy does not 
> support ALTER statements directly, which means we would have to use an 
> additional library such as Alembic to achieve this result. This would 
> increase both the complexity and the dependencies of our project. 
>  * As it was discussed earlier, SQLite fits our needs, even when it 
> comes to large-scale projects, so we don't have to worry about easy 
> transition to another SQL database.
>  * Finally, it may impact the performance of our tool, when it comes to 
> large archives.
> I would like to hear your opinion about this design decision.
> 
> 2) Should we keep a history of the scan that we perform to populate the 
> database? If the answer is yes, would a simple log file be enough, or 
> should we keep a detailed list of the scans we performed inside the 
> database?
> 
> 3) I noticed that the ssdeep hash function doesn't have a fixed digest 
> size. Of course we can still use the "Multihash" format for the ssdeep 
> hashes, however I have the following concerns. My primary concern is 
> that using the multihash format for fuzzy hashes would impact the 
> performance of our similarity checking feature, since we would have to 
> separate the hash value from the rest of the multihash, a task that 
> involves calculating varints(possibly a lot of them). A secondary 
> concern is that we would have to add the ssdeep function to the 
> function table, in order to assign a number to it. However, other 
> multihash users may not be familiar with our custom table. On the other 
> hand, using the multihash format for regular hashing and not using it 
> for fuzzy hashing would lead to ambiguity. What do you think?
> 
> 4) In addition to these technical questions, I would like to know if 
> the GSoC rules allow us to apply multiple proposals to the same 
> organization, since there is another GFOSS-related project that caught 
> my interest and I would like to apply for it too, if this is acceptable.
> 
> Right now I feel comfortable with the provided material. Although I 
> suppose this is not directly related to the project, I am currently 
> trying to get a better grasp of fuzzy hashing in general. At the same 
> time I am trying to write a solid proposal for this project. For these 
> reasons I may start coding for this project a little bit later.
> 
> Thank you for your time,
> 
> Giorgos
> 
> On 4/3/21 12:52 μ.μ., Alexios Zavras wrote:
> > Once you're OK, let's have a discussion on the design
> before you start writing code.
> 
> On Wed, Mar 3, 2021, at 10:37, Giorgos wrote:
> >> Hello,
> 
> I am currently studying the resources included in the wiki (multihash, 
> ssdeep), while simultaneously thinking about the database schema.
> Once I feel comfortable with the material, should I create a new GitHub 
> repository for the project and start committing code?
> 
> Thanks in advance,
> Giorgos
> 
> On 1/2/21 7:29 μ.μ., Alexios Zavras wrote:
> >>> Thanks for your interest in getting involved with hashesDB.
> 
> Please note that the db keeps only meta-information on files.
> Each hash is a few bytes; 250 bytes for 1 billion files
> amount to 250GB (nowadays available on a USB stick).
> Even if we have that many data (which would be a nice problem to have),
> one could partition the data, keeping different hashes
> on different SQLite data files.
> 
> The advantages of SQLite are obviously the single-file data store
> and the serverless implementation.
> I believe its ease of deployment it offers outweigh
> any potential reservation of scale.
> 
> To your last point, of course I'd welcome any involvement
> to the project whenever possible.
> However, it should be clear that any work outside the GSoC framework
> will not affect any decisions related to GSoC.
> 
> 
> On Mon, Feb 1, 2021, at 16:41, Giorgos wrote:
> >>>> Hello everyone,
> 
> I am Giorgos Kosmas and I am a third-year undergraduate student at
> National Technical University of Athens, studying towards a degree in
> Electrical and Computer Engineering.
> 
> I am interested in participating in Google Summer of Code during the
> summer. Particularly, I would like to get involved in the development of
> the "hashesDB" project.
> 
> I have some clarifying questions regarding the library in which the
> database will be implemented. I am familiar with SQLite, however I am
> not sure if SQLite is appropriate for this project, due to the fact that
> SQLite is a single-disk database. Given the fact that this project
> should support huge archives of files, is it likely that the total size
> of our database exceeds the memory size of a single disk? In this case
> the user would be required to buy a larger and more expensive hard-disk,
> instead of just buying a much cheaper disk that would provide the
> additional memory space required.
> 
> I would like to begin contributing to this project as soon as the fall
> exam period ends(late February), before the announcement of the accepted
> mentoring organizations. If this is possible, I would be glad to arrange
> a skype call in order to discuss further technical decisions in detail.
> 
> Thanks in advance,
> 
> Giorgos
>

-- 
-- zvr -

----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.

απαντήσεις

Re: [GSoC 2021]Questions about HashesDB, Giorgos

αναφορές

Re: [GSoC 2021]Questions about HashesDB, Giorgos
Re: [GSoC 2021]Questions about HashesDB, Alexios Zavras
Re: [GSoC 2021]Questions about HashesDB, Giorgos

πλοήγηση μηνυμάτων

προηγούμενο ημερολογιακά: Re: PackageInfo WebApp
επόμενο ημερολογιακά: Re: Sastix-CMS project interest
προηγούμενο βάσει θέματος: Re: [GSoC 2021]Questions about HashesDB
επόμενο βάσει θέματος: Re: [GSoC 2021]Questions about HashesDB