ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

Interest and Questions regarding: "Cleaning of HPLT Greek v2 Dataset for GlossApi LLM"

 Dear Mr. Karounos and Mr. Vidras, I hope you are doing well. My name is
Nektarios, and I am a third-year undergraduate student at the Department of
Informatics and Telecommunications, National and Kapodistrian University of
Athens. I am very interested in contributing to the "Cleaning of HPLT Greek
v2 Dataset for GlossApi LLM" project as part of Google Summer of Code
(GSoC) and would like to clarify some details before submitting my
proposal. Relevant Experience: I have a strong background in Artificial
Intelligence and Machine Learning through my coursework: AI1 (Berkeley
CS188 - The Pac-Man Projects): Implemented informed state-space search
algorithms, reinforcement learning, and probabilistic models. AI2 (Current
Course): First assignment: Developed a sentiment classifier on a Twitter
dataset using logistic regression and TF-IDF, including EDA, text
preprocessing, feature extraction, and a detailed report. Upcoming
assignments (to be completed before the GSoC coding phase): Implementing
feed-forward/deep neural networks, bidirectional stacked RNNs, and
fine-tuning GreekBERT (Hugging Face). Additionally, I am currently
familiarizing myself with the FineWeb v1+ methodology and exploring the
glossAPI GitHub repository to better understand how this project fits into
the existing ecosystem. I wanted to ask for your advice on the following:
1. Should the FineWeb methodology be implemented as a standalone process,
or as part of glossAPI’s source code? 2. Regarding the GSoC project
duration, would this task be categorized as an Intermediate (175 hours) or
Hard (350 hours) project? 3. Will the dataset cleaning process focus on the
deduplicated (648GB) or cleaned (505GB) file of the HPLT Greek Dataset?
Thank you for your time and for this opportunity. I would greatly
appreciate your insights, and I look forward to the possibility of
contributing to this project. Best regards, Nektarios Tsimpourakis-Pavlakos
----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.