Dear Mr. Karounos and Mr. Vidras, I hope you are doing well. My name is Nektarios, and I am a third-year undergraduate student at the Department of Informatics and Telecommunications, National and Kapodistrian University of Athens. I am very interested in contributing to the "Cleaning of HPLT Greek v2 Dataset for GlossApi LLM" project as part of Google Summer of Code (GSoC) and would like to clarify some details before submitting my proposal. Relevant Experience: I have a strong background in Artificial Intelligence and Machine Learning through my coursework: AI1 (Berkeley CS188 - The Pac-Man Projects): Implemented informed state-space search algorithms, reinforcement learning, and probabilistic models. AI2 (Current Course): First assignment: Developed a sentiment classifier on a Twitter dataset using logistic regression and TF-IDF, including EDA, text preprocessing, feature extraction, and a detailed report. Upcoming assignments (to be completed before the GSoC coding phase): Implementing feed-forward/deep neural networks, bidirectional stacked RNNs, and fine-tuning GreekBERT (Hugging Face). Additionally, I am currently familiarizing myself with the FineWeb v1+ methodology and exploring the glossAPI GitHub repository to better understand how this project fits into the existing ecosystem. I wanted to ask for your advice on the following: 1. Should the FineWeb methodology be implemented as a standalone process, or as part of glossAPI’s source code? 2. Regarding the GSoC project duration, would this task be categorized as an Intermediate (175 hours) or Hard (350 hours) project? 3. Will the dataset cleaning process focus on the deduplicated (648GB) or cleaned (505GB) file of the HPLT Greek Dataset? Thank you for your time and for this opportunity. I would greatly appreciate your insights, and I look forward to the possibility of contributing to this project. Best regards, Nektarios Tsimpourakis-Pavlakos
