Hi GFOSS community, I'm Khushi Agrawal, an undergraduate student at DSCE, Bangalore, India and I'm writing to share my GSoC 2026 proposal for the GlossAPI project for your feedback. My proposal focuses on building an anonymization layer (Corpus.anonymize()) directly into the GlossAPI pipeline. Right now, highly valuable NLP datasets (like the ~300GB of OpenArchives data from GSoC 2025) are blocked from public release because they contain PII like Greek Tax IDs and personal contacts. My goal is to safely mask this data without breaking downstream structural classifiers, making the datasets GDPR-compliant so they can be released for research and training Greek LLMs. Over the last few weeks, I’ve been discussing this with mentors Nikos Tsekos and Dimitrios Athanasopoulos. Their feedback really helped me pin down the technical design, especially figuring out how to handle edge cases like ML context loss and markdown structural leaks using Full-Block Batched Inference and $O(\log N)$ Bisection Mapping. To make sure the core idea actually works, I’ve also built a working Proof of Concept (PoC) locally in the codebase, which includes a 12-case pytest suite and JSON sidecar generation for auditability. *Draft Proposal:* Google Doc Link <https://docs.google.com/document/d/1FLiseNImolb73WOmboLIn_2YMtonJc9JpiaJQXoSSCQ/edit?tab=t.0> I have also attached the PDF for reference. I would love to hear any thoughts, technical critiques, or suggestions from the community before the final deadline. Thanks for your time, *Khushi Agrawal* GitHub: khushiiagarwal <https://github.com/khushiiagrawal>
Attachment:
KhushiAgrawal_GlossAPI_GSoC'26.pdf
Description: Adobe PDF document
---- Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects., https://lists.ellak.gr/gsoc-developers/listinfo.html Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.