ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

[GSoC 2026] Proposal for Feedback: ML-Assisted Anonymization Layer for GlossAPI

Hi GFOSS community,

I'm Khushi Agrawal, an undergraduate student at DSCE, Bangalore, India and
I'm writing to share my GSoC 2026 proposal for the GlossAPI project for
your feedback.

My proposal focuses on building an anonymization layer (Corpus.anonymize())
directly into the GlossAPI pipeline. Right now, highly valuable NLP
datasets (like the ~300GB of OpenArchives data from GSoC 2025) are blocked
from public release because they contain PII like Greek Tax IDs and
personal contacts. My goal is to safely mask this data without breaking
downstream structural classifiers, making the datasets GDPR-compliant so
they can be released for research and training Greek LLMs.

Over the last few weeks, I’ve been discussing this with mentors Nikos
Tsekos and Dimitrios Athanasopoulos. Their feedback really helped me pin
down the technical design, especially figuring out how to handle edge cases
like ML context loss and markdown structural leaks using Full-Block Batched
Inference and $O(\log N)$ Bisection Mapping.

To make sure the core idea actually works, I’ve also built a working Proof
of Concept (PoC) locally in the codebase, which includes a 12-case pytest
suite and JSON sidecar generation for auditability.

*Draft Proposal:* Google Doc Link
<https://docs.google.com/document/d/1FLiseNImolb73WOmboLIn_2YMtonJc9JpiaJQXoSSCQ/edit?tab=t.0>

I have also attached the PDF for reference.

I would love to hear any thoughts, technical critiques, or suggestions from
the community before the final deadline.

Thanks for your time,

*Khushi Agrawal*

GitHub: khushiiagarwal <https://github.com/khushiiagrawal>

Attachment: KhushiAgrawal_GlossAPI_GSoC'26.pdf
Description: Adobe PDF document

----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.

πλοήγηση μηνυμάτων