ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

GSOC'26 Integrating anonymization prototype

Subject: GSOC'26 Integrating anonymization prototype
From: Yuvraj Saxena <yuvraj100706 [ at ] gmail [ dot ] com>
Date: Thu, 12 Mar 2026 18:30:19 +0530

I am writing this mail again because i received a mail from mlmmj
suggesting that my mail were either delayed or not delivered.

Hi  I am Yuvraj Saxena a student at KIIT University,India pursuing a B.Tech
in CSE and currently exploring the ML-assisted anonymization project for
GSOC'26.

So far I have been locally running the GlossAPI repository to better
understand the document extraction pipeline. While reading through the
pipeline (specifically gloss_extract.py) I implemented an anonymization
layer that fits the project description i.e to mask
Persons,Locations,Organizations,Phone Numbers and Email ids, which
comprises of two layers i.e Regex Layer(handles phone no. and email ids)
and NER Layer (handles the rest). I chose to work with GLiNER for better
recognition of tokens. Both the layers appoint a masked tag to the masked
entities.

In my prototype, I attempted to handle the anonymization of text per-page
level(to better handle the amount of tokens). While testing this approach I
had a few architectural questions:

1. Should I run the anonymizer per-page(which I am currently doing) or
shift to anonymizing the document as a whole
2.Should the mask entities inherit a common tag or should they inherit a
different tag by anonymizing layer. For eg: [Person] or [Person 1] i.e to
distinguish between the masks.
3.Should anonymization be implemented as an optional stage within the
extraction pipeline, or as a separate module that can be enabled when
needed?
4.Are there preferred models or approaches the y'all would like me to
explore for entity detection in Greek datasets? Or is GLiNER preferable?
5.In my current prototype to mask phone numbers i use a 10 digit format
which might miss the country code i.e +91,etc which might miss
international format of phone numbers, should I opt to refine the regex
layer to include international numbers or a 10 digit number works fine ?

I would really appreciate any guidance on the preferred architectural
direction before continuing further with the implementation.
Below I've attached my work in glossAPI(checkout anonymizer.py,
gloss_extract.py and run_test.py) also please let me know if I should make
the PR for it now or later
https://github.com/raja-jaloka/raj_glossAPI/tree/anonymization-prototype

Thanks!

----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.

πλοήγηση μηνυμάτων

προηγούμενο ημερολογιακά: GSoC: Proposal Guidelines for "Open-Source AI Framework for Thermal Satellite Payload Data Analysis"
επόμενο ημερολογιακά: Thermal Satellite Payload AI Framework
προηγούμενο βάσει θέματος: GSoC: Proposal Guidelines for "Open-Source AI Framework for Thermal Satellite Payload Data Analysis"
επόμενο βάσει θέματος: Thermal Satellite Payload AI Framework