I am writing this mail again because i received a mail from mlmmj suggesting that my mail were either delayed or not delivered. Hi I am Yuvraj Saxena a student at KIIT University,India pursuing a B.Tech in CSE and currently exploring the ML-assisted anonymization project for GSOC'26. So far I have been locally running the GlossAPI repository to better understand the document extraction pipeline. While reading through the pipeline (specifically gloss_extract.py) I implemented an anonymization layer that fits the project description i.e to mask Persons,Locations,Organizations,Phone Numbers and Email ids, which comprises of two layers i.e Regex Layer(handles phone no. and email ids) and NER Layer (handles the rest). I chose to work with GLiNER for better recognition of tokens. Both the layers appoint a masked tag to the masked entities. In my prototype, I attempted to handle the anonymization of text per-page level(to better handle the amount of tokens). While testing this approach I had a few architectural questions: 1. Should I run the anonymizer per-page(which I am currently doing) or shift to anonymizing the document as a whole 2.Should the mask entities inherit a common tag or should they inherit a different tag by anonymizing layer. For eg: [Person] or [Person 1] i.e to distinguish between the masks. 3.Should anonymization be implemented as an optional stage within the extraction pipeline, or as a separate module that can be enabled when needed? 4.Are there preferred models or approaches the y'all would like me to explore for entity detection in Greek datasets? Or is GLiNER preferable? 5.In my current prototype to mask phone numbers i use a 10 digit format which might miss the country code i.e +91,etc which might miss international format of phone numbers, should I opt to refine the regex layer to include international numbers or a 10 digit number works fine ? I would really appreciate any guidance on the preferred architectural direction before continuing further with the implementation. Below I've attached my work in glossAPI(checkout anonymizer.py, gloss_extract.py and run_test.py) also please let me know if I should make the PR for it now or later https://github.com/raja-jaloka/raj_glossAPI/tree/anonymization-prototype Thanks!
---- Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects., https://lists.ellak.gr/gsoc-developers/listinfo.html Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.