ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

Interest in ML-assisted Anonymization Project – Technical Questions

Subject: Interest in ML-assisted Anonymization Project – Technical Questions
From: Emmanuel Adewumi <sccsmart247 [ at ] gmail [ dot ] com>
Date: Sun, 1 Mar 2026 06:20:02 +0100

Dear mentors and contributors,

I am very interested in the “ML-assisted Anonymization Layer and Targeted
Pipeline Improvements for Greek Datasets” project and have been studying
the proposal carefully.

From a systems perspective, the core challenge appears to be designing a
high-recall anonymization layer that remains robust under Greek
morphological variation and OCR-induced noise, while integrating cleanly
into the existing GlossAPI pipeline.

Before drafting a proposal, I would appreciate clarification on a few
technical points:


   1. Are there existing annotated Greek NER datasets currently used or
   recommended for this project, or would part of the work involve curating or
   adapting such data?
   2. Has there been prior evaluation of multilingual transformer-based NER
   models (e.g., XLM-R or similar) on your datasets, particularly under OCR
   noise conditions?
   3. Do the datasets exhibit systematic OCR error patterns (e.g., accent
   loss, character confusion, token fragmentation) that should influence
   preprocessing design?
   4. In terms of anonymization policy, is recall considered strictly more
   important than precision, or is there a defined tradeoff target?

Additionally, I am curious whether LLM-based structured extraction
approaches have been considered or experimentally evaluated (e.g.,
prompt-based entity extraction using multilingual instruction-tuned
models), either as a primary method or as part of a hybrid architecture. I
would be interested to understand whether such approaches have been
explored and how they compare to token-classification models in terms of
reliability and reproducibility within your pipeline constraints.

My initial thinking is to explore a layered architecture combining:


   - Deterministic regex detection (emails, phone numbers),
   - Transformer-based NER for PERSON/ORG,
   - Noise-aware preprocessing,
   - Confidence-based masking and evaluation tooling,
   - Potentially a controlled LLM-assisted validation layer if appropriate.

I would be glad to receive any guidance on preferred technical directions
or constraints within the current GlossAPI codebase.

Thank you for your time, and I look forward to following the discussion.

Best regards, Emmanuel Adewumi

----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.

πλοήγηση μηνυμάτων

προηγούμενο ημερολογιακά: Interest in Alexandria3k Entity Disambiguation — GSoC 2026
επόμενο ημερολογιακά: GSoC 2026 Interest - Thermal Satellite AI Framework - Karthik
προηγούμενο βάσει θέματος: Interest in Alexandria3k Entity Disambiguation — GSoC 2026
επόμενο βάσει θέματος: GSoC 2026 Interest - Thermal Satellite AI Framework - Karthik