Dear Dr. Zavras and mentors, I hope you are well. Following our conversation at FOSDEM, I took a deeper look into the Software Heritage specifications, particularly around artifact identification for SBOM mapping. From this exploration, I identified reliable artifact identification as a key bottleneck for effective SBOM validation, which led me to focus on the idea of using SWHIDs to identify software components. To better understand the practical challenges, I built a working prototype targeting the PyPI ecosystem, available at https://github.com/OdysseasKalaitsidis/SWHID_POC. The tool fetches a package source distribution (sdist) from PyPI, computes its SWHID locally using swh.model, and verifies it against the Software Heritage archive API. My experiments revealed an important technical challenge. While pure Python packages such as six 1.17.0 verified successfully, others such as certifi failed validation. The root cause appears to be build-time metadata included in PyPI distributions, such as .egg-info directories or PKG-INFO files, which are absent from the upstream source tree archived by Software Heritage. This suggests that a robust implementation requires ecosystem-specific normalization before hashing, stripping build artifacts to align distributions with their canonical source representation. Given this complexity, I believe the project aligns better with a Large scope. The core deliverable would be ecosystem-specific normalization engines for PyPI and Crates.io, with Maven as a stretch goal due to the additional source-to-binary JAR mapping complexity. Building on that foundation, I would develop a PURL-based CLI tool and a lightweight REST API for querying verified mappings, as well as publish a dataset of top packages with SPDX 3.0 export support. This work connects directly to the Unified SBOM Management via RDF Database Abstraction project. Verified PURL-to-SWHID mappings would allow SBOMs stored in the RDF triplestore to be validated against content-addressed source artifacts, effectively closing the gap between declared components and their archived source code. Through my previous projects, I have gained practical experience with Python data pipelines, Docker-based deployment, and Git workflows, which I believe provide a solid foundation for this work. I would greatly appreciate your feedback on the direction and scope. I also have one architectural question for the proposal stage: should the tool treat the sdist SWHID from PyPI and the git-tag SWHID from the upstream repository as two distinct valid mappings, or should we aim to canonicalize everything to the git-based SWHID? This decision will significantly influence the schema design. Best regards, Odysseas Kalaitsidis
---- Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects., https://lists.ellak.gr/gsoc-developers/listinfo.html Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.