ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

Interest in "Using SWHID to Identify Software Components": POC & Architecture

Dear Dr. Zavras and mentors,

I hope you are well. Following our conversation at FOSDEM, I took a deeper
look into the Software Heritage specifications, particularly around
artifact identification for SBOM mapping. From this exploration, I
identified reliable artifact identification as a key bottleneck for
effective SBOM validation, which led me to focus on the idea of using
SWHIDs to identify software components.

To better understand the practical challenges, I built a working prototype
targeting the PyPI ecosystem, available at
https://github.com/OdysseasKalaitsidis/SWHID_POC. The tool fetches a
package source distribution (sdist) from PyPI, computes its SWHID locally
using swh.model, and verifies it against the Software Heritage archive API.

My experiments revealed an important technical challenge. While pure Python
packages such as six 1.17.0 verified successfully, others such as certifi
failed validation. The root cause appears to be build-time metadata
included in PyPI distributions, such as .egg-info directories or PKG-INFO
files, which are absent from the upstream source tree archived by Software
Heritage. This suggests that a robust implementation requires
ecosystem-specific normalization before hashing, stripping build artifacts
to align distributions with their canonical source representation.

Given this complexity, I believe the project aligns better with a Large
scope. The core deliverable would be ecosystem-specific normalization
engines for PyPI and Crates.io, with Maven as a stretch goal due to the
additional source-to-binary JAR mapping complexity. Building on that
foundation, I would develop a PURL-based CLI tool and a lightweight REST
API for querying verified mappings, as well as publish a dataset of top
packages with SPDX 3.0 export support.

This work connects directly to the Unified SBOM Management via RDF Database
Abstraction project. Verified PURL-to-SWHID mappings would allow SBOMs
stored in the RDF triplestore to be validated against content-addressed
source artifacts, effectively closing the gap between declared components
and their archived source code.

Through my previous projects, I have gained practical experience with
Python data pipelines, Docker-based deployment, and Git workflows, which I
believe provide a solid foundation for this work. I would greatly
appreciate your feedback on the direction and scope.

I also have one architectural question for the proposal stage: should the
tool treat the sdist SWHID from PyPI and the git-tag SWHID from the
upstream repository as two distinct valid mappings, or should we aim to
canonicalize everything to the git-based SWHID? This decision will
significantly influence the schema design.

Best regards,
Odysseas Kalaitsidis
----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.

πλοήγηση μηνυμάτων