Hi Odysseas, thanks for your interest.
You've correctly identified a major issue with matching
"intrinsic" identifiers (like SWHID) with "extrinsic" ones
(like URL or DOI or PURL): what is the actual thing
that we are trying to identify?
Is it only the software? Does it include files like README and LICENSE?
It's even worse in the case where the package (in PyPi in your case)
includes compiled binaries.
And of course extrinsic identifiers are almost useless
when one introduces local modifications.
I believe you've just scratched the surface of the problem complexity
and I'd invite you to explore more real-world use cases.
On Wed, Feb 25, 2026, at 03:14, Odysseas Kalaitsidis wrote:
> Dear Dr. Zavras and mentors,
>
> I hope you are well. Following our conversation at FOSDEM, I took a
> deeper look into the Software Heritage specifications, particularly
> around artifact identification for SBOM mapping. From this exploration,
> I identified reliable artifact identification as a key bottleneck for
> effective SBOM validation, which led me to focus on the idea of using
> SWHIDs to identify software components.
>
> To better understand the practical challenges, I built a working
> prototype targeting the PyPI ecosystem, available at
> https://github.com/OdysseasKalaitsidis/SWHID_POC. The tool fetches a
> package source distribution (sdist) from PyPI, computes its SWHID
> locally using swh.model, and verifies it against the Software Heritage
> archive API.
>
> My experiments revealed an important technical challenge. While pure
> Python packages such as six 1.17.0 verified successfully, others such
> as certifi failed validation. The root cause appears to be build-time
> metadata included in PyPI distributions, such as .egg-info directories
> or PKG-INFO files, which are absent from the upstream source tree
> archived by Software Heritage. This suggests that a robust
> implementation requires ecosystem-specific normalization before
> hashing, stripping build artifacts to align distributions with their
> canonical source representation.
>
> Given this complexity, I believe the project aligns better with a Large
> scope. The core deliverable would be ecosystem-specific normalization
> engines for PyPI and Crates.io, with Maven as a stretch goal due to the
> additional source-to-binary JAR mapping complexity. Building on that
> foundation, I would develop a PURL-based CLI tool and a lightweight
> REST API for querying verified mappings, as well as publish a dataset
> of top packages with SPDX 3.0 export support.
>
> This work connects directly to the Unified SBOM Management via RDF
> Database Abstraction project. Verified PURL-to-SWHID mappings would
> allow SBOMs stored in the RDF triplestore to be validated against
> content-addressed source artifacts, effectively closing the gap between
> declared components and their archived source code.
>
> Through my previous projects, I have gained practical experience with
> Python data pipelines, Docker-based deployment, and Git workflows,
> which I believe provide a solid foundation for this work. I would
> greatly appreciate your feedback on the direction and scope.
>
> I also have one architectural question for the proposal stage: should
> the tool treat the sdist SWHID from PyPI and the git-tag SWHID from the
> upstream repository as two distinct valid mappings, or should we aim to
> canonicalize everything to the git-based SWHID? This decision will
> significantly influence the schema design.
>
> Best regards,
> Odysseas Kalaitsidis
>
> ----
> Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και
> συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του
> Google Summer of Code - A discussion list for student developers and
> mentors of Google Summer of Code projects.,
> https://lists.ellak.gr/gsoc-developers/listinfo.html
> Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ.
> ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr
> <mailto:gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr>>.
--
-- zvr -
----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.