ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

Re: Interest in "Using SWHID to Identify Software Components": POC & Architecture

Hi Alexios,

Following your feedback, I spent the last few days exploring real-world
cases in more depth, starting with the pytorch example you mentioned.

I queried the PyPI API for torch 2.6.0 and found 20 wheels and zero source
distributions. The Linux x86 wheel is 731 MB because it includes CUDA
libraries, while the macOS ARM wheel is only 63 MB since it uses Metal.
Furthermore, download.pytorch.org has additional variants for CPU-only,
CUDA 11.8, 12.1, 12.4, and ROCm. So a single torch release has dozens of
distinct binary artifacts, each with different content, each producing a
different SWHID. A PURL like pkg:pypi/torch@2.6.0 does not map to one
artifact. It maps to a whole graph of them.

I also looked into Crates.io. I downloaded serde 1.0.203 from the registry
and diffed it against the v1.0.203 git tag. Three files differ. The
registry adds .cargo_vcs_info.json which contains the source git commit
hash. It also rewrites Cargo.toml and keeps the original as
Cargo.toml.orig. All actual source files are identical. This is very
different from PyPI. The mismatch is predictable, and the registry even
provides a built-in link back to the source commit.

So the problem looks different depending on the ecosystem. On PyPI, the gap
between source and distribution is wide, with compiled binaries, platform
variants, and sometimes no sdist at all. On Crates.io, the gap is small and
the registry already gives you a path back to the source.

And as you pointed out earlier, for local modifications after installation,
the PURL stays the same even though the files on disk have changed. Only
computing SWHIDs on the installed files would catch that.

Given these ecosystem differences, should the tool define separate
verification strategies per ecosystem with a common output format, or
should it try to use a single unified mapping model?

Best regards,
Odysseas Kalaitsidis



Στις Δευ 9 Μαρ 2026 στις 11:24 μ.μ., ο/η Alexios Zavras <zvr+gsoc [ at ] zvr [ dot ] gr>
έγραψε:

> [please always use the gsoc-developers list for communicating.
> Others might also be helped by what is being discussed.]
>
> As I've written before, you should really think
> of what "SWHID" actually identifies.
> As you might have discovered, the single command
> "pip install pytorch" results into completely different files
> being installed, depending on the environment this is run.
>
> So, your approach of "one SWHID for the source
> and _one_ SWHID for the package" is still short
> and cannot describe what is really happening.
>
>
> On Fri, Mar 6, 2026, at 02:53, Odysseas Kalaitsidis wrote:
> > Dear Dr. Zavras,
> >
> > Thank you for the feedback, you're right, I had only scratched the
> surface.
> >
> > After thinking more about the mismatch I found with certifi, I realize
> > the problem gets much harder in other cases. Packages with C extensions
> > like numpy would have generated files and dependencies that don't exist
> > in the upstream repo. On Crates.io, the registry adds metadata that
> > doesn't match the Git tag. And Maven is a different problem, since
> > Central has compiled JARs, so you'd need to go from a binary back to
> > the source.
> >
> > This made me rethink my approach. Trying to strip everything to force a
> > single SWHID match feels fragile and doesn't really scale. Instead of
> > forcing one SWHID, I think it makes more sense to give one to the
> > package and one to the source, and just record which source the package
> > came from. These links can be expressed with SPDX 3.0 relationships,
> > and the data would also fit well into an RDF-based system like the one
> > on the ideas list.
> >
> > I'm reworking my proposal around this. Would it be ok to share a
> > revised draft early next week?
> >
> > Best regards,
> > Odysseas Kalaitsidis
> >
> >
> >
> > Στις Πέμ 26 Φεβ 2026 στις 4:00 μ.μ., ο/η Alexios Zavras
> > <zvr+gsoc [ at ] zvr [ dot ] gr <mailto:zvr%2Bgsoc [ at ] zvr [ dot ] gr>> έγραψε:
> >> Hi Odysseas, thanks for your interest.
> >>
> >> You've correctly identified a major issue with matching
> >> "intrinsic" identifiers (like SWHID) with "extrinsic" ones
> >> (like URL or DOI or PURL): what is the actual thing
> >> that we are trying to identify?
> >> Is it only the software? Does it include files like README and LICENSE?
> >> It's even worse in the case where the package (in PyPi in your case)
> >> includes compiled binaries.
> >>
> >> And of course extrinsic identifiers are almost useless
> >> when one introduces local modifications.
> >>
> >> I believe you've just scratched the surface of the problem complexity
> >> and I'd invite you to explore more real-world use cases.
> >>
> >> On Wed, Feb 25, 2026, at 03:14, Odysseas Kalaitsidis wrote:
> >> > Dear Dr. Zavras and mentors,
> >> >
> >> > I hope you are well. Following our conversation at FOSDEM, I took a
> >> > deeper look into the Software Heritage specifications, particularly
> >> > around artifact identification for SBOM mapping. From this
> exploration,
> >> > I identified reliable artifact identification as a key bottleneck for
> >> > effective SBOM validation, which led me to focus on the idea of using
> >> > SWHIDs to identify software components.
> >> >
> >> > To better understand the practical challenges, I built a working
> >> > prototype targeting the PyPI ecosystem, available at
> >> > https://github.com/OdysseasKalaitsidis/SWHID_POC. The tool fetches a
> >> > package source distribution (sdist) from PyPI, computes its SWHID
> >> > locally using swh.model, and verifies it against the Software
> Heritage
> >> > archive API.
> >> >
> >> > My experiments revealed an important technical challenge. While pure
> >> > Python packages such as six 1.17.0 verified successfully, others such
> >> > as certifi failed validation. The root cause appears to be build-time
> >> > metadata included in PyPI distributions, such as .egg-info
> directories
> >> > or PKG-INFO files, which are absent from the upstream source tree
> >> > archived by Software Heritage. This suggests that a robust
> >> > implementation requires ecosystem-specific normalization before
> >> > hashing, stripping build artifacts to align distributions with their
> >> > canonical source representation.
> >> >
> >> > Given this complexity, I believe the project aligns better with a
> Large
> >> > scope. The core deliverable would be ecosystem-specific normalization
> >> > engines for PyPI and Crates.io, with Maven as a stretch goal due to
> the
> >> > additional source-to-binary JAR mapping complexity. Building on that
> >> > foundation, I would develop a PURL-based CLI tool and a lightweight
> >> > REST API for querying verified mappings, as well as publish a dataset
> >> > of top packages with SPDX 3.0 export support.
> >> >
> >> > This work connects directly to the Unified SBOM Management via RDF
> >> > Database Abstraction project. Verified PURL-to-SWHID mappings would
> >> > allow SBOMs stored in the RDF triplestore to be validated against
> >> > content-addressed source artifacts, effectively closing the gap
> between
> >> > declared components and their archived source code.
> >> >
> >> > Through my previous projects, I have gained practical experience with
> >> > Python data pipelines, Docker-based deployment, and Git workflows,
> >> > which I believe provide a solid foundation for this work. I would
> >> > greatly appreciate your feedback on the direction and scope.
> >> >
> >> > I also have one architectural question for the proposal stage: should
> >> > the tool treat the sdist SWHID from PyPI and the git-tag SWHID from
> the
> >> > upstream repository as two distinct valid mappings, or should we aim
> to
> >> > canonicalize everything to the git-based SWHID? This decision will
> >> > significantly influence the schema design.
> >> >
> >> > Best regards,
> >> > Odysseas Kalaitsidis
> >> >
> >> > ----
> >> > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και
> >> > συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων
> του
> >> > Google Summer of Code - A discussion list for student developers and
> >> > mentors of Google Summer of Code projects.,
> >> > https://lists.ellak.gr/gsoc-developers/listinfo.html
> >> > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ.
> >> > ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr
> <mailto:gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr>
> >> > <mailto:gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr <mailto:
> gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr>>>.
> >>
> >> --
> >> -- zvr -
>
> --
> -- zvr -
>
----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.

πλοήγηση μηνυμάτων