ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

Re: Interest in "Using SWHID to Identify Software Components": POC & Architecture

Hi Alexios,

That makes sense. A unified interface where the user provides a PURL and
gets back a verified SWHID mapping, but with the internals adapting to each
ecosystem.

What I found interesting is that some ecosystems already provide built-in
provenance links. On Crates.io, every crate includes .cargo_vcs_info.json
with the source commit hash. On PyPI, PEP 740 attestations can serve a
similar role for packages that use Trusted Publishing. These could act as
ecosystem-specific verification paths feeding into a common output format.

I will structure my proposal around this direction.

Best regards,

Odysseas Kalaitsidis


Στις Παρ 13 Μαρ 2026 στις 1:27 π.μ., ο/η Alexios Zavras <zvr+gsoc [ at ] zvr [ dot ] gr>
έγραψε:

> Hi Odysseas
>
> You are correct when you observe that every language ecosystem
> has its own peculiarities, and the problem is completely different
> on each one of them.
> There are even variations on each ecosystem --
> e.g. Python packages comprised solely by Python code
> are much more easy to handle than ones with compiled parts.
>
> The goal of the project is to have a unified approach
> to using SWHID to identify software components.
> But, as you found out, the implementation will eventually
> be very different, depending on the package manager case.
>
>
> On Wed, Mar 11, 2026, at 13:54, Odysseas Kalaitsidis wrote:
> > Hi Alexios,
> >
> > Following your feedback, I spent the last few days exploring real-world
> > cases in more depth, starting with the pytorch example you mentioned.
> >
> > I queried the PyPI API for torch 2.6.0 and found 20 wheels and zero
> > source distributions. The Linux x86 wheel is 731 MB because it includes
> > CUDA libraries, while the macOS ARM wheel is only 63 MB since it uses
> > Metal. Furthermore, download.pytorch.org has additional variants for
> > CPU-only, CUDA 11.8, 12.1, 12.4, and ROCm. So a single torch release
> > has dozens of distinct binary artifacts, each with different content,
> > each producing a different SWHID. A PURL like pkg:pypi/torch@2.6.0 does
> > not map to one artifact. It maps to a whole graph of them.
> >
> > I also looked into Crates.io. I downloaded serde 1.0.203 from the
> > registry and diffed it against the v1.0.203 git tag. Three files
> > differ. The registry adds .cargo_vcs_info.json which contains the
> > source git commit hash. It also rewrites Cargo.toml and keeps the
> > original as Cargo.toml.orig. All actual source files are identical.
> > This is very different from PyPI. The mismatch is predictable, and the
> > registry even provides a built-in link back to the source commit.
> >
> > So the problem looks different depending on the ecosystem. On PyPI, the
> > gap between source and distribution is wide, with compiled binaries,
> > platform variants, and sometimes no sdist at all. On Crates.io, the gap
> > is small and the registry already gives you a path back to the source.
> >
> > And as you pointed out earlier, for local modifications after
> > installation, the PURL stays the same even though the files on disk
> > have changed. Only computing SWHIDs on the installed files would catch
> > that.
> >
> > Given these ecosystem differences, should the tool define separate
> > verification strategies per ecosystem with a common output format, or
> > should it try to use a single unified mapping model?
> >
> > Best regards,
> > Odysseas Kalaitsidis
> >
> >
> >
> > Στις Δευ 9 Μαρ 2026 στις 11:24 μ.μ., ο/η Alexios Zavras
> > <zvr+gsoc [ at ] zvr [ dot ] gr <mailto:zvr%2Bgsoc [ at ] zvr [ dot ] gr>> έγραψε:
> >> [please always use the gsoc-developers list for communicating.
> >> Others might also be helped by what is being discussed.]
> >>
> >> As I've written before, you should really think
> >> of what "SWHID" actually identifies.
> >> As you might have discovered, the single command
> >> "pip install pytorch" results into completely different files
> >> being installed, depending on the environment this is run.
> >>
> >> So, your approach of "one SWHID for the source
> >> and _one_ SWHID for the package" is still short
> >> and cannot describe what is really happening.
> >>
> >>
> >> On Fri, Mar 6, 2026, at 02:53, Odysseas Kalaitsidis wrote:
> >> > Dear Dr. Zavras,
> >> >
> >> > Thank you for the feedback, you're right, I had only scratched the
> surface.
> >> >
> >> > After thinking more about the mismatch I found with certifi, I
> realize
> >> > the problem gets much harder in other cases. Packages with C
> extensions
> >> > like numpy would have generated files and dependencies that don't
> exist
> >> > in the upstream repo. On Crates.io, the registry adds metadata that
> >> > doesn't match the Git tag. And Maven is a different problem, since
> >> > Central has compiled JARs, so you'd need to go from a binary back to
> >> > the source.
> >> >
> >> > This made me rethink my approach. Trying to strip everything to force
> a
> >> > single SWHID match feels fragile and doesn't really scale. Instead of
> >> > forcing one SWHID, I think it makes more sense to give one to the
> >> > package and one to the source, and just record which source the
> package
> >> > came from. These links can be expressed with SPDX 3.0 relationships,
> >> > and the data would also fit well into an RDF-based system like the
> one
> >> > on the ideas list.
> >> >
> >> > I'm reworking my proposal around this. Would it be ok to share a
> >> > revised draft early next week?
> >> >
> >> > Best regards,
> >> > Odysseas Kalaitsidis
> >> >
> >> >
> >> >
> >> > Στις Πέμ 26 Φεβ 2026 στις 4:00 μ.μ., ο/η Alexios Zavras
> >> > <zvr+gsoc [ at ] zvr [ dot ] gr <mailto:zvr%2Bgsoc [ at ] zvr [ dot ] gr> <mailto:zvr%2Bgsoc [ at ] zvr [ dot ] gr
> <mailto:zvr%252Bgsoc [ at ] zvr [ dot ] gr>>> έγραψε:
> >> >> Hi Odysseas, thanks for your interest.
> >> >>
> >> >> You've correctly identified a major issue with matching
> >> >> "intrinsic" identifiers (like SWHID) with "extrinsic" ones
> >> >> (like URL or DOI or PURL): what is the actual thing
> >> >> that we are trying to identify?
> >> >> Is it only the software? Does it include files like README and
> LICENSE?
> >> >> It's even worse in the case where the package (in PyPi in your case)
> >> >> includes compiled binaries.
> >> >>
> >> >> And of course extrinsic identifiers are almost useless
> >> >> when one introduces local modifications.
> >> >>
> >> >> I believe you've just scratched the surface of the problem complexity
> >> >> and I'd invite you to explore more real-world use cases.
> >> >>
> >> >> On Wed, Feb 25, 2026, at 03:14, Odysseas Kalaitsidis wrote:
> >> >> > Dear Dr. Zavras and mentors,
> >> >> >
> >> >> > I hope you are well. Following our conversation at FOSDEM, I took
> a
> >> >> > deeper look into the Software Heritage specifications,
> particularly
> >> >> > around artifact identification for SBOM mapping. From this
> exploration,
> >> >> > I identified reliable artifact identification as a key bottleneck
> for
> >> >> > effective SBOM validation, which led me to focus on the idea of
> using
> >> >> > SWHIDs to identify software components.
> >> >> >
> >> >> > To better understand the practical challenges, I built a working
> >> >> > prototype targeting the PyPI ecosystem, available at
> >> >> > https://github.com/OdysseasKalaitsidis/SWHID_POC. The tool
> fetches a
> >> >> > package source distribution (sdist) from PyPI, computes its SWHID
> >> >> > locally using swh.model, and verifies it against the Software
> Heritage
> >> >> > archive API.
> >> >> >
> >> >> > My experiments revealed an important technical challenge. While
> pure
> >> >> > Python packages such as six 1.17.0 verified successfully, others
> such
> >> >> > as certifi failed validation. The root cause appears to be
> build-time
> >> >> > metadata included in PyPI distributions, such as .egg-info
> directories
> >> >> > or PKG-INFO files, which are absent from the upstream source tree
> >> >> > archived by Software Heritage. This suggests that a robust
> >> >> > implementation requires ecosystem-specific normalization before
> >> >> > hashing, stripping build artifacts to align distributions with
> their
> >> >> > canonical source representation.
> >> >> >
> >> >> > Given this complexity, I believe the project aligns better with a
> Large
> >> >> > scope. The core deliverable would be ecosystem-specific
> normalization
> >> >> > engines for PyPI and Crates.io, with Maven as a stretch goal due
> to the
> >> >> > additional source-to-binary JAR mapping complexity. Building on
> that
> >> >> > foundation, I would develop a PURL-based CLI tool and a
> lightweight
> >> >> > REST API for querying verified mappings, as well as publish a
> dataset
> >> >> > of top packages with SPDX 3.0 export support.
> >> >> >
> >> >> > This work connects directly to the Unified SBOM Management via RDF
> >> >> > Database Abstraction project. Verified PURL-to-SWHID mappings
> would
> >> >> > allow SBOMs stored in the RDF triplestore to be validated against
> >> >> > content-addressed source artifacts, effectively closing the gap
> between
> >> >> > declared components and their archived source code.
> >> >> >
> >> >> > Through my previous projects, I have gained practical experience
> with
> >> >> > Python data pipelines, Docker-based deployment, and Git workflows,
> >> >> > which I believe provide a solid foundation for this work. I would
> >> >> > greatly appreciate your feedback on the direction and scope.
> >> >> >
> >> >> > I also have one architectural question for the proposal stage:
> should
> >> >> > the tool treat the sdist SWHID from PyPI and the git-tag SWHID
> from the
> >> >> > upstream repository as two distinct valid mappings, or should we
> aim to
> >> >> > canonicalize everything to the git-based SWHID? This decision will
> >> >> > significantly influence the schema design.
> >> >> >
> >> >> > Best regards,
> >> >> > Odysseas Kalaitsidis
> >> >> >
> >> >> > ----
> >> >> > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και
> >> >> > συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων
> του
> >> >> > Google Summer of Code - A discussion list for student developers
> and
> >> >> > mentors of Google Summer of Code projects.,
> >> >> > https://lists.ellak.gr/gsoc-developers/listinfo.html
> >> >> > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ.
> >> >> > ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr
> <mailto:gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr> <mailto:
> gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr <mailto:
> gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr>>
> >> >> > <mailto:gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr <mailto:
> gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr> <mailto:
> gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr <mailto:
> gsoc-developers%25252Bunsubscribe [ at ] ellak [ dot ] gr>>>>.
> >> >>
> >> >> --
> >> >> -- zvr -
> >>
> >> --
> >> -- zvr -
>
> --
> -- zvr -
>
----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.

πλοήγηση μηνυμάτων