ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

Re: Interest in "Using SWHID to Identify Software Components": POC & Architecture

Hi Alexios,

I have submitted my proposal on the GSoC portal.

The PoC has been updated with attestation verification for PyPI, in
addition to the existing Crates.io and Maven pipelines.

https://github.com/OdysseasKalaitsidis/SWHID_POC

Happy to discuss any questions.

Best Regards,
Odysseas Kalaitsidis


Στις Παρ 13 Μαρ 2026 στις 12:01 μ.μ., ο/η Odysseas Kalaitsidis <
odysseaskalaitsides [ at ] gmail [ dot ] com> έγραψε:

> Hi Alexios,
>
> That makes sense. A unified interface where the user provides a PURL and
> gets back a verified SWHID mapping, but with the internals adapting to each
> ecosystem.
>
> What I found interesting is that some ecosystems already provide built-in
> provenance links. On Crates.io, every crate includes .cargo_vcs_info.json
> with the source commit hash. On PyPI, PEP 740 attestations can serve a
> similar role for packages that use Trusted Publishing. These could act as
> ecosystem-specific verification paths feeding into a common output format.
>
> I will structure my proposal around this direction.
>
> Best regards,
>
> Odysseas Kalaitsidis
>
>
> Στις Παρ 13 Μαρ 2026 στις 1:27 π.μ., ο/η Alexios Zavras <zvr+gsoc [ at ] zvr [ dot ] gr>
> έγραψε:
>
>> Hi Odysseas
>>
>> You are correct when you observe that every language ecosystem
>> has its own peculiarities, and the problem is completely different
>> on each one of them.
>> There are even variations on each ecosystem --
>> e.g. Python packages comprised solely by Python code
>> are much more easy to handle than ones with compiled parts.
>>
>> The goal of the project is to have a unified approach
>> to using SWHID to identify software components.
>> But, as you found out, the implementation will eventually
>> be very different, depending on the package manager case.
>>
>>
>> On Wed, Mar 11, 2026, at 13:54, Odysseas Kalaitsidis wrote:
>> > Hi Alexios,
>> >
>> > Following your feedback, I spent the last few days exploring real-world
>> > cases in more depth, starting with the pytorch example you mentioned.
>> >
>> > I queried the PyPI API for torch 2.6.0 and found 20 wheels and zero
>> > source distributions. The Linux x86 wheel is 731 MB because it includes
>> > CUDA libraries, while the macOS ARM wheel is only 63 MB since it uses
>> > Metal. Furthermore, download.pytorch.org has additional variants for
>> > CPU-only, CUDA 11.8, 12.1, 12.4, and ROCm. So a single torch release
>> > has dozens of distinct binary artifacts, each with different content,
>> > each producing a different SWHID. A PURL like pkg:pypi/torch@2.6.0
>> does
>> > not map to one artifact. It maps to a whole graph of them.
>> >
>> > I also looked into Crates.io. I downloaded serde 1.0.203 from the
>> > registry and diffed it against the v1.0.203 git tag. Three files
>> > differ. The registry adds .cargo_vcs_info.json which contains the
>> > source git commit hash. It also rewrites Cargo.toml and keeps the
>> > original as Cargo.toml.orig. All actual source files are identical.
>> > This is very different from PyPI. The mismatch is predictable, and the
>> > registry even provides a built-in link back to the source commit.
>> >
>> > So the problem looks different depending on the ecosystem. On PyPI, the
>> > gap between source and distribution is wide, with compiled binaries,
>> > platform variants, and sometimes no sdist at all. On Crates.io, the gap
>> > is small and the registry already gives you a path back to the source.
>> >
>> > And as you pointed out earlier, for local modifications after
>> > installation, the PURL stays the same even though the files on disk
>> > have changed. Only computing SWHIDs on the installed files would catch
>> > that.
>> >
>> > Given these ecosystem differences, should the tool define separate
>> > verification strategies per ecosystem with a common output format, or
>> > should it try to use a single unified mapping model?
>> >
>> > Best regards,
>> > Odysseas Kalaitsidis
>> >
>> >
>> >
>> > Στις Δευ 9 Μαρ 2026 στις 11:24 μ.μ., ο/η Alexios Zavras
>> > <zvr+gsoc [ at ] zvr [ dot ] gr <mailto:zvr%2Bgsoc [ at ] zvr [ dot ] gr>> έγραψε:
>> >> [please always use the gsoc-developers list for communicating.
>> >> Others might also be helped by what is being discussed.]
>> >>
>> >> As I've written before, you should really think
>> >> of what "SWHID" actually identifies.
>> >> As you might have discovered, the single command
>> >> "pip install pytorch" results into completely different files
>> >> being installed, depending on the environment this is run.
>> >>
>> >> So, your approach of "one SWHID for the source
>> >> and _one_ SWHID for the package" is still short
>> >> and cannot describe what is really happening.
>> >>
>> >>
>> >> On Fri, Mar 6, 2026, at 02:53, Odysseas Kalaitsidis wrote:
>> >> > Dear Dr. Zavras,
>> >> >
>> >> > Thank you for the feedback, you're right, I had only scratched the
>> surface.
>> >> >
>> >> > After thinking more about the mismatch I found with certifi, I
>> realize
>> >> > the problem gets much harder in other cases. Packages with C
>> extensions
>> >> > like numpy would have generated files and dependencies that don't
>> exist
>> >> > in the upstream repo. On Crates.io, the registry adds metadata that
>> >> > doesn't match the Git tag. And Maven is a different problem, since
>> >> > Central has compiled JARs, so you'd need to go from a binary back to
>> >> > the source.
>> >> >
>> >> > This made me rethink my approach. Trying to strip everything to
>> force a
>> >> > single SWHID match feels fragile and doesn't really scale. Instead
>> of
>> >> > forcing one SWHID, I think it makes more sense to give one to the
>> >> > package and one to the source, and just record which source the
>> package
>> >> > came from. These links can be expressed with SPDX 3.0 relationships,
>> >> > and the data would also fit well into an RDF-based system like the
>> one
>> >> > on the ideas list.
>> >> >
>> >> > I'm reworking my proposal around this. Would it be ok to share a
>> >> > revised draft early next week?
>> >> >
>> >> > Best regards,
>> >> > Odysseas Kalaitsidis
>> >> >
>> >> >
>> >> >
>> >> > Στις Πέμ 26 Φεβ 2026 στις 4:00 μ.μ., ο/η Alexios Zavras
>> >> > <zvr+gsoc [ at ] zvr [ dot ] gr <mailto:zvr%2Bgsoc [ at ] zvr [ dot ] gr> <mailto:
>> zvr%2Bgsoc [ at ] zvr [ dot ] gr <mailto:zvr%252Bgsoc [ at ] zvr [ dot ] gr>>> έγραψε:
>> >> >> Hi Odysseas, thanks for your interest.
>> >> >>
>> >> >> You've correctly identified a major issue with matching
>> >> >> "intrinsic" identifiers (like SWHID) with "extrinsic" ones
>> >> >> (like URL or DOI or PURL): what is the actual thing
>> >> >> that we are trying to identify?
>> >> >> Is it only the software? Does it include files like README and
>> LICENSE?
>> >> >> It's even worse in the case where the package (in PyPi in your case)
>> >> >> includes compiled binaries.
>> >> >>
>> >> >> And of course extrinsic identifiers are almost useless
>> >> >> when one introduces local modifications.
>> >> >>
>> >> >> I believe you've just scratched the surface of the problem
>> complexity
>> >> >> and I'd invite you to explore more real-world use cases.
>> >> >>
>> >> >> On Wed, Feb 25, 2026, at 03:14, Odysseas Kalaitsidis wrote:
>> >> >> > Dear Dr. Zavras and mentors,
>> >> >> >
>> >> >> > I hope you are well. Following our conversation at FOSDEM, I took
>> a
>> >> >> > deeper look into the Software Heritage specifications,
>> particularly
>> >> >> > around artifact identification for SBOM mapping. From this
>> exploration,
>> >> >> > I identified reliable artifact identification as a key bottleneck
>> for
>> >> >> > effective SBOM validation, which led me to focus on the idea of
>> using
>> >> >> > SWHIDs to identify software components.
>> >> >> >
>> >> >> > To better understand the practical challenges, I built a working
>> >> >> > prototype targeting the PyPI ecosystem, available at
>> >> >> > https://github.com/OdysseasKalaitsidis/SWHID_POC. The tool
>> fetches a
>> >> >> > package source distribution (sdist) from PyPI, computes its SWHID
>> >> >> > locally using swh.model, and verifies it against the Software
>> Heritage
>> >> >> > archive API.
>> >> >> >
>> >> >> > My experiments revealed an important technical challenge. While
>> pure
>> >> >> > Python packages such as six 1.17.0 verified successfully, others
>> such
>> >> >> > as certifi failed validation. The root cause appears to be
>> build-time
>> >> >> > metadata included in PyPI distributions, such as .egg-info
>> directories
>> >> >> > or PKG-INFO files, which are absent from the upstream source tree
>> >> >> > archived by Software Heritage. This suggests that a robust
>> >> >> > implementation requires ecosystem-specific normalization before
>> >> >> > hashing, stripping build artifacts to align distributions with
>> their
>> >> >> > canonical source representation.
>> >> >> >
>> >> >> > Given this complexity, I believe the project aligns better with a
>> Large
>> >> >> > scope. The core deliverable would be ecosystem-specific
>> normalization
>> >> >> > engines for PyPI and Crates.io, with Maven as a stretch goal due
>> to the
>> >> >> > additional source-to-binary JAR mapping complexity. Building on
>> that
>> >> >> > foundation, I would develop a PURL-based CLI tool and a
>> lightweight
>> >> >> > REST API for querying verified mappings, as well as publish a
>> dataset
>> >> >> > of top packages with SPDX 3.0 export support.
>> >> >> >
>> >> >> > This work connects directly to the Unified SBOM Management via
>> RDF
>> >> >> > Database Abstraction project. Verified PURL-to-SWHID mappings
>> would
>> >> >> > allow SBOMs stored in the RDF triplestore to be validated against
>> >> >> > content-addressed source artifacts, effectively closing the gap
>> between
>> >> >> > declared components and their archived source code.
>> >> >> >
>> >> >> > Through my previous projects, I have gained practical experience
>> with
>> >> >> > Python data pipelines, Docker-based deployment, and Git
>> workflows,
>> >> >> > which I believe provide a solid foundation for this work. I would
>> >> >> > greatly appreciate your feedback on the direction and scope.
>> >> >> >
>> >> >> > I also have one architectural question for the proposal stage:
>> should
>> >> >> > the tool treat the sdist SWHID from PyPI and the git-tag SWHID
>> from the
>> >> >> > upstream repository as two distinct valid mappings, or should we
>> aim to
>> >> >> > canonicalize everything to the git-based SWHID? This decision
>> will
>> >> >> > significantly influence the schema design.
>> >> >> >
>> >> >> > Best regards,
>> >> >> > Odysseas Kalaitsidis
>> >> >> >
>> >> >> > ----
>> >> >> > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και
>> >> >> > συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors
>> έργων του
>> >> >> > Google Summer of Code - A discussion list for student developers
>> and
>> >> >> > mentors of Google Summer of Code projects.,
>> >> >> > https://lists.ellak.gr/gsoc-developers/listinfo.html
>> >> >> > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ.
>> >> >> > ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr
>> <mailto:gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr> <mailto:
>> gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr <mailto:
>> gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr>>
>> >> >> > <mailto:gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr <mailto:
>> gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr> <mailto:
>> gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr <mailto:
>> gsoc-developers%25252Bunsubscribe [ at ] ellak [ dot ] gr>>>>.
>> >> >>
>> >> >> --
>> >> >> -- zvr -
>> >>
>> >> --
>> >> -- zvr -
>>
>> --
>> -- zvr -
>>
>
----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.

πλοήγηση μηνυμάτων