Hi Alexios, I have submitted my proposal on the GSoC portal. The PoC has been updated with attestation verification for PyPI, in addition to the existing Crates.io and Maven pipelines. https://github.com/OdysseasKalaitsidis/SWHID_POC Happy to discuss any questions. Best Regards, Odysseas Kalaitsidis Στις Παρ 13 Μαρ 2026 στις 12:01 μ.μ., ο/η Odysseas Kalaitsidis < odysseaskalaitsides [ at ] gmail [ dot ] com> έγραψε: > Hi Alexios, > > That makes sense. A unified interface where the user provides a PURL and > gets back a verified SWHID mapping, but with the internals adapting to each > ecosystem. > > What I found interesting is that some ecosystems already provide built-in > provenance links. On Crates.io, every crate includes .cargo_vcs_info.json > with the source commit hash. On PyPI, PEP 740 attestations can serve a > similar role for packages that use Trusted Publishing. These could act as > ecosystem-specific verification paths feeding into a common output format. > > I will structure my proposal around this direction. > > Best regards, > > Odysseas Kalaitsidis > > > Στις Παρ 13 Μαρ 2026 στις 1:27 π.μ., ο/η Alexios Zavras <zvr+gsoc [ at ] zvr [ dot ] gr> > έγραψε: > >> Hi Odysseas >> >> You are correct when you observe that every language ecosystem >> has its own peculiarities, and the problem is completely different >> on each one of them. >> There are even variations on each ecosystem -- >> e.g. Python packages comprised solely by Python code >> are much more easy to handle than ones with compiled parts. >> >> The goal of the project is to have a unified approach >> to using SWHID to identify software components. >> But, as you found out, the implementation will eventually >> be very different, depending on the package manager case. >> >> >> On Wed, Mar 11, 2026, at 13:54, Odysseas Kalaitsidis wrote: >> > Hi Alexios, >> > >> > Following your feedback, I spent the last few days exploring real-world >> > cases in more depth, starting with the pytorch example you mentioned. >> > >> > I queried the PyPI API for torch 2.6.0 and found 20 wheels and zero >> > source distributions. The Linux x86 wheel is 731 MB because it includes >> > CUDA libraries, while the macOS ARM wheel is only 63 MB since it uses >> > Metal. Furthermore, download.pytorch.org has additional variants for >> > CPU-only, CUDA 11.8, 12.1, 12.4, and ROCm. So a single torch release >> > has dozens of distinct binary artifacts, each with different content, >> > each producing a different SWHID. A PURL like pkg:pypi/torch@2.6.0 >> does >> > not map to one artifact. It maps to a whole graph of them. >> > >> > I also looked into Crates.io. I downloaded serde 1.0.203 from the >> > registry and diffed it against the v1.0.203 git tag. Three files >> > differ. The registry adds .cargo_vcs_info.json which contains the >> > source git commit hash. It also rewrites Cargo.toml and keeps the >> > original as Cargo.toml.orig. All actual source files are identical. >> > This is very different from PyPI. The mismatch is predictable, and the >> > registry even provides a built-in link back to the source commit. >> > >> > So the problem looks different depending on the ecosystem. On PyPI, the >> > gap between source and distribution is wide, with compiled binaries, >> > platform variants, and sometimes no sdist at all. On Crates.io, the gap >> > is small and the registry already gives you a path back to the source. >> > >> > And as you pointed out earlier, for local modifications after >> > installation, the PURL stays the same even though the files on disk >> > have changed. Only computing SWHIDs on the installed files would catch >> > that. >> > >> > Given these ecosystem differences, should the tool define separate >> > verification strategies per ecosystem with a common output format, or >> > should it try to use a single unified mapping model? >> > >> > Best regards, >> > Odysseas Kalaitsidis >> > >> > >> > >> > Στις Δευ 9 Μαρ 2026 στις 11:24 μ.μ., ο/η Alexios Zavras >> > <zvr+gsoc [ at ] zvr [ dot ] gr <mailto:zvr%2Bgsoc [ at ] zvr [ dot ] gr>> έγραψε: >> >> [please always use the gsoc-developers list for communicating. >> >> Others might also be helped by what is being discussed.] >> >> >> >> As I've written before, you should really think >> >> of what "SWHID" actually identifies. >> >> As you might have discovered, the single command >> >> "pip install pytorch" results into completely different files >> >> being installed, depending on the environment this is run. >> >> >> >> So, your approach of "one SWHID for the source >> >> and _one_ SWHID for the package" is still short >> >> and cannot describe what is really happening. >> >> >> >> >> >> On Fri, Mar 6, 2026, at 02:53, Odysseas Kalaitsidis wrote: >> >> > Dear Dr. Zavras, >> >> > >> >> > Thank you for the feedback, you're right, I had only scratched the >> surface. >> >> > >> >> > After thinking more about the mismatch I found with certifi, I >> realize >> >> > the problem gets much harder in other cases. Packages with C >> extensions >> >> > like numpy would have generated files and dependencies that don't >> exist >> >> > in the upstream repo. On Crates.io, the registry adds metadata that >> >> > doesn't match the Git tag. And Maven is a different problem, since >> >> > Central has compiled JARs, so you'd need to go from a binary back to >> >> > the source. >> >> > >> >> > This made me rethink my approach. Trying to strip everything to >> force a >> >> > single SWHID match feels fragile and doesn't really scale. Instead >> of >> >> > forcing one SWHID, I think it makes more sense to give one to the >> >> > package and one to the source, and just record which source the >> package >> >> > came from. These links can be expressed with SPDX 3.0 relationships, >> >> > and the data would also fit well into an RDF-based system like the >> one >> >> > on the ideas list. >> >> > >> >> > I'm reworking my proposal around this. Would it be ok to share a >> >> > revised draft early next week? >> >> > >> >> > Best regards, >> >> > Odysseas Kalaitsidis >> >> > >> >> > >> >> > >> >> > Στις Πέμ 26 Φεβ 2026 στις 4:00 μ.μ., ο/η Alexios Zavras >> >> > <zvr+gsoc [ at ] zvr [ dot ] gr <mailto:zvr%2Bgsoc [ at ] zvr [ dot ] gr> <mailto: >> zvr%2Bgsoc [ at ] zvr [ dot ] gr <mailto:zvr%252Bgsoc [ at ] zvr [ dot ] gr>>> έγραψε: >> >> >> Hi Odysseas, thanks for your interest. >> >> >> >> >> >> You've correctly identified a major issue with matching >> >> >> "intrinsic" identifiers (like SWHID) with "extrinsic" ones >> >> >> (like URL or DOI or PURL): what is the actual thing >> >> >> that we are trying to identify? >> >> >> Is it only the software? Does it include files like README and >> LICENSE? >> >> >> It's even worse in the case where the package (in PyPi in your case) >> >> >> includes compiled binaries. >> >> >> >> >> >> And of course extrinsic identifiers are almost useless >> >> >> when one introduces local modifications. >> >> >> >> >> >> I believe you've just scratched the surface of the problem >> complexity >> >> >> and I'd invite you to explore more real-world use cases. >> >> >> >> >> >> On Wed, Feb 25, 2026, at 03:14, Odysseas Kalaitsidis wrote: >> >> >> > Dear Dr. Zavras and mentors, >> >> >> > >> >> >> > I hope you are well. Following our conversation at FOSDEM, I took >> a >> >> >> > deeper look into the Software Heritage specifications, >> particularly >> >> >> > around artifact identification for SBOM mapping. From this >> exploration, >> >> >> > I identified reliable artifact identification as a key bottleneck >> for >> >> >> > effective SBOM validation, which led me to focus on the idea of >> using >> >> >> > SWHIDs to identify software components. >> >> >> > >> >> >> > To better understand the practical challenges, I built a working >> >> >> > prototype targeting the PyPI ecosystem, available at >> >> >> > https://github.com/OdysseasKalaitsidis/SWHID_POC. The tool >> fetches a >> >> >> > package source distribution (sdist) from PyPI, computes its SWHID >> >> >> > locally using swh.model, and verifies it against the Software >> Heritage >> >> >> > archive API. >> >> >> > >> >> >> > My experiments revealed an important technical challenge. While >> pure >> >> >> > Python packages such as six 1.17.0 verified successfully, others >> such >> >> >> > as certifi failed validation. The root cause appears to be >> build-time >> >> >> > metadata included in PyPI distributions, such as .egg-info >> directories >> >> >> > or PKG-INFO files, which are absent from the upstream source tree >> >> >> > archived by Software Heritage. This suggests that a robust >> >> >> > implementation requires ecosystem-specific normalization before >> >> >> > hashing, stripping build artifacts to align distributions with >> their >> >> >> > canonical source representation. >> >> >> > >> >> >> > Given this complexity, I believe the project aligns better with a >> Large >> >> >> > scope. The core deliverable would be ecosystem-specific >> normalization >> >> >> > engines for PyPI and Crates.io, with Maven as a stretch goal due >> to the >> >> >> > additional source-to-binary JAR mapping complexity. Building on >> that >> >> >> > foundation, I would develop a PURL-based CLI tool and a >> lightweight >> >> >> > REST API for querying verified mappings, as well as publish a >> dataset >> >> >> > of top packages with SPDX 3.0 export support. >> >> >> > >> >> >> > This work connects directly to the Unified SBOM Management via >> RDF >> >> >> > Database Abstraction project. Verified PURL-to-SWHID mappings >> would >> >> >> > allow SBOMs stored in the RDF triplestore to be validated against >> >> >> > content-addressed source artifacts, effectively closing the gap >> between >> >> >> > declared components and their archived source code. >> >> >> > >> >> >> > Through my previous projects, I have gained practical experience >> with >> >> >> > Python data pipelines, Docker-based deployment, and Git >> workflows, >> >> >> > which I believe provide a solid foundation for this work. I would >> >> >> > greatly appreciate your feedback on the direction and scope. >> >> >> > >> >> >> > I also have one architectural question for the proposal stage: >> should >> >> >> > the tool treat the sdist SWHID from PyPI and the git-tag SWHID >> from the >> >> >> > upstream repository as two distinct valid mappings, or should we >> aim to >> >> >> > canonicalize everything to the git-based SWHID? This decision >> will >> >> >> > significantly influence the schema design. >> >> >> > >> >> >> > Best regards, >> >> >> > Odysseas Kalaitsidis >> >> >> > >> >> >> > ---- >> >> >> > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και >> >> >> > συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors >> έργων του >> >> >> > Google Summer of Code - A discussion list for student developers >> and >> >> >> > mentors of Google Summer of Code projects., >> >> >> > https://lists.ellak.gr/gsoc-developers/listinfo.html >> >> >> > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. >> >> >> > ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr >> <mailto:gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr> <mailto: >> gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr <mailto: >> gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr>> >> >> >> > <mailto:gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr <mailto: >> gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr> <mailto: >> gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr <mailto: >> gsoc-developers%25252Bunsubscribe [ at ] ellak [ dot ] gr>>>>. >> >> >> >> >> >> -- >> >> >> -- zvr - >> >> >> >> -- >> >> -- zvr - >> >> -- >> -- zvr - >> >
---- Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects., https://lists.ellak.gr/gsoc-developers/listinfo.html Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.