Hi Alexios, That makes sense. A unified interface where the user provides a PURL and gets back a verified SWHID mapping, but with the internals adapting to each ecosystem. What I found interesting is that some ecosystems already provide built-in provenance links. On Crates.io, every crate includes .cargo_vcs_info.json with the source commit hash. On PyPI, PEP 740 attestations can serve a similar role for packages that use Trusted Publishing. These could act as ecosystem-specific verification paths feeding into a common output format. I will structure my proposal around this direction. Best regards, Odysseas Kalaitsidis Στις Παρ 13 Μαρ 2026 στις 1:27 π.μ., ο/η Alexios Zavras <zvr+gsoc [ at ] zvr [ dot ] gr> έγραψε: > Hi Odysseas > > You are correct when you observe that every language ecosystem > has its own peculiarities, and the problem is completely different > on each one of them. > There are even variations on each ecosystem -- > e.g. Python packages comprised solely by Python code > are much more easy to handle than ones with compiled parts. > > The goal of the project is to have a unified approach > to using SWHID to identify software components. > But, as you found out, the implementation will eventually > be very different, depending on the package manager case. > > > On Wed, Mar 11, 2026, at 13:54, Odysseas Kalaitsidis wrote: > > Hi Alexios, > > > > Following your feedback, I spent the last few days exploring real-world > > cases in more depth, starting with the pytorch example you mentioned. > > > > I queried the PyPI API for torch 2.6.0 and found 20 wheels and zero > > source distributions. The Linux x86 wheel is 731 MB because it includes > > CUDA libraries, while the macOS ARM wheel is only 63 MB since it uses > > Metal. Furthermore, download.pytorch.org has additional variants for > > CPU-only, CUDA 11.8, 12.1, 12.4, and ROCm. So a single torch release > > has dozens of distinct binary artifacts, each with different content, > > each producing a different SWHID. A PURL like pkg:pypi/torch@2.6.0 does > > not map to one artifact. It maps to a whole graph of them. > > > > I also looked into Crates.io. I downloaded serde 1.0.203 from the > > registry and diffed it against the v1.0.203 git tag. Three files > > differ. The registry adds .cargo_vcs_info.json which contains the > > source git commit hash. It also rewrites Cargo.toml and keeps the > > original as Cargo.toml.orig. All actual source files are identical. > > This is very different from PyPI. The mismatch is predictable, and the > > registry even provides a built-in link back to the source commit. > > > > So the problem looks different depending on the ecosystem. On PyPI, the > > gap between source and distribution is wide, with compiled binaries, > > platform variants, and sometimes no sdist at all. On Crates.io, the gap > > is small and the registry already gives you a path back to the source. > > > > And as you pointed out earlier, for local modifications after > > installation, the PURL stays the same even though the files on disk > > have changed. Only computing SWHIDs on the installed files would catch > > that. > > > > Given these ecosystem differences, should the tool define separate > > verification strategies per ecosystem with a common output format, or > > should it try to use a single unified mapping model? > > > > Best regards, > > Odysseas Kalaitsidis > > > > > > > > Στις Δευ 9 Μαρ 2026 στις 11:24 μ.μ., ο/η Alexios Zavras > > <zvr+gsoc [ at ] zvr [ dot ] gr <mailto:zvr%2Bgsoc [ at ] zvr [ dot ] gr>> έγραψε: > >> [please always use the gsoc-developers list for communicating. > >> Others might also be helped by what is being discussed.] > >> > >> As I've written before, you should really think > >> of what "SWHID" actually identifies. > >> As you might have discovered, the single command > >> "pip install pytorch" results into completely different files > >> being installed, depending on the environment this is run. > >> > >> So, your approach of "one SWHID for the source > >> and _one_ SWHID for the package" is still short > >> and cannot describe what is really happening. > >> > >> > >> On Fri, Mar 6, 2026, at 02:53, Odysseas Kalaitsidis wrote: > >> > Dear Dr. Zavras, > >> > > >> > Thank you for the feedback, you're right, I had only scratched the > surface. > >> > > >> > After thinking more about the mismatch I found with certifi, I > realize > >> > the problem gets much harder in other cases. Packages with C > extensions > >> > like numpy would have generated files and dependencies that don't > exist > >> > in the upstream repo. On Crates.io, the registry adds metadata that > >> > doesn't match the Git tag. And Maven is a different problem, since > >> > Central has compiled JARs, so you'd need to go from a binary back to > >> > the source. > >> > > >> > This made me rethink my approach. Trying to strip everything to force > a > >> > single SWHID match feels fragile and doesn't really scale. Instead of > >> > forcing one SWHID, I think it makes more sense to give one to the > >> > package and one to the source, and just record which source the > package > >> > came from. These links can be expressed with SPDX 3.0 relationships, > >> > and the data would also fit well into an RDF-based system like the > one > >> > on the ideas list. > >> > > >> > I'm reworking my proposal around this. Would it be ok to share a > >> > revised draft early next week? > >> > > >> > Best regards, > >> > Odysseas Kalaitsidis > >> > > >> > > >> > > >> > Στις Πέμ 26 Φεβ 2026 στις 4:00 μ.μ., ο/η Alexios Zavras > >> > <zvr+gsoc [ at ] zvr [ dot ] gr <mailto:zvr%2Bgsoc [ at ] zvr [ dot ] gr> <mailto:zvr%2Bgsoc [ at ] zvr [ dot ] gr > <mailto:zvr%252Bgsoc [ at ] zvr [ dot ] gr>>> έγραψε: > >> >> Hi Odysseas, thanks for your interest. > >> >> > >> >> You've correctly identified a major issue with matching > >> >> "intrinsic" identifiers (like SWHID) with "extrinsic" ones > >> >> (like URL or DOI or PURL): what is the actual thing > >> >> that we are trying to identify? > >> >> Is it only the software? Does it include files like README and > LICENSE? > >> >> It's even worse in the case where the package (in PyPi in your case) > >> >> includes compiled binaries. > >> >> > >> >> And of course extrinsic identifiers are almost useless > >> >> when one introduces local modifications. > >> >> > >> >> I believe you've just scratched the surface of the problem complexity > >> >> and I'd invite you to explore more real-world use cases. > >> >> > >> >> On Wed, Feb 25, 2026, at 03:14, Odysseas Kalaitsidis wrote: > >> >> > Dear Dr. Zavras and mentors, > >> >> > > >> >> > I hope you are well. Following our conversation at FOSDEM, I took > a > >> >> > deeper look into the Software Heritage specifications, > particularly > >> >> > around artifact identification for SBOM mapping. From this > exploration, > >> >> > I identified reliable artifact identification as a key bottleneck > for > >> >> > effective SBOM validation, which led me to focus on the idea of > using > >> >> > SWHIDs to identify software components. > >> >> > > >> >> > To better understand the practical challenges, I built a working > >> >> > prototype targeting the PyPI ecosystem, available at > >> >> > https://github.com/OdysseasKalaitsidis/SWHID_POC. The tool > fetches a > >> >> > package source distribution (sdist) from PyPI, computes its SWHID > >> >> > locally using swh.model, and verifies it against the Software > Heritage > >> >> > archive API. > >> >> > > >> >> > My experiments revealed an important technical challenge. While > pure > >> >> > Python packages such as six 1.17.0 verified successfully, others > such > >> >> > as certifi failed validation. The root cause appears to be > build-time > >> >> > metadata included in PyPI distributions, such as .egg-info > directories > >> >> > or PKG-INFO files, which are absent from the upstream source tree > >> >> > archived by Software Heritage. This suggests that a robust > >> >> > implementation requires ecosystem-specific normalization before > >> >> > hashing, stripping build artifacts to align distributions with > their > >> >> > canonical source representation. > >> >> > > >> >> > Given this complexity, I believe the project aligns better with a > Large > >> >> > scope. The core deliverable would be ecosystem-specific > normalization > >> >> > engines for PyPI and Crates.io, with Maven as a stretch goal due > to the > >> >> > additional source-to-binary JAR mapping complexity. Building on > that > >> >> > foundation, I would develop a PURL-based CLI tool and a > lightweight > >> >> > REST API for querying verified mappings, as well as publish a > dataset > >> >> > of top packages with SPDX 3.0 export support. > >> >> > > >> >> > This work connects directly to the Unified SBOM Management via RDF > >> >> > Database Abstraction project. Verified PURL-to-SWHID mappings > would > >> >> > allow SBOMs stored in the RDF triplestore to be validated against > >> >> > content-addressed source artifacts, effectively closing the gap > between > >> >> > declared components and their archived source code. > >> >> > > >> >> > Through my previous projects, I have gained practical experience > with > >> >> > Python data pipelines, Docker-based deployment, and Git workflows, > >> >> > which I believe provide a solid foundation for this work. I would > >> >> > greatly appreciate your feedback on the direction and scope. > >> >> > > >> >> > I also have one architectural question for the proposal stage: > should > >> >> > the tool treat the sdist SWHID from PyPI and the git-tag SWHID > from the > >> >> > upstream repository as two distinct valid mappings, or should we > aim to > >> >> > canonicalize everything to the git-based SWHID? This decision will > >> >> > significantly influence the schema design. > >> >> > > >> >> > Best regards, > >> >> > Odysseas Kalaitsidis > >> >> > > >> >> > ---- > >> >> > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και > >> >> > συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων > του > >> >> > Google Summer of Code - A discussion list for student developers > and > >> >> > mentors of Google Summer of Code projects., > >> >> > https://lists.ellak.gr/gsoc-developers/listinfo.html > >> >> > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. > >> >> > ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr > <mailto:gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr> <mailto: > gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr <mailto: > gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr>> > >> >> > <mailto:gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr <mailto: > gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr> <mailto: > gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr <mailto: > gsoc-developers%25252Bunsubscribe [ at ] ellak [ dot ] gr>>>>. > >> >> > >> >> -- > >> >> -- zvr - > >> > >> -- > >> -- zvr - > > -- > -- zvr - >
---- Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects., https://lists.ellak.gr/gsoc-developers/listinfo.html Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.