ΕΕΛΛΑΚ - Λίστες Ταχυδρομείου

Re: Interest in "Using SWHID to Identify Software Components": POC & Architecture

  • Subject: Re: Interest in "Using SWHID to Identify Software Components": POC & Architecture
  • From: "Alexios Zavras" <zvr+gsoc [ at ] zvr [ dot ] gr>
  • Date: Mon, 09 Mar 2026 22:24:23 +0100
[please always use the gsoc-developers list for communicating.
Others might also be helped by what is being discussed.]

As I've written before, you should really think
of what "SWHID" actually identifies.
As you might have discovered, the single command
"pip install pytorch" results into completely different files
being installed, depending on the environment this is run.

So, your approach of "one SWHID for the source
and _one_ SWHID for the package" is still short
and cannot describe what is really happening.


On Fri, Mar 6, 2026, at 02:53, Odysseas Kalaitsidis wrote:
> Dear Dr. Zavras,
>
> Thank you for the feedback, you're right, I had only scratched the surface.
>
> After thinking more about the mismatch I found with certifi, I realize 
> the problem gets much harder in other cases. Packages with C extensions 
> like numpy would have generated files and dependencies that don't exist 
> in the upstream repo. On Crates.io, the registry adds metadata that 
> doesn't match the Git tag. And Maven is a different problem, since 
> Central has compiled JARs, so you'd need to go from a binary back to 
> the source.
>
> This made me rethink my approach. Trying to strip everything to force a 
> single SWHID match feels fragile and doesn't really scale. Instead of 
> forcing one SWHID, I think it makes more sense to give one to the 
> package and one to the source, and just record which source the package 
> came from. These links can be expressed with SPDX 3.0 relationships, 
> and the data would also fit well into an RDF-based system like the one 
> on the ideas list.
>
> I'm reworking my proposal around this. Would it be ok to share a 
> revised draft early next week?
>
> Best regards,
> Odysseas Kalaitsidis
>
>
>
> Στις Πέμ 26 Φεβ 2026 στις 4:00 μ.μ., ο/η Alexios Zavras 
> <zvr+gsoc [ at ] zvr [ dot ] gr <mailto:zvr%2Bgsoc [ at ] zvr [ dot ] gr>> έγραψε:
>> Hi Odysseas, thanks for your interest.
>> 
>> You've correctly identified a major issue with matching
>> "intrinsic" identifiers (like SWHID) with "extrinsic" ones
>> (like URL or DOI or PURL): what is the actual thing
>> that we are trying to identify?
>> Is it only the software? Does it include files like README and LICENSE?
>> It's even worse in the case where the package (in PyPi in your case)
>> includes compiled binaries.
>> 
>> And of course extrinsic identifiers are almost useless
>> when one introduces local modifications.
>> 
>> I believe you've just scratched the surface of the problem complexity
>> and I'd invite you to explore more real-world use cases.
>> 
>> On Wed, Feb 25, 2026, at 03:14, Odysseas Kalaitsidis wrote:
>> > Dear Dr. Zavras and mentors,
>> >
>> > I hope you are well. Following our conversation at FOSDEM, I took a 
>> > deeper look into the Software Heritage specifications, particularly 
>> > around artifact identification for SBOM mapping. From this exploration, 
>> > I identified reliable artifact identification as a key bottleneck for 
>> > effective SBOM validation, which led me to focus on the idea of using 
>> > SWHIDs to identify software components.
>> >
>> > To better understand the practical challenges, I built a working 
>> > prototype targeting the PyPI ecosystem, available at 
>> > https://github.com/OdysseasKalaitsidis/SWHID_POC. The tool fetches a 
>> > package source distribution (sdist) from PyPI, computes its SWHID 
>> > locally using swh.model, and verifies it against the Software Heritage 
>> > archive API.
>> >
>> > My experiments revealed an important technical challenge. While pure 
>> > Python packages such as six 1.17.0 verified successfully, others such 
>> > as certifi failed validation. The root cause appears to be build-time 
>> > metadata included in PyPI distributions, such as .egg-info directories 
>> > or PKG-INFO files, which are absent from the upstream source tree 
>> > archived by Software Heritage. This suggests that a robust 
>> > implementation requires ecosystem-specific normalization before 
>> > hashing, stripping build artifacts to align distributions with their 
>> > canonical source representation.
>> >
>> > Given this complexity, I believe the project aligns better with a Large 
>> > scope. The core deliverable would be ecosystem-specific normalization 
>> > engines for PyPI and Crates.io, with Maven as a stretch goal due to the 
>> > additional source-to-binary JAR mapping complexity. Building on that 
>> > foundation, I would develop a PURL-based CLI tool and a lightweight 
>> > REST API for querying verified mappings, as well as publish a dataset 
>> > of top packages with SPDX 3.0 export support.
>> >
>> > This work connects directly to the Unified SBOM Management via RDF 
>> > Database Abstraction project. Verified PURL-to-SWHID mappings would 
>> > allow SBOMs stored in the RDF triplestore to be validated against 
>> > content-addressed source artifacts, effectively closing the gap between 
>> > declared components and their archived source code.
>> >
>> > Through my previous projects, I have gained practical experience with 
>> > Python data pipelines, Docker-based deployment, and Git workflows, 
>> > which I believe provide a solid foundation for this work. I would 
>> > greatly appreciate your feedback on the direction and scope.
>> >
>> > I also have one architectural question for the proposal stage: should 
>> > the tool treat the sdist SWHID from PyPI and the git-tag SWHID from the 
>> > upstream repository as two distinct valid mappings, or should we aim to 
>> > canonicalize everything to the git-based SWHID? This decision will 
>> > significantly influence the schema design.
>> >
>> > Best regards,
>> > Odysseas Kalaitsidis
>> >
>> > ----
>> > Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και 
>> > συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του 
>> > Google Summer of Code - A discussion list for student developers and 
>> > mentors of Google Summer of Code projects.,
>> > https://lists.ellak.gr/gsoc-developers/listinfo.html
>> > Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. 
>> > ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr <mailto:gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr> 
>> > <mailto:gsoc-developers%2Bunsubscribe [ at ] ellak [ dot ] gr <mailto:gsoc-developers%252Bunsubscribe [ at ] ellak [ dot ] gr>>>.
>> 
>> -- 
>> -- zvr -

-- 
-- zvr -
----
Λαμβάνετε αυτό το μήνυμα απο την λίστα: Λίστα αλληλογραφίας και συζητήσεων που απευθύνεται σε φοιτητές developers \& mentors έργων του Google Summer of Code - A discussion list for student developers and mentors of Google Summer of Code projects.,
https://lists.ellak.gr/gsoc-developers/listinfo.html
Μπορείτε να απεγγραφείτε από τη λίστα στέλνοντας κενό μήνυμα ηλ. ταχυδρομείου στη διεύθυνση <gsoc-developers+unsubscribe [ at ] ellak [ dot ] gr>.

πλοήγηση μηνυμάτων