File Aggregation, Multi-Stack Dependency Insights

Welcome back to SBOM-Fridays! Where the first part of this series provided context and theoretical paradigms regarding SBOMs, this part will be laying out the philosophical approach and technical implementations on how an SBOM is constructed in more detail. We’ll not yet finish or even start with an SBOM in this part, just prepping our solution and infrastructure for the engine itself.

Since the goal is to list every component used in our software, we’ll be leaning on the same files build tools and compilers use to assemble the software.

This post will roughly be divided into two segments - gathering source files on the one hand and distilling them into raw usable metadata on the other - both with the purpose of feeding the engine with the exact fuel it needs.

Gathering Source Files

Existing SBOM-solutions often request access to your full repository. But what do they touch, once you hand them the keys to the whole garage? Just like we will do, they read out dependency manifests, traverse lock files and pull metadata for their contents. Often, SBOMs are generated as a last step in build pipelines, just before the source-files get compiled into DLLs for production servers. It means, any artifact generated from that process, will contain vulnerabilities known at compile-time. SBOMs get put on the production server, and live there until the next release cycle.

What files are we talking about?

Ecosystem	Source Files
.NET	`csproj`, `fsproj`, `vbproj`, `project.assets.json`
Angular, React, Node.js	`package.json`, `package-lock.json`
Java	`pom.xml`, `gradle.lockfile`
Go	`go.mod`, `go.sum`
Python	`requirements.txt`, `Pipfile`, `Pipfile.lock`, `poetry.lock`

In our experiments here, we’re going to give the CISO guys some sweaty palms, as we will grab these source files with us, and put them (encrypted & zipped) on our production server. Files on production server not needed for the application to run?! I’m a heretic, I know.

We’re not going to just leave them there hanging out to dry obviously, as they will be working HARD for our application's vulnerability assessment processes. We will be using them to up our game from compile-time SBOMs to SBOMs generated at run-time in the production-environment. What does it matter? Well, every run-time created SBOM is guaranteed to have the latest take on known vulnerabilities. The waterfall boyz can surf their own waves and still be compliant for ISO27001.

If you’re a CISO and haven’t called an exorcist on me yet, we can tend to your sensitivities and just let our engine run on development environment as well, if you’re absolutely not comfortable hoisting blueprints on a prod server. I’m not judging, I’m trying to make a deal here.
Without giving away too much already, having these source files at hand in the environment that matters means we can set up a closed circuit of scanning and reporting vulnerabilities or undesired license-changes. Go to part 4 of this series to see how we implement this for a .NET solution!

In a .NET C# ecosystem, be it on Azure or local Solution Explorer we can do something like the following to grab our desired files. Just make sure you have an abstraction that sets the correct SolutionDirectory per your environment as missing or incorrect source files ultimately lead to a worthless SBOM.

1. .NET: adapt the searchPattern to your type of proj-files (*.csproj, or go for **.**proj)

var csprojPaths = Directory
  .GetFiles(
    path: SolutionDirectory,
    searchPattern: "*.csproj", 
    searchOption: SearchOption.AllDirectories)
  .ToList();

var csProjectAssetsPaths = Directory
  .GetFiles(
    path: SolutionDirectory,
    searchPattern: "project.assets.json",
    searchOption: SearchOption.AllDirectories)
  .ToList();

2. Front-ends based on package(-lock).json can use something like the following. Adapt it depending on your solution being a mono-repo carrying UIs, or a batch of micro-front-ends or whatever.

var packageJsonPath = Directory 
  .GetFiles( 
    path: SolutionDirectory,
    searchPattern: "package.json",
    searchOption: SearchOption.AllDirectories)
  .FirstOrDefault(path => 
    Path.GetFileName(path).Equals("package.json", StringComparison.OrdinalIgnoreCase) && 
    Path.GetFileName(Path.GetDirectoryName(path)).Equals("ClientApp", StringComparison.OrdinalIgnoreCase));

var packageJsonLockPath = Directory 
  .GetFiles( 
    path: SolutionDirectory,
    searchPattern: "package-lock.json",
    searchOption: SearchOption.AllDirectories) 
  .FirstOrDefault(path => 
    Path.GetFileName(path).Equals("package-lock.json", StringComparison.OrdinalIgnoreCase) &&
    Path.GetFileName(Path.GetDirectoryName(path)).Equals("ClientApp", StringComparison.OrdinalIgnoreCase));

In a build pipeline, you’d need something -and I’m trying to keep it brief here - like the following to scoop up the necessary files and zip them for transfer. Mind you, this is not set up for micro-front-ends carrying multiple package.json & package-lock.json files, you’d have to adapt the script to grab them recursively. Another challenge obviously is, when flattening the files to a single folder, name-clashes can give you some minor headaches.

- task: CopyFiles@2 
  displayName: "Copy .csproj files for SCA" 
  inputs: Contents: '/*.csproj' 
  TargetFolder: '$(Build.ArtifactStagingDirectory)/vulnerability-scanner' 
  flattenFolders: true 
- task: CopyFiles@2 
  displayName: "Copy project.assets.json files to Temp" 
  inputs: Contents: '/project.assets.json' 
  TargetFolder: '$(Build.ArtifactStagingDirectory)/vulnerability-scanner/assets' 
  flattenFolders: false 
- task: PowerShell@2 
  displayName: "Copy renamed project.assets.json files for SCA" 
  inputs: targetType: inline 
  script: | 
    $assetsFolder = "$(Build.ArtifactStagingDirectory)\vulnerability-scanner\assets" 
    $targetFolder = "$(Build.ArtifactStagingDirectory)\vulnerability-scanner"
    $files = Get-ChildItem -Path $assetsFolder -Recurse -Filter "project.assets.json"

    foreach ($file in $files) { 
      $relativePath = $file.FullName.Substring($assetsFolder.Length).TrimStart('') 
      $parts = $relativePath -split '\\' 
      $projectName = $parts[0] 
      $newName = "$projectName.project.assets.json" 
      $destination = Join-Path $targetFolder $newName

      Write-Host "Moving and renaming $relativePath to $newName" 
      Move-Item -Path $file.FullName -Destination $destination -Force 
    }
    Remove-Item -Recurse -Force $assetsFolder 
    Write-Host "Flattened project.assets.json files and removed /assets folder"
- task: CopyFiles@2 
  displayName: "Copy package.json & package-lock.json to temp" 
  inputs: 
    Contents: 'MyApplication.UI/ClientApp/package*.json' 
    TargetFolder: '$(Build.ArtifactStagingDirectory)/vulnerability-scanner/assets'
    flattenFolders: true 
- task: PowerShell@2 
  displayName: "Copy renamed package(-lock).json files for SCA" 
  inputs: targetType: inline 
  script: | 
    $assetsFolder = "$(Build.ArtifactStagingDirectory)\vulnerability-scanner\assets" 
    $targetFolder = "$(Build.ArtifactStagingDirectory)\vulnerability-scanner"

    $files = Get-ChildItem -Path $assetsFolder -Recurse -Filter "package*.json"

    foreach ($file in $files) { 
      $relativePath = $file.FullName.Substring($assetsFolder.Length).TrimStart('') 
      $projectName = 'ClientApp' 
      $newName = "$projectName.$relativePath" 
      $destination = Join-Path $targetFolder $newName

      Write-Host "Moving and renaming $relativePath to $newName" 
      Move-Item -Path $file.FullName -Destination $destination -Force 
    } 
    Remove-Item -Recurse -Force $assetsFolder 
    Write-Host "Flattened package*.json files and removed /assets folder" 
- task: ArchiveFiles@2 
  displayName: 'Zip Vulnerability Scanner Source Files' 
  inputs: 
    rootFolderOrFile: '$(Build.ArtifactStagingDirectory)/vulnerability-scanner'
    includeRootFolder: false 
    archiveType: 'zip'
    archiveFile: '$(Build.ArtifactStagingDirectory)/Drop/vulnerability-scanner/sca.zip' 
    replaceExistingArchive: true

Ideally we end this phase with a flattened zip/folder that contains our .csproj files, package(-lock).json etc only, depending on our tech-stack and architecture.

Distilling Dependency Data

The actual engine will do the heavy lifting of collecting the relevant (meta)data, but in this part of the series, we will already discuss some of the core principles of it.

Our source files contain only so much information. To enrich this basic information, we can call in a little help from our friends over at the NuGet and NPM repositories, as well as the guys running the NVD & OSV databases. Set up Clients for them in a factory pattern and inject them into your engine.

High level this is what our process looks like:

Let me briefly introduce the Four Horsemen of Dependency Management:

NuGet: https://api.nuget.org/v3/

The official package registry for the .NET ecosystem (C#, F#, VB.NET, etc.). Developers publish their libraries there so others can consume them. Contains package metadata (name, version, authors, description, license, project URL, etc), dependency trees and tarballs (.nupkg files) for hashes. We will mostly be using the metadata and the hashed tarballs.

 public async Task<IPackageSearchMetadata> GetPackageMetadataAsync(NuGetInfo packageReference, CancellationToken cancellationToken)
 {
   var repo = Repository.Factory.GetCoreV3("https://api.nuget.org/v3/index.json");
   var resource = await repo.GetResourceAsync<PackageMetadataResource>();

   var packageIdentity = new PackageIdentity(
     id: packageReference.NuGetPackage,
     version: NuGetVersion.Parse(packageReference.Version)
   );

   return await resource.GetMetadataAsync(
     package: packageIdentity,
     sourceCacheContext: new SourceCacheContext(),
     log: NullLogger.Instance,
     token: default
   );
 }

 public async Task<string?> GetPackageHashAsync(NuGetInfo nuget, CancellationToken cancellationToken)
 {
   string envNuGetCache = string.Empty;
   string envCachedPackageFilePath = string.Empty;
   var fileName = $"{nuget.NuGetPackage.ToLowerInvariant()}.{nuget.Version}.nupkg";

   var localNuGetCache = Path.Combine(
     NugetCacheRoot,
     nuget.NuGetPackage.ToLowerInvariant(),
     nuget.Version);

   var localCachedPackageFilePath = Path.Combine(
     localNuGetCache,
     fileName);

   if (File.Exists(localCachedPackageFilePath))
   {
     envNuGetCache = localNuGetCache;
     envCachedPackageFilePath = localCachedPackageFilePath;
   } else {
     //points to D:/local/Temp on Production server
     var productionWriteableFolder = Path.GetTempPath();
     var nugetCache = Path.Combine(productionWriteableFolder, "NuGetCache");
     var productionNuGetCache = Path.Combine(
       nugetCache,
       nuget.NuGetPackage.ToLowerInvariant(), 
       nuget.Version);

     envNuGetCache = productionNuGetCache;
     envCachedPackageFilePath = Path.Combine(envNuGetCache, fileName);
   }

   if (!File.Exists(envCachedPackageFilePath))
   {
     Directory.CreateDirectory(envNuGetCache);

     var packageUrl = $"https://www.nuget.org/api/v2/package/{nuget.NuGetPackage}/{nuget.Version}";
     using var response = await _httpClient.GetAsync(packageUrl);

     if (!response.IsSuccessStatusCode)
     {
       return null;
     }

     await using var fs = File.Create(envCachedPackageFilePath);
     await response.Content.CopyToAsync(fs);
   }

   using var sha256 = SHA256.Create();
   using var stream = File.OpenRead(envCachedPackageFilePath);
   var hashBytes = sha256.ComputeHash(stream);

   return BitConverter.ToString(hashBytes).Replace("-", "").ToLowerInvariant();
 }

NPM: Node Package Manager, https://registry.npmjs.org/

The official registry for Node.js packages. Contains full package metadata: versions, tarball URLs, dependencies, maintainers, license, repository info. Used to enrich SBOM data & to cross-reference vulnerabilities later (via OSV & NVD, see below)

 private async Task<JsonNode?> GetExactVersionMetadataAsync(string packageName, string rawVersion, CancellationToken cancellationToken)
 {
   var request = new HttpRequestMessage(
     HttpMethod.Get,
     Uri.EscapeDataString(packageName)
   );

   using var response = await _httpClient.SendAsync(
     request: request,
     completionOption: HttpCompletionOption.ResponseHeadersRead,
     cancellationToken: cancellationToken);

   var metadataJson = await response.Content.ReadAsStringAsync(cancellationToken);
   var metadata = JsonNode.Parse(metadataJson);

   if (metadata?["versions"] is not JsonObject versions)
   {
     return null;
   }

   var targetVersion = Strings.ExtractSemverVersion(rawVersion);

   // Try exact match first
   if (versions.TryGetPropertyValue(rawVersion, out JsonNode? exactMatch))
   {
     return exactMatch;
   }

   var fallbackKey = versions
     .Select(v => v.Key)
     .FirstOrDefault(k => k.Contains(rawVersion) || rawVersion.Contains(k));

   if (fallbackKey != null && versions.TryGetPropertyValue(fallbackKey, out JsonNode? fallbackMatch))
   {
     return fallbackMatch;
   }

   return null;
 }

 public async Task<PackageMetadata?> FetchFullPackageMetadataAsync(string packageName, string? version, CancellationToken cancellationToken)
 {
   var versionInfo = await GetExactVersionMetadataAsync(packageName, version ?? "", cancellationToken);

   try
   {
     var license = versionInfo["license"]
       switch {
         JsonObject licObj => licObj["type"]?.ToString() ?? licObj.ToString(),
         JsonValue licVal => licVal.ToString(),
         _ => null };

     var homepage = versionInfo["homepage"]?.ToString();
     var description = versionInfo["description"]?.ToString();
     var tarball = versionInfo["dist"]?["tarball"]?.ToString();
     var versionInfoVersion = versionInfo["version"]?.ToString();
     var severity = versionInfo["severity"]?.ToString();
     var shasum = versionInfo["dist"]?["shasum"]?.ToString();
     var engines = versionInfo["engines"]
         switch {
           JsonObject obj =>
             obj.ToDictionary(kvp => kvp.Key, kvp => kvp.Value?.ToString()),
           JsonArray arr =>
               arr.Select((v, i) => new { Key = i.ToString(), Value = v?.ToString() })
               .ToDictionary(x => x.Key, x => x.Value),
           _ => new Dictionary<string, string>()};

     var repositoryUrl = versionInfo["repository"] is JsonObject repoObj ?
       repoObj["url"]?.ToString() :
       versionInfo["repository"]?.ToString();

     var authorName = versionInfo["author"] is JsonObject authorObj ?
       authorObj["name"]?.ToString() :
       versionInfo["author"]?.ToString(); // fallback if it's just a string

     var bugsUrl = versionInfo["bugs"]
       switch {
         JsonObject bugsObj => bugsObj["url"]?.ToString(),
         JsonValue bugsVal => bugsVal.ToString(), // covers plain string form
         _ => null};

     var vulnerabilities = await _osvRegistryClient.ListPackageVulnerabilitiesAsync("npm", packageName, versionInfo["version"]?.ToString(), cancellationToken);

     return new PackageMetadata
     {
       LicenseId = license,
       TopLevelLicense = null, // Optional: could cache top-level metadata if needed
       HomepageUrl = homepage,
       Description = description,
       TarballUrl = tarball,
       Version = versionInfoVersion,
       Vulnerabilities = vulnerabilities,
       Author = authorName,
       Severity = severity,
       BugsUrl = bugsUrl,
       RepositoryUrl = repositoryUrl,
       Shasum = shasum,
       Engines = engines
     };
   }
   catch (Exception ex)
   {
     var properties = new Dictionary<string, string>
     {
       { "packageName", packageName },
       { "version", version ?? string.Empty }
     };

     _telemetry.TrackException(ex, properties);

     return null;
   }
 }

 public async Task<string?> GetRemotePackageHashFromTarballAsync(string? tarballUrl, CancellationToken cancellationToken)
 {
   try
   {
     using var stream = await _httpClient.GetStreamAsync(tarballUrl, cancellationToken);
     using var sha256 = SHA256.Create();
     var hashBytes = await sha256.ComputeHashAsync(stream, cancellationToken);

     return BitConverter.ToString(hashBytes).Replace("-", "").ToLowerInvariant();
   }
   catch (Exception ex)
   {
     var properties = new Dictionary<string, string>
     {
       { "tarballUrl", tarballUrl }
     };

     _telemetry.TrackException(ex, properties);

      return null;
   }
 }

NVD: National Vulnerability Database, https://services.nvd.nist.gov/

The US government–maintained vulnerability database (by NIST). It’s the canonical database for Common Vulnerabilities & Exposures (CVEs). Contains records with ID, description, CVSS score, severity, references & Common Platform Enumeration (CPE) data — standardized identifiers for software. We use it to validate (estimated/generated) CPEs and use that CPE to map package → CVEs → vulnerabilities.

 public string GenerateCandidateCpe(string ecosystem, string packageName, string? version = null, string? author = null)
 {
   if (string.IsNullOrWhiteSpace(packageName))
   {
     return string.Empty;
   }

   packageName = packageName.Trim().ToLowerInvariant();
   version = Strings.ExtractSemverVersion(version?.Trim());
   author = author?.Trim().ToLowerInvariant();
   ecosystem = ecosystem.Trim().ToLowerInvariant();

   string vendor = author ?? InferVendor(ecosystem, packageName);
   string product = InferProduct(ecosystem, packageName);
   string normalizedVersion = version ?? "*";

   return NormalizeCpeQuery($"cpe:2.3:a:{EscapeCpeField(vendor)}:{EscapeCpeField(product)}:{EscapeCpeField(normalizedVersion)}:*:*:*:*:*:*:*");
 }

 public async Task<bool> ValidateCpeAsync(string cpe, CancellationToken cancellationToken)
 {
   if (string.IsNullOrWhiteSpace(cpe))
   {
     return false;
   }

   var apiKey = await GetNvdApiKeyAsync();

   var request = new HttpRequestMessage(
     HttpMethod.Get,
     $"rest/json/cpes/2.0?cpeMatchString={Uri.EscapeDataString(cpe)}");

   request.Headers.Add("apiKey", apiKey);

   using var response = await _httpClient.SendAsync(
     request: request,
     completionOption: HttpCompletionOption.ResponseHeadersRead,
     cancellationToken: cancellationToken);

   if (!response.IsSuccessStatusCode)
   {
     return false;
   }

   var json = await response.Content.ReadAsStringAsync();
   var parsed = JsonNode.Parse(json);

   return parsed?["products"]?.AsArray()?.Count > 0;
 }

OSV: Open Source Vulnerabilities, https://osv.dev/

A newer open-source project (by Google and partners) to provide a unified vulnerability database. Unlike NVD (which is CVE-based), OSV is package-ecosystem-native (NuGet, NPM, PyPI, Maven, etc.). Data is aggregated from many sources (including NVD, GitHub Advisories, language ecosystems). Contains vulnerability records tied directly to package names and versions, CVSS severity, affected ranges, references We use this to enrich our SBOM with vulnerability information that’s easier to match than raw NVD CPEs.

 public async Task<List<PackageVulnerability>> ListPackageVulnerabilitiesAsync(string ecosystem, string packageName, string version, CancellationToken cancellationToken)
 {
   var payload = new
   {
     package = new { name = packageName, ecosystem = ecosystem },
     version = version
   };

   using var request = new HttpRequestMessage(HttpMethod.Post, "query")
   {
     Content = new StringContent(JsonContracts.Serialize(payload, JsonContracts.ApiOptions), Encoding.UTF8, ContentTypes.Json)
   };

   using var response = await _httpClient.SendAsync(
     request,
     HttpCompletionOption.ResponseHeadersRead,
     cancellationToken
   );

   if (!response.IsSuccessStatusCode)
   {
       return new();
   }

   var json = JsonNode.Parse(await response.Content.ReadAsStringAsync());

   return json?["vulns"]?.AsArray().Select(vuln => new PackageVulnerability {
     Id = vuln["id"]?.ToString() ?? $"OSV-{Guid.NewGuid()}",
     Url = $"https://osv.dev/vulnerability/{vuln["id"]?.ToString()}",
     Description = vuln["summary"]?.ToString() ?? "No description provided",
     Details = vuln["details"]?.ToString(),
     Published = DateTime.TryParse(vuln["published"]?.ToString(), out var pub) ? pub : null,
     Modified = DateTime.TryParse(vuln["modified"]?.ToString(), out var mod) ? mod : null,
     IsWithdrawn = !string.IsNullOrEmpty(vuln["withdrawn"]?.ToString()),
     Aliases = vuln["aliases"]?.AsArray()?
       .Select(alias => alias?.ToString())?
       .Where(alias => !string.IsNullOrWhiteSpace(alias))?
       .ToList() ?? new(),
     Ratings = vuln["severity"]?.AsArray().Select(sev => {
       var type = sev["type"]?.ToString() ?? "unknown";
       var scoreStr = sev["score"]?.ToString();
       var score = float.TryParse(scoreStr, out var s) ? s : 0.0f;

       string? vector = null;
          var cvssArray = vuln["cvss"]?.AsArray();
       if (cvssArray != null)
       {
         foreach (var cvssEntry in cvssArray)
         {
             var cvssVersion = cvssEntry?["version"]?.ToString();
           var vectorString = cvssEntry?["vectorString"]?.ToString();

              // Match CVSSv3 to version 3.x
           if (type.Contains("CVSS_V3", StringComparison.OrdinalIgnoreCase) &&
               cvssVersion?.StartsWith("3") == true)
           {
             vector = vectorString;
             break;
           }
         }
          }

       return new PackageVulnerabilityRating {
         Score = score,
         Type = type,
         CvssVector = vector
       };
     }).ToList() ?? new(),
     References = vuln["references"]?.AsArray()?
       .Select(r => new PackageVulnerabilityReference {
         Type = r?["type"]?.ToString(),
         Url = r?["url"]?.ToString()})
         // Keep only well-formed refs
           .Where(r => !string.IsNullOrWhiteSpace(r?.Url))
           // Deduplicate by URL (common in OSV)
           .GroupBy(r => r.Url, StringComparer.OrdinalIgnoreCase)
           .Select(g => g.First())
           .ToList() ?? new()})
       .ToList() ?? new();
 }

So, high level:

We get the basic dependency tree and info from our source files’ package names and versions
We extend the tree with more detailed info and more nested packages using the Public Repositories
We add vulnerabilities via the Public Databases

Together these will provide us with all of the component related info we need to produce a rich and extended SBOM.

How everything then comes together to actually produce the SBOM, that is for SBOM-Fridays: III. The Engine, the Trade-offs & the Generating of the Artifact.

Some heads up for part 3: we're going to be doing a LOT of API-calls, best to ask an API key to the NVD guys to avoid rate limits.

Also, in the context of adding actual business value, we’re not going to rely on caching as we want the freshest possible data from our external sources. We will mitigate this by de-duplicating possible external API calls. Business choices you don’t have to follow but are definitely encouraged.

Furthermore, chances are your SBOM takes longer than 3 minutes and 50 seconds to produce. Web clients tend to automatically time out after this cut-off, making a long-running operational approach necessary. So my endpoint puts a background Job on a queue with a poller that runs in the background. For local development, eg. using a postman-request, this time-limit is obviously not enforced, so if you haven't got a polled queue in your infrastructure, you can still code along with (most of) this demo.

Thank you for reading, see you in the next installment!

SBOM-Fridays: II. File Aggregation and Gathering Dependency Data from Multi-Stack Repos

Gathering Source Files

Distilling Dependency Data

Comments

More from this blog

SBOM-Fridays: V. Post Processing: Dealing with Engine Output - Signing, Serializing, Compressing & Encrypting.

SBOM-Fridays: IV. Closed Circuit Setup: Vulnerability Scanner, Background Processor for SBOM Generation and Blocking Release Test

SBOM-Fridays: III. The Engine, the Trade-offs & the Generating of the Artifact

SBOM-Fridays: I. Introduction to and relevance of SBOMS

Command Palette

Gathering Source Files

Distilling Dependency Data

Comments

More from this blog