How It Works

A transparent look at how CVEDoc analyzes vulnerability impact, detects suspicious CVE patterns, and turns findings into verifiable security portfolios.

What is a CVE, and why does it matter?

CVE stands for Common Vulnerabilities and Exposures. It's a global identifier system for publicly disclosed security vulnerabilities — every bug that affects real software gets a unique ID like CVE-2021-44228 (that one is Log4Shell). The IDs are allocated by the MITRE Corporation, a US non-profit that has maintained the program since 1999, and day-to-day assignment is delegated to CVE Numbering Authorities (CNAs) — hundreds of approved vendors, open-source projects, bug bounty platforms, and security research teams who can allocate IDs within their scope.

The system exists because, without it, everyone would talk about the same bug with a different name. A researcher might call it "the Apache deserialization thing," a vendor advisory might call it "CVE‑2023‑XXXX High severity RCE," a scanner tool might call it "plugin-42-finding-7," and a patch note might call it "security fix in 2.4.1." A CVE ID collapses all of that into one canonical reference that every database, scanner, advisory, patch, and SIEM rule can point to. That single-source-of-truth property is what makes the modern security ecosystem work at all — coordinated disclosure, automated scanning, patch tracking, compliance reporting, threat intelligence feeds, and every other downstream workflow are all keyed on CVE IDs.

The wider standards around CVE

CVSS — a 0–10 technical severity score from FIRST.org. Measures how bad a bug is in theory, not how many deployments it actually touches. CVEDoc's impact score exists because this number alone doesn't answer "who's affected."
CWE — a taxonomy of root-cause categories ("Cross-Site Scripting," "Use After Free") tagged against each CVE. This is what CVEDoc's portfolio page uses to derive researcher skill labels.
NVD — NIST's enriched view of the CVE feed: adds CVSS scores, affected-product identifiers, and reference links. It's the source most vulnerability tools actually consume.

How certification bodies and pentesting firms rely on CVEs

Certification curricula are built around them. Practical offensive certs — PortSwigger's BSCP, HackTheBox's CPTS, CWES, and CWEE, TCM Security's PNPT and PJPT, INE Security's eJPT and eCPPT, Zero-Point Security's CRTO, and CREST's CRT/CCT — all assume candidates can recognize, exploit, and remediate real CVEs in hands-on labs, with real vulnerable builds of real software.
Pentesting firms track CVEs to stock their arsenal. New disclosures in common targets (CMSes, VPNs, load balancers) turn into tested exploits in the red team toolkit, and engagement reports cite the CVE IDs so clients can map findings straight to their patching workflow.
Researcher reputation lives on CVEs. Credited CVEs on widely-deployed software are public, measurable evidence of skill — which is exactly why CVEDoc has a portfolio feature to translate those IDs into something non-specialists can read.

How governments rely on CVEs

CISA's KEV catalog. The US Known Exploited Vulnerabilities list catalogs CVEs confirmed to be actively exploited. Binding Operational Directive 22-01 legally requires US federal agencies to patch KEV entries within set deadlines.
National CERTs publish CVE advisories. CERT/CC, the UK's NCSC, Germany's BSI, Japan's JPCERT/CC, and dozens of other national response teams coordinate disclosure and publish advisories keyed to CVE IDs.
Compliance frameworks enforce CVE tracking. FedRAMP, PCI DSS, HIPAA, ISO 27001, SOC 2, and the DoD Risk Management Framework all expect vulnerability management programs that scan, track, and remediate against the CVE feed.

CVEs are the raw material CVEDoc works with. Every score on this site, every data source lookup, every skill label, and every cost ledger row traces back to one or more CVE IDs pulled directly from MITRE. The goal is to take this piece of critical security infrastructure — which is usually opaque to anyone outside the field — and turn it into something anybody can read, verify, and act on.

Researcher Portfolios

CVE IDs are cryptic to anyone outside security. CVE-2025-47939 means nothing to a hiring manager, even when it represents real, high-impact research. CVEDoc portfolios solve that translation problem: you provide a list of the CVEs you've analyzed, and the page turns them into the language non-specialists already understand — skills, specializations, affected digital assets, and verifiable evidence links.

A portfolio is deterministic. There's no LLM involved at the portfolio layer, nothing generative. The same input produces the same page every time, because every number is pure arithmetic over data the analyzer already stored. Every chart, skill label, and specialization traces back to a specific CVE report that anyone can click and audit.

For researchers

A single shareable URL that takes the time a recruiter spends on your CVE list from "what do these mean?" to "I can see the impact." Your findings get a digital-assets-protected tally, a specialization breakdown, a target-variety radar chart, and a top-vulnerability-classes bar chart — all built from the stored analyses, none of it spun up on the fly.

For hiring managers

A candidate's CVEs converted into skill labels and specializations you can filter on, with every single number linkable back to the underlying CVEDoc report. No trust-me claims, no AI-generated summaries. You can click through to verify each finding in seconds, and the impact score, data sources, and reasoning chain are visible for every CVE.

What the portfolio shows

Header metrics — total CVEs analyzed, total digital assets protected (sum of real installation counts from data sources), unique vendors touched, and the number of target categories.
Target variety radar — a radar chart across web apps, servers, frameworks, libraries, mobile, CLI tools, OS, and IoT. Shows at a glance whether the researcher is a specialist or has broad reach.
Top vulnerability classes — the most frequent CWEs across analyzed findings, each with its human name (e.g. "CWE-94 — Code Injection"), rendered as a horizontal bar chart.
Demonstrated skills — hiring-friendly skill labels derived from the CWEs (Command Injection, Memory Safety, Binary Exploitation, CSRF, Sandbox Escape, etc.). Pure lookup, no inference.
Specializations — role-relevant labels derived from the software types the researcher has targeted (Web Application Security, Operating System Security, IoT & Embedded Security, etc.).
Findings table — paginated, 12 per page, with CVE ID, title, CWE classes, digital asset count, and a color-coded impact bar. Every row links to the original CVEDoc report for verification.

How to create one

Three steps, no accounts, no database migration:

Analyze your CVEs through CVEDoc (either the UI or the API). Each analysis generates a share_id.
Drop a JSON file at backend/portfolios/{username}.json with your display name, GitHub and LinkedIn URLs, tagline, and the list of share IDs.
Visit /portfolio/{username}. The page reads the file, fetches the stored analyses, aggregates on the server, and renders.

The JSON stores only share IDs, not full URLs — so the same portfolio file works on any deployment host, whether it's the live site or a self-hosted copy.

Anyone can do this for free

CVEDoc is an open community project. Clone the repo, set your own Anthropic API key, and run it locally or on a VPS — there's no paywall, no seat licence, no tier restriction. The live site at cvedoc.cyberm.ca is rate-limited and shares a small public Anthropic credit pool so it doesn't get drained, but the self-hosted version has no cap. If a researcher cares about owning their evidence, they can control the deployment themselves and publish a portfolio that'll outlive any one hosted instance.

See an example portfolio

CVE Impact Analysis

Most CVE databases only provide a CVSS score, which measures technical severity but not real-world impact. A critical RCE in a hobby project with 10 users is very different from the same vulnerability in software running on a billion devices. CVEDoc bridges this gap with a 6-step methodology that measures how widely the affected software is actually deployed.

Fetch CVE Record from MITRE

We pull the official CVE record from the MITRE CVE API. This gives us the vulnerability title, description, CVSS score, affected product/version info, and reference links. We also extract each CWE with both its ID and human name (e.g. CWE-94 — Code Injection) so downstream consumers like the portfolio and the report page can show a readable vulnerability class instead of an opaque ID.

Classify Software Type via LLM

An LLM reads the CVE description and classifies the affected software into types like web_app, library, os, mobile_app, cli, etc. Each classification comes with a confidence percentage.

The LLM also extracts the product name, vendor, programming language, and package ecosystem. This is what drives which data sources we query next. Only classifications with 60%+ confidence are used for routing decisions.

Query Data Sources Based on Type

Different software types have different adoption signals. We route to the right data sources based on the classification:

BuiltWith web_app, server, framework

Tracks how many live websites use a given technology. Only queried for web-facing software.

Package Registries library, framework, sdk, cli

PyPI, npm, Packagist, RubyGems, Crates.io, NuGet, Maven. Checks weekly download counts.

Google Play mobile_app

Scrapes Google Play Store for download counts and install base.

Apple App Store mobile_app

Scrapes apps.apple.com search results, navigates to the app page, and extracts ratings count. Apple doesn't expose download numbers, but ratings count indicates usage scale.

Project Website Crawl all types

The LLM finds the official project website, we crawl it, and a second LLM pass extracts any usage/adoption numbers mentioned on the page (e.g. "used by 500,000 companies").

Usage numbers from these sources are constant facts retrieved from external websites. They represent real deployment data, not estimates.

GitHub Stats (Secondary Signal)

We search GitHub for the project repository and pull stars, forks, watchers, open issues, and open PRs. This is a secondary popularity signal — stars don't equal deployments, but they do indicate community interest and ecosystem importance.

LLM Reasoning & Domain Knowledge

An LLM reviews all gathered data and generates a usage/adoption summary. Crucially, when data sources return low or irrelevant numbers, the LLM applies domain knowledge to fill the gap.

For example, no public data source reports "Windows Notepad installations" — but the LLM knows it ships with every copy of Windows (1B+ devices). This domain knowledge estimate feeds into the final score when it exceeds what the data sources found.

Compute Impact Score (0-100)

The final score is built from 5 weighted factors:

Factor	Max Points	How
CVSS Base Score	25	CVSS * 2.5 (so a 10.0 = 25 pts)
Deployment Scale	40	1M+ = 40, 100K+ = 32, 10K+ = 24, 1K+ = 16, 100+ = 8
GitHub Popularity	15	50K+ stars = 15, 10K+ = 12, 1K+ = 8, 100+ = 4
Source Coverage	10	3 pts per data source that returned results (max 4 sources)
Dangerous CWE	10	+10 if CWE is command injection, SQLi, RCE, file upload, deserialization, hardcoded creds, or auth bypass

Installations vs. Estimated Users

Every CVE report surfaces two usage numbers with very different confidence levels:

Installations (also called "digital assets") — the real number reported by external data sources: BuiltWith live sites, package registry download counts, or a number scraped straight off the project's own website. When data sources come up dry, this can also come from LLM domain knowledge (for example: "Windows Notepad ships on every Windows install, ~1B+"). In either case, it's grounded.
Estimated users — installations multiplied by 10. This is an inference, not a measurement. We have no way to know the true user-to-install ratio for any given software, so we flag the number as an estimation and keep it separate from the grounded installation count.

Because installations is the reliable metric, it's what the portfolio page uses for both the "Digital assets protected" summary card and the per-CVE column. The estimated-users number is still visible on the individual CVE report page for callers who want it, but it deliberately doesn't feed into any aggregated portfolio total.

Impact Level

The impact level label is based purely on usage scale:

Critical: 1M+ installations High: 100K+ Medium: 10K+ Low: <10K

GitHub Repository Analysis

Some GitHub repositories exhibit suspicious CVE reporting patterns — long-time core maintainers filing numerous low-quality CVEs against their own project, potentially to inflate their security credentials or game bug bounty programs. CVEDoc's repo scanner detects these patterns with a 6-step analysis.

Scrape Security Advisories

We use Playwright to scrape every page of the repository's GitHub Security Advisories (GHSA). For each advisory, we extract the GHSA ID, title, severity, publication date, publisher, and most importantly the credits section — who was credited as Reporter, Analyst, or Finder for each vulnerability.

Fetch Top Contributors

We query the GitHub API for the repository's contributor statistics — the top 10 committers ranked by total commits. For each contributor we record their total lifetime commits, commits in the past year, first commit date, and last commit date.

Cross-Reference Credits vs. Contributors

We match each GHSA-credited username against the top contributors list. This is the core question: are the people reporting vulnerabilities the same people who write the code? Overlap between these groups is a potential red flag.

Classify Each Credited Person

Each person credited on a security advisory is classified into one of four categories:

Long-Tenure Core Committer Top 10 contributor with 180+ days tenure AND 50+ commits. Counted as insider.

Moderate Committer Top 10 contributor but shorter tenure or fewer commits. Counted as insider.

Recent Joiner Top 10 contributor but very new (under 180 days or under 20 commits). Counted as external.

External Reporter Not in the top 10 contributors at all. Counted as external.

Reporter Diversity Check

We analyze the ratio of insider vs. external reporters across all advisories. A healthy project should have a mix — many different external security researchers finding bugs. A project where 100% of CVEs are filed by the same 1-2 core maintainers is suspicious.

Compute Suspicion Score (0-100)

The suspicion score combines four weighted factors:

Factor	Max	Thresholds
Insider CVE %	40	Over 80% insider = 40 pts, 61-80% = 30, 41-60% = 20, 1-40% = 10
No External Reporters	25	Zero external reporters = 25 pts, only 1 external + many CVEs = 15, low ratio = 10
Insider Concentration	20	Same insider credited 5+ times with many CVEs = 20 pts, 3+ times = 10
CVE Velocity	15	Over 5 CVEs/year + 10+ total = 15 pts, 3-5/year = 8

What the Score Means

A high suspicion score does not mean the CVEs are fake. It means the reporting pattern is unusual and worth investigating. Legitimate projects can score high if, for example, a security-focused maintainer proactively audits their own code. The score is a starting point for human review, not a verdict.

70+: High Suspicion 40-69: Moderate 15-39: Low <15: No Suspicion

Shareable Reports

Every completed scan generates a permanent report at a unique URL that can be shared with anyone. Reports include a full breakdown of the score, all data sources checked, reasoning chain, and a timestamped record in EST. Reports can be exported as JSON for integration into other tools and workflows.

Analyze a CVE Scan a Repository