P
Provenance Live trust scoring for blogs, YouTube, and PubMed
Trust dashboard / live pipeline

Make the source readable before you trust it.

Provenance pulls content from three source types, normalizes the result, scores trust with visible factors, and keeps the evidence attached so the output can be validated instead of guessed.

-- seeded sources 1 live scrape endpoint 4 explainable trust factors

API Routes

The backend endpoints powering this dashboard.

Swagger UI

/docs

Explore all endpoints via interactive API documentation.

Status

/health

Server health check and basic system info.

Metadata

/summary

High-level aggregation of scraped data summary.

Data GET

/sources

Paginated list of all scraped source documents.

Action POST

/scrape

Live scrape pipeline for standard URLs.

How the system works

Use the arrow controls to move through each connected stage. The stage panel updates in sync so you can track what changed and why.

Step 01 / collect

Grab the source with the right tool

Blogs, YouTube, and PubMed each need different fetch logic. The system starts by keeping source-specific metadata intact.

URL Metadata Transcript / abstract
Step 02 / normalize

Turn messy fields into one clean record

Noise gets removed, language gets detected, tags get attached, and the output becomes comparable across source families.

Cleaning Language Chunks
Step 03 / score

Score trust with visible reasons

Each score is computed from explicit factors, so every confidence claim is traceable to evidence.

Author Recency Citations
Step 04 / deliver

Keep the result evidence-backed

The final record includes the score, factor breakdown, flags, and content chunks so every claim has a clear audit trail.

Breakdown Flags Chunks

How scraping works in practice

This is the exact method used for each source type, including fallbacks when metadata or transcripts are missing.

Blog pipeline

Requests + BeautifulSoup, with JS fallback path

  • 1 Fetch HTML via HTTP GET with standard browser headers and a 15s timeout.
  • 2 Parse metadata (title, author, publish date) looking at OpenGraph tags, JSON-LD structured data, and common CSS class patterns like .author or .byline.
  • 3 Extract the main body text by prioritizing <article>, <main>, or #content tags, falling back to all <p> tags if necessary. Strip out navs, ads, and footers.
  • 4 Count visible citations by scanning the text for DOIs, PubMed IDs, or standard bracketed reference markers like [1].
YouTube pipeline

oEmbed + page metadata + transcript recovery

  • 1 Extract the unique Video ID from the provided URL (handles standard youtube.com/watch and shortened youtu.be formats).
  • 2 Use the official YouTube oEmbed API to fetch the video title and channel name without requiring a developer API key.
  • 3 Attempt to fetch the video transcript using the youtube-transcript-api to get the actual spoken words as the primary content payload.
  • 4 If the transcript is private or missing, gracefully fall back to parsing the raw video description text and apply a transcript_unavailable risk flag.
PubMed pipeline

Entrez XML first, HTML fallback second

  • 1 Parse the PMID (PubMed ID) directly from the URL.
  • 2 Use Biopython's Entrez E-utilities to hit the NCBI database. This returns structured, completely clean XML instead of messy HTML.
  • 3 Extract the title, author list, journal name, publication date, and full abstract directly from the XML tree.
  • 4 If the Entrez API is down or rate-limited, fall back to a traditional BeautifulSoup HTML scrape of the PubMed web page to ensure the pipeline doesn't break.
Unified scoring pipeline

One normalized schema for all source types

  • Raw text is cleaned, language-detected, topic-tagged, and chunked before scoring.
  • Trust score is computed from explicit factors: author credibility, citations, domain authority, recency, and disclaimers.
  • The API returns score + factor reasons + penalties + risk flags so every confidence claim remains auditable.

Live scrape demo

Paste a URL, pick a source type, and the page will render the response with score, reasons, flags, and chunks.

Try it live

One input. One trust score. No hidden guesswork.

The interface is designed to feel like a tool, not a landing page. The form stays simple, the output stays readable, and the result keeps its evidence attached.

Visible scoring The score comes with a breakdown and a short reason.
Source-specific fetches The scraper changes based on whether the input is a blog, video, or paper.
Evidence-backed output Flags, factor reasons, and chunks are kept with the response so the rationale is always visible.
Scraping, cleaning, tagging, and scoring…
Live request

Waiting for input

Submit a URL to see the score, the breakdown, and the evidence behind it.

-

No result yet

The score explanation will show up here after the scrape completes.

-- Trust Score
Breakdown
Flags
Chunks