Data Methodology

How the
data is built.

Every record in the Valan dataset traces back to a named official government source. Here is exactly how it gets from there to your infrastructure.

The Collection Pipeline

01 — Collection

Primary source only

Data is collected directly from official government procurement portals — the authoritative source of record for each jurisdiction. We do not use aggregators, secondary providers, or press release scraping.

Each source is monitored daily. New publications are detected and ingested within hours of appearing on the source portal.

80+ sources · 140 countries · T+0 from publication
02 — Parsing

Multi-format extraction

Government procurement data arrives in every format imaginable — structured XML, JSON APIs, HTML tables, PDF notices, CSV bulk downloads. Each source has a dedicated parser built and maintained for its specific format and schema.

Parsing is source-aware: field definitions, date formats, currency codes, and agency identifiers are all handled at the source level before normalisation.

Structured · Semi-structured · PDF extraction · 8 languages
03 — Normalisation

One schema

All parsed records are mapped to a single common schema regardless of source. A US federal contract award, a Brazilian PNCP tender, and a Czech public procurement record end up in the same table with the same field names and the same data types.

Currency normalisation converts all values to USD at the publication-date exchange rate, with original currency and value preserved. Dates are normalised to ISO 8601. Agency names are standardised within each jurisdiction.

Common schema · ISO currencies · Standardised taxonomies
04 — Enrichment

Entity resolution

Award records are enriched with supplier entity resolution — matching award recipients to verified company identifiers (LEI, CRO, registration numbers) and, where applicable, listed equity tickers.

The ticker crosswalk covers 117,000 listed securities across global exchanges, enabling direct linkage between government award data and equity research workflows.

117K tickers · LEI resolution · 3M+ entity aliases
05 — Quality Control

Automated validation

Every record passes automated validation checks before promotion to the live dataset: duplicate detection across source and source identifier, value range validation, mandatory field checks, and PII screening to ensure no personal data is retained in the delivered dataset.

Data quality issues are flagged and quarantined for review before any affected records reach clients.

Deduplication · PII-clean · Range validation · Daily audit
06 — Delivery

Daily to your infrastructure

The validated dataset is exported daily in structured parquet format and delivered via S3 to client-specified buckets. Snowflake delivery and custom pipeline integration are available by arrangement.

Incremental daily updates contain only new and modified records, minimising data transfer. Full historical snapshots are available on request.

Parquet · S3 · Snowflake · Daily incremental
Refresh Cadence
Daily
New records ingested within hours of government publication. T+0 for all major source markets.
Historical Depth
2010 – present
Core markets back to 2010. Some sources extend further. Full backfill available on request.
Delivery Format
Parquet · S3
Structured columnar format. Snowflake delivery and custom integration available.
Languages Parsed
8 languages
English, Mandarin, Arabic, Spanish, Portuguese, French, German, Polish — with machine translation for secondary fields.
PII Policy
Zero retention
Personal data is stripped at ingestion. Delivered dataset contains only B2B entity and agency data.
Compliance
GDPR · Irish law
Operated under Irish law. All source data is publicly published by government. Full provenance available.
Common Schema

Core Fields

Every award and tender record — regardless of source — maps to the same normalised schema. Key fields available across the dataset:

source
Origin portal identifier — the official government source of record
source_country
ISO 3166-1 alpha-2 country code
published_date
Date the record appeared on the source portal (ISO 8601)
title
Contract or tender title as published, with English translation where applicable
value_usd
Normalised USD value at publication-date FX rate
value_original
Original reported value in source currency
currency
ISO 4217 currency code of original value
supplier_name
Awarded supplier entity name as reported
supplier_lei
Legal Entity Identifier where resolved
supplier_ticker
Listed equity ticker where entity-matched (117K securities)
agency_name
Contracting authority name, standardised within jurisdiction
category
CPV / NAICS / source-native category code, normalised

What clean
looks like.

  • Deduplication across source and source identifier
  • Value outlier detection and flagging
  • Mandatory field validation before promotion
  • PII screening — zero personal data retained
  • Currency normalisation with FX audit trail
  • Daily automated quality audit
  • Full provenance — every record traces to source URL

Data quality in procurement is harder than it looks. Governments republish, amend, cancel, and re-award contracts. Values change. Suppliers merge. Agencies restructure.

Our pipeline tracks amendments and modifications — where a source publishes contract updates, those flow through to the dataset with the original record preserved and the amendment flagged. Clients see both the original obligation and any subsequent modifications.

Ceiling values on indefinite delivery vehicles — a common source of apparent anomalies in US federal data — are identified and flagged separately from actual obligated amounts.

Questions about
coverage or quality?

Contact john@valan.io