Every record in the Valan dataset traces back to a named official government source. Here is exactly how it gets from there to your infrastructure.
Data is collected directly from official government procurement portals — the authoritative source of record for each jurisdiction. We do not use aggregators, secondary providers, or press release scraping.
Each source is monitored daily. New publications are detected and ingested within hours of appearing on the source portal.
Government procurement data arrives in every format imaginable — structured XML, JSON APIs, HTML tables, PDF notices, CSV bulk downloads. Each source has a dedicated parser built and maintained for its specific format and schema.
Parsing is source-aware: field definitions, date formats, currency codes, and agency identifiers are all handled at the source level before normalisation.
All parsed records are mapped to a single common schema regardless of source. A US federal contract award, a Brazilian PNCP tender, and a Czech public procurement record end up in the same table with the same field names and the same data types.
Currency normalisation converts all values to USD at the publication-date exchange rate, with original currency and value preserved. Dates are normalised to ISO 8601. Agency names are standardised within each jurisdiction.
Award records are enriched with supplier entity resolution — matching award recipients to verified company identifiers (LEI, CRO, registration numbers) and, where applicable, listed equity tickers.
The ticker crosswalk covers 117,000 listed securities across global exchanges, enabling direct linkage between government award data and equity research workflows.
Every record passes automated validation checks before promotion to the live dataset: duplicate detection across source and source identifier, value range validation, mandatory field checks, and PII screening to ensure no personal data is retained in the delivered dataset.
Data quality issues are flagged and quarantined for review before any affected records reach clients.
The validated dataset is exported daily in structured parquet format and delivered via S3 to client-specified buckets. Snowflake delivery and custom pipeline integration are available by arrangement.
Incremental daily updates contain only new and modified records, minimising data transfer. Full historical snapshots are available on request.
Every award and tender record — regardless of source — maps to the same normalised schema. Key fields available across the dataset:
Data quality in procurement is harder than it looks. Governments republish, amend, cancel, and re-award contracts. Values change. Suppliers merge. Agencies restructure.
Our pipeline tracks amendments and modifications — where a source publishes contract updates, those flow through to the dataset with the original record preserved and the amendment flagged. Clients see both the original obligation and any subsequent modifications.
Ceiling values on indefinite delivery vehicles — a common source of apparent anomalies in US federal data — are identified and flagged separately from actual obligated amounts.