Data Methodology — Valan Technologies

The Collection Pipeline

01 — Collection

Primary source only

Data is collected directly from official government procurement portals — the authoritative source of record for each jurisdiction. We do not use aggregators, secondary providers, or press release scraping.

Each source is monitored daily. New publications are detected and ingested within hours of appearing on the source portal.

80+ sources · 140 countries · T+0 from publication

02 — Parsing

Multi-format extraction

Government procurement data arrives in every format imaginable — structured XML, JSON APIs, HTML tables, PDF notices, CSV bulk downloads. Each source has a dedicated parser built and maintained for its specific format and schema.

Parsing is source-aware: field definitions, date formats, currency codes, and agency identifiers are all handled at the source level before normalisation.

Structured · Semi-structured · PDF extraction · 8 languages

03 — Normalisation

One schema

All parsed records are mapped to a single common schema regardless of source. A US federal contract award, a Brazilian PNCP tender, and a Czech public procurement record end up in the same table with the same field names and the same data types.

Currency normalisation converts all values to USD at the publication-date exchange rate, with original currency and value preserved. Dates are normalised to ISO 8601. Agency names are standardised within each jurisdiction.

Common schema · ISO currencies · Standardised taxonomies

04 — Enrichment

Entity resolution

Award records are enriched with supplier entity resolution — matching award recipients to verified company identifiers (LEI, CRO, registration numbers) and, where applicable, listed equity tickers.

The ticker crosswalk covers 117,000 listed securities across global exchanges, enabling direct linkage between government award data and equity research workflows.

117K tickers · LEI resolution · 3M+ entity aliases

05 — Quality Control

Automated validation

Every record passes automated validation checks before promotion to the live dataset: duplicate detection across source and source identifier, value range validation, mandatory field checks, and PII screening to ensure no personal data is retained in the delivered dataset.

Data quality issues are flagged and quarantined for review before any affected records reach clients.

Deduplication · PII-clean · Range validation · Daily audit

06 — Delivery

Daily to your infrastructure

The validated dataset is exported daily in structured parquet format and delivered via S3 to client-specified buckets. Snowflake delivery and custom pipeline integration are available by arrangement.

Incremental daily updates contain only new and modified records, minimising data transfer. Full historical snapshots are available on request.

Parquet · S3 · Snowflake · Daily incremental

Refresh Cadence

Daily

New records ingested within hours of government publication. T+0 for all major source markets.

Historical Depth

2010 – present

Core markets back to 2010. Some sources extend further. Full backfill available on request.

Delivery Format

Parquet · S3

Structured columnar format. Snowflake delivery and custom integration available.

Languages Parsed

8 languages

English, Mandarin, Arabic, Spanish, Portuguese, French, German, Polish — with machine translation for secondary fields.

PII Policy

Zero retention

Personal data is stripped at ingestion. Delivered dataset contains only B2B entity and agency data.

Compliance

GDPR · Irish law

Operated under Irish law. All source data is publicly published by government. Full provenance available.

Common Schema

Core Fields

Every award and tender record — regardless of source — maps to the same normalised schema. Key fields available across the dataset:

source

Origin portal identifier — the official government source of record

source_country

ISO 3166-1 alpha-2 country code

published_date

Date the record appeared on the source portal (ISO 8601)

title

Contract or tender title as published, with English translation where applicable

value_usd

Normalised USD value at publication-date FX rate

value_original

Original reported value in source currency

currency

ISO 4217 currency code of original value

supplier_name

Awarded supplier entity name as reported

supplier_lei

Legal Entity Identifier where resolved

supplier_ticker

Listed equity ticker where entity-matched (117K securities)

agency_name

Contracting authority name, standardised within jurisdiction

What clean
looks like.

Deduplication across source and source identifier
Value outlier detection and flagging
Mandatory field validation before promotion
PII screening — zero personal data retained
Currency normalisation with FX audit trail
Daily automated quality audit
Full provenance — every record traces to source URL

Data quality in procurement is harder than it looks. Governments republish, amend, cancel, and re-award contracts. Values change. Suppliers merge. Agencies restructure.

Our pipeline tracks amendments and modifications — where a source publishes contract updates, those flow through to the dataset with the original record preserved and the amendment flagged. Clients see both the original obligation and any subsequent modifications.

Ceiling values on indefinite delivery vehicles — a common source of apparent anomalies in US federal data — are identified and flagged separately from actual obligated amounts.

How the
data is built.

The Collection Pipeline

Primary source only

Multi-format extraction

One schema

Entity resolution

Automated validation

Daily to your infrastructure

Core Fields

What clean
looks like.

Questions about
coverage or quality?

How thedata is built.

The Collection Pipeline

Primary source only

Multi-format extraction

One schema

Entity resolution

Automated validation

Daily to your infrastructure

Core Fields

What cleanlooks like.

Questions aboutcoverage or quality?

How the
data is built.

What clean
looks like.

Questions about
coverage or quality?