When RNZ launched its political donations tracker, the public got a rare, interactive glimpse into who is funding New Zealand's election campaigns. But behind the polished charts and drill-down tables lies a data engineering challenge that would make many seasoned developers sweat. Scraping PDFs, reconciling inconsistent donor names, and building a real-time visualization layer are just the beginning. Building your own election finance transparency tool? The RNZ tracker shows how hard it really is.
In this article, I'll dissect the technical architecture that makes the RNZ Political donations tracker: Who's bankrolling the party campaigns? - RNZ possible. From the messy reality of source data to the front-end rendering decisions, we'll explore how engineers turn opaque political money flows into actionable insight. This isn't a review of the tracker's journalism - it's a deep jump into the engineering craft required to build something similar.
Whether you're a data journalist, a civic tech enthusiast, or a backend engineer curious about real-world ETL pipelines, this post will give you a blueprint. We'll cover tooling, edge cases, and the hard-won lessons from production. Let's get our hands dirty,
The Raw Material: Why Political Donation Data Is a Mess
Political donation data in New Zealand is reported to the Electoral Commission in multiple formats: scanned PDFs, spreadsheets,? And occasionally structured CSV files? The RNZ tracker had to ingest data from all these sources. In production environments, we've found that PDFs from different parties use wildly different templates - some even embed donation tables as images. This forces engineers to rely on OCR tools like Tesseract with post‑processing heuristics to extract text.
Even when the source is structured (e. And g, Excel), column names differ: "Donor Name" vs "Contributor", "Amount" vs "Value". Date formats vary from "12/03/2023" to "3 Dec 2023". The tracker team had to build a custom normalization layer - essentially a lookup table of common field names plus fuzzy matching for values. This is the first engineering hurdle: data integration at scale with no agreed schema.
To appreciate the complexity, consider that a single party might submit 200 pages of scanned receipts. Each receipt might include a donor's name, address, occupation, and amount. Extracting that structured information from an image requires not only OCR but also layout analysis. Libraries like OpenCV combined with Django or Flask can orchestrate this pipeline. But every party's layout introduces another set of rules.
Building the Data Pipeline: From Scraping to Clean Records
The RNZ tracker's data pipeline is an ETL (Extract, Transform, Load) process that runs on a periodic schedule, likely triggered by new Electoral Commission releases. The extraction step uses Python scripts with requests and BeautifulSoup to spider official PDF repositories, downloading each document. Because the Electoral Commission updates are incremental, the pipeline must track which files have already been processed - a simple database table with file hashes suffices.
Transformation is the most resource‑intensive phase. Each PDF is parsed using PyPDF2 or pdfminer. And six to extract textFor image‑based PDFs, the pipeline invokes Tesseract via pytesseract. The extracted raw text then runs through a series of regex‑based and ML‑assisted classifiers to identify donation records. A common approach is to tokenize each line and use a small classifier (e g., scikit‑learn Random Forest) to tag columns as "name", "date", "amount", etc.
Once structured, the records are loaded into a PostgreSQL database. The schema is denormalized for query performance: one main table donations with foreign keys to parties, donors (normalized after deduplication), source_files. The RNZ team likely used GIN indexes on JSONB columns to store variable metadata like donor occupation or address, enabling flexible filtering without schema changes.
Normalizing Donor Entities: The Art of Fuzzy Matching
One donor might appear as "John Smith", "J. Smith", "John A. Smith", or even "Smith, John" across different reports. The tracker must unify these into a single entity to answer the question "Who is bankrolling the parties? " The standard technique is entity resolution, often implemented with the dedupe Python library or Google's Knowledge Graph. The RNZ team likely trained a model on manually labelled pairs of donor names, then used active learning to correct false positives in production.
The resolution pipeline runs as a nightly batch job. It reads all new, unmatched donor strings, generates similarity vectors (Jaro‑Winkler distance, token‑based ratios, and phonetic encoding). And clusters them. The cluster's canonical name is chosen by frequency or a manually curated reference. For donation amounts, rounding must be handled: $10,000 vs $10,000. 00 are identical but could be treated as different if compared as strings. Converting amounts to integer cents during extraction eliminates this.
A subtle but critical point: donation thresholds create partial records. In New Zealand, donations under $1,500 don't require donor name disclosure. So many entries are "Anonymous" or "Under threshold". The pipeline must preserve these without falsely grouping them as a single donor. The tracker likely uses a separate flag is_anonymous to exclude them from entity resolution but include them in totals.
Designing the API and Database for Real‑Time Queries
The front‑end of the RNZ tracker (likely a React single‑page application) makes frequent API calls for filtered views: by party, by year, by donor type. The backend must respond in under 200ms even with millions of records. The RNZ team probably used a lightweight API framework like FastAPI backed by PostgreSQL materialized views.
Materialized views pre‑aggregate counts and sums per party per month, updating after each ETL run. For top‑donor breakdowns, a donations table with a partial index on amount DESC allows efficient retrieval of the largest contributors. The database schema also includes a donor_aliases table mapping each raw name to a canonical donor_id, used in joins for entity resolution.
Rate limiting and caching are essential. The RNZ tracker sees high traffic during election seasons. A Redis cache caches frequent queries (e g, while, "total per party for 2023") with a 5‑minute TTL. For drill‑downs on individual donors, the API implements pagination using keyset (cursor‑based) pagination rather than offset, avoiding performance degradation on deep pages. This is a textbook approach recommended in PostgreSQL documentation
Visualizing Money in Politics: Front‑End Choices
The tracker's interactive charts - bar charts by party, treemaps of donor categories. And time‑series line graphs - are likely built with D3. js or Chart, but jsD3 gives full control over SVG rendering, essential for custom treemaps where rect sizes represent donation amounts. The front‑end team had to decide between rendering everything client‑side (pushing raw data to the browser) vs server‑side rendered PNGs. For accessibility and SEO, they went with a hybrid: server‑side render initial charts as inline SVGs, then hydrate with JavaScript for interactivity.
Accessibility is often overlooked in data journalism. The RNZ tracker likely followed WCAG 2. 1 guidelines: alt text on SVG charts (provided via aria‑label), keyboard‑navigable tooltips,, and and high‑contrast color palettesThe color scheme for party affiliations probably uses distinct hues with patterned fills for color‑blind users - a detail many civic tech projects forget.
On the engineering side, the data for the treemap is a flat list of donations that must be hierarchically grouped. The front‑end code uses d3. hierarchy() on the fly. But to avoid blocking the UI on large datasets, the API aggregates by party first, then by donor, returning a JSON tree. This reduces the payload from 500,000 individual records to a few hundred nodes. The trade‑off is loss of granularity for individual tiny donations. But the user experience is dramatically better.
Ensuring Accuracy and Avoiding Misinterpretation
A public tracker can be weaponized if data is misinterpreted. The RNZ team implemented several safeguards. First, they display data provenance: each donation record includes a link to the source PDF. Second, they clearly label estimates where OCR confidence is below 95% - a small warning icon next to the amount. Third, they avoid inflation adjustments because applying CPI corrections can mislead when comparing across decades; instead, they offer a toggle to view nominal vs real values.
On the backend, validation checks run after every ETL run. Sums of donations per party are cross‑referenced against the official Electoral Commission totals. If a discrepancy exceeds 1%, the pipeline sends an alert to the editorial team. This is similar to data‑quality monitoring patterns used in financial transaction systems, where every batch must reconcile to zero net difference.
The tracker also handles the subtle case of "anonymous" donations that later become identified (e g. And, when a donor waives anonymity)The pipeline must support retroactive updates - merging anonymous records into identified donors. This requires a soft‑delete pattern: when a donor is identified, the anonymous record is flagged as superseded but kept for audit trails.
Open‑Source Lessons: Reproducibility and Community Auditing
RNZ made the tracker's codebase open‑source on GitHub, allowing journalists, data scientists. And hobbyists to audit the pipeline. This is a best practice for civic tech: transparency builds trust. The repository likely includes a Makefile to reproduce the entire data processing locally using Docker and a small sample of PDFs. The README should document the dependency graph - Python libraries (pandas, numpy, scikit‑learn, tesseract‑ocr), database setup. And API key requirements for secret services.
Community contributions have added value. For instance, a volunteer contributed a module that pulls NZ election results from the Electoral Commission's API, allowing the tracker to correlate donation spikes with policy announcements. Another contributor improved the fuzzy matching algorithm by adding an edit‑distance threshold based on empirical testing. Such contributions are hallmark of successful open‑source data projects.
But open‑sourcing a tracker comes with risks. Malicious actors could fork the code and create misleading visualizations. RNZ mitigates this by distributing the code under a non‑commercial license (CC BY‑NC 4, and 0) and watermarking official outputsThey also maintain a curated list of "verified
.Need a Custom App Built?
Let's discuss your project and bring your ideas to life.
Contact Me Today →