MyVoterWisdom

About this project & how it was built

Completeness + Consistency + Validity = Wisdom  —  Can the Wisdom Formula tell us which AI models to trust?

📖 The Origin Story

In October 2024, SysWisdom.AI set out to answer one question: does the Wisdom Formula actually work on a real-world dataset? We needed a problem with publicly available, verifiable historical data — and U.S. presidential elections by county fit perfectly. Every vote is on the public record. Every county reports the same structured data. The ground truth is known.

We began curating county-level election data from 2004 through 2024, applying the three Wisdom dimensions to measure whether the data itself was sound enough to build predictions on. The project became a live test of our own framework.

📰
We did it — MyVoterWisdom Alpha is Complete
SysWisdom.AI blog · Aaron · Feb 28, 2025

🔍 Data Sample: Why These 39 Counties?

Anyone doing serious electoral research will ask: why these 39 counties, what were the selection criteria, and is the sample geographically or demographically representative? Those are the right questions. Here are honest answers.

How the 39 Counties Were Selected

The counties were chosen pragmatically, not statistically. The project started in October 2024 as a Wisdom Formula validation experiment, not an academic sample of the U.S. electorate. The selection criteria were:

There was no random sampling, no stratification by population or demographics, and no attempt to construct a nationally representative panel. The dataset was built to test a framework, and it shows.

What the Sample Covers — and What It Misses

Dimension In Sample Gaps
State coverage 25 of 51 states/D.C. 26 states entirely absent — including IA, MN, MO, VA, MA, SC, LA, AR, and most of the Plains & Mountain West
Urban/Rural balance Majority large urban counties Heavily skewed toward cities (Manhattan, Chicago, Houston, Seattle, Detroit, Atlanta, Philadelphia, Miami). Rural and exurban counties under-represented.
State concentration HI: 4 counties (100%)
KS: 4 counties
DE: 3 counties (100%)
GA: 3 counties
Some small states are fully covered; most large states have 1–2 counties. Texas (254 counties) is represented by Harris County alone.
Party lean Mix of reliably Dem, reliably Rep, and swing Over-represents deep-blue urban counties. The Wisdom flag distribution (82% True) reflects this selection bias.
Data richness Full: registered voters + mail-in + in-person splits This is where the 39 counties exceed broader datasets — most county-level sources lack registration and mail-in detail.

How the Data Expanded: The Three-Tier Architecture

In Phase 6 (May 2026) the project expanded beyond the original 39 counties by integrating the MIT Election Data and Science Lab (MEDSL) county presidential returns dataset (CC BY 4.0). The ML model now trains on a global dataset of 19,155 county-year records across 1,956 counties and 51 states.

Tier Source Coverage Role
Tier 1 Manual curated (39 counties) 39 counties · 25 states · 2004–2024 Gold standard — includes voter registration + mail-in detail. Hand-verified against official county sources.
Tier 2 MEDSL / Harvard Dataverse 1,956 counties · 51 states · 2000–2024 Authoritative academic source. CC BY 4.0. Vote totals only — no registration or mail-in splits.
Tier 3 tonmcg / GitHub (scraped) ~3,100 counties · 2008–2024 Secondary validation only. Not authoritative — scraped from news outlets. MIT license.

What this means for the prediction tool: when you select a county and run a prediction, the global Random Forest model draws on cross-county patterns from all 1,956 counties — not just the 39 original hand-curated ones. The original 39 remain the highest-quality rows in the dataset because they carry registration and mail-in data that MEDSL does not include.

Known Remaining Limitations

We publish these limitations because transparency is part of the mandate. See DISCLAIMER.md for the full model limitations statement.

📊 Why 73.8% Is the Honest Score

When you click Check Data Quality on the main page, our backend sends prediction_pres_data.csv to the SysWisdom Data Quality API and scores it against the three Wisdom dimensions. Here is what the score means — and why we deliberately do not inflate it.

Current score — prediction_pres_data.csv (38 rows × 11 columns)

Completeness
100%
Consistency
100%
Validity
25%
Overall
73.8%

Validity is 25% because the API detects statistical outliers in projected vote totals — Harris County TX projects 1.7 million ballots while Glacier County MT projects 5,370. That 300× spread triggers outlier flags on 5 columns. But this is not bad data: it is real geographic diversity across the United States. A rural Montana county and a major Texas urban county are both valid data points. The score is honest, not broken.

The 70% threshold in our GitHub Actions gate means a dataset needs to be at least Mostly Wise before it can be merged. A contributor who accidentally deletes data or introduces inconsistent values would drop the score and be blocked automatically. This dataset, with its expected geographic spread, passes at 73.8%.

🤝 How to Contribute County Data

This is a community-curated dataset. We welcome contributions from researchers, civic technologists, and data volunteers. Every data PR is automatically checked by the Wisdom quality gate before it can be merged.

  1. Fork sysWisdom/myvoterwisdom on GitHub and clone your fork locally.
  2. Add or update rows in data/voting_pres_data.csv or data/prediction_pres_data.csv. Use public election records only — no PII, no individual voter data.
  3. Open a Pull Request. GitHub Actions will automatically run the Wisdom Data Quality Gate — your changes need a score ≥ 70% to be approved.
  4. A maintainer reviews the PR for accuracy and context. If the data checks out, it is merged and the live predictions update on the next deploy.

Questions? Open a GitHub Discussion or email info@syswisdom.ai.

⚖️ Responsible AI Statement

MyVoterWisdom was built as a civic education tool and a framework validation experiment. We are committed to the following principles:

🏛️

Public data only

All data is county-level aggregate election records from public sources. No individual voter data. No PII. No micro-targeting.

🚫

Not for campaigns

This platform must not be used for political advertising, voter suppression, campaign targeting, or any active electoral campaign activity.

🔓

Open-source forever

BSD 3-Clause license. All code, data, and model logic is public. No black boxes. Anyone can audit, fork, or improve it.

📚

Educational purpose

Designed to help people understand how AI models work on real civic data — not to influence votes or declare winners before election day.

🧪

Honest uncertainty

When models disagree, we say so. A "Models disagree" flag is not a failure — it is Consistency working correctly.

🤝

Community reviewed

Data quality is enforced by automated gate AND human review. No single person controls what gets merged.

🔧 What We Built

The full stack is free-tier and open-source. No cloud billing required to run or contribute.

Layer Technology Notes
FrontendGitHub PagesAuto-deploys static/ on push to main
BackendFlask 3 on Render.comPython 3.11, free tier, /predict + /data-quality
ML Modelsscikit-learn + imbalanced-learnRF, LR, SVM, Gradient Boosting with SMOTE
NotebooksGoogle Colab3 notebooks, free, no install required
Data QualitySysWisdom DQ APIProxied server-side, key never in browser
CI GateGitHub ActionsWisdom quality check blocks PRs below 70%
Domainmyvoter.syswisdom.aiWix DNS CNAME → syswisdom.github.io