About — MyVoterWisdom

📖 The Origin Story

In October 2024, SysWisdom.AI set out to answer one question: does the Wisdom Formula actually work on a real-world dataset? We needed a problem with publicly available, verifiable historical data — and U.S. presidential elections by county fit perfectly. Every vote is on the public record. Every county reports the same structured data. The ground truth is known.

We began curating county-level election data from 2004 through 2024, applying the three Wisdom dimensions to measure whether the data itself was sound enough to build predictions on. The project became a live test of our own framework.

🌱

October 2024

Project started — Wisdom model validation

Data curation begins. 27 counties across 2004–2024 general elections. Goal: validate the Wisdom Formula against a real, politically neutral dataset.
🏁

February 2025

Alpha complete

Electoral analytics platform alpha ships. 27 counties, voter trend analysis, AI forecasting models for 2028 using Random Forest, Laplace smoothing, and SMOTE. 541 social media views with zero SEO investment. 20 data-driven analysis posts published.
🚀

May 2026

Beta — open-source public release & MEDSL expansion

Original 39-county hand-curated set open-sourced on GitHub. MEDSL (Harvard Dataverse, CC BY 4.0) integration expanded training data to 1,956 counties across all 51 states. Global ML model replaces per-county training, eliminating single-class fallback and overfitting on small samples. Electoral College aggregation (Phase 7) added: county → state → 270 EV map. Live at myvoter.syswisdom.ai.

📰

We did it — MyVoterWisdom Alpha is Complete

SysWisdom.AI blog · Aaron · Feb 28, 2025

→

🔍 Data Sample: Why These 39 Counties?

Anyone doing serious electoral research will ask: why these 39 counties, what were the selection criteria, and is the sample geographically or demographically representative? Those are the right questions. Here are honest answers.

How the 39 Counties Were Selected

The counties were chosen pragmatically, not statistically. The project started in October 2024 as a Wisdom Formula validation experiment, not an academic sample of the U.S. electorate. The selection criteria were:

Public availability of complete data — registered voters, ballots cast, mail-in vs. in-person splits across 2004–2024
Geographic spread across party-lean categories (reliably blue, reliably red, genuine swing)
Sufficient year-over-year variation to make Wisdom flag calculation meaningful
Hand-verification feasibility — every row was manually cross-checked against official county sources

There was no random sampling, no stratification by population or demographics, and no attempt to construct a nationally representative panel. The dataset was built to test a framework, and it shows.

What the Sample Covers — and What It Misses

Dimension	In Sample	Gaps
State coverage	25 of 51 states/D.C.	26 states entirely absent — including IA, MN, MO, VA, MA, SC, LA, AR, and most of the Plains & Mountain West
Urban/Rural balance	Majority large urban counties	Heavily skewed toward cities (Manhattan, Chicago, Houston, Seattle, Detroit, Atlanta, Philadelphia, Miami). Rural and exurban counties under-represented.
State concentration	HI: 4 counties (100%) KS: 4 counties DE: 3 counties (100%) GA: 3 counties	Some small states are fully covered; most large states have 1–2 counties. Texas (254 counties) is represented by Harris County alone.
Party lean	Mix of reliably Dem, reliably Rep, and swing	Over-represents deep-blue urban counties. The Wisdom flag distribution (82% True) reflects this selection bias.
Data richness	Full: registered voters + mail-in + in-person splits	This is where the 39 counties exceed broader datasets — most county-level sources lack registration and mail-in detail.

How the Data Expanded: The Three-Tier Architecture

In Phase 6 (May 2026) the project expanded beyond the original 39 counties by integrating the MIT Election Data and Science Lab (MEDSL) county presidential returns dataset (CC BY 4.0). The ML model now trains on a global dataset of 19,155 county-year records across 1,956 counties and 51 states.

Tier	Source	Coverage	Role
Tier 1	Manual curated (39 counties)	39 counties · 25 states · 2004–2024	Gold standard — includes voter registration + mail-in detail. Hand-verified against official county sources.
Tier 2	MEDSL / Harvard Dataverse	1,956 counties · 51 states · 2000–2024	Authoritative academic source. CC BY 4.0. Vote totals only — no registration or mail-in splits.
Tier 3	tonmcg / GitHub (scraped)	~3,100 counties · 2008–2024	Secondary validation only. Not authoritative — scraped from news outlets. MIT license.

What this means for the prediction tool: when you select a county and run a prediction, the global Random Forest model draws on cross-county patterns from all 1,956 counties — not just the 39 original hand-curated ones. The original 39 remain the highest-quality rows in the dataset because they carry registration and mail-in data that MEDSL does not include.

Known Remaining Limitations

MEDSL covers 2000–2024 presidential elections only; not all ~3,100 U.S. counties are present in every year
No voter registration data for MEDSL rows — turnout feature is zero for those records
No demographic or Census features — the model predicts from voting history alone, not why counties vote as they do
Alaska uses borough boundaries; the original AK entry (House District 40) was relabeled as Matanuska-Susitna Borough — a known approximation
The Electoral College projection (Phase 7) treats Maine and Nebraska as winner-take-all for simplicity

We publish these limitations because transparency is part of the mandate. See DISCLAIMER.md for the full model limitations statement.

📊 Why 73.8% Is the Honest Score

When you click Check Data Quality on the main page, our backend sends prediction_pres_data.csv to the SysWisdom Data Quality API and scores it against the three Wisdom dimensions. Here is what the score means — and why we deliberately do not inflate it.

Current score — prediction_pres_data.csv (38 rows × 11 columns)

Completeness

100%

Consistency

100%

Validity

25%

Overall

73.8%

Validity is 25% because the API detects statistical outliers in projected vote totals — Harris County TX projects 1.7 million ballots while Glacier County MT projects 5,370. That 300× spread triggers outlier flags on 5 columns. But this is not bad data: it is real geographic diversity across the United States. A rural Montana county and a major Texas urban county are both valid data points. The score is honest, not broken.

The 70% threshold in our GitHub Actions gate means a dataset needs to be at least Mostly Wise before it can be merged. A contributor who accidentally deletes data or introduces inconsistent values would drop the score and be blocked automatically. This dataset, with its expected geographic spread, passes at 73.8%.

🤝 How to Contribute County Data

This is a community-curated dataset. We welcome contributions from researchers, civic technologists, and data volunteers. Every data PR is automatically checked by the Wisdom quality gate before it can be merged.

Fork sysWisdom/myvoterwisdom on GitHub and clone your fork locally.
Add or update rows in data/voting_pres_data.csv or data/prediction_pres_data.csv. Use public election records only — no PII, no individual voter data.
Open a Pull Request. GitHub Actions will automatically run the Wisdom Data Quality Gate — your changes need a score ≥ 70% to be approved.
A maintainer reviews the PR for accuracy and context. If the data checks out, it is merged and the live predictions update on the next deploy.

Questions? Open a GitHub Discussion or email info@syswisdom.ai.

⚖️ Responsible AI Statement

MyVoterWisdom was built as a civic education tool and a framework validation experiment. We are committed to the following principles:

🏛️

Public data only

All data is county-level aggregate election records from public sources. No individual voter data. No PII. No micro-targeting.

🚫

Not for campaigns

This platform must not be used for political advertising, voter suppression, campaign targeting, or any active electoral campaign activity.

🔓

Open-source forever

BSD 3-Clause license. All code, data, and model logic is public. No black boxes. Anyone can audit, fork, or improve it.

📚

Educational purpose

Designed to help people understand how AI models work on real civic data — not to influence votes or declare winners before election day.

🧪

Honest uncertainty

When models disagree, we say so. A "Models disagree" flag is not a failure — it is Consistency working correctly.

🤝

Community reviewed

Data quality is enforced by automated gate AND human review. No single person controls what gets merged.

🔧 What We Built

The full stack is free-tier and open-source. No cloud billing required to run or contribute.

Layer	Technology	Notes
Frontend	GitHub Pages	Auto-deploys static/ on push to main
Backend	Flask 3 on Render.com	Python 3.11, free tier, /predict + /data-quality
ML Models	scikit-learn + imbalanced-learn	RF, LR, SVM, Gradient Boosting with SMOTE
Notebooks	Google Colab	3 notebooks, free, no install required
Data Quality	SysWisdom DQ API	Proxied server-side, key never in browser
CI Gate	GitHub Actions	Wisdom quality check blocks PRs below 70%
Domain	myvoter.syswisdom.ai	Wix DNS CNAME → syswisdom.github.io