📖 The Origin Story
In October 2024, SysWisdom.AI set out to answer one question: does the Wisdom Formula actually work on a real-world dataset? We needed a problem with publicly available, verifiable historical data — and U.S. presidential elections by county fit perfectly. Every vote is on the public record. Every county reports the same structured data. The ground truth is known.
We began curating county-level election data from 2004 through 2024, applying the three Wisdom dimensions to measure whether the data itself was sound enough to build predictions on. The project became a live test of our own framework.
-
🌱October 2024
Project started — Wisdom model validation
Data curation begins. 27 counties across 2004–2024 general elections. Goal: validate the Wisdom Formula against a real, politically neutral dataset.
-
🏁February 2025
Alpha complete
Electoral analytics platform alpha ships. 27 counties, voter trend analysis, AI forecasting models for 2028 using Random Forest, Laplace smoothing, and SMOTE. 541 social media views with zero SEO investment. 20 data-driven analysis posts published.
-
🚀May 2026
Beta — open-source public release & MEDSL expansion
Original 39-county hand-curated set open-sourced on GitHub. MEDSL (Harvard Dataverse, CC BY 4.0) integration expanded training data to 1,956 counties across all 51 states. Global ML model replaces per-county training, eliminating single-class fallback and overfitting on small samples. Electoral College aggregation (Phase 7) added: county → state → 270 EV map. Live at myvoter.syswisdom.ai.
🔍 Data Sample: Why These 39 Counties?
Anyone doing serious electoral research will ask: why these 39 counties, what were the selection criteria, and is the sample geographically or demographically representative? Those are the right questions. Here are honest answers.
How the 39 Counties Were Selected
The counties were chosen pragmatically, not statistically. The project started in October 2024 as a Wisdom Formula validation experiment, not an academic sample of the U.S. electorate. The selection criteria were:
- Public availability of complete data — registered voters, ballots cast, mail-in vs. in-person splits across 2004–2024
- Geographic spread across party-lean categories (reliably blue, reliably red, genuine swing)
- Sufficient year-over-year variation to make Wisdom flag calculation meaningful
- Hand-verification feasibility — every row was manually cross-checked against official county sources
There was no random sampling, no stratification by population or demographics, and no attempt to construct a nationally representative panel. The dataset was built to test a framework, and it shows.
What the Sample Covers — and What It Misses
| Dimension | In Sample | Gaps |
|---|---|---|
| State coverage | 25 of 51 states/D.C. | 26 states entirely absent — including IA, MN, MO, VA, MA, SC, LA, AR, and most of the Plains & Mountain West |
| Urban/Rural balance | Majority large urban counties | Heavily skewed toward cities (Manhattan, Chicago, Houston, Seattle, Detroit, Atlanta, Philadelphia, Miami). Rural and exurban counties under-represented. |
| State concentration | HI: 4 counties (100%) KS: 4 counties DE: 3 counties (100%) GA: 3 counties |
Some small states are fully covered; most large states have 1–2 counties. Texas (254 counties) is represented by Harris County alone. |
| Party lean | Mix of reliably Dem, reliably Rep, and swing | Over-represents deep-blue urban counties. The Wisdom flag distribution (82% True) reflects this selection bias. |
| Data richness | Full: registered voters + mail-in + in-person splits | This is where the 39 counties exceed broader datasets — most county-level sources lack registration and mail-in detail. |
How the Data Expanded: The Three-Tier Architecture
In Phase 6 (May 2026) the project expanded beyond the original 39 counties by integrating the MIT Election Data and Science Lab (MEDSL) county presidential returns dataset (CC BY 4.0). The ML model now trains on a global dataset of 19,155 county-year records across 1,956 counties and 51 states.
| Tier | Source | Coverage | Role |
|---|---|---|---|
| Tier 1 | Manual curated (39 counties) | 39 counties · 25 states · 2004–2024 | Gold standard — includes voter registration + mail-in detail. Hand-verified against official county sources. |
| Tier 2 | MEDSL / Harvard Dataverse | 1,956 counties · 51 states · 2000–2024 | Authoritative academic source. CC BY 4.0. Vote totals only — no registration or mail-in splits. |
| Tier 3 | tonmcg / GitHub (scraped) | ~3,100 counties · 2008–2024 | Secondary validation only. Not authoritative — scraped from news outlets. MIT license. |
What this means for the prediction tool: when you select a county and run a prediction, the global Random Forest model draws on cross-county patterns from all 1,956 counties — not just the 39 original hand-curated ones. The original 39 remain the highest-quality rows in the dataset because they carry registration and mail-in data that MEDSL does not include.
Known Remaining Limitations
- MEDSL covers 2000–2024 presidential elections only; not all ~3,100 U.S. counties are present in every year
- No voter registration data for MEDSL rows — turnout feature is zero for those records
- No demographic or Census features — the model predicts from voting history alone, not why counties vote as they do
- Alaska uses borough boundaries; the original AK entry (House District 40) was relabeled as Matanuska-Susitna Borough — a known approximation
- The Electoral College projection (Phase 7) treats Maine and Nebraska as winner-take-all for simplicity
We publish these limitations because transparency is part of the mandate. See DISCLAIMER.md for the full model limitations statement.
📊 Why 73.8% Is the Honest Score
When you click Check Data Quality on the main page, our backend sends
prediction_pres_data.csv to the
SysWisdom Data Quality API
and scores it against the three Wisdom dimensions. Here is what the score means — and why
we deliberately do not inflate it.
Current score — prediction_pres_data.csv (38 rows × 11 columns)
Validity is 25% because the API detects statistical outliers in projected vote totals — Harris County TX projects 1.7 million ballots while Glacier County MT projects 5,370. That 300× spread triggers outlier flags on 5 columns. But this is not bad data: it is real geographic diversity across the United States. A rural Montana county and a major Texas urban county are both valid data points. The score is honest, not broken.
The 70% threshold in our GitHub Actions gate means a dataset needs to be at least Mostly Wise before it can be merged. A contributor who accidentally deletes data or introduces inconsistent values would drop the score and be blocked automatically. This dataset, with its expected geographic spread, passes at 73.8%.
🤝 How to Contribute County Data
This is a community-curated dataset. We welcome contributions from researchers, civic technologists, and data volunteers. Every data PR is automatically checked by the Wisdom quality gate before it can be merged.
- Fork sysWisdom/myvoterwisdom on GitHub and clone your fork locally.
- Add or update rows in
data/voting_pres_data.csvordata/prediction_pres_data.csv. Use public election records only — no PII, no individual voter data. - Open a Pull Request. GitHub Actions will automatically run the Wisdom Data Quality Gate — your changes need a score ≥ 70% to be approved.
- A maintainer reviews the PR for accuracy and context. If the data checks out, it is merged and the live predictions update on the next deploy.
Questions? Open a GitHub Discussion or email info@syswisdom.ai.
⚖️ Responsible AI Statement
MyVoterWisdom was built as a civic education tool and a framework validation experiment. We are committed to the following principles:
Public data only
All data is county-level aggregate election records from public sources. No individual voter data. No PII. No micro-targeting.
Not for campaigns
This platform must not be used for political advertising, voter suppression, campaign targeting, or any active electoral campaign activity.
Open-source forever
BSD 3-Clause license. All code, data, and model logic is public. No black boxes. Anyone can audit, fork, or improve it.
Educational purpose
Designed to help people understand how AI models work on real civic data — not to influence votes or declare winners before election day.
Honest uncertainty
When models disagree, we say so. A "Models disagree" flag is not a failure — it is Consistency working correctly.
Community reviewed
Data quality is enforced by automated gate AND human review. No single person controls what gets merged.
🔧 What We Built
The full stack is free-tier and open-source. No cloud billing required to run or contribute.
| Layer | Technology | Notes |
|---|---|---|
| Frontend | GitHub Pages | Auto-deploys static/ on push to main |
| Backend | Flask 3 on Render.com | Python 3.11, free tier, /predict + /data-quality |
| ML Models | scikit-learn + imbalanced-learn | RF, LR, SVM, Gradient Boosting with SMOTE |
| Notebooks | Google Colab | 3 notebooks, free, no install required |
| Data Quality | SysWisdom DQ API | Proxied server-side, key never in browser |
| CI Gate | GitHub Actions | Wisdom quality check blocks PRs below 70% |
| Domain | myvoter.syswisdom.ai | Wix DNS CNAME → syswisdom.github.io |