AUSynth Methodology

What AUSynth Is

AUSynth provides synthetic Australian population data; individual-level records that statistically represent the demographic and socioeconomic patterns found in the ABS 2021 Census, updated to reflect current population estimates and economic conditions.

The data is synthetic: no record corresponds to a real person. But the records faithfully preserve the joint distributions and relationships found in the source Census data. Each synthetic person, family, or dwelling is a statistically plausible combination of attributes, drawn from distributions estimated at the suburb level.

AUSynth exists because real unit-record Census data is not available to researchers, analysts, or businesses in Australia. The ABS publishes aggregate cross-tabulations, but these cannot be used for record-level modelling, machine learning training, microsimulation, or any analysis requiring individual observations. AUSynth fills this gap.

Data Sources

AUSynth is built on publicly available ABS datasets:

ABS Census of Population and Housing, 2021. The foundational source. Cross-tabulations capturing how demographic variables relate to each other within each suburb, for example, how income varies by age group and sex in a given area.

ABS Estimated Resident Population (ERP), Catalogue 3218.0. Population counts by Local Government Area, used to scale suburb populations from the 2021 Census base year to current levels.

ABS Wage Price Index (WPI), Catalogue 6345.0. State-level wage growth indices used to adjust income distributions from 2021 nominal values to current levels.

ABS Consumer Price Index (CPI), Catalogue 6401.0. Capital-city-level housing cost indices used to adjust mortgage repayment and rental distributions.

Australian Statistical Geography Standard (ASGS), 2021. The geographic framework defining Suburbs and Localities (SAL), Local Government Areas (LGA), Greater Capital City Statistical Areas (GCCSA), and State/Territory boundaries.

Synthetic Generation

The synthetic population is generated using established Markov Chain Monte Carlo methods that preserve the joint distributions of demographic variables as observed in the source Census. The generation process works at the suburb level: for each of Australia's 15,343 covered suburbs, conditional probability distributions estimated from Census cross-tabulations are used to produce individual-level records that reproduce the statistical relationships found in the real data.

Population totals are calibrated to ABS 2025 estimates. Income distributions are adjusted for cumulative wage growth using state-specific WPI factors. Housing cost distributions are adjusted using capital-city-level CPI indices for mortgage repayments and rents.

Each suburb's pool of synthetic records equals its projected population. When customers query records, they are sampled without replacement from the pool, ensuring that repeat queries yield different subsets; supporting multiple imputation workflows.

Structural Validation

Every generated record is checked against a comprehensive set of validation rules derived from the ABS Census Dictionary 2021. These rules encode the structural relationships between variables, for example, children under 15 cannot have an occupation, and a renter cannot have mortgage repayments. Records that violate any rule are discarded, ensuring that every delivered record is structurally plausible.

It is important to understand what validation does and does not do. It removes structurally impossible combinations; a 5-year-old with a full-time job. It does not remove improbable but possible combinations; a 25-year-old retiree. The synthetic data may contain records that are unusual but not impossible.

Quality

AUSynth has been extensively validated against ABS Census 2021 source distributions. Key demographic relationships; age, income, sex, education, occupation, industry, birthplace, and transport; are reproduced with high fidelity. Family-level relationships are reproduced with good fidelity. Dwelling-level household composition is moderate; the planned v1.1 linking release will substantially improve this.

Quality is measured using Standardised Root Mean Square Error (SRMSE), which quantifies how closely the synthetic distributions match ABS targets. Scores below 0.05 indicate strong fit, and below 0.10 is good. Person and family datasets achieve strong fit; dwelling data is acceptable and will improve with household linking. For detailed validation metrics and published benchmark comparisons, see the Quality Benchmarks page.

Quality is highest for populous suburbs with rich source data and lowest for very small suburbs where the underlying Census data is naturally sparse. Suburbs with original Census populations below 100 are flagged with small_suburb_flag to alert users to this.

Geographic Coverage

AUSynth covers 15,343 of Australia's 15,352 Suburbs and Localities across 547 Local Government Areas and all states and territories. The 9 excluded entries are ABS statistical constructs (such as "No usual address") that do not represent physical locations.

Temporal Adjustment

The Census captures Australia as it was on 10 August 2021. AUSynth adjusts three dimensions to produce records representative of current conditions. Population counts are scaled to 2025 using LGA-level ERP growth rates. Income distributions are adjusted using state-specific wage growth. Housing cost distributions are adjusted using capital-city-level CPI indices.

The structural composition of the population; how age relates to occupation, for example; reflects 2021 patterns. Industry and occupation distributions have not been adjusted for post-2021 labour market changes.

Known Limitations

Dwelling household composition. In v1.0, person, family, and dwelling datasets are generated independently, which limits household-level coherence. The planned v1.1 release will introduce person-family-dwelling linking, substantially improving dwelling-level fidelity.

Static labour market structure. Industry and occupation distributions reflect the 2021 Census. Sectors that have grown or contracted significantly since then will be under- or over-represented.

Income bracket boundaries. Income variables use the ABS's fixed 2021 bracket definitions, adjusted for wage growth. The adjustment shifts probability mass across brackets but cannot change the bracket boundaries themselves.

No longitudinal dimension. Each record is a cross-sectional snapshot. Longitudinal capability is under research for future releases.

Census coverage. AUSynth inherits whatever coverage patterns exist in the ABS Census 2021. Populations undercounted in the Census are correspondingly underrepresented in the synthetic data.

Updates

The dataset is refreshed periodically. Quarterly: economic adjustments (income, housing costs). Annually: full population recalibration. 2027/28: full Census 2026 integration when published by ABS.

References

Farooq, B., Bierlaire, M., Hurtubia, R., & Flotterod, G. (2013). Simulation based population synthesis. Transportation Research Part B: Methodological, 58, 243-263.

Australian Bureau of Statistics. (2022). Census of Population and Housing: Census Dictionary, 2021. ABS Catalogue No. 2901.0.

AUSynth is synthetic proxy data. It preserves the statistical relationships and population counts found in the ABS Census 2021, adjusted to current demographic and economic conditions. It should not be treated as a direct observation of real individuals or used as the sole basis for decisions affecting real people.