Docs

Glossary

Glossary

Statistical Terms

Conditional distribution; The probability distribution of one variable given fixed values of one or more other variables. For example, the income distribution for women aged 30-34 in a given suburb. AUSynth preserves conditional distributions from ABS Census 2021 cross-tabulations.

Cross-tabulation; A table showing the frequency or proportion of observations for every combination of two or more categorical variables. ABS TableBuilder publishes cross-tabulations of Census variables, which serve as the foundation for AUSynth's statistical estimates.

Joint distribution; The probability distribution over all variables simultaneously. AUSynth's goal is to generate records that approximate the true joint distribution within each suburb.

Marginal distribution; The distribution of a single variable, ignoring all other variables. For example, the proportion of people in each age group in a suburb. AUSynth validates synthetic output by comparing marginal distributions against ABS targets.

Markov Chain Monte Carlo (MCMC); A class of algorithms that generate samples from a target probability distribution. AUSynth uses MCMC methods to generate synthetic records that preserve the statistical relationships found in Census data. The general approach is well-documented in the population synthesis literature.

Multiple imputation; A statistical technique for creating multiple plausible versions of a dataset, each generated independently. Analyses are run on each version separately, and results are combined using Rubin's rules to produce estimates that properly account for uncertainty. AUSynth's without-replacement sampling supports multiple imputation workflows.

Rubin's rules; Formulas for combining parameter estimates and standard errors from multiple imputed datasets. The combined estimate is the average across imputations. The combined variance includes both within-imputation variance and between-imputation variance, properly accounting for imputation uncertainty.

SRMSE (Standardised Root Mean Square Error); A measure of how closely a synthetic distribution matches a target distribution, normalised by the mean category proportion. Lower is better. Values below 0.05 are generally considered strong fit for synthetic population data.

Synthetic data; Data generated algorithmically to approximate the statistical properties of a real dataset, without containing any real individual records. AUSynth generates synthetic population records that preserve the distributional relationships found in the ABS Census 2021.

Total Variation Distance (TVD); A measure of the maximum difference between two probability distributions. Ranges from 0 (identical) to 1 (no overlap). Used in validation to compare synthetic distributions against ABS targets.

ABS Terms

ABS (Australian Bureau of Statistics); Australia's national statistical agency, responsible for the Census and the statistical frameworks that AUSynth builds upon.

ASGS (Australian Statistical Geography Standard); The ABS framework defining geographic boundaries and hierarchies across Australia. ASGS 2021 defines the suburb, LGA, GCCSA, state, and national boundaries used by AUSynth.

Census 2021; The Australian Census of Population and Housing conducted on 10 August 2021. Provides the conditional probability distributions and marginal targets from which synthetic records are generated.

Census Dictionary; The ABS reference document (Catalogue 2901.0) defining every Census variable, its categories, and the rules governing which categories apply to which persons. AUSynth's validation rules are derived from these definitional relationships.

CPI (Consumer Price Index); An ABS index (Catalogue 6401.0) measuring changes in the price of goods and services. AUSynth uses CPI Housing (for mortgage repayments) and CPI Rents (for weekly rental costs) to adjust housing cost distributions.

ERP (Estimated Resident Population); The ABS's authoritative measure of how many people live in each area (Catalogue 3218.0). Published at the LGA level. AUSynth uses ERP to scale suburb populations from 2021 Census counts to current levels.

GCCSA (Greater Capital City Statistical Area); An ASGS geographic unit comprising a capital city and its surrounding functional urban area. Australia has eight GCCSAs plus corresponding "Rest of State" regions.

LGA (Local Government Area); An ASGS geographic unit corresponding to a local council or shire boundary. Australia has approximately 560 LGAs. AUSynth uses LGA-level ERP for population scaling.

Not applicable; An ABS category code indicating that a variable does not logically apply to the record. For example, occupation is "Not applicable" for children under 15. This is a structural code, not missing data.

Not stated; An ABS category code indicating that the respondent did not provide a response, or the response could not be coded. This represents genuinely missing information, distinct from "Not applicable".

SAL (Suburb and Locality); The ASGS geographic unit representing suburbs in urban areas and localities in rural areas. There are 15,352 SALs in Australia. AUSynth covers 15,343, excluding 9 that are ABS statistical constructs.

TableBuilder; An ABS online tool that allows users to create custom cross-tabulations from Census data. AUSynth's statistical distributions are derived from TableBuilder data.

WPI (Wage Price Index); An ABS index (Catalogue 6345.0) measuring changes in the price of labour. Published quarterly at the state level. AUSynth uses WPI to adjust income distributions from 2021 to current levels.

AUSynth Terms

Dataset; One of the three record types generated by AUSynth: persons, families, or dwellings. Each dataset has its own set of variables and is generated independently in v1.0. Version 1.1 will introduce linking across datasets.

Hierarchical geography; An output option that includes all geographic levels (suburb, postcode, LGA, GCCSA, state) in each record. Priced at 1.5x the standard credit rate.

Pool; The complete set of pre-generated, validated synthetic records for a single suburb and dataset. The pool size equals the suburb's projected 2025-26 population. Customer queries sample from the pool without replacement, so repeat queries return different subsets.

Small suburb flag; A boolean indicator (small_suburb_flag = True) on suburbs whose original Census population was below 100. These suburbs have their pool floored to 100 records to ensure usability, but users should be aware that distributional fidelity may be lower due to sparser source data.

Validation filtering; The post-generation step that checks every synthetic record against structural rules encoding impossibilities in the Australian Census. Records violating any rule are discarded. This removes impossible combinations (a 5-year-old with a full-time job) but not improbable ones (a 25-year-old retiree).

Variable scope; The subset of variables relevant to a given dataset. Person-scope variables (e.g., AGE5P, SEXP) are only used when generating person records; family-scope variables (e.g., FMCF, CDCF) only for family records; dwelling-scope variables (e.g., DWTD, TEND) only for dwelling records.


See also: Methodology · Data Dictionary · FAQ