Docs

FAQ

Frequently Asked Questions

About AUSynth

Is this real Australian data?

No. AUSynth provides synthetic records that are statistically representative of Australian demographics but do not correspond to any real individual. Each record is generated algorithmically. The data preserves the distributional properties; joint and conditional distributions of variables within each suburb; found in the ABS Census 2021, adjusted to current population estimates and economic conditions. It is proxy data suitable for modelling, analysis, and development, not a sample of real Census records.

How is this different from ABS TableBuilder?

ABS TableBuilder provides aggregate cross-tabulations; counts or proportions showing how many people in an area have a given combination of attributes. It does not provide individual records. You cannot run a regression, train a machine learning model, or perform microsimulation on a cross-tabulation table.

AUSynth generates individual-level records that reproduce the same statistical relationships you see in TableBuilder, but in a format you can load into pandas, R, Stata, or any tool that works with rectangular data. Each row is a synthetic person, family, or dwelling with a consistent set of attributes.

How is the data generated?

AUSynth uses established statistical methods (Markov Chain Monte Carlo with hierarchical conditioning) to generate synthetic populations that preserve the joint demographic relationships found in ABS Census 2021 data. Quality is verified through extensive validation against source distributions. The generation approach follows peer-reviewed methods documented in the population synthesis literature.

Can I use this for academic research?

Yes. AUSynth is designed for research applications including hypothesis exploration, model development, and methods testing. If you publish research using AUSynth data, please cite it as a synthetic dataset and note the version number. Be transparent with reviewers that the data is synthetic and does not represent real individuals.

AUSynth should not be used as a substitute for real data when real data is available and accessible. Its value is in contexts where unit-record Census data is unavailable, which, in Australia, is essentially all contexts outside the ABS itself.

Can I use this commercially?

Yes, subject to your account tier's terms of service. Commercial use cases include market sizing, site selection modelling, demographic profiling for retail planning, and training machine learning models. You may not redistribute the raw synthetic data as a standalone product.

What can't I use it for?

AUSynth should not be used as the sole basis for decisions that directly affect real individuals, for example, determining someone's eligibility for services, making lending decisions, or profiling specific real people. The data represents statistical patterns, not real individuals. It is also not suitable for producing official statistics. That is the role of the ABS.

Is the source code available?

AUSynth is a commercial product. The synthetic data is licensed for use, but the underlying generation methodology is proprietary intellectual property. Published academic literature references the general class of methods we use.

Can I reproduce the data myself?

While the underlying methodologies are documented in academic literature, the specific implementation, calibrations, and quality controls in AUSynth represent significant proprietary engineering work. We provide the data; reproduction is not part of the commercial license.

How do I cite AUSynth in academic work?

"AUSynth Synthetic Australian Population Dataset, version 1.0 (2026). Available at [URL]. Based on Australian Bureau of Statistics Census 2021."

See the Citation Guide for methodological references suitable for academic papers.


Data and Quality

How current is the data?

The underlying demographic structure; how variables relate to each other; comes from the ABS Census conducted on 10 August 2021. Three adjustments bring the data forward to current conditions. Population counts are scaled to 2025 using ABS Estimated Resident Population actuals. Income distributions are adjusted to December 2025 using state-level Wage Price Index growth. Housing cost distributions (rent and mortgage repayments) are adjusted to December 2025 using capital-city-level CPI indices.

The structural composition of the population (e.g., the relationship between age and occupation) reflects 2021 patterns. Industry and occupation distributions have not been adjusted for post-2021 labour market changes.

How accurate is the data?

AUSynth has been extensively validated against ABS Census 2021 source distributions. Key demographic relationships; age, income, sex, education, occupation, industry; are reproduced with high fidelity for persons and families. Dwelling-level data is acceptable for distributional analysis and will improve substantially with person-family-dwelling linking in v1.1.

The data is well-suited for identifying demographic patterns, comparing across suburbs, modelling, and machine learning training. As with any synthetic data, precision is higher for common categories in populous suburbs and lower for rare categories in small suburbs. For specific validation metrics, see the Quality Benchmarks page.

What does "small suburb warning" mean?

Suburbs with an original Census population below 100 have their synthetic pool floored to 100 records. This ensures every suburb has enough data for analysis. For these small suburbs, the underlying source data has fewer observations to estimate distributions from, so the synthetic records are best used for general patterns rather than fine-grained demographic breakdowns.

The small_suburb_flag field tells you whether a suburb was floored. Of the 15,343 suburbs covered, 8,879 (57.9%) have this flag set. Most of them rural localities with very small populations.

What should I know about dwelling data?

Dwelling-level variables describe household composition: how many people live in a dwelling, the type of dwelling, how many bedrooms, the tenure type, and so on. In v1.0, the three datasets (persons, families, dwellings) are generated independently. This means dwelling data is well-suited for analysing distributional patterns, for example, comparing the mix of housing types or tenure arrangements across suburbs, but is less precise for individual-household analysis.

Version 1.1 will introduce person-family-dwelling linking, which will bring dwelling fidelity in line with the persons and families datasets.


Variables

What variables are included?

AUSynth v1.0 covers 47 variables across three datasets. Persons (24 variables): age, sex, income, labour force status, occupation, industry, hours worked, marital status, relationship in household, birthplace, education, highest educational attainment, method of travel to work, year of arrival, qualification level, health conditions, language, and others. Families (9 variables): family composition, dependent children, family income, labour force status, and related attributes. Dwellings (14 variables): dwelling type, tenure, bedrooms, mortgage repayments, rent, household income, vehicles, and household composition measures.

The complete list with all category labels is in the Data Dictionary.

Why are variable names cryptic (e.g., AGE5P)?

AUSynth uses the exact variable codes defined by the ABS for the Census 2021. AGE5P means "Age in Five Year Groups, Person". SEXP means "Sex, Person". TEND means "Tenure and Landlord Type, Dwelling". Using ABS codes ensures consistency with official documentation and avoids ambiguity that could arise from renaming.

The Data Dictionary maps every code to its human-readable description and provides convenience functions for applying labels in Python and R.

What does "Not applicable" mean?

"Not applicable" is an ABS structural code indicating that a variable does not logically apply to the record. A child aged 0-14 has OCCP (Occupation) = "Not applicable" because the ABS does not collect occupation data for children. A person not in the labour force has HRWRP (Hours Worked) = "Not applicable" because the concept is undefined for them.

"Not applicable" is not missing data. It is definitionally correct. In AUSynth's integer coding, "Not applicable" is always the last category index for each variable.

What does "Not stated" mean?

"Not stated" means the Census respondent did not provide an answer, or the response could not be classified. This is genuinely missing information, distinct from "Not applicable". AUSynth's synthetic records include "Not stated" at approximately the same rate as the Census cross-tabulations for each variable and suburb.

What's the difference between persons, families, and dwellings?

These are three separate datasets, each describing a different unit of analysis.

Persons records describe individuals: their age, sex, income, occupation, and so on. One record = one person.

Families records describe family units within a household: how many dependent children, family income, family composition type. One record = one family.

Dwellings records describe physical dwellings: type (house, apartment, etc.), tenure, bedrooms, costs. One record = one dwelling.

In v1.0, the three datasets are generated independently; you cannot link a person to their family or dwelling. Version 1.1 will introduce hierarchical linking.

Are the three datasets linkable?

Not in v1.0. Each dataset is generated from its own set of distributions without reference to the other datasets. A person record does not know which dwelling it belongs to. Version 1.1 will generate person-family-dwelling hierarchies, making the datasets linkable.


Geography

What geographic levels are supported?

AUSynth's primary unit is the Suburb and Locality (SAL) as defined by the ABS. You can query at higher levels; postcode, Local Government Area (LGA), state, or all of Australia, and the system aggregates suburb-level pools accordingly.

How many suburbs are covered?

15,343 of Australia's 15,352 Suburbs and Localities. The 9 excluded entries are ABS statistical constructs that do not represent physical locations (such as "No usual address" and "Outside Australia").

What's the difference between SAL and suburbs?

SAL (Suburb and Locality) is the official ABS geographic code for suburbs. In urban areas, SALs generally align with commonly understood suburb boundaries. In rural areas, they correspond to localities. The boundaries are defined by the Australian Statistical Geography Standard (ASGS) 2021 and may not perfectly match postal or colloquial suburb definitions.

Why are some suburb names followed by a state code?

Australia has suburbs with the same name in different states. Paddington exists in both NSW and QLD. Abbotsford exists in NSW and VIC. When a suburb name is shared across states, AUSynth disambiguates by adding the state in parentheses: "Paddington (QLD)", "Paddington (NSW)". Suburbs with unique names appear without a state qualifier.

Can I query by postcode?

Yes. Set geography_level to postcode and provide postcode strings in geography_selections. The system aggregates pools from all suburbs within the postcode. Note that postcode boundaries do not always align neatly with suburb boundaries; the ABS mapping is used.

Can I do custom geographies (e.g., trade areas, catchment zones)?

Not directly. AUSynth does not support arbitrary polygon queries. The recommended approach is to identify all suburbs that fall within your custom geography, query them individually or as a multi-suburb request, and combine the results. For trade area analysis, you can purchase the relevant suburbs and filter or weight records as needed.


Pricing and Credits

How do credits work?

1 credit = 500 observations. If you query 1,000 person records for one suburb, that costs 2 credits. If you query 1,000 records for 3 suburbs (3,000 records total), that costs 6 credits. The preview endpoint shows the exact cost before you commit.

Why is hierarchical geography 1.5x?

Hierarchical geography adds five extra columns to each record (suburb, postcode, LGA, GCCSA, state), increasing the data volume and processing required. The 1.5x multiplier reflects this. If you only need records for analysis within a single suburb, use the standard option and save credits.

Do credits expire?

Purchased credits do not expire. Free-tier credits reset weekly; any unused free credits at the end of the week are lost and replaced with a fresh allocation.

Can I get a refund on credits?

Refund policies are set by your account agreement. Credits spent on executed queries are generally non-refundable because the data has been generated and delivered. Contact support for specific situations.

Are there volume discounts?

Credit bundles are priced with built-in volume discounts; larger bundles have a lower per-credit cost. See the pricing page for current bundle options.

Is there academic pricing?

Academic pricing details are available on request. Contact us with your institutional affiliation and intended use case.


Technical

What format is the data?

AUSynth supports three output formats. Parquet is the recommended format for analysis. It is compact, fast to load, preserves column types, and is supported by pandas, R (arrow), and most modern data tools. CSV is a universal plain-text format compatible with everything, including Excel, but produces larger files. XLSX (Excel) is available for users who work primarily in spreadsheets.

How do I open Parquet files?

In Python: pd.read_parquet("file.parquet") (requires pyarrow or fastparquet). In R: arrow::read_parquet("file.parquet"). In the command line: parquet-tools show file.parquet. DuckDB can also query Parquet files directly: SELECT * FROM 'file.parquet' LIMIT 10.

Why does the data use integer indices instead of labels?

Three reasons. File size: integer columns are far more compact than repeating text labels across thousands of rows. ABS convention: the integer indices follow the category ordering in the ABS Census Dictionary. ML pipelines: integer-encoded categorical variables are the expected input for most machine learning frameworks. Use the Data Dictionary to map indices to labels when you need readable output.

How do I apply labels to the data?

Download the data dictionary in Python or R format from the Dictionary page. In Python, import the module and call apply_labels(df, ["AGE5P", "SEXP"]). In R, source the script and call apply_labels_df(df, c("AGE5P", "SEXP")). Both add new columns with a _label suffix containing the ABS category text.

Is there an API?

Yes. The REST API supports query preview, execution, history, credit balance, and geography search. Full documentation is in the API Reference. Quick start guides are available for Python and R.

What's the rate limit?

60 requests per minute per API key. If exceeded, the API returns HTTP 429 with a Retry-After header. Implement exponential backoff in automated workflows. The rate limit applies to all endpoints collectively.

Can I download multiple suburbs at once?

Yes. Pass multiple suburb identifiers in the geography_selections array. Each suburb contributes up to n_observations records, and the output includes a suburb_id column identifying the source suburb.

What's the maximum query size?

For each suburb, you can request up to the pool size, which equals the suburb's projected 2025-26 population. Pool sizes range from 100 (small suburb floor) to over 300,000 for large suburbs. There is no maximum on the number of suburbs per query, but very large multi-suburb queries may take longer to process.

Can I get the same records twice?

By default, no. Each query samples from the pool without replacement, so repeated queries return different subsets. This is intentional. It supports multiple imputation workflows where each imputation should be an independent draw. If you need to reproduce a specific sample, save the data locally after downloading.


Multiple Imputation

What is multiple imputation?

Multiple imputation is a statistical technique for properly accounting for uncertainty in your data. Instead of analysing a single dataset, you create multiple versions (imputations), run your analysis on each, and combine the results using formulas (Rubin's rules) that account for both within-imputation and between-imputation variability. The result is estimates with correctly calibrated confidence intervals.

How many imputations should I use?

Standard practice is 5-20 imputations for most analyses. Five is sufficient for simple point estimates. If you need precise confidence intervals or are working with rare subgroups, use 10-20. More imputations improve the stability of the between-imputation variance estimate but with diminishing returns beyond 20.

Are imputations independent?

Yes. AUSynth samples from the pool without replacement for each query, meaning each imputation is a genuinely different subset of the synthetic population. The imputations are not identical redrawings; they represent the sampling variability inherent in drawing a finite sample from a pool.

What software supports multiple imputation?

In R, the mitools package provides MIcombine() for applying Rubin's rules. The mice package offers a more comprehensive framework. In Python, you can implement Rubin's rules manually (the formulas are simple) or use statsmodels which has some multiple imputation support. The Quick Start; R guide includes a worked example.


Updates and Versioning

Will the data be updated?

Yes. Quarterly updates refresh the WPI and CPI indices, adjusting income and housing cost distributions. Annual updates recalculate suburb populations using the latest ERP estimates. Major updates occur when the ABS releases new Census data (next expected: Census 2026, with data likely available 2028).

What happens to my saved data when the version changes?

Data you have already downloaded is not affected by version changes. It remains valid as a product of the version under which it was generated. If you want data generated under the new version, re-execute your queries. The metadata.version field in the API response tells you which version produced each download.

What's planned for v1.1?

The main feature of v1.1 is person-family-dwelling linking. Instead of generating the three datasets independently, v1.1 will construct coherent households: first generating persons, grouping them into families, then assembling families into dwellings. This is expected to substantially improve dwelling-level data quality and enable cross-dataset analysis. See the Versioning page for the full roadmap.

Will AUSynth support Census 2026?

Yes. When the ABS publishes Census 2026 cross-tabulations (expected 2028), AUSynth will be fully recalibrated against the new data. This will update all distributions, variable definitions (if any change), and population targets.


Privacy and Ethics

Does AUSynth contain real people's data?

No. Every record is generated algorithmically. No individual record corresponds to a real person. The generation process uses aggregate statistical tables published by the ABS, not unit-record Census data. It is mathematically impossible to identify or re-identify any individual from the synthetic output.

Yes. AUSynth uses publicly available ABS aggregate data as its input. The ABS publishes cross-tabulations specifically for public use. AUSynth does not access, copy, or redistribute any confidential or restricted ABS data.

Can synthetic data be biased?

Yes. The synthetic data inherits whatever biases exist in the ABS Census 2021. If certain populations were undercounted in the Census (e.g., people experiencing homelessness, remote Indigenous communities), those undercounts are reflected in the synthetic data. AUSynth does not correct for Census coverage biases. It reproduces the data as collected.

As with any statistical model, it is good practice to validate key findings against other data sources when possible.


AUSynth is synthetic proxy data. It preserves the statistical relationships and population counts found in the ABS Census 2021, adjusted to current demographic and economic conditions. It should not be treated as a direct observation of real individuals or used as the sole basis for decisions affecting real people.