Docs
Quickstart — Python
Quick Start; Python
This guide walks you through querying AUSynth and analysing synthetic population data using Python. You will go from sign-up to a working analysis in under 10 minutes.
Prerequisites
You need Python 3.8 or later and a few standard packages. If you are working in a fresh environment:
pip install pandas requests pyarrow
pandas handles tabular data. requests calls the API. pyarrow reads Parquet files (optional; you can also request CSV output).
Get Your API Key
Sign in to your AUSynth account at ausynth.com. Navigate to Account → API Keys and generate a new key. Copy it somewhere safe. It will not be shown again.
API_KEY = "your-api-key-here"
BASE_URL = "https://api.ausynth.com/v1"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
Your First Query
Let's generate 1,000 synthetic person records for Paddington, QLD.
Step 1: Search for the Suburb
Use the geography search endpoint to find the suburb identifier:
import requests
resp = requests.get(
f"{BASE_URL}/geography/search",
params={"query": "Paddington", "state": "QLD"},
headers=HEADERS
)
results = resp.json()["results"]
for r in results:
print(f"{r['suburb_id']}; pop. {r['pool_size']}")
This returns matching suburbs with their pool sizes (the maximum number of records available).
Step 2: Preview the Cost
Before spending credits, preview what the query will cost:
query = {
"geography_level": "suburb",
"geography_selections": ["Paddington (QLD)"],
"dataset_type": "persons",
"n_observations": 1000,
"output_format": "parquet"
}
preview = requests.post(
f"{BASE_URL}/query/preview",
json=query,
headers=HEADERS
).json()
print(f"Records: {preview['n_observations']}")
print(f"Credit cost: {preview['credit_cost']}")
print(f"Pool available: {preview['pool_available']}")
The preview endpoint validates your query and reports the credit cost without executing it.
Step 3: Execute the Query
result = requests.post(
f"{BASE_URL}/query/execute",
json=query,
headers=HEADERS
).json()
print(f"Query ID: {result['query_id']}")
print(f"Download URL: {result['download_url']}")
Step 4: Load the Data
import pandas as pd
df = pd.read_parquet(result["download_url"])
print(f"Shape: {df.shape}")
print(df.head())
You should see 1,000 rows and 21 columns (one per person variable), all containing integer codes.
Apply Human-Readable Labels
The raw data uses integer indices; AGE5P = 3 means "15–19 years", SEXP = 0 means "Male". Download the data dictionary from the Dictionary page and place ausynth_dictionary.py in your working directory.
from ausynth_dictionary import apply_labels, VARIABLE_LABELS
# Label specific columns
df = apply_labels(df, ["AGE5P", "SEXP", "INCP"])
print(df[["AGE5P", "AGE5P_label", "SEXP", "SEXP_label"]].head())
This adds _label suffix columns with the ABS category text. The original integer columns are preserved for computation.
Explore the Data
Age–Sex Distribution
cross = pd.crosstab(df["AGE5P_label"], df["SEXP_label"], normalize="all")
print(cross.round(3))
Income Distribution by Sex
df = apply_labels(df, ["INCP"])
income_by_sex = pd.crosstab(
df["INCP_label"], df["SEXP_label"], normalize="columns"
)
print(income_by_sex.round(3))
Simple Visualisation
import matplotlib.pyplot as plt
age_dist = df["AGE5P_label"].value_counts().sort_index()
age_dist.plot(kind="barh", figsize=(8, 6), title="Age Distribution; Paddington QLD")
plt.xlabel("Count")
plt.tight_layout()
plt.savefig("paddington_age_dist.png")
Multi-Suburb Queries
You can request records from multiple suburbs in a single query:
query = {
"geography_level": "suburb",
"geography_selections": ["Paddington (QLD)", "Toorak", "Inala"],
"dataset_type": "persons",
"n_observations": 500,
"output_format": "parquet"
}
result = requests.post(
f"{BASE_URL}/query/execute",
json=query,
headers=HEADERS
).json()
df = pd.read_parquet(result["download_url"])
print(df.groupby("suburb_id").size())
Each suburb contributes up to n_observations records. The suburb_id column identifies which suburb each record belongs to.
Hierarchical Geography
Add full geographic context to each record by requesting hierarchical output:
query["include_geography"] = "hierarchical"
result = requests.post(
f"{BASE_URL}/query/execute",
json=query,
headers=HEADERS
).json()
df = pd.read_parquet(result["download_url"])
print(df[["suburb_id", "postcode", "lga", "gccsa", "state"]].head())
Hierarchical geography is billed at 1.5× the standard credit rate.
Multiple Imputation Workflow
AUSynth samples records from the pool without replacement. This means each query returns a different subset, making it suitable for multiple imputation:
imputations = []
for i in range(5):
result = requests.post(
f"{BASE_URL}/query/execute",
json=query,
headers=HEADERS
).json()
imp_df = pd.read_parquet(result["download_url"])
imp_df["imputation"] = i
imputations.append(imp_df)
all_imps = pd.concat(imputations, ignore_index=True)
print(f"Total records across 5 imputations: {len(all_imps)}")
Run your analysis on each imputation separately, then combine estimates using Rubin's rules. See the Glossary for a brief explanation of Rubin's rules.
Working with Families and Dwellings
The process is identical; change dataset_type:
family_query = {
"geography_level": "suburb",
"geography_selections": ["Paddington (QLD)"],
"dataset_type": "families",
"n_observations": 500,
"output_format": "parquet"
}
result = requests.post(
f"{BASE_URL}/query/execute",
json=family_query,
headers=HEADERS
).json()
families = pd.read_parquet(result["download_url"])
print(families.head())
Family records have 9 variables and dwelling records have 14. The same dictionary covers all three datasets.
Error Handling
The API returns standard HTTP status codes. Common cases to handle:
result = requests.post(
f"{BASE_URL}/query/execute",
json=query,
headers=HEADERS
)
if result.status_code == 200:
data = result.json()
df = pd.read_parquet(data["download_url"])
elif result.status_code == 402:
print("Insufficient credits. Top up at ausynth.com/account.")
elif result.status_code == 422:
error = result.json()
print(f"Invalid query: {error['detail']}")
elif result.status_code == 429:
print("Rate limited. Wait and retry.")
else:
print(f"Unexpected error: {result.status_code}")
Next Steps
Consult the API Reference for the complete endpoint specification, including pagination, saved queries, and query history. The FAQ covers common questions about data interpretation, credit usage, and best practices.
AUSynth is synthetic proxy data. It preserves the statistical relationships and population counts found in the ABS Census 2021, adjusted to current demographic and economic conditions. It should not be treated as a direct observation of real individuals or used as the sole basis for decisions affecting real people.