Docs

Quickstart — Python

Quick Start; Python

This guide walks you through querying AUSynth and analysing synthetic population data using Python. You will go from sign-up to a working analysis in under 10 minutes.

Prerequisites

You need Python 3.8 or later and a few standard packages. If you are working in a fresh environment:

pip install pandas requests pyarrow

pandas handles tabular data. requests calls the API. pyarrow reads Parquet files (optional; you can also request CSV output).

Get Your API Key

Sign in to your AUSynth account at ausynth.com. Navigate to Account → API Keys and generate a new key. Copy it somewhere safe. It will not be shown again.

API_KEY = "your-api-key-here"
BASE_URL = "https://api.ausynth.com/v1"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

Your First Query

Let's generate 1,000 synthetic person records for Paddington, QLD.

Step 1: Search for the Suburb

Use the geography search endpoint to find the suburb identifier:

import requests

resp = requests.get(
    f"{BASE_URL}/geography/search",
    params={"query": "Paddington", "state": "QLD"},
    headers=HEADERS
)
results = resp.json()["results"]
for r in results:
    print(f"{r['suburb_id']}; pop. {r['pool_size']}")

This returns matching suburbs with their pool sizes (the maximum number of records available).

Step 2: Preview the Cost

Before spending credits, preview what the query will cost:

query = {
    "geography_level": "suburb",
    "geography_selections": ["Paddington (QLD)"],
    "dataset_type": "persons",
    "n_observations": 1000,
    "output_format": "parquet"
}

preview = requests.post(
    f"{BASE_URL}/query/preview",
    json=query,
    headers=HEADERS
).json()

print(f"Records: {preview['n_observations']}")
print(f"Credit cost: {preview['credit_cost']}")
print(f"Pool available: {preview['pool_available']}")

The preview endpoint validates your query and reports the credit cost without executing it.

Step 3: Execute the Query

result = requests.post(
    f"{BASE_URL}/query/execute",
    json=query,
    headers=HEADERS
).json()

print(f"Query ID: {result['query_id']}")
print(f"Download URL: {result['download_url']}")

Step 4: Load the Data

import pandas as pd

df = pd.read_parquet(result["download_url"])
print(f"Shape: {df.shape}")
print(df.head())

You should see 1,000 rows and 21 columns (one per person variable), all containing integer codes.

Apply Human-Readable Labels

The raw data uses integer indices; AGE5P = 3 means "15–19 years", SEXP = 0 means "Male". Download the data dictionary from the Dictionary page and place ausynth_dictionary.py in your working directory.

from ausynth_dictionary import apply_labels, VARIABLE_LABELS

# Label specific columns
df = apply_labels(df, ["AGE5P", "SEXP", "INCP"])
print(df[["AGE5P", "AGE5P_label", "SEXP", "SEXP_label"]].head())

This adds _label suffix columns with the ABS category text. The original integer columns are preserved for computation.

Explore the Data

Age–Sex Distribution

cross = pd.crosstab(df["AGE5P_label"], df["SEXP_label"], normalize="all")
print(cross.round(3))

Income Distribution by Sex

df = apply_labels(df, ["INCP"])
income_by_sex = pd.crosstab(
    df["INCP_label"], df["SEXP_label"], normalize="columns"
)
print(income_by_sex.round(3))

Simple Visualisation

import matplotlib.pyplot as plt

age_dist = df["AGE5P_label"].value_counts().sort_index()
age_dist.plot(kind="barh", figsize=(8, 6), title="Age Distribution; Paddington QLD")
plt.xlabel("Count")
plt.tight_layout()
plt.savefig("paddington_age_dist.png")

Multi-Suburb Queries

You can request records from multiple suburbs in a single query:

query = {
    "geography_level": "suburb",
    "geography_selections": ["Paddington (QLD)", "Toorak", "Inala"],
    "dataset_type": "persons",
    "n_observations": 500,
    "output_format": "parquet"
}

result = requests.post(
    f"{BASE_URL}/query/execute",
    json=query,
    headers=HEADERS
).json()

df = pd.read_parquet(result["download_url"])
print(df.groupby("suburb_id").size())

Each suburb contributes up to n_observations records. The suburb_id column identifies which suburb each record belongs to.

Hierarchical Geography

Add full geographic context to each record by requesting hierarchical output:

query["include_geography"] = "hierarchical"

result = requests.post(
    f"{BASE_URL}/query/execute",
    json=query,
    headers=HEADERS
).json()

df = pd.read_parquet(result["download_url"])
print(df[["suburb_id", "postcode", "lga", "gccsa", "state"]].head())

Hierarchical geography is billed at 1.5× the standard credit rate.

Multiple Imputation Workflow

AUSynth samples records from the pool without replacement. This means each query returns a different subset, making it suitable for multiple imputation:

imputations = []
for i in range(5):
    result = requests.post(
        f"{BASE_URL}/query/execute",
        json=query,
        headers=HEADERS
    ).json()
    imp_df = pd.read_parquet(result["download_url"])
    imp_df["imputation"] = i
    imputations.append(imp_df)

all_imps = pd.concat(imputations, ignore_index=True)
print(f"Total records across 5 imputations: {len(all_imps)}")

Run your analysis on each imputation separately, then combine estimates using Rubin's rules. See the Glossary for a brief explanation of Rubin's rules.

Working with Families and Dwellings

The process is identical; change dataset_type:

family_query = {
    "geography_level": "suburb",
    "geography_selections": ["Paddington (QLD)"],
    "dataset_type": "families",
    "n_observations": 500,
    "output_format": "parquet"
}

result = requests.post(
    f"{BASE_URL}/query/execute",
    json=family_query,
    headers=HEADERS
).json()

families = pd.read_parquet(result["download_url"])
print(families.head())

Family records have 9 variables and dwelling records have 14. The same dictionary covers all three datasets.

Error Handling

The API returns standard HTTP status codes. Common cases to handle:

result = requests.post(
    f"{BASE_URL}/query/execute",
    json=query,
    headers=HEADERS
)

if result.status_code == 200:
    data = result.json()
    df = pd.read_parquet(data["download_url"])
elif result.status_code == 402:
    print("Insufficient credits. Top up at ausynth.com/account.")
elif result.status_code == 422:
    error = result.json()
    print(f"Invalid query: {error['detail']}")
elif result.status_code == 429:
    print("Rate limited. Wait and retry.")
else:
    print(f"Unexpected error: {result.status_code}")

Next Steps

Consult the API Reference for the complete endpoint specification, including pagination, saved queries, and query history. The FAQ covers common questions about data interpretation, credit usage, and best practices.


AUSynth is synthetic proxy data. It preserves the statistical relationships and population counts found in the ABS Census 2021, adjusted to current demographic and economic conditions. It should not be treated as a direct observation of real individuals or used as the sole basis for decisions affecting real people.