Docs
Quickstart — R
Quick Start; R
This guide walks you through querying AUSynth and analysing synthetic population data using R. You will go from sign-up to a working analysis in under 10 minutes.
Prerequisites
You need R 4.0 or later. Install the required packages if you do not already have them:
install.packages(c("httr", "jsonlite", "arrow", "dplyr"))
httr handles API requests. jsonlite parses JSON responses. arrow reads Parquet files (optional; you can also request CSV output). dplyr provides data manipulation verbs.
Get Your API Key
Sign in to your AUSynth account at ausynth.com. Navigate to Account → API Keys and generate a new key. Copy it somewhere safe. It will not be shown again.
API_KEY <- "your-api-key-here"
BASE_URL <- "https://api.ausynth.com/v1"
We will define a small helper to avoid repeating the authorisation header:
library(httr)
library(jsonlite)
ausynth_get <- function(path, ...) {
resp <- GET(
paste0(BASE_URL, path),
add_headers(Authorization = paste("Bearer", API_KEY)),
...
)
content(resp, as = "parsed", type = "application/json")
}
ausynth_post <- function(path, body) {
resp <- POST(
paste0(BASE_URL, path),
add_headers(Authorization = paste("Bearer", API_KEY)),
body = body,
encode = "json"
)
content(resp, as = "parsed", type = "application/json")
}
Your First Query
Let's generate 1,000 synthetic person records for Paddington, QLD.
Step 1: Search for the Suburb
results <- ausynth_get("/geography/search", query = list(query = "Paddington", state = "QLD"))
for (r in results$results) {
cat(sprintf("%s; pop. %d\n", r$suburb_id, r$pool_size))
}
This returns matching suburbs with their pool sizes (the maximum number of records available).
Step 2: Preview the Cost
Before spending credits, preview what the query will cost:
query_body <- list(
geography_level = "suburb",
geography_selections = list("Paddington (QLD)"),
dataset_type = "persons",
n_observations = 1000,
output_format = "parquet"
)
preview <- ausynth_post("/query/preview", body = query_body)
cat(sprintf("Records: %d\nCredit cost: %d\nPool available: %d\n",
preview$n_observations, preview$credit_cost, preview$pool_available))
The preview endpoint validates your query and reports the credit cost without executing it.
Step 3: Execute the Query
result <- ausynth_post("/query/execute", body = query_body)
cat(sprintf("Query ID: %s\n", result$query_id))
cat(sprintf("Download URL: %s\n", result$download_url))
Step 4: Load the Data
library(arrow)
df <- read_parquet(result$download_url)
cat(sprintf("Rows: %d, Columns: %d\n", nrow(df), ncol(df)))
head(df)
You should see 1,000 rows and 21 columns (one per person variable), all containing integer codes.
Apply Human-Readable Labels
The raw data uses integer indices; AGE5P = 3 means "15–19 years", SEXP = 0 means "Male". Download the data dictionary from the Dictionary page and place ausynth_dictionary.R in your working directory.
source("ausynth_dictionary.R")
# Label specific columns
df <- apply_labels_df(df, c("AGE5P", "SEXP", "INCP"))
head(df[, c("AGE5P", "AGE5P_label", "SEXP", "SEXP_label")])
This adds _label suffix columns as properly ordered factors. The original integer columns are preserved for computation.
Explore the Data
Age–Sex Distribution
table(df$AGE5P_label, df$SEXP_label) |> prop.table() |> round(3)
Income Distribution by Sex
df <- apply_labels_df(df, "INCP")
income_sex <- table(df$INCP_label, df$SEXP_label)
prop.table(income_sex, margin = 2) |> round(3)
Simple Visualisation
barplot(
table(df$AGE5P_label),
horiz = TRUE, las = 1, cex.names = 0.7,
main = "Age Distribution; Paddington QLD",
xlab = "Count"
)
Or with ggplot2:
library(ggplot2)
ggplot(df, aes(x = AGE5P_label)) +
geom_bar() +
coord_flip() +
labs(title = "Age Distribution; Paddington QLD", x = NULL, y = "Count") +
theme_minimal()
ggsave("paddington_age_dist.png", width = 8, height = 6)
Multi-Suburb Queries
You can request records from multiple suburbs in a single query:
query_body <- list(
geography_level = "suburb",
geography_selections = list("Paddington (QLD)", "Toorak", "Inala"),
dataset_type = "persons",
n_observations = 500,
output_format = "parquet"
)
result <- ausynth_post("/query/execute", body = query_body)
df <- read_parquet(result$download_url)
library(dplyr)
df |> count(suburb_id)
Each suburb contributes up to n_observations records. The suburb_id column identifies which suburb each record belongs to.
Hierarchical Geography
Add full geographic context to each record by requesting hierarchical output:
query_body$include_geography <- "hierarchical"
result <- ausynth_post("/query/execute", body = query_body)
df <- read_parquet(result$download_url)
head(df[, c("suburb_id", "postcode", "lga", "gccsa", "state")])
Hierarchical geography is billed at 1.5× the standard credit rate.
Multiple Imputation Workflow
AUSynth samples records from the pool without replacement. This means each query returns a different subset, making it suitable for multiple imputation:
n_imp <- 5
imputations <- vector("list", n_imp)
for (i in seq_len(n_imp)) {
result <- ausynth_post("/query/execute", body = query_body)
imp_df <- read_parquet(result$download_url)
imp_df$imputation <- i
imputations[[i]] <- imp_df
}
all_imps <- bind_rows(imputations)
cat(sprintf("Total records across %d imputations: %d\n", n_imp, nrow(all_imps)))
Run your analysis on each imputation separately, then combine estimates. The mitools package provides MIcombine() for applying Rubin's rules:
library(mitools)
# Split by imputation
imp_list <- imputations_to_list(all_imps, "imputation")
# Example: estimate mean income category by imputation
estimates <- lapply(imp_list, function(d) {
list(coef = mean(d$INCP, na.rm = TRUE),
var = var(d$INCP, na.rm = TRUE) / nrow(d))
})
# Combine using Rubin's rules
combined <- MIcombine(
results = lapply(estimates, function(e) e$coef),
variances = lapply(estimates, function(e) e$var)
)
summary(combined)
See the Glossary for a brief explanation of Rubin's rules.
Working with Families and Dwellings
The process is identical; change dataset_type:
family_query <- list(
geography_level = "suburb",
geography_selections = list("Paddington (QLD)"),
dataset_type = "families",
n_observations = 500,
output_format = "parquet"
)
result <- ausynth_post("/query/execute", body = family_query)
families <- read_parquet(result$download_url)
head(families)
Family records have 9 variables and dwelling records have 14. The same dictionary covers all three datasets.
Error Handling
Wrap API calls to handle common error codes:
safe_query <- function(body) {
resp <- POST(
paste0(BASE_URL, "/query/execute"),
add_headers(Authorization = paste("Bearer", API_KEY)),
body = body,
encode = "json"
)
if (status_code(resp) == 200) {
return(content(resp, as = "parsed", type = "application/json"))
} else if (status_code(resp) == 402) {
stop("Insufficient credits. Top up at ausynth.com/account.")
} else if (status_code(resp) == 422) {
err <- content(resp, as = "parsed")
stop(sprintf("Invalid query: %s", err$detail))
} else if (status_code(resp) == 429) {
stop("Rate limited. Wait and retry.")
} else {
stop(sprintf("Unexpected error: %d", status_code(resp)))
}
}
result <- safe_query(query_body)
Next Steps
Consult the API Reference for the complete endpoint specification, including pagination, saved queries, and query history. The FAQ covers common questions about data interpretation, credit usage, and best practices.
AUSynth is synthetic proxy data. It preserves the statistical relationships and population counts found in the ABS Census 2021, adjusted to current demographic and economic conditions. It should not be treated as a direct observation of real individuals or used as the sole basis for decisions affecting real people.