Docs

Quickstart — R

Quick Start; R

This guide walks you through querying AUSynth and analysing synthetic population data using R. You will go from sign-up to a working analysis in under 10 minutes.

Prerequisites

You need R 4.0 or later. Install the required packages if you do not already have them:

install.packages(c("httr", "jsonlite", "arrow", "dplyr"))

httr handles API requests. jsonlite parses JSON responses. arrow reads Parquet files (optional; you can also request CSV output). dplyr provides data manipulation verbs.

Get Your API Key

Sign in to your AUSynth account at ausynth.com. Navigate to Account → API Keys and generate a new key. Copy it somewhere safe. It will not be shown again.

API_KEY <- "your-api-key-here"
BASE_URL <- "https://api.ausynth.com/v1"

We will define a small helper to avoid repeating the authorisation header:

library(httr)
library(jsonlite)

ausynth_get <- function(path, ...) {
  resp <- GET(
    paste0(BASE_URL, path),
    add_headers(Authorization = paste("Bearer", API_KEY)),
    ...
  )
  content(resp, as = "parsed", type = "application/json")
}

ausynth_post <- function(path, body) {
  resp <- POST(
    paste0(BASE_URL, path),
    add_headers(Authorization = paste("Bearer", API_KEY)),
    body = body,
    encode = "json"
  )
  content(resp, as = "parsed", type = "application/json")
}

Your First Query

Let's generate 1,000 synthetic person records for Paddington, QLD.

Step 1: Search for the Suburb

results <- ausynth_get("/geography/search", query = list(query = "Paddington", state = "QLD"))
for (r in results$results) {
  cat(sprintf("%s; pop. %d\n", r$suburb_id, r$pool_size))
}

This returns matching suburbs with their pool sizes (the maximum number of records available).

Step 2: Preview the Cost

Before spending credits, preview what the query will cost:

query_body <- list(
  geography_level = "suburb",
  geography_selections = list("Paddington (QLD)"),
  dataset_type = "persons",
  n_observations = 1000,
  output_format = "parquet"
)

preview <- ausynth_post("/query/preview", body = query_body)
cat(sprintf("Records: %d\nCredit cost: %d\nPool available: %d\n",
            preview$n_observations, preview$credit_cost, preview$pool_available))

The preview endpoint validates your query and reports the credit cost without executing it.

Step 3: Execute the Query

result <- ausynth_post("/query/execute", body = query_body)
cat(sprintf("Query ID: %s\n", result$query_id))
cat(sprintf("Download URL: %s\n", result$download_url))

Step 4: Load the Data

library(arrow)

df <- read_parquet(result$download_url)
cat(sprintf("Rows: %d, Columns: %d\n", nrow(df), ncol(df)))
head(df)

You should see 1,000 rows and 21 columns (one per person variable), all containing integer codes.

Apply Human-Readable Labels

The raw data uses integer indices; AGE5P = 3 means "15–19 years", SEXP = 0 means "Male". Download the data dictionary from the Dictionary page and place ausynth_dictionary.R in your working directory.

source("ausynth_dictionary.R")

# Label specific columns
df <- apply_labels_df(df, c("AGE5P", "SEXP", "INCP"))
head(df[, c("AGE5P", "AGE5P_label", "SEXP", "SEXP_label")])

This adds _label suffix columns as properly ordered factors. The original integer columns are preserved for computation.

Explore the Data

Age–Sex Distribution

table(df$AGE5P_label, df$SEXP_label) |> prop.table() |> round(3)

Income Distribution by Sex

df <- apply_labels_df(df, "INCP")
income_sex <- table(df$INCP_label, df$SEXP_label)
prop.table(income_sex, margin = 2) |> round(3)

Simple Visualisation

barplot(
  table(df$AGE5P_label),
  horiz = TRUE, las = 1, cex.names = 0.7,
  main = "Age Distribution; Paddington QLD",
  xlab = "Count"
)

Or with ggplot2:

library(ggplot2)

ggplot(df, aes(x = AGE5P_label)) +
  geom_bar() +
  coord_flip() +
  labs(title = "Age Distribution; Paddington QLD", x = NULL, y = "Count") +
  theme_minimal()

ggsave("paddington_age_dist.png", width = 8, height = 6)

Multi-Suburb Queries

You can request records from multiple suburbs in a single query:

query_body <- list(
  geography_level = "suburb",
  geography_selections = list("Paddington (QLD)", "Toorak", "Inala"),
  dataset_type = "persons",
  n_observations = 500,
  output_format = "parquet"
)

result <- ausynth_post("/query/execute", body = query_body)
df <- read_parquet(result$download_url)

library(dplyr)
df |> count(suburb_id)

Each suburb contributes up to n_observations records. The suburb_id column identifies which suburb each record belongs to.

Hierarchical Geography

Add full geographic context to each record by requesting hierarchical output:

query_body$include_geography <- "hierarchical"

result <- ausynth_post("/query/execute", body = query_body)
df <- read_parquet(result$download_url)
head(df[, c("suburb_id", "postcode", "lga", "gccsa", "state")])

Hierarchical geography is billed at 1.5× the standard credit rate.

Multiple Imputation Workflow

AUSynth samples records from the pool without replacement. This means each query returns a different subset, making it suitable for multiple imputation:

n_imp <- 5
imputations <- vector("list", n_imp)

for (i in seq_len(n_imp)) {
  result <- ausynth_post("/query/execute", body = query_body)
  imp_df <- read_parquet(result$download_url)
  imp_df$imputation <- i
  imputations[[i]] <- imp_df
}

all_imps <- bind_rows(imputations)
cat(sprintf("Total records across %d imputations: %d\n", n_imp, nrow(all_imps)))

Run your analysis on each imputation separately, then combine estimates. The mitools package provides MIcombine() for applying Rubin's rules:

library(mitools)

# Split by imputation
imp_list <- imputations_to_list(all_imps, "imputation")

# Example: estimate mean income category by imputation
estimates <- lapply(imp_list, function(d) {
  list(coef = mean(d$INCP, na.rm = TRUE),
       var = var(d$INCP, na.rm = TRUE) / nrow(d))
})

# Combine using Rubin's rules
combined <- MIcombine(
  results = lapply(estimates, function(e) e$coef),
  variances = lapply(estimates, function(e) e$var)
)
summary(combined)

See the Glossary for a brief explanation of Rubin's rules.

Working with Families and Dwellings

The process is identical; change dataset_type:

family_query <- list(
  geography_level = "suburb",
  geography_selections = list("Paddington (QLD)"),
  dataset_type = "families",
  n_observations = 500,
  output_format = "parquet"
)

result <- ausynth_post("/query/execute", body = family_query)
families <- read_parquet(result$download_url)
head(families)

Family records have 9 variables and dwelling records have 14. The same dictionary covers all three datasets.

Error Handling

Wrap API calls to handle common error codes:

safe_query <- function(body) {
  resp <- POST(
    paste0(BASE_URL, "/query/execute"),
    add_headers(Authorization = paste("Bearer", API_KEY)),
    body = body,
    encode = "json"
  )
  
  if (status_code(resp) == 200) {
    return(content(resp, as = "parsed", type = "application/json"))
  } else if (status_code(resp) == 402) {
    stop("Insufficient credits. Top up at ausynth.com/account.")
  } else if (status_code(resp) == 422) {
    err <- content(resp, as = "parsed")
    stop(sprintf("Invalid query: %s", err$detail))
  } else if (status_code(resp) == 429) {
    stop("Rate limited. Wait and retry.")
  } else {
    stop(sprintf("Unexpected error: %d", status_code(resp)))
  }
}

result <- safe_query(query_body)

Next Steps

Consult the API Reference for the complete endpoint specification, including pagination, saved queries, and query history. The FAQ covers common questions about data interpretation, credit usage, and best practices.


AUSynth is synthetic proxy data. It preserves the statistical relationships and population counts found in the ABS Census 2021, adjusted to current demographic and economic conditions. It should not be treated as a direct observation of real individuals or used as the sole basis for decisions affecting real people.