Utilities for Synthetic Controls

The Synthetic Control Method is a causal inference tool used to estimate the impact of a policy or event on a single entity. It constructs a “synthetic” counterfactual by taking a weighted average of similar untreated units (the donor pool) to mirror the treated unit’s behavior before the intervention.

The validity of SCM depends on how closely the synthetic unit matches the treated unit during the pre-treatment period.

Traditionally, Abadie et al. (2010) utilized the Root Mean Square Prediction Error to quantify the “gap” between the treated unit and its synthetic counterpart.
To generalize and simplify interpretation of RMSPE, Adhikari and Alm (2016) developed the Pre-treatment Fit Index. This provides a standardized, intuitive value that makes it easier to assess whether the synthetic unit is a high-quality match or if the discrepancy is too large for a reliable analysis.

I show a step-by-step example to calculate the pre-treatment fit index in Stata and Python.

Stata Example

/*
Author: Bibek Adhikari.
*/

* ==========================================
* 1. Load Data & Setup Panel
* ==========================================
clear all
sysuse smoking, clear
xtset state year

* ==========================================
* 2. Define Parameters
* ==========================================
local outcome "cigsale"
local treatment_year = 1989
local treatment_unit = 3
local first_year = 1970

* ==========================================
* 3. Run Synth
* ==========================================
synth `outcome' `outcome' cigsale(1988) cigsale(1980) cigsale(1975) beer lnincome retprice age15to24, trunit(`treatment_unit') trperiod(`treatment_year') fig

* ==========================================
* 4. Calculate RMSPE & Fit Index
* ==========================================

* Store RMSPE in scalar
scalar rmspe = e(RMSPE)[1,1]

* Create Benchmark RMSPE (Levels)
gen outcome_square = `outcome'*`outcome'
sum outcome_square if year<`treatment_year' & year >= `first_year' & state==`treatment_unit'
scalar benchmark_rmspe = sqrt(r(mean))

* Calculate Fit Index
scalar fit_index = round(rmspe/benchmark_rmspe,0.001)

* ==========================================
* 5. Display Results
* ==========================================
di "{hline 30}"
di "Actual RMSPE is " rmspe
di "Benchmark RMSPE is " benchmark_rmspe	
di "Pre-treatment fit index is " fit_index
di "{hline 30}"

Python Example

Note: This script generates synthetic data so it can be run immediately without external files.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pysyncon import Dataprep, Synth

# ==========================================
# 1. Generate Synthetic Data (Replaces read_csv)
# ==========================================
np.random.seed(42) # for reproducibility

states = range(1, 40) # 39 states
years = range(1970, 2001)
data_rows = []

for state in states:
    # distinct baseline for each state (state fixed effect)
    state_base = np.random.uniform(80, 150) 
    
    for year in years:
        # Predictors (random generation with some trend)
        beer = np.random.normal(20, 5)
        lnincome = np.log(np.random.uniform(20000, 60000))
        retprice = 30 + (year - 1970) * 2 + np.random.normal(0, 2)
        age15to24 = np.random.uniform(0.1, 0.2)
        
        # Outcome (cigsale) calculation based on predictors + noise
        # This ensures predictors actually have predictive power
        noise = np.random.normal(0, 2)
        cigsale = state_base - (year - 1970) * 0.5 - (retprice * 0.2) + noise
        
        # *** APPLY TREATMENT EFFECT ***
        # Drop sales by 25 units for State 3 after 1988
        if state == 3 and year >= 1989:
            cigsale -= 25
            
        data_rows.append([state, year, cigsale, beer, lnincome, retprice, age15to24])

# Create the DataFrame
df = pd.DataFrame(data_rows, columns=['state', 'year', 'cigsale', 'beer', 'lnincome', 'retprice', 'age15to24'])

print("Synthetic data generated successfully.")

# ==========================================
# 2. Define Parameters
# ==========================================
outcome = "cigsale"
treatment_year = 1989
treatment_unit = 3   
first_year = 1970

# Define controls (everyone except state 3)
control_units = [u for u in df['state'].unique() if u != treatment_unit]

# ==========================================
# 3. Prepare Data (Dataprep)
# ==========================================
dataprep = Dataprep(
    foo=df,
    predictors=["beer", "lnincome", "retprice", "age15to24"],
    predictors_op="mean",
    time_predictors_prior=range(first_year, treatment_year),
    special_predictors=[
        ("cigsale", [1988], "mean"),
        ("cigsale", [1980], "mean"),
        ("cigsale", [1975], "mean"),
    ],
    dependent=outcome,
    unit_variable="state",
    time_variable="year",
    treatment_identifier=treatment_unit,
    controls_identifier=control_units,
    time_optimize_ssr=range(first_year, treatment_year)
)

# ==========================================
# 4. Run Synth
# ==========================================
print("Running optimization...")
synth = Synth()
synth.fit(dataprep)

# ==========================================
# 5. Calculate RMSPE & Fit Index (Manual Pandas Method)
# ==========================================

# A. Get the weights from the model
# specific_weights is a Series where index=state_id, value=weight
specific_weights = synth.weights() 

# B. Construct the Synthetic Line manually
# 1. Filter data for control units only
control_data = df[df['state'].isin(control_units)]

# 2. Pivot so we have a matrix: Rows=Years, Columns=States, Values=Outcome
# This makes the math easy: (Year x State) dot (State Weights)
control_matrix = control_data.pivot(index='year', columns='state', values=outcome)

# 3. Align the weights with the columns (crucial step!)
# We ensure the weights are in the exact same order as the matrix columns
aligned_weights = specific_weights.reindex(control_matrix.columns).fillna(0)

# 4. Calculate Synthetic Outcome (Dot Product)
synthetic_outcome = control_matrix.dot(aligned_weights)

# C. Get Actual Outcome for Treated Unit
treated_outcome = df[df['state'] == treatment_unit].set_index('year')[outcome]

# D. Calculate the Gap (Actual - Synthetic)
gap = treated_outcome - synthetic_outcome

# E. Calculate Pre-treatment RMSPE
# We slice specifically for the pre-treatment years
pre_gap = gap.loc[first_year:treatment_year-1]
rmspe = np.sqrt(np.mean(pre_gap ** 2))

# F. Calculate Benchmark RMSPE
# (Root Mean Square of the treated unit's LEVEL, not error)
pre_treated = treated_outcome.loc[first_year:treatment_year-1]
benchmark_rmspe = np.sqrt(np.mean(pre_treated ** 2))

# G. Calculate Fit Index
fit_index = round(rmspe / benchmark_rmspe, 5)

# ==========================================
# 6. Display Results
# ==========================================
print("\n" + "="*30)
print(f"Actual RMSPE is: {rmspe:.5f}")
print(f"Benchmark RMSPE is: {benchmark_rmspe:.5f}")
print(f"Pre-treatment fit index is: {fit_index}")
print("="*30 + "\n")

# ==========================================
# 7. Plotting (Manual)
# ==========================================
plt.figure(figsize=(10, 6))
# Plot Actual
plt.plot(treated_outcome.index, treated_outcome, label=f"Actual State {treatment_unit}", linewidth=2)
# Plot Synthetic
plt.plot(synthetic_outcome.index, synthetic_outcome, label="Synthetic Control", linestyle="--", linewidth=2)

# Add vertical line for treatment
plt.axvline(x=treatment_year, color='red', linestyle=':', label='Intervention (1989)')

plt.title("Synthetic Control Method: Actual vs Synthetic")
plt.xlabel("Year")
plt.ylabel("Cigarette Sales")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()