Helper library for enabling data studies: utilities for study enablement such as date shifting, profiling, and related workflows.
!pip install git+https://github.com/Health-Informatics-UoN/nuh-helper.gitnuh_helper.date_shift— Date shifting for patient data in Excel/DataFrames (consistent shifts per patient ID, reproducible via linking tables).nuh_helper.profile- Profile a dataset into a Scan Report
from nuh_helper import shift_excel_dates
# or: from nuh_helper.date_shift import shift_excel_dates
# Configure which sheets and columns to shift
sheet_configs = {
"patients": {
"patient_id_col": "patient_id",
"date_columns": ["dob", "date_of_diagnosis"],
"header_row": 1, # Optional: zero-based row index for column names
},
"labs": {
"patient_id_col": "patient_id",
"date_columns": ["test_date"],
"header_row": 1,
},
}
# Shift dates in the Excel file
shift_excel_dates(
input_file="input.xlsx",
output_file="output.xlsx",
patient_sheet="patients",
patient_id_col="patient_id",
sheet_configs=sheet_configs,
min_shift_days=-15, # Lower range
max_shift_days=15, # Upper range
seed=42, # For reproducibility
date_format="YYYY-MM-DD",
)To use the same shifts across multiple runs, save and reuse a linking table:
# First run: generate and save shifts
shift_excel_dates(
input_file="input.xlsx",
output_file="output.xlsx",
patient_sheet="patients",
patient_id_col="patient_id",
sheet_configs=sheet_configs,
linking_table_output="shift_mappings.csv", # Save shifts
seed=42,
)
# Subsequent runs: reuse the same shifts
shift_excel_dates(
input_file="new_input.xlsx",
output_file="new_output.xlsx",
patient_sheet="patients",
patient_id_col="patient_id",
sheet_configs=sheet_configs,
linking_table_path="shift_mappings.csv", # Reuse saved shifts
)If your workbook has rich formatting (cell styles, column widths, conditional formatting, etc.) use shift_excel_dates_inplace instead. It copies the input file and modifies date cells directly via openpyxl, so all formatting is preserved exactly.
from nuh_helper import shift_excel_dates_inplace
shift_excel_dates_inplace(
input_file="input.xlsx",
output_file="output.xlsx",
patient_sheet="patients",
patient_id_col="patient_id",
sheet_configs=sheet_configs,
seed=42,
linking_table_output="shift_mappings.csv",
)The function accepts the same parameters as shift_excel_dates except date_format (not needed — the original cell format is preserved). External links and named ranges are removed from the output to avoid Excel repair dialogs.
input_file: Path to input Excel fileoutput_file: Path to output Excel file with shifted datespatient_sheet: Name of the sheet containing patient IDspatient_id_col: Name of the column containing patient IDssheet_configs: Dictionary mapping sheet names to configuration dicts with:patient_id_col: Patient ID column name in that sheetdate_columns: List of date column names to shiftheader_row: (Optional) Zero-based row index for the row that contains column namesskip_rows_after_header: (Optional) List of zero-based row indices to exclude from data (e.g. a data-type row immediately below the header)
patient_header_row: (Optional) Zero-based header row for the patient sheet (default: 0). If the patient sheet is insheet_configs, that sheet’sheader_rowis used instead.patient_skip_rows: (Optional) Zero-based row indices to exclude from patient data (e.g. a data-type row). If the patient sheet is insheet_configs, that sheet’sskip_rows_after_headeris used instead.min_shift_days/max_shift_days: Range of days to shift (default: -15 to 15)linking_table_path: (Optional) Path to existing linking table CSV for reproducibilitylinking_table_output: (Optional) Path to save the linking table CSVseed: (Optional) Random seed for generating shiftsdate_format: (Optional,shift_excel_datesonly) Excel date format string (e.g., ‘YYYY-MM-DD’)
Sheets can have a non-standard layout: e.g. a merged title row, then a description row, then the actual column names, then a data-type row. Configure as follows:
- Set
header_rowto the zero-based index of the row that contains the column names (the row you use for config:patient_id_col,date_columns). - Set
skip_rows_after_headerto the indices of any rows below the header that should not be treated as data (e.g. a data-type row). - Merged cells: The library reads the header row via openpyxl and resolves merged cells (value taken from the top-left of each merge), so column names are correct even when the sheet has merged cells. Merged ranges in the description area (rows above the header) are preserved when writing the output.
- Shifts dates consistently across multiple Excel sheets
shift_excel_dates_inplace: full formatting preservation (cell styles, column widths, conditional formatting, etc.)- Preserves Excel structure (description rows and merged cells in that area)
- Correct header detection with merged cells (openpyxl-based resolution)
- Optional skip of rows after the header (e.g. data-type row) via
skip_rows_after_header - Supports flexible date parsing (handles various formats and placeholders like "Unknown")
- Reproducible shifts via linking tables
Profile a dataset and generate a Scan Report.
from nuh_helper import generate_scan_report
csv_files = [
"patients.csv",
]
generate_scan_report(csv_files, min_cell_count=5)