Potential enhancements to the TableSleuth CDK infrastructure.
Current Status: The ec2.Instance L2 construct in AWS CDK does not support spot instances directly.
Issue: When attempting to use spot instances, the following error occurs:
AttributeError: module 'aws_cdk.aws_ec2' has no attribute 'SpotInstanceOptions'
Root Cause: The ec2.Instance construct is designed for on-demand instances only. AWS CDK does not provide a SpotInstanceOptions class or spot instance support for the high-level ec2.Instance construct.
Current Implementation: The CDK stack uses on-demand instances only. The use_spot configuration option is preserved in cdk.json but currently has no effect. A warning is printed if use_spot is set to true.
Improvement: Refactor to use Launch Templates with Auto Scaling Groups
# Create launch template
launch_template = ec2.LaunchTemplate(
self,
"TablesleuthLaunchTemplate",
instance_type=ec2.InstanceType(self.config.instance_type),
machine_image=machine_image,
security_group=self.security_group,
role=self.instance_role,
user_data=ec2.UserData.custom(self._get_user_data()),
block_devices=[
ec2.BlockDevice(
device_name="/dev/xvda",
volume=ec2.BlockDeviceVolume.ebs(
volume_size=30,
encrypted=True,
delete_on_termination=True,
),
)
],
)
# Create Auto Scaling Group with spot
asg = autoscaling.AutoScalingGroup(
self,
"TablesleuthASG",
vpc=self.vpc,
launch_template=launch_template,
min_capacity=1,
max_capacity=1,
desired_capacity=1,
spot_price="0.05" if self.config.use_spot else None,
)Alternative Approaches:
Option 1: EC2 Fleet (L1 Construct)
fleet = ec2.CfnEC2Fleet(
self,
"SpotFleet",
launch_template_configs=[
ec2.CfnEC2Fleet.FleetLaunchTemplateConfigRequestProperty(
launch_template_specification=ec2.CfnEC2Fleet.FleetLaunchTemplateSpecificationRequestProperty(
launch_template_id=launch_template.launch_template_id,
version="$Latest",
)
)
],
target_capacity_specification=ec2.CfnEC2Fleet.TargetCapacitySpecificationRequestProperty(
total_target_capacity=1,
default_target_capacity_type="spot",
),
spot_options=ec2.CfnEC2Fleet.SpotOptionsRequestProperty(
allocation_strategy="lowestPrice",
instance_interruption_behavior="terminate",
),
)Option 2: AWS Batch For workloads that can tolerate interruptions:
compute_env = batch.FargateComputeEnvironment(
self,
"SpotComputeEnv",
vpc=self.vpc,
spot=True,
)Cost Comparison:
| Instance Type | On-Demand | Spot | Savings |
|---|---|---|---|
| t3.small | 70% | ||
| m4.xlarge | 70% |
Benefits:
- 70% cost savings for dev/staging environments
- Configurable via
useSpotin cdk.json - Maintains same functionality
Recommendation: For the TableSleuth use case (development/testing environment), the current on-demand implementation is acceptable because:
- Simplicity - The
ec2.Instanceconstruct is simpler and more straightforward - Stability - No risk of spot instance interruptions during analysis
- Cost - Using t3.small for dev environment is already very affordable (~$15/month)
- Stop When Not in Use - Users can stop the instance when not in use to save costs
If cost optimization is critical, implement Option 1 (Launch Templates with Auto Scaling Group) in a future version.
References:
- AWS CDK ec2.Instance Documentation
- AWS CDK Launch Template Documentation
- AWS CDK Auto Scaling Group Documentation
- AWS EC2 Spot Instances
Current: Passwords in environment variables
Improvement:
# Store password in Secrets Manager
password = secretsmanager.Secret(
self,
"GizmoSQLPassword",
description="GizmoSQL server password",
generate_secret_string=secretsmanager.SecretStringGenerator(
exclude_punctuation=True,
password_length=32,
),
)
# Reference in user data
user_data.add_commands(
f"export GIZMOSQL_PASSWORD=$(aws secretsmanager get-secret-value "
f"--secret-id {password.secret_arn} --query SecretString --output text)"
)
# Grant read access
password.grant_read(instance.role)Benefits:
- Automatic password generation
- Rotation support
- Audit trail via CloudTrail
- No passwords in environment variables
Add monitoring and alerting:
# CPU utilization alarm
cpu_alarm = cloudwatch.Alarm(
self,
"HighCPUAlarm",
metric=instance.metric_cpu_utilization(),
threshold=80,
evaluation_periods=2,
alarm_description="Alert when CPU exceeds 80%",
)
# Disk space alarm
disk_alarm = cloudwatch.Alarm(
self,
"LowDiskSpaceAlarm",
metric=cloudwatch.Metric(
namespace="CWAgent",
metric_name="disk_used_percent",
dimensions_map={"path": "/"},
),
threshold=85,
evaluation_periods=2,
comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
)
# Status check alarm
status_alarm = cloudwatch.Alarm(
self,
"StatusCheckAlarm",
metric=instance.metric_status_check_failed(),
threshold=1,
evaluation_periods=2,
)Replace SSH with Session Manager:
# Add SSM permissions to role
role.add_managed_policy(
iam.ManagedPolicy.from_aws_managed_policy_name(
"AmazonSSMManagedInstanceCore"
)
)
# No key pair needed
# Connect via: aws ssm start-session --target <instance-id>Benefits:
- No SSH keys to manage
- Better audit trail
- No open ports required
- Session recording
Add compliance monitoring:
# Ensure EBS encryption
config.ManagedRule(
self,
"EBSEncryptionRule",
identifier=config.ManagedRuleIdentifiers.EC2_EBS_ENCRYPTION_BY_DEFAULT,
)
# Ensure VPC Flow Logs
config.ManagedRule(
self,
"VPCFlowLogsRule",
identifier=config.ManagedRuleIdentifiers.VPC_FLOW_LOGS_ENABLED,
)Add infrastructure tests:
# tests/test_stack.py
from aws_cdk import assertions as assertions
import pytest
def test_vpc_created():
app = cdk.App()
stack = TablesleuthStack(app, "TestStack", config=test_config)
template = assertions.Template.from_stack(stack)
template.resource_count_is("AWS::EC2::VPC", 1)
template.has_resource_properties("AWS::EC2::VPC", {
"CidrBlock": "10.10.0.0/16"
})
def test_ebs_encryption_enabled():
app = cdk.App()
stack = TablesleuthStack(app, "TestStack", config=test_config)
template = assertions.Template.from_stack(stack)
template.has_resource_properties("AWS::EC2::Instance", {
"BlockDeviceMappings": [
{
"Ebs": {
"Encrypted": True
}
}
]
})
def test_iam_least_privilege():
app = cdk.App()
stack = TablesleuthStack(app, "TestStack", config=test_config)
template = assertions.Template.from_stack(stack)
# Ensure no wildcard actions
template.has_resource_properties("AWS::IAM::Role", {
"Policies": assertions.Match.array_with([
assertions.Match.object_like({
"PolicyDocument": {
"Statement": assertions.Match.array_with([
assertions.Match.object_like({
"Action": assertions.Match.not_(
assertions.Match.string_like_regexp(".*\\*.*")
)
})
])
}
})
])
})Automated compliance checks:
from aws_cdk import IAspect, Annotations
import jsii
@jsii.implements(IAspect)
class SecurityAspect:
def visit(self, node):
# Check S3 buckets have encryption
if isinstance(node, s3.CfnBucket):
if not node.bucket_encryption:
Annotations.of(node).add_error(
"S3 buckets must have encryption enabled"
)
# Check EC2 instances have encrypted EBS
if isinstance(node, ec2.CfnInstance):
if not node.block_device_mappings:
Annotations.of(node).add_warning(
"EC2 instances should have encrypted EBS volumes"
)
# Check security groups don't allow 0.0.0.0/0 SSH
if isinstance(node, ec2.CfnSecurityGroup):
for rule in node.security_group_ingress or []:
if rule.get("FromPort") == 22 and rule.get("CidrIp") == "0.0.0.0/0":
Annotations.of(node).add_error(
"Security groups should not allow SSH from 0.0.0.0/0"
)
# Apply to app
Aspects.of(app).add(SecurityAspect())Automated best practices validation:
from cdk_nag import AwsSolutionsChecks, NagSuppressions
# Apply AWS Solutions checks
Aspects.of(app).add(AwsSolutionsChecks(verbose=True))
# Suppress specific rules with justification
NagSuppressions.add_stack_suppressions(
stack,
[
{
"id": "AwsSolutions-IAM4",
"reason": "Managed policies acceptable for this use case"
}
]
)Add monitoring dashboard:
dashboard = cloudwatch.Dashboard(
self,
"TablesleuthDashboard",
dashboard_name=f"tablesleuth-{environment}",
)
dashboard.add_widgets(
cloudwatch.GraphWidget(
title="CPU Utilization",
left=[instance.metric_cpu_utilization()],
),
cloudwatch.GraphWidget(
title="Network Traffic",
left=[
instance.metric_network_in(),
instance.metric_network_out(),
],
),
cloudwatch.SingleValueWidget(
title="Instance Status",
metrics=[instance.metric_status_check_failed()],
),
)Split VPC into separate stack:
# network_stack.py
class NetworkStack(Stack):
def __init__(self, scope, construct_id, **kwargs):
super().__init__(scope, construct_id, **kwargs)
self.vpc = ec2.Vpc(
self,
"TablesleuthVPC",
# ... configuration
)
# compute_stack.py
class ComputeStack(Stack):
def __init__(self, scope, construct_id, vpc, **kwargs):
super().__init__(scope, construct_id, **kwargs)
self.instance = ec2.Instance(
self,
"TablesleuthInstance",
vpc=vpc,
# ... configuration
)
# app.py
network_stack = NetworkStack(app, "TablesleuthNetwork-dev")
compute_stack = ComputeStack(
app,
"TablesleuthCompute-dev",
vpc=network_stack.vpc,
)Benefits:
- Independent lifecycle management
- VPC can be shared across stacks
- Easier to update compute without touching network
Pre-bake Python and dependencies:
# Create image pipeline
image_recipe = imagebuilder.CfnImageRecipe(
self,
"TablesleuthRecipe",
name="tablesleuth-recipe",
version="1.0.0",
parent_image="ami-xxxxx", # Amazon Linux 2023
components=[
{
"componentArn": python_component.attr_arn,
},
{
"componentArn": gizmosql_component.attr_arn,
},
],
)
# Use custom AMI
instance = ec2.Instance(
self,
"TablesleuthInstance",
machine_image=ec2.MachineImage.generic_linux({
"us-east-2": custom_ami_id,
}),
)Benefits:
- Faster instance startup (~1 minute vs ~5 minutes)
- Consistent environment
- Easier updates
Support multiple instances:
asg = autoscaling.AutoScalingGroup(
self,
"TablesleuthASG",
vpc=vpc,
instance_type=instance_type,
machine_image=machine_image,
min_capacity=1,
max_capacity=5,
desired_capacity=1,
user_data=user_data,
)
# Add scaling policies
asg.scale_on_cpu_utilization(
"ScaleOnCPU",
target_utilization_percent=70,
)Add load balancer for multiple instances:
lb = elbv2.ApplicationLoadBalancer(
self,
"TablesleuthLB",
vpc=vpc,
internet_facing=True,
)
listener = lb.add_listener(
"Listener",
port=443,
certificates=[certificate],
)
listener.add_targets(
"Targets",
port=8080,
targets=[asg],
health_check=elbv2.HealthCheck(
path="/health",
interval=cdk.Duration.seconds(30),
),
)Automated EBS snapshots:
backup_plan = backup.BackupPlan(
self,
"BackupPlan",
backup_plan_name="tablesleuth-backup",
)
backup_plan.add_rule(
backup.BackupPlanRule(
backup_vault=backup_vault,
rule_name="DailyBackup",
schedule_expression=events.Schedule.cron(
hour="2",
minute="0",
),
delete_after=cdk.Duration.days(7),
)
)
backup_plan.add_selection(
"Selection",
resources=[
backup.BackupResource.from_ec2_instance(instance),
],
)Scheduled start/stop:
# Lambda function to stop instance at night
stop_function = lambda_.Function(
self,
"StopInstance",
runtime=lambda_.Runtime.PYTHON_3_11,
handler="index.handler",
code=lambda_.Code.from_inline("""
import boto3
def handler(event, context):
ec2 = boto3.client('ec2')
ec2.stop_instances(InstanceIds=[event['instance_id']])
return {'statusCode': 200}
"""),
)
# EventBridge rule to trigger at 6 PM
events.Rule(
self,
"StopInstanceRule",
schedule=events.Schedule.cron(hour="18", minute="0"),
targets=[targets.LambdaFunction(stop_function)],
)Deploy to multiple regions:
# app.py
for region in ["us-east-2", "us-west-2", "eu-west-1"]:
TablesleuthStack(
app,
f"TablesleuthStack-{environment}-{region}",
config=config,
env=cdk.Environment(
account=os.getenv("CDK_DEFAULT_ACCOUNT"),
region=region,
),
)Automated deployment:
pipeline = pipelines.CodePipeline(
self,
"Pipeline",
synth=pipelines.ShellStep(
"Synth",
input=pipelines.CodePipelineSource.git_hub(
"user/repo",
"main",
),
commands=[
"npm install -g aws-cdk",
"pip install -r requirements.txt",
"cdk synth",
],
),
)
# Add deployment stages
pipeline.add_stage(
TablesleuthStage(app, "Dev", config=dev_config)
)
pipeline.add_stage(
TablesleuthStage(app, "Prod", config=prod_config),
pre=[
pipelines.ManualApprovalStep("PromoteToProd"),
],
)- Secrets Manager integration
- CloudWatch alarms
- Systems Manager Session Manager
- Unit tests
- CDK Aspects
- CDK Nag
- CloudWatch Dashboard
- Separate stateful stack
- Backup and recovery
- Custom AMI
- Auto Scaling
- Load Balancer
- Multi-region
- CI/CD pipeline
| Improvement | Effort | Impact | Priority |
|---|---|---|---|
| Secrets Manager | 2 hours | High | P1 |
| CloudWatch Alarms | 1 hour | Medium | P1 |
| Session Manager | 1 hour | High | P1 |
| Unit Tests | 4 hours | High | P2 |
| CDK Aspects | 2 hours | Medium | P2 |
| CDK Nag | 1 hour | Medium | P2 |
| Dashboard | 2 hours | Low | P3 |
| Separate Stacks | 3 hours | Medium | P3 |
| Custom AMI | 8 hours | Medium | P4 |
| Auto Scaling | 4 hours | Low | P4 |
| Load Balancer | 4 hours | Low | P4 |
| CI/CD Pipeline | 8 hours | High | P4 |