Cloud Security as a Graph Problem

CSPM gives you 46,268 findings. A graph database gives you the relationships that make them answerable. Here's the pipeline — real queries, real numbers — that connects the two.

Cloud Security as a Graph Problem

One nightly scan. 48 AWS accounts. 46,268 FAIL findings.

Critical: 616. High: 8,399. Medium: 25,417. Low: 11,836.

Same question as Part 1’s 4,200-finding screenshot: where do I even start?

Part 1 argued that context is the bottleneck — scanners are cheap and deliberately context-stripped; the leverage move is attaching context at detection time, not triage time. This post is about what that looks like in practice: the data model, the tooling, and the automation that turns 46,268 findings into a triaged, owner-attributed queue.

The short version: security questions are almost always relationship questions, and the natural data structure for relationships is a graph.

Why cloud findings are hard to act on

“S3 bucket is public.” Our CSPM surfaced 119 buckets with a Principal: * Allow statement in the bucket policy. To act on any one of them you need to know: is this intentional? Is it a static website bucket behind CloudFront? Is the account dev or prod? What’s actually in it?

The scanner doesn’t know. By design.

After pulling graph context — account tier, CloudFront distribution edges, IP-condition allowlists — the real list was five buckets. No conditions, no block-public-access, no CloudFront, in production-tier accounts. The other 114 resolved automatically. Same finding type, completely different risk profile.

The answer to “does this finding matter?” is almost always a relationship question:

  • Is this asset internet-facing? — reachability
  • If this principal is compromised, what can it reach? — blast radius
  • Who owns it and who do I page? — ownership
  • Is the vulnerable path actually reachable in prod? — exposure

These are graph questions. The answers live in edges between nodes — between an IAM role and the EC2 instances it’s attached to, between an S3 bucket and the policies that reference it, between a running instance and the subnet that determines whether it’s public-facing. A flat CSV of findings doesn’t model edges. A graph database does.

Cartography: your infra as a graph

Cartography (Lyft, OSS) ingests AWS, GCP, GitHub, Kubernetes, and others into Neo4j. Every resource becomes a node. Every relationship becomes an edge.

What that looks like in practice: EC2Instance nodes connected via MEMBER_OF_EC2_SECURITY_GROUP to EC2SecurityGroup nodes, which connect to IpPermissionInbound nodes with their CIDR rules. AWSRole nodes connected via TRUSTS_AWS_PRINCIPAL edges to the principals that can assume them. S3Bucket nodes connected via TAGGED edges to AWSTag nodes carrying ownership keys. EBS snapshots traceable through CREATED_FROMEBSVolumeATTACHED_TOEC2InstanceMEMBER_OF_EKS_CLUSTER — all the way back to the EKS cluster the volume belonged to.

The schema isn’t a flat list of resources. It’s a navigable map of your infrastructure.

Finding public S3 buckets with graph context:

MATCH (s:S3Bucket) WHERE s.anonymous_access = true
OPTIONAL MATCH (s)-[:TAGGED]->(owner:AWSTag {key: 'Owner'})
RETURN s.name AS bucket, s.anonymous_actions AS actions, owner.value AS owner
ORDER BY size(s.anonymous_actions) DESC

Result on our baseline: 24 buckets via Cartography’s anonymous_access flag. Steampipe’s live Principal: * check found 119. The 95-bucket gap is ACL+policy combinations with IP-condition allowlists that Cartography’s flag doesn’t decompose — that’s where Steampipe fills in (more on this in the next section).

Blast radius of an EC2 instance’s attached role:

MATCH (i:EC2Instance)
WHERE i.id = $resourceId OR i.instanceid = $resourceId

OPTIONAL MATCH (i)-[:MEMBER_OF_EC2_SECURITY_GROUP]->(sg:EC2SecurityGroup)
OPTIONAL MATCH (i)-[:PART_OF_SUBNET]->(sub:EC2Subnet)-[:MEMBER_OF_AWS_VPC]->(vpc:AWSVpc)
OPTIONAL MATCH (i)-[:INSTANCE_PROFILE]->(ip:AWSInstanceProfile)-[:ASSOCIATED_WITH]->(role:AWSRole)
OPTIONAL MATCH (role)-[:STS_ASSUMEROLE_ALLOW]->(assumable:AWSRole)
OPTIONAL MATCH (role)-[:POLICY]->(policy:AWSPolicy)

RETURN
  i.publicipaddress as publicIp,
  i.state as state,
  collect(DISTINCT {id: sg.id, name: sg.name}) as securityGroups,
  vpc.id as vpc,
  role.arn as roleArn,
  collect(DISTINCT {arn: assumable.arn, name: assumable.name}) as assumableRoles,
  collect(DISTINCT policy.name) as policies

This runs synchronously in the analyst dashboard as a tool call — sub-5 seconds. The cross-account trust query runs alongside it:

MATCH (role:AWSRole)-[:TRUSTS_AWS_PRINCIPAL]->(principal)
WHERE role.arn = $roleArn
RETURN principal.arn as trustedPrincipal, labels(principal) as labels

Together these two queries answer the blast-radius question that no single CSPM finding surfaces.

BloodHound proved this framing works: for a decade, Active Directory security has been done as graph queries — “what is the shortest path from this user to Domain Admin?” Cartography is the same bet applied to cloud infrastructure. The primitives are different (IAM roles instead of AD groups, STS_ASSUMEROLE_ALLOW instead of MemberOf), but the insight is identical: relationships are where the risk lives.

Deployment reality

This isn’t magic. Cartography runs on an EC2 instance (r7i.2xlarge, single-host Docker). Neo4j runs as a container alongside it. The nightly sync kicks off at 8pm PT via a systemd timer with a 10-minute randomized delay. The sync fans out per-account, and the per-account reliability unit is independent — a failed sync on one account doesn’t block the others.

The node/edge examples above reflect our actual Cartography schema. One important note: verify node and edge type names against the current Cartography version before writing code against them. The schema has been updated across versions, and a type that existed in 0.90 may have been renamed by 0.134.

What Cartography doesn’t cover

Cartography is a nightly snapshot. It reflects the state of your infra at ingestion time, not right now. Resources created this morning aren’t in last night’s graph. That’s where Steampipe fills in.

Commercial CSPMs like Wiz and Orca solve this with near-real-time connectors. The tradeoff is vendor lock-in and cost at scale. CloudQuery offers a relational (Postgres-backed) alternative if your team thinks in SQL rather than Cypher. The approach described here is the open-source end of the spectrum: bring your own infrastructure, full schema access, no vendor dependency.

Steampipe: live queries for the freshness gap

Steampipe queries cloud APIs directly using SQL. Our production config has 48 per-account connections behind an aws_org aggregator. It’s AWS-only for us (one plugin), but the plugin ecosystem covers GCP, Azure, GitHub, Kubernetes, and more.

The Cartography/Steampipe split in practice:

  • Cartography: relationship traversals, blast radius, ownership edges, anything that requires graph structure
  • Steampipe: live current-state checks, freshness validation, anything that requires a real API call right now

The S3 discrepancy query — confirming live BPA state for buckets Cartography flagged:

SELECT name, account_id, bucket_policy_is_public,
       block_public_acls, block_public_policy,
       ignore_public_acls, restrict_public_buckets
FROM aws_org.aws_s3_bucket

This is what exposed the 119 vs 24 discrepancy. Cartography’s anonymous_access flag is conservative; Steampipe’s live policy read catches the broader Principal: * case including IP-condition policies that the flag misses.

Other live checks in production use — EC2 instances with public IPs, IMDSv1 enabled, and an attached IAM role:

SELECT instance_id, instance_state, public_ip_address,
  metadata_options ->> 'HttpTokens' as http_tokens,
  iam_instance_profile_arn, account_id,
  tags ->> 'Owner' as owner_tag
FROM aws_org.aws_ec2_instance

This returned 105 instances matching the IMDSv1 + public IP + IAM role pattern — the exact set used to validate the CSPM finding (CF-06). The Steampipe live count matched the Cartography-enriched count exactly. When they match, you have confidence; when they don’t, the gap is worth investigating.

Automating the loop: from finding to triaged ticket

The pipeline shape:

CSPM (Prowler)
        │
        │  raw findings (OCSF JSON)
        ▼
Enrichment layer
   ├─► Cartography: asset relationships, blast radius, tag edges
   ├─► Steampipe: live state confirmation
   └─► Tag lookup: team, environment, owner
        │
        │  enriched finding
        ▼
Reclassification engine (deterministic rules)
        │
        │  regraded severity + rule key
        ▼
Attribution pipeline
   ├─► Terraform git blame
   ├─► Deploy pipeline blame (CloudTrail → role → repo)
   ├─► Resource lineage (derivative from parent ARN)
   └─► BambooHR employment status check
        │
        │  reclassified finding + attributed owner
        ▼
Ticketing / routing

The key thing about the reclassification engine: it’s deterministic, not an LLM. Every regrade rule is a Python function keyed by check_id with explicit conditions and an explicit rule label. The LLM lives in the analyst dashboard — interactive, read-only, for ad-hoc queries — not in the nightly pipeline. This is intentional: the automation path needs to be auditable, fast, and right for the bulk case. The LLM is for the fuzzy residue that the rules don’t cover.

Reclassification examples

The 16.1% of findings that got regraded (5,824 out of 36,078) fell into recognizable patterns. Three worth showing in detail:

1. Bulk downgrade: context shows no risk

A large batch of secretsmanager_automatic_rotation_enabled / secretsmanager_secret_rotated_periodically findings (severity HIGH) matched a known pattern: the resource ARN contained litellm, identifying them as API proxy keys where rotation isn’t applicable. One ARN-pattern rule, no graph query needed.

Result: 3,662 findings moved HIGH → LOW. That’s 63% of the total regrade volume, from a single rule. The context wasn’t in the finding. It was in knowing what that resource class is.

2. Graph-enriched triage: security group exposure depends on what’s attached

ec2_securitygroup_allow_ingress_* findings (HIGH) all look the same on paper: a security group allows inbound from 0.0.0.0/0. The actual risk is entirely about what’s attached to the security group. One Cypher query:

UNWIND $sg_ids AS sg_id
MATCH (sg:EC2SecurityGroup {groupid: sg_id})
OPTIONAL MATCH (sg)<-[:MEMBER_OF_EC2_SECURITY_GROUP]-(i:EC2Instance)
OPTIONAL MATCH (sg)<-[:MEMBER_OF_EC2_SECURITY_GROUP]-(db:Database)
RETURN sg_id,
       COLLECT(DISTINCT i.instanceid) AS instances,
       COLLECT(DISTINCT i.state) AS states,
       COLLECT(DISTINCT db.id) AS databases

The outcomes from the same Cypher result, split by what’s attached:

ContextRegradeCount
No instances, no databases attachedHIGH → LOW24
Only stopped instancesHIGH → LOW7
Running instances, private IPs onlyHIGH → MEDIUM1,178
EBS volume, detached instanceHIGH → LOW186
Database directly behind open SGHIGH → CRITICAL2

Same check ID. Same original severity. Five different real risk levels, all from one graph query.

3. Escalation: the combination that matters

ec2_instance_imdsv2_enabled (HIGH) — IMDSv1 is enabled, which means an SSRF vulnerability could let an attacker steal the instance’s IAM credentials via the metadata endpoint. But “SSRF is possible → credential theft” only holds if the instance (a) has a public IP, (b) has an IAM role attached, and (c) has no owner to page.

UNWIND $iids AS iid
MATCH (i:EC2Instance {instanceid: iid})
OPTIONAL MATCH (i)-[:TAGGED]->(t:AWSTag {key: 'Owner'})
RETURN iid, i.publicipaddress, i.iaminstanceprofile, t.value

Any instance satisfying all three conditions got escalated to CRITICAL. The rule:

if has_public and has_iam and not has_owner:
    results.append((*_base(f), "critical", "imdsv2_public_iam_no_owner",
                    "Public IP + IAM role, no owner — SSRF → credential theft"))

Result: 96 findings HIGH → CRITICAL. The 9 instances that had an Owner tag stayed HIGH — the risk is real, but there’s someone to page. The 96 without an owner are the ones that needed to be in front of the security team before anything else.

Attribution: who do I page?

For every finding, the question is: who created this resource, and do they still work here?

The attribution chain, in order:

  1. Direct tag — a team-contact tag or Owner AWSTag. Fastest path. Only 329 resources had this set — low single-digit coverage of the EC2 fleet.
  2. Terraform git blame — for resources declared in .tf files: git blame --porcelain on the resource block across 19 Terraform repos → first committer email.
  3. Deploy pipeline blame — for container workloads: CloudTrail PutImage/RegisterTaskDefinition event → push-role name → repo via a role-name-to-repo mapping config.
  4. Resource lineage — for derivatives (snapshots, volumes, load balancers): trace parent ARN via Cartography edges (CREATED_FROM, ATTACHED_TO, MEMBER_OF), inherit parent’s RCA row.
  5. System-generated — GuardDuty-managed buckets, SecurityHub findings, CloudFront OAI — attribute to the AWS service, not a human.
  6. Unresolvedgit_blame_unmatched, pipeline_unresolved, derivative_unresolved. The row still lands in the attribution table, but with no email. These route to a triage queue.

The combined pipeline targets ~93.5% ARN coverage. The ~6.5% that exhaust all fallbacks are the “unattributed” bucket — either resources that predated the Terraform history or that were created via console click-ops with no trace.

Once you have an email, one more check: is this person still here?

SELECT employment_status FROM <identity_store>
WHERE canonical_email = $owner_email;

'terminated' means they’re absent from the HR system. A terminated creator is a flag, not an answer — the resource may have been handed off, or it may be genuinely orphaned.

Three use cases

1. S3 triage: 119 findings → 5 real tickets

119 buckets had a Principal: * Allow statement in their policy. Raw CSPM severity: mixed Medium/High. To know which ones actually needed a ticket, we needed: is the bucket intentionally public? What’s in it? What account tier?

Graph + Steampipe live check + IP-condition analysis got this to five buckets: no IP conditions, no block-public-access settings, in production-tier accounts. The other 114 either had allowlists restricting access to known office CIDRs, were dev/sandbox accounts, or were CloudFront origin buckets where public access is structural.

Five tickets created, not 119. The other 114 were documented and closed automatically.

2. IMDSv1 escalation: 105 instances → 96 criticals, 9 highs

105 instances had IMDSv1 enabled with a public IP and an attached IAM role (CF-06). Without graph context, these are 105 identical HIGH findings. With context: 96 had no Owner tag (escalated to CRITICAL, paged immediately), 9 had an Owner tag (stayed HIGH, ticketed to the owner).

One graph query stratified 105 uniform findings into two meaningfully different action items.

Edge case worth noting: several MongoDB production nodes were in this set but their security groups restrict access to app-layer traffic — direct exploitation requires an app-layer SSRF, not a direct network path. The rule escalates on “no owner” rather than “no SG protection” because the SG restriction is a mitigation, not a fix, and the instance still has no one to page.

3. Orphaned admin credentials: 45% of AWS users had left the company

Access Analyzer found 125 unused IAM access keys. Steampipe found 39 keys over 2 years old. Standard scan-output treatment: hygiene findings, Low to Medium severity.

Identity spine cross-reference: 573 AWS users were absent from the HR system — 45% of all users with AWS access. Of those, 79 were contractors with full admin privileges, 42 not in the HR system. 530 former full-time employees still had active company domain principals in IAM.

These aren’t hygiene findings. They’re orphaned admin credentials with no current owner and no termination offboarding. The graph traversal (IAM principal → canonical email → employment status) turned a Low-severity hygiene finding into a board-level access control incident.

What this doesn’t solve

Data classification. The graph knows a bucket exists and who can reach it. It doesn’t know what’s in it. Whether a public bucket contains static marketing assets or customer PII is not in Cartography — it’s in S3 Select queries, object tagging, and domain knowledge. The graph tells you blast radius; data classification tells you blast impact.

Runtime behavior. Whether a Lambda actually fires in production isn’t in the graph. Whether a suspicious IAM role has ever been used isn’t in Cartography. For that you need CloudTrail (historical) or runtime tooling (continuous). The graph is structural; behavior is orthogonal.

Tag coverage. Only 329 resources had an Owner tag set on our baseline scan. That’s a low single-digit percentage of the EC2 fleet. The Terraform git blame and pipeline blame fallbacks recover a lot of that ground, but they require good Terraform hygiene and consistent role-naming conventions — neither of which you can assume. The attribution pipeline is only as good as the signals feeding it.

Wildcard IAM noise. Every S3 bucket has ~1,150 wildcard IAM paths from org-wide admin roles. The first pass at access matrix analysis was unusable because of this. The fix was to switch from raw path count to scoped_iam_path_count — paths that are specific to a bucket, not inherited from *:* policies. If you build IAM blast radius analysis, build this filter from day one.

Activity status from small samples. We initially marked buckets DORMANT or ACTIVE based on a 5-object alphabetical sample of LastModified dates. This was wrong for 11 buckets — including a 2.2 PB bucket that showed no recent modifications in the sample but had 231,000 CloudTrail events per day. Small samples on large datasets produce confident wrong answers. Use CloudWatch storage metrics and CloudTrail for activity classification, not object-level sampling.

Maintenance. The Cypher queries above work against our current Cartography schema. Cartography updates its node/edge types. AWS adds resource types. The regrade rules need updating as new check IDs are added. Every piece of this pipeline has a maintenance surface. Budget for it before you ship it.

The through-line

Part 1 said: “the leverage isn’t more scanners — it’s anything that closes the context gap.”

The pipeline above is what closing the context gap looks like in practice. 46,268 findings from one nightly run. 16.1% regraded by deterministic rules in sub-second Neo4j queries. HIGH findings reduced by 63%. The 96 findings that most needed attention — public instances with admin IAM roles and no owner to page — surfaced as Critical before any human touched a ticket.

That’s not AI replacing the analyst. It’s a graph database doing the relationship work that used to be archaeology, so the analyst can spend their time on the 4% of findings where the rules don’t have enough signal to be confident.

Detection is solved. Context is still where the work is. This is what that looks like when you build it.


Part 2 of the Security Engineering Is a Context Problem series. Next: how the same pattern applies to security operations alert triage — same graph, different data sources, different urgency.

Idea, framing, and edits: Aseem. Drafting assistance: Claude.