Evolution of Static Code Scanning Tools

Static code analysis has come a long way from the days of simple string searches. With the rising complexity of applications and threats, our tooling has evolved to meet the demand for both precision and context-awareness. This blog takes you through that journey - from grep to Semgrep, and on to powerful taint-aware engines like CodeQL and Checkmarx - with real examples and actionable insights.

🧠 1. Conceptual Foundation

What is `grep` and How It's Used in Security

grep is a Unix command-line utility that searches for lines matching a given regular expression. In security, it's a classic first step to:

Look for usage of dangerous APIs (eval, exec, system)
Identify insecure configurations (AWS_SECRET, password =)
Detect patterns in logs or diffs

grep -rnw './src' -e 'eval'

Limitations of Text-Based Search

No context: grep doesn't understand syntax or semantics.
False positives: Matches might be in comments or safe use-cases.
False negatives: Slight variations in code syntax are missed.
No dataflow: Cannot trace if a tainted input reaches a sensitive sink.

Abstract Syntax Trees (AST) to the Rescue

An AST is a structured, tree-like representation of code where each node corresponds to a language construct (function, variable, call, etc). ASTs let tools understand code at a syntactic level, making matches more reliable.

🧠 AST Structure Example (Mermaid)

How Semgrep Leverages AST

Semgrep is an open-source static analysis tool that performs pattern matching over ASTs. Instead of regex, you write Semgrep rules using structured patterns.

Example:

rules:
  - id: no-eval
    pattern: eval(...)
    message: Avoid eval()
    severity: ERROR
    languages: [javascript]

Python and Java Examples

Python SSRF-like Issue:

import requests
from flask import request

@app.route('/proxy')
def proxy():
    url = request.args.get('url')
    return requests.get(url).content

Java SQLi:

String userInput = request.getParameter("user");
String query = "SELECT * FROM users WHERE name = '" + userInput + "'";
Statement stmt = connection.createStatement();
ResultSet rs = stmt.executeQuery(query);

🧬 2. Deep Dive Into Taint Analysis

What is Taint?

"Taint" marks data that comes from untrusted sources (e.g., user input).

Key Concepts

Source: Where tainted data comes from (e.g., req.query.id)
Sink: Where data should not go if tainted (e.g., child_process.exec())
Propagation: How taint moves through variables or functions.
Sanitizer: Code that cleans or validates taint (e.g., encodeURIComponent())

🧬 Taint Flow Diagram (Mermaid)

Real-World Examples

XSS: res.send(req.query.name)
SQLi: db.query("SELECT * FROM users WHERE id = " + req.query.id)
SSRF: http.get(req.query.url)

Internals: How Taint Analysis Works

Most modern tools use:

Control Flow Graphs (CFGs): Tracks possible execution paths
Data Flow Graphs: Models how data propagates
Symbol Tables: Keeps track of variables, types, and scopes

⚙️ Basic Control Flow Graph (Mermaid)

⚙️ Advanced Control Flow (Loop + Branches)

Reducing False Positives

Use CFG + DFG to avoid matching on unreachable or dead code
Incorporate sanitization context into rule writing
Customize source/sink/sanitizer functions specific to your app

Pattern vs Taint-Based Detection

Feature	Pattern Matching	Taint Analysis
Scope	Single line/function	Full dataflow
Accuracy	Medium	High
Speed	Fast	Slower
Complexity	Simple rules	Requires CFG + DFG

⚔️ 3. Practical Tool Comparison: SSRF in Node.js

... (Section unchanged for brevity) ...

Using Checkmarx

Full taint-aware engine
CxQuery Language: Custom query language to define patterns and flows
Accurately detects SSRF
Enterprise-grade dashboards, policy gating, and CI integrations

Using CodeQL

import javascript
from DataFlow::PathNode source, DataFlow::PathNode sink
where source.isSource() and sink.isSink() and DataFlow::localFlow(source, sink)
select source, sink

Uses QL, a logic programming language
Highly customizable with reusable libraries (e.g., DataFlow, Security::XSS)
Visualizes paths from source to sink via VS Code + GitHub Code Scanning integrations

DevSecOps Fit

Tool	Speed	Accuracy	Taint-Aware	CI Friendly
`grep`	✅ Fast	❌ Low	❌	✅
Semgrep OSS	✅ Fast	⚠️ Medium	❌	✅
Semgrep Pro	⚠️ Medium	✅ Good	✅	✅
Checkmarx	❌ Slow	✅ High	✅	⚠️ Medium
CodeQL	❌ Slow	✅ High	✅	⚠️ Medium

🧹 4. Summary Table: Tool Comparison

Feature	Semgrep OSS	Semgrep Pro	Checkmarx	CodeQL
Pattern Matching	✅	✅	❌	❌
Taint Mode	❌	✅	✅	✅
Speed	Fast	Medium	Slow	Slow
Accuracy	Medium	High	High	High
Customizability	High	High	Medium	Very High
Ideal For	CI + PRs	CI + Security Teams	Compliance + Large Orgs	Custom Rules + Power Users

🗺️ 5. Final Takeaways

Start with Semgrep OSS if you're early-stage or want fast CI checks
Upgrade to Semgrep Pro for taint analysis without enterprise overhead
Use CodeQL when building deep custom security queries for large repos
Use Checkmarx if you need mature reporting, policy gates, or integrations

Taint Mode is the Middle Path

It brings the precision of dataflow without requiring a full enterprise-grade engine. Ideal for engineering teams who want signal over noise.

From Grep to Taint Analysis: The Evolution of Static Code Scanning

🧠 1. Conceptual Foundation

What is `grep` and How It's Used in Security

Limitations of Text-Based Search

Abstract Syntax Trees (AST) to the Rescue

🧠 AST Structure Example (Mermaid)

How Semgrep Leverages AST

Python and Java Examples

Python SSRF-like Issue:

Java SQLi:

🧬 2. Deep Dive Into Taint Analysis

What is Taint?

Key Concepts

🧬 Taint Flow Diagram (Mermaid)

Real-World Examples

Internals: How Taint Analysis Works

⚙️ Basic Control Flow Graph (Mermaid)

⚙️ Advanced Control Flow (Loop + Branches)

Reducing False Positives

Pattern vs Taint-Based Detection

⚔️ 3. Practical Tool Comparison: SSRF in Node.js

Using Checkmarx

Using CodeQL

DevSecOps Fit

🧹 4. Summary Table: Tool Comparison

🗺️ 5. Final Takeaways

Taint Mode is the Middle Path

Comments

More from this blog

Axios malicious package: what developers and defenders should check

React2shell Leads to Full Microsoft 365 SharePoint compromise: How One Server Exploit Exposed an Entire Tenant

The Shai-Hulud Worm: Dissecting the Self-Spreading Malware Attack on the NPM Ecosystem

The Nx Supply Chain Attack: When AI Becomes an Accomplice Attackers

Command Palette

🧠 1. Conceptual Foundation

What is grep and How It's Used in Security

Limitations of Text-Based Search

Abstract Syntax Trees (AST) to the Rescue

🧠 AST Structure Example (Mermaid)

How Semgrep Leverages AST

Python and Java Examples

Python SSRF-like Issue:

Java SQLi:

🧬 2. Deep Dive Into Taint Analysis

What is Taint?

Key Concepts

🧬 Taint Flow Diagram (Mermaid)

Real-World Examples

Internals: How Taint Analysis Works

⚙️ Basic Control Flow Graph (Mermaid)

⚙️ Advanced Control Flow (Loop + Branches)

Reducing False Positives

Pattern vs Taint-Based Detection

⚔️ 3. Practical Tool Comparison: SSRF in Node.js

Using Checkmarx

Using CodeQL

DevSecOps Fit

🧹 4. Summary Table: Tool Comparison

🗺️ 5. Final Takeaways

Taint Mode is the Middle Path

Comments

More from this blog

What is `grep` and How It's Used in Security