From Grep to Taint Analysis: The Evolution of Static Code Scanning

Specialized in uncovering vulnerabilities within software supply chains and dependency ecosystems. Creator of SCAGoat and other open-source security tools. Speaker at Black Hat, DEF CON, and AppSec conferences with research on malicious package detection, dependency confusion, and CI/CD security.
Static code analysis has come a long way from the days of simple string searches. With the rising complexity of applications and threats, our tooling has evolved to meet the demand for both precision and context-awareness. This blog takes you through that journey - from grep to Semgrep, and on to powerful taint-aware engines like CodeQL and Checkmarx - with real examples and actionable insights.
🧠 1. Conceptual Foundation
What is grep and How It's Used in Security
grep is a Unix command-line utility that searches for lines matching a given regular expression. In security, it's a classic first step to:
Look for usage of dangerous APIs (
eval,exec,system)Identify insecure configurations (
AWS_SECRET,password =)Detect patterns in logs or diffs
grep -rnw './src' -e 'eval'
Limitations of Text-Based Search
No context:
grepdoesn't understand syntax or semantics.False positives: Matches might be in comments or safe use-cases.
False negatives: Slight variations in code syntax are missed.
No dataflow: Cannot trace if a tainted input reaches a sensitive sink.
Abstract Syntax Trees (AST) to the Rescue
An AST is a structured, tree-like representation of code where each node corresponds to a language construct (function, variable, call, etc). ASTs let tools understand code at a syntactic level, making matches more reliable.
🧠 AST Structure Example (Mermaid)
How Semgrep Leverages AST
Semgrep is an open-source static analysis tool that performs pattern matching over ASTs. Instead of regex, you write Semgrep rules using structured patterns.
Example:
rules:
- id: no-eval
pattern: eval(...)
message: Avoid eval()
severity: ERROR
languages: [javascript]
Python and Java Examples
Python SSRF-like Issue:
import requests
from flask import request
@app.route('/proxy')
def proxy():
url = request.args.get('url')
return requests.get(url).content
Java SQLi:
String userInput = request.getParameter("user");
String query = "SELECT * FROM users WHERE name = '" + userInput + "'";
Statement stmt = connection.createStatement();
ResultSet rs = stmt.executeQuery(query);
🧬 2. Deep Dive Into Taint Analysis
What is Taint?
"Taint" marks data that comes from untrusted sources (e.g., user input).
Key Concepts
Source: Where tainted data comes from (e.g.,
req.query.id)Sink: Where data should not go if tainted (e.g.,
child_process.exec())Propagation: How taint moves through variables or functions.
Sanitizer: Code that cleans or validates taint (e.g.,
encodeURIComponent())
🧬 Taint Flow Diagram (Mermaid)
Real-World Examples
XSS:
res.send(req.query.name)SQLi:
db.query("SELECT * FROM users WHERE id = " +req.query.id)SSRF:
http.get(req.query.url)
Internals: How Taint Analysis Works
Most modern tools use:
Control Flow Graphs (CFGs): Tracks possible execution paths
Data Flow Graphs: Models how data propagates
Symbol Tables: Keeps track of variables, types, and scopes
⚙️ Basic Control Flow Graph (Mermaid)
⚙️ Advanced Control Flow (Loop + Branches)
Reducing False Positives
Use CFG + DFG to avoid matching on unreachable or dead code
Incorporate sanitization context into rule writing
Customize source/sink/sanitizer functions specific to your app
Pattern vs Taint-Based Detection
| Feature | Pattern Matching | Taint Analysis |
| Scope | Single line/function | Full dataflow |
| Accuracy | Medium | High |
| Speed | Fast | Slower |
| Complexity | Simple rules | Requires CFG + DFG |
⚔️ 3. Practical Tool Comparison: SSRF in Node.js
... (Section unchanged for brevity) ...
Using Checkmarx
Full taint-aware engine
CxQuery Language: Custom query language to define patterns and flows
Accurately detects SSRF
Enterprise-grade dashboards, policy gating, and CI integrations
Using CodeQL
import javascript
from DataFlow::PathNode source, DataFlow::PathNode sink
where source.isSource() and sink.isSink() and DataFlow::localFlow(source, sink)
select source, sink
Uses QL, a logic programming language
Highly customizable with reusable libraries (e.g.,
DataFlow,Security::XSS)Visualizes paths from source to sink via VS Code + GitHub Code Scanning integrations
DevSecOps Fit
| Tool | Speed | Accuracy | Taint-Aware | CI Friendly |
grep | ✅ Fast | ❌ Low | ❌ | ✅ |
| Semgrep OSS | ✅ Fast | ⚠️ Medium | ❌ | ✅ |
| Semgrep Pro | ⚠️ Medium | ✅ Good | ✅ | ✅ |
| Checkmarx | ❌ Slow | ✅ High | ✅ | ⚠️ Medium |
| CodeQL | ❌ Slow | ✅ High | ✅ | ⚠️ Medium |
🧹 4. Summary Table: Tool Comparison
| Feature | Semgrep OSS | Semgrep Pro | Checkmarx | CodeQL |
| Pattern Matching | ✅ | ✅ | ❌ | ❌ |
| Taint Mode | ❌ | ✅ | ✅ | ✅ |
| Speed | Fast | Medium | Slow | Slow |
| Accuracy | Medium | High | High | High |
| Customizability | High | High | Medium | Very High |
| Ideal For | CI + PRs | CI + Security Teams | Compliance + Large Orgs | Custom Rules + Power Users |
🗺️ 5. Final Takeaways
Start with Semgrep OSS if you're early-stage or want fast CI checks
Upgrade to Semgrep Pro for taint analysis without enterprise overhead
Use CodeQL when building deep custom security queries for large repos
Use Checkmarx if you need mature reporting, policy gates, or integrations
Taint Mode is the Middle Path
It brings the precision of dataflow without requiring a full enterprise-grade engine. Ideal for engineering teams who want signal over noise.




