Skip to main content

Command Palette

Search for a command to run...

From Grep to Taint Analysis: The Evolution of Static Code Scanning

Updated
5 min read
From Grep to Taint Analysis: The Evolution of Static Code Scanning
H

Specialized in uncovering vulnerabilities within software supply chains and dependency ecosystems. Creator of SCAGoat and other open-source security tools. Speaker at Black Hat, DEF CON, and AppSec conferences with research on malicious package detection, dependency confusion, and CI/CD security.

Static code analysis has come a long way from the days of simple string searches. With the rising complexity of applications and threats, our tooling has evolved to meet the demand for both precision and context-awareness. This blog takes you through that journey - from grep to Semgrep, and on to powerful taint-aware engines like CodeQL and Checkmarx - with real examples and actionable insights.


🧠 1. Conceptual Foundation

What is grep and How It's Used in Security

grep is a Unix command-line utility that searches for lines matching a given regular expression. In security, it's a classic first step to:

  • Look for usage of dangerous APIs (eval, exec, system)

  • Identify insecure configurations (AWS_SECRET, password =)

  • Detect patterns in logs or diffs

grep -rnw './src' -e 'eval'
  • No context: grep doesn't understand syntax or semantics.

  • False positives: Matches might be in comments or safe use-cases.

  • False negatives: Slight variations in code syntax are missed.

  • No dataflow: Cannot trace if a tainted input reaches a sensitive sink.

Abstract Syntax Trees (AST) to the Rescue

An AST is a structured, tree-like representation of code where each node corresponds to a language construct (function, variable, call, etc). ASTs let tools understand code at a syntactic level, making matches more reliable.

🧠 AST Structure Example (Mermaid)

How Semgrep Leverages AST

Semgrep is an open-source static analysis tool that performs pattern matching over ASTs. Instead of regex, you write Semgrep rules using structured patterns.

Example:

rules:
  - id: no-eval
    pattern: eval(...)
    message: Avoid eval()
    severity: ERROR
    languages: [javascript]

Python and Java Examples

Python SSRF-like Issue:

import requests
from flask import request

@app.route('/proxy')
def proxy():
    url = request.args.get('url')
    return requests.get(url).content

Java SQLi:

String userInput = request.getParameter("user");
String query = "SELECT * FROM users WHERE name = '" + userInput + "'";
Statement stmt = connection.createStatement();
ResultSet rs = stmt.executeQuery(query);

🧬 2. Deep Dive Into Taint Analysis

What is Taint?

"Taint" marks data that comes from untrusted sources (e.g., user input).

Key Concepts

  • Source: Where tainted data comes from (e.g., req.query.id)

  • Sink: Where data should not go if tainted (e.g., child_process.exec())

  • Propagation: How taint moves through variables or functions.

  • Sanitizer: Code that cleans or validates taint (e.g., encodeURIComponent())

🧬 Taint Flow Diagram (Mermaid)

Real-World Examples

Internals: How Taint Analysis Works

Most modern tools use:

  • Control Flow Graphs (CFGs): Tracks possible execution paths

  • Data Flow Graphs: Models how data propagates

  • Symbol Tables: Keeps track of variables, types, and scopes

⚙️ Basic Control Flow Graph (Mermaid)

⚙️ Advanced Control Flow (Loop + Branches)

Reducing False Positives

  • Use CFG + DFG to avoid matching on unreachable or dead code

  • Incorporate sanitization context into rule writing

  • Customize source/sink/sanitizer functions specific to your app

Pattern vs Taint-Based Detection

FeaturePattern MatchingTaint Analysis
ScopeSingle line/functionFull dataflow
AccuracyMediumHigh
SpeedFastSlower
ComplexitySimple rulesRequires CFG + DFG

⚔️ 3. Practical Tool Comparison: SSRF in Node.js

... (Section unchanged for brevity) ...

Using Checkmarx

  • Full taint-aware engine

  • CxQuery Language: Custom query language to define patterns and flows

  • Accurately detects SSRF

  • Enterprise-grade dashboards, policy gating, and CI integrations

Using CodeQL

import javascript
from DataFlow::PathNode source, DataFlow::PathNode sink
where source.isSource() and sink.isSink() and DataFlow::localFlow(source, sink)
select source, sink
  • Uses QL, a logic programming language

  • Highly customizable with reusable libraries (e.g., DataFlow, Security::XSS)

  • Visualizes paths from source to sink via VS Code + GitHub Code Scanning integrations

DevSecOps Fit

ToolSpeedAccuracyTaint-AwareCI Friendly
grep✅ Fast❌ Low
Semgrep OSS✅ Fast⚠️ Medium
Semgrep Pro⚠️ Medium✅ Good
Checkmarx❌ Slow✅ High⚠️ Medium
CodeQL❌ Slow✅ High⚠️ Medium

🧹 4. Summary Table: Tool Comparison

FeatureSemgrep OSSSemgrep ProCheckmarxCodeQL
Pattern Matching
Taint Mode
SpeedFastMediumSlowSlow
AccuracyMediumHighHighHigh
CustomizabilityHighHighMediumVery High
Ideal ForCI + PRsCI + Security TeamsCompliance + Large OrgsCustom Rules + Power Users

🗺️ 5. Final Takeaways

  • Start with Semgrep OSS if you're early-stage or want fast CI checks

  • Upgrade to Semgrep Pro for taint analysis without enterprise overhead

  • Use CodeQL when building deep custom security queries for large repos

  • Use Checkmarx if you need mature reporting, policy gates, or integrations

Taint Mode is the Middle Path

It brings the precision of dataflow without requiring a full enterprise-grade engine. Ideal for engineering teams who want signal over noise.

More from this blog