Harnessing Semgrep for Multi-File Vulnerability Detection and Reducing False Positives -- miguellopes.net

In the realm of application security, the term “shift-left” has gained prominence. It emphasizes identifying and addressing security issues earlier in the development lifecycle, ideally as code is being written. Semgrep, with its pattern-based code scanning capabilities, is a quintessential tool for this shift-left movement. However, while it’s adept at flagging potential vulnerabilities in individual files, it operates on each file in isolation. This can sometimes lead to false positives, especially when the context spanning multiple files is crucial for accurate vulnerability detection.

In this post, we’ll explore a method to not only harness Semgrep for multi-file vulnerability detection but also to significantly reduce false positives by understanding the broader context.

The Multi-File Vulnerability Challenge

Consider a hypothetical scenario in a Python application:

1. source.py: A module that captures user input.

# source.py

def get_user_input():
    return input("Enter a query: ")

if __name__ == "__main__":
    user_query = get_user_input()
    from database_sink import execute_query
    execute_query_internal(user_query)

2. database_sink.py: A module with multiple functions that interact with a database. While some functions are safe, others can introduce vulnerabilities like SQL injection if used improperly.

# database_sink.py

import sqlite3

def safe_execute_query(query):
    # Uses parameterized query to prevent SQL injection
    conn = sqlite3.connect('example.db')
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM users WHERE name=?", (query,))
    return cursor.fetchall()

def execute_query_internal(query):
    # Vulnerable to SQL injection
    conn = sqlite3.connect('example.db')
    cursor = conn.cursor()
    cursor.execute(query)
    return cursor.fetchall()

In this setup, the execute_query_internal function in database_sink.py is vulnerable to SQL injection, especially if it directly receives unvalidated input from source.py.

A Two-Pronged Solution: Semgrep + Python

1. Crafting Semgrep Rules:

Using Semgrep’s YAML-based rule definitions, we create:

Source Rule: Detects functions or lines capturing user input.

rules:
- id: user-input-source
  languages:
    - python
  message: User input source detected
  patterns:
    - pattern: input(...)

Sink Rule: Pinpoints the vulnerable database interaction function.

rules:
- id: vulnerable-db-function
  languages:
    - python
  message: Vulnerable database function detected
  patterns:
    - pattern: cursor.execute($QUERY)

2. Python Script for Contextual Correlation:

With our Semgrep rules in place, we use Python to:

Execute the Semgrep rules on the respective files.
Analyze the results. If the source and the vulnerable sink are detected, and there’s a logical flow between them, the script flags a potential vulnerability.

import subprocess
import json

# ... [Semgrep rules defined here] ...

def run_semgrep(rule_content, target_file):
    result = subprocess.run(
        ["semgrep", "--config", "-", target_file],
        input=rule_content,
        text=True,
        capture_output=True
    )
    return json.loads(result.stdout)

source_results = run_semgrep(source_rule, "source.py")
sink_results = run_semgrep(sink_rule, "database_sink.py")

if source_results["results"] and sink_results["results"]:
    print("Potential vulnerability detected!")
    print("Source in:", source_results["results"][0]["path"])
    print("Sink in:", sink_results["results"][0]["path"])
else:
    print("No correlated vulnerability found.")

Reducing False Positives

By understanding the broader context and the flow of data across multiple files, this approach significantly reduces false positives. Instead of flagging every instance of user input or every database function, we only raise alarms when there’s a clear and unsafe data flow between a source and a vulnerable sink. This ensures developers are alerted to genuine vulnerabilities, reducing the noise and the risk of “alert fatigue”. More importantly, it lets teams focus on real, proven problems rather than getting sidetracked by potential but unproven issues.

Conclusion

Semgrep is undeniably a powerful tool for the shift-left movement, allowing developers to catch vulnerabilities early in the development process. However, the approach of extending its capabilities for multi-file vulnerability detection, as outlined in this post, is particularly suited for application security (AppSec) engineers during code reviews. This method provides a deeper, more holistic view of potential vulnerabilities, especially those that span across multiple files or modules. While developers can benefit from running Semgrep on an ad-hoc basis to catch straightforward issues, the nuanced detection of multi-file vulnerabilities is best handled by AppSec professionals who can understand and interpret the broader context. By leveraging this approach, AppSec engineers can ensure a more comprehensive and accurate security review, further bolstering the application’s defense against potential threats.

Java Serialization: Understanding and Mitigating Risks