When building multi-tenant applications, one of a developer’s first concerns is preventing one customer from seeing another customer’s data. In traditional SQL databases, we have decades of best practices to enforce this separation.

But what about vector databases (sometimes referred to as “AI’s long-term memory”), which have become a critical component in most large-scale enterprise AI applications?

While vector databases may offer a new way to interact with data, they share a fundamental security requirement with all database systems: secure application design.

As with all multi-tenant applications, it’s crucial to implement proper authorization checks to ensure one tenant cannot access another’s data. This article will explore how a classic application-level vulnerability can manifest in a vector database environment and, more importantly, how to implement straightforward and effective defenses.

First of All, What Are Embeddings?

Vector embeddings are numerical representations of data like text, images, or audio, transforming them into lists of numbers (vectors) that capture their inherent meaning and relationships within a high-dimensional space. These embeddings enable computers to understand and process unstructured data by representing semantic similarity: data points with similar meanings will have “close” vectors. This capability makes them fundamental to most of the most prominent types of AI applications, like semantic search, recommendation systems, natural language processing, and image analysis, as they allow machine learning algorithms to effectively analyze and derive insights from complex information.

Fundamentally, that means vector databases do not have the equivalents of columns, rows, or even tables that you would find in SQL. Nor do they use key-value pairs, wide columns, or documents found in NoSQL solutions. Instead they work by grouping the most similar items together (semantics), and try and query how MOST alike data points are.

Think of it this way:

  • A traditional SQL database is like stadium seating. Each piece of data has a specific, assigned seat (row, column, table). It’s highly organized and rigid.
  • A vector database is like the open field at a festival. People (your data points) are grouped together based on their interests. People who like the same music (semantically similar data) will be clustered close together.

But then how do you separate data?!

If everything is on an open field, how do you build walls between different tenants’ data?

The primary security primitive for data isolation in vector databases is the namespace. Namespaces act as distinct, isolated fields for each tenant’s data. When you query, you specify which namespace you want to search in, ensuring you only get results from that tenant’s data cluster.

A Broken Trust Boundary

The leak happens when an application improperly trusts user-supplied input to decide which namespace to query. This is a classic Insecure Direct Object Reference (IDOR) vulnerability, adapted for vector databases.

Most applications follow a simple pattern:

 

# Standard common connection setup

API_KEY=”<SUPER_SECRET_API_KEY>”

DATABASE=”AmazingDatabase”

vc = VectorDatabase(api_key=API_KEY)

db = vc.Database(DATABASE)

# Later, for a query that comes through for a different customer…
# The ‘customer’ variable might come from a URL or API parameter.

results = db.query(namespace=customer, vector=vectors)

 

There are many ways changing the namespace unintentionally can happen, the most common of which is by trusting a client when it is used to determine WHICH namespace/partition to query.  As an example of this broken access control, let’s assume a web application that uses web routing.  As a naive example, a URL path that may mimic

https://example.com/customer/<company_name>/sales?query=<text query>” .  Manipulating “<company_name>” via the URL on an unchecked path would end up querying another company’s information.

As an example of vulnerable code:

 

@app.route(“/customer/<company_name>/sales”, methods=[‘GET’])

def get_sales_data(company_name):

“”” This endpoint is VULNERABLE. It trusts the ‘company_name’ from the URL to select the database namespace. “””

# The user’s query from the URL (e.g., ?query=projections)

user_query = request.args.get(“query”, “”)

# VULNERABILITY: The ‘company_name’ from the URL path is passed directly

# as the ‘namespace’ to the database query function. There is no check

# to see if the logged-in user actually belongs to this company.

results = database_query(namespace=company_name, query=user_query)

return jsonify(results)

 

From here, a bad actor would take the time to exfiltrate data. Again, in a vector database world this isn’t as easy as doing a `SELECT *` like you would in SQL or using `{}` like you would against a NoSQL solution. Instead the primary known method is Iterative Crawling.

This is the most common and effective method. It works by starting with a single point in the vector space and expanding outward until no new data points are discovered.

Crawl by Vector

Step 1: Get an Initial “Seed” Vector

The attacker first needs a starting point inside the namespace. They can get this by querying with a very generic vector.

  • Zero Vector: The simplest approach is to query with a vector of all zeros: [0.0, 0.0, …, 0.0]. This will return the k vectors that happen to be closest to the origin of the vector space.
  • Random Vector: Alternatively, they could generate a random vector.

This initial query gives the attacker their first set of valid vectors and, more importantly, their associated metadata.

Step 2: Use Results as New Queries (The “Crawl”)

The attacker now takes the vectors returned from the first query and uses each one as a new query vector. This is the crawling step.

  1. The attacker maintains a list of discovered vector IDs to avoid duplicate work.
  2. They take a vector they just discovered and query the database with it.
  3. This query returns its own nearest neighbors, some of which may be new, undiscovered vectors.
  4. The attacker adds any new vector IDs and their metadata to their loot and adds the new vectors to their queue of items to query with.

They repeat this process, fanning out through the data. It’s like finding one person in a crowd, asking who their closest friends are, then asking each of those friends who their closest friends are, until the entire social network is mapped.

Crawl by Text

The attack is just as feasible but relies on using the application’s own functionality as a tool to navigate the semantic space. The core principle of “seed and crawl” remains the same, but instead of using vectors directly, the attacker uses the text metadata they exfiltrate.

Step 1: The Initial Seed Query (Text-based)

The attacker starts by submitting a very common or generic word/phrase as a query. The goal is to get any valid result from the compromised namespace.

  • Examples of seed queries: “the”, “a”, “data”, “report”, “test”, “summary”

The application takes this generic term, converts it into a vector, and returns the top_k results, – which are how many of the most similar results are returned for a query – typically as text snippets from the vector’s metadata.

Step 2: Use Returned Text as New Queries (The “Crawl”)

This is the core of the text-based crawl. The attacker now has a small set of legitimate text snippets from the private dataset. Each of these snippets is a much better “seed” than the original generic term because it represents a real data point.

  1. The attacker takes a text result they just received, for example, “Q3 financial projections summary”.
  2. They submit this exact text snippet as a new query to the vulnerable endpoint.
  3. The application receives this specific phrase, embeds it, and runs a similarity search. The results will be other text chunks that are semantically very close to “Q3 financial projections summary”.
  4. The attacker collects all the new, unique text snippets from the results and adds them to their queue of queries to submit.

By repeatedly using the output of one query as the input for the next, the attacker can systematically walk through the semantic clusters of the tenant’s data, exfiltrating the content chunk by chunk.

Defense Techniques

Along with known defense techniques from OWASP, here are some additional ways to defend vector databases.

Use Non-Guessable Identifiers (UUIDs)

Using non-guessable identifiers, like Universally Unique Identifiers (UUIDs), is a fundamental security practice. Instead of using easily predictable tenant names like “company_a” or “tenant_1”, you use long, random strings (e.g., 855a004c-31c3-4d7a-826c-d22f87a8b417).  This method will stop bad actors from enumerating through all of the namespaces easily.  However, this is Security through Obscurity, and should only be considered one layer in the broader defense posture.

Enforce Server-Side Authorization

The server must never use an identifier from the client directly in a security-sensitive context. Instead, it should rely on a trusted, server-side source of information, like the user’s authenticated session.

The correct workflow is:

  1. Authenticate User: The user logs in and provides proof of identity (e.g., a session cookie or a JWT token).
  2. Identify Tenant Server-Side: The application uses the authenticated user’s identity to look up their corresponding tenant ID (i.e., their namespace) from a trusted database or identity provider.
  3. Query Data: The application uses this server-derived namespace for the vector database query. Any tenant identifier provided in the URL should either be ignored for security decisions or validated against the server-side value.

Detecting a Crawl Attempt

Even with strong preventative measures, you should have monitoring in place to detect suspicious activity. Iterative crawling generates a distinct, machine-like pattern that stands out from normal human behavior. Here are the key signals to watch for:

  1. High Query Volume: The most obvious signal is a sudden, sustained spike in the number of queries coming from a single user or IP address. An attacker will try to exfiltrate data as quickly as possible. Implement rate limiting on your API endpoints to slow them down and trigger alerts on high-velocity usage.
  2. Repetitive Query Patterns: This is the smoking gun for this specific attack. The attacker uses the output of one query as the input for the next. You can detect this by logging query inputs and their corresponding results. Trigger an alert if a user consistently submits new queries that are identical or highly similar to the text metadata they just received in a previous result.
  3. Anomalous top_k Requests: Most applications have a standard number of nearest neighbors ( top_k) to show (e.g., top 5 or 10). If an attacker finds a way to manipulate the number of neighbors returned per query (the top_k value) and requests an unusually high number (e.g., top_k=1000), it’s a major red flag for a data exfiltration attempt.
  4. High Data Egress: Monitor the amount of data (specifically the text metadata) being sent to a single user. If an account that normally retrieves a few kilobytes of data per day suddenly pulls megabytes, it warrants investigation.

New Tech, Timeless Principles

The shift to vector databases represents a powerful evolution in data management, but it demands a parallel evolution in our security mindset. The very feature that makes these databases so effective, organizing data by semantic meaning in a vast, open space, places immense importance on the digital walls we build between tenants. As discussed, a classic web application vulnerability like an Insecure Direct Object Reference (IDOR) can provide an attacker the key to another tenant’s private space, allowing them to map and exfiltrate data through iterative crawling.

By combining strong preventative measures, such as non-guessable namespace IDs and robust server-side authorization, and with vigilant detective controls that watch for the tell-tale patterns of a crawling attack, developers can confidently build secure multi-tenant AI applications.

TECHSTRONG AI PODCAST

SHARE THIS STORY