During the Operationalizing AI event in Boca Raton, participants were given a chance to see some AI in action using a library called Chroma. Chroma’s documentation calls it an “open source embedding database.” Chroma makes use of large language models, allowing you to search documents for sentences that not only appear in documents, but are similar to sentences appearing in documents. Using this approach, users don’t have to remember word-for-word sentences. Instead they can search for sentences similar to what’s found in the documents, or even just for search terms. This allows for a far more robust search engine of your documents.
While we won’t build an entire app, together I’ll be showing you the basic python code that accomplishes the search portion. To do this, I pulled down several pages from Wikipedia, and I saved them as text files.
Using a GPU
If you have an NVIDIA GPU card on your computer, you should be able to use this code. But if not, you can easily allocate a server on AWS. If you’re up to speed on AWS, you’ll want to use a GPU instance. The one I chose is the g4dn.xlarge instance with Ubuntu installed as described here. What’s great about these GPU instances is each instance has an actual NVIDIA video card attached to it. NVIDIA cards provide a huge amount of cores (usually more than two thousand) that are great at floating point vector arithmetic. That computing power is exactly what you need for processing large language models, which is what we’ll be doing here.
Just remember to shut the server down! It runs about $13 per day!
Install Python and Chroma
First install python if you haven’t done so already, using these three lines:
sudo apt install python3
sudo apt install python3-pip
sudo apt install python-is-python3
Then install Chroma by typing this:
pip install chromadb
Grabbing Some Wikis
Next you’ll want to copy and paste the text of several pages of Wikipedia. I copied these entries, each into a separate file, and I put these in a directory called wiki:
Creating Some Python Chroma Code
Now you’re ready for some code. Use whatever editor you know, and put the following code in. (Make sure this is in the directory above the Wikipedia files.)
txt_files = 
names = 
for filename in os.listdir(directory):
file_path = os.path.join(directory, filename)
with open(file_path, 'r') as file:
return (txt_files, names)
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="wiki")
docs, ids = load_txt_files('./wiki')
done = False
while not done:
query = input('What are you searching for? (or Q to quit) ')
if query == 'Q':
done = True
results = collection.query(
query_texts = [query],
n_results = 1
Save this in whatever filename you want, provided it ends with .py. (I called it chromafun.py.)
Then run it like so:
The first thing the app does is read in the text from all the saved Wikipedia files. It saves the text in a list, along with the filenames in a separate list, returning the two lists as a tuple. (I chose this because that’s the format Chroma wants them in.)
Then I create a Chroma client, and a collection named wiki. The collection will hold the documents to be searched. You load them by calling the collection’s add function, passing in the document contents and the document IDs, for which in this case I used the filenames as IDs.
That’s all it takes. Now you can search using similar sentences to what’s actually found in the documents. The app prompts you for a search. The Gecko entry contains the actual sentence “geckos are usually nocturnal” so you might try that first just as a test. It should print out gecko.txt as the resulting document. Then get creative! Here are a few sentences to try:
- geckos are nighttime
- geckos prefer nighttime
- lizards prefer nighttime
- nighttime lizards
- creatures of the night
All of these should locate the original sentence “Geckos are nocturnal” and then print out the name of the file, gecko.txt.
Notice the query line and the parameter called n_results. This is how many results we want to get back in the query. Try entering “large reptile”. You’ll only get back one entry; in my case I got back gecko.txt, to my surprise. Try changing the n_results line in the code to 2, and run the program again, and again put in “large reptile”. When I do, I get back both gecko.txt and alligator.txt.
This is a pretty basic app, as all it does is search through a handful of documents that I downloaded from Wikipedia. I also kept it simple by just saving the text only. No HTML format or PDF format. That would likely be the next step. Then think about how you could build a full app around this simple code. You might have a large set of documents on site that you want employees to be able to search through. Or you might have, for example, software library documentation; anything, really. And instead of just printing the filename, you could make a complete web front end that prints out the titles of the documents, with links to the actual documents. Then before you know it, you have an AI-powered search engine for all of your documents.