How to use Generative AI & Vector Embeddings with Ruby (and a little bit of Python)

Artificial intelligence seems to have reached a peak in popularity, and for good reason, it’s advancing very fast, it has reached the public eye (thanks to ChatGPT), and there is VERY active research on the field right now.

So I fully immersed myself into the topic (the best way to learn something new, btw) so I can understand it & distill the basic concepts & practical ideas for you!

What we are going to do is build a small project, which is a RAG-powered question-answering bot.

Let’s get started!

What is RAG & Why is it Useful?

So 1st thing we need to address is the concept of RAG, it means Retrieval Aumented Generation, and it’s used to feed a generative AI model (like ChatGPT) with extra content & information so that it can produce a more accurate answer.

In addition…

RAG is very useful when we have private documents that our AI model hasn’t been trained on. Like your company sales figures, your personal notes, or some website that requires login.

So what we’re able to do is: take this data, clean it, chunk it into smaller pieces, generate something called “vector embeddings“, and then save those embeddings on a special database which we can query them to find the relevant chunks from our documents.

Then we feed these to a generative AI & ask it to come up with an answer using ONLY the provided context.

So that’s the idea of RAG.

Notice this is different from fine-tuning, when you fine-tune a model you’re giving the model additional training so that it becomes more specialized for a particular task.

Now let’s move on to the next topic!

What are Vector Embeddings?

I mentioned this fancy term, vector embeddings, what is this mysterious entity? Well, embeddings are just arrays of many floating-point numbers.

They are used to capture meaning & context from text, for example, the word “cat”, has an embedding that may look like this:

[-0.015128633938729763, 6.5704266489774454e-06, 0.013684336096048355, 0.000725181307643652, 0.0792837142944336, 0.02489437721669674, 0.04961267486214638, 0.04444398730993271, -0.0032522594556212425, -0.07948382198810577, ...]

The thing is if we generate embeddings for similar words, like “animal”, or “kitten”, the values of the embedding will be closer to the word “cat”, than the word “house”.

So that’s how we can find related words & sentences, thanks to these vector embeddings.

Notice a few important things:

  • First, embeddings are not exclusive to text, it’s also possible to generate them for images, audio, etc.
  • Second, embeddings are generated by Machine Learning models, and each model will generate different embeddings.
  • Third, there are specialized models for generating embeddings (freely available & from OpenAI).
  • Fourth, generating embeddings using these models is way cheaper than generating text completions (like GPT).

Ok, fine, we should have an idea of what we are dealing with, but how we can generate them ourselves?

And how can we store these embeddings so we can use them?

That’s coming next!

Generating & Storing Text Embeddings

It’s time to get our hands dirty & write some code… We have a few options to generate embeddings: we can use a paid API, or we can run a model locally.

If you have a modern GPU with at least 6GB of VRAM then running locally for development & education is an option, with some models, you may even get away with less.

Now here is the thing, to run a model locally there are also many options, an app called Ollama allows us to run ML (machine learning) models locally & expose them as an API.

But for the sake of learning & to have more control over the whole thing, we are going to write a bit of Python code (don’t worry, it’s just basic stuff, and I used ChatGPT to help me get started btw) that wraps around a model runner & sets up an API endpoint that we can use from Ruby.

Using a bit of Python here will just make our life easier with running the model part, but for the rest of the code, we can still use Ruby.

Here is the vector embeddings Python endpoint:

# Step 1: Import necessary libraries

from flask import Flask, jsonify, request

from sentence_transformers import SentenceTransformer

# Step 2: Create a Flask application

app = Flask(__name__)

# Step 3: Define your function

def get_embeddings(data):
    sentences = data
    model = SentenceTransformer('thenlper/gte-base')
    embeddings = model.encode(sentences)

    return embeddings

# Step 4: Create an API endpoint

@app.route('/get_embeddings', methods=['POST'])

def api_call_function():

# Get 'data' from query string

result = get_embeddings(request.form.getlist('data'))

# Return the result as JSON

return jsonify({"embeddings": result.tolist()})

# Step 5: Run the flask application locally

if __name__ == '__main__':

You’ll need to install some packages, via the pip package manager (it’s similar to RubyGems).

Like this:

pip install flask sentence_transformers

Also, notice that Python is very special about white space, so if you get syntax errors make sure everything is correctly indented & there are no extra white spaces (blank lines are fine).

Now run the code:


This will run a Flask server (similar to Sinatra in Ruby) on port 5000, which we will now use.

Here is the Ruby code we need to generate & print our embeddings:

require 'json'
require 'excon'

EMBEDDINGS_API_URL = "http://localhost:5000/get_embeddings"

class SentenceTransformer
  def self.get_embeddings_from_model(data)
        body: URI.encode_www_form({data: data}),
        headers: { "Content-Type" => "application/x-www-form-urlencoded" }

data = ARGV[0..]

vectors = SentenceTransformer.get_embeddings_from_model(data)

# Array of vector embeddings
p vectors

# How many vectors we are getting back
p vectors.size

# Size of an individual vector
p vectors.first.size 

We can run it like this:

ruby get_embeddings.rb "cat"

And we should see our embeddings printed for the word “cat”, remember that you can also send in full sentences, it’s not only for single words.

The 1st time you try to fetch your embeddings the Python code will download the ML model from, this particular model thenlper/gte-base, requires about 300 MB of disk space. Other models, such as those for creating a chat experience or images, can be much larger (>= 3 GB).

In addition, we can use curl:

curl http://localhost:5000/get_embeddings -d "data=cat"

This could serve as a verification method to make sure everything is working correctly, it should produce the same numerical values for the same word & model.

Storing our Embeddings

Now that we’re able to create embeddings we want to store them.

You can find many vector databases out there, like Pinecone, pgvector, and even Redis, but for this example, I’m going to use Chroma DB.

Install Chroma in your system:

pip install chromadb
gem install chroma-db

Start the DB inside your project directory:

cd my-sexy-ai-project
chroma run

Connect to the DB & add new embeddings:

require 'chroma-db'
require 'logger'
require 'digest'

Chroma.connect_host = "http://localhost:8000"
Chroma.logger       =$stdout)
Chroma.log_level     = Chroma::LEVEL_ERROR

# Check current Chrome server version
version = Chroma::Resources::Database.version
puts "Connected to DB: v #{version}"

# Create a collection, this is like a new table on SQL

collection = Chroma::Resources::Collection.create(collection_name, { lang: "ruby", gem: "chroma-db" })

document = ARGV[0]
vectors  = SentenceTransformer.get_embeddings_from_model(document)

# MD5 digest is used here to create unique IDs, but you can use any kind of id-generation scheme that you like, as long as they are unique for each embedding

# Inside the metadata key you can include any info that you find convenient, it's often useful to include the source of your data, specially if you're sourcing from multiple documents

vectors.each do |vector|
          id: Digest::MD5.hexdigest(document),
          embedding: vector,
          document: doc,
          metadata: { source: EMBEDDINGS_SOURCE }

Once we have our embeddings we can query the DB:

# I added a whole free book on statistics to my Chroma Db, this is my example query
query = "how to calculate Z value for normals"

# Just one result as an example, but you should try 3-4 results
results = collection.query(
    query_embeddings: SentenceTransformer.get_embeddings_from_model(query),
    results: 1

pp results


   "Statistical software can also be used. A normal probability table is given in Appendix B.1 on page 427 and abbreviated in Table 3.8. We use this table to identify the percentile corresponding to any particular Z-score. For instance, the percentile of Z = 0.43 is shown in row 0.4 and column 0.03 in Table 3.8: 0.6664, or the 66.64th percentile",
  @metadata={"source"=>"Intro to Stats 3ed"}>]

This will return an array of Chroma objects, the two values we care most about are “document”, which refers to the original content we saved along with its embedding & “distance” which refers to how close is your query to the retrieved embeddings.

Important things:

  • The vector database takes embeddings for its query because it compares embeddings with embeddings using math, specifically, it’s using a “similarity function”, one such F(x) is “cosine similarity”.
  • Try querying for the same word, for example, if you added “chocolate” search for “chocolate”, then “dessert”, and “person”, to see the differences in the distance value.

That’s it for embeddings, now you can take some content, like your class notes for this semester, or your favorite PDF, and break them down into small chunks (256 to 512 characters works well) & add them into Chroma, then try some queries to see the results.

You could also get your favorite videos or podcasts transcribed, or anything else you like. Just giving you some ideas to play around with.

Chunking Data

You could just break your data at a fixed amount of characters, but you will probably break words & sentences in half, which is no bueno.

It’s helpful to try & maintain the original context as much as possible when creating your chunks, to do this you can use the baran Ruby gem.

Like this:

require 'baran'


splitter =
    chunk_size: CHUNK_SIZE,
    chunk_overlap: CHUNK_SIZE * OVERLAP_PERCENT,
    separators: ["\n\n", "\n", ".", ","]

This will return an array of chunks & be smarter about the chunking process.

You can try different chunk sizes & overlaps, but you don’t want a chunk size bigger than 1024 from my testing, as these ML models have a limit to the amount of text they can work with for a single embedding.

Generating an Answer


We got our vector DB loaded up with our beautiful data & we can query it, we should get back some documents related to our query, now what?

We send it to a “smarter” AI model to produce our answer, the easy way is to use GPT-3.5-turbo, with this prompt:

# This is your question, it can be the same one you searched for on Chroma
question = "..."

# This is the joined chunks from Chroma DB
context  = "..."

chat_prompt = "Answer the following user question using ONLY the given context, never mention the context, just answer the question. Question: #{question}. Context: #{context}"

You can use the ruby-openai gem for the API:

require 'openai'

client = <your API key>)

response =
    parameters: {
        model: "gpt-3.5-turbo",
        messages: [
            { role: "user", content: chat_prompt }

if response.key?("error")
    puts "Error: #{response}"
    puts response.dig("choices", 0, "message", "content")

And GPT will summarize your data in a way that answers the question.

If your data is more private or if you don’t want to give OpenAI your moneys, you can run a local generative model like Mistral 7B using the Python API technique, like we did for vector embeddings.


And there you have it, the journey from the exciting realms of Generative AI & Vector Embeddings to the practical integration of these concepts using Ruby, with a sprinkle of Python for the heavy lifting.

Thanks for reading & have fun! 🙂