Generate sentence embeddings

In this lesson, you will use the cleaned movie dataset to create vector embeddings from the plot summaries. You will leverage the model you selected earlier in your python/.env file.

Each embedding will be stored alongside its metadata, preparing it for the next step of writing to the blockchain.

Filter by box office revenue (optional)

To streamline processing, the script filters out movies with low box office revenue.

The default threshold is set at $100,000,000, yielding around 1,000 movies. You can adjust this value in your python/.env file by modifying:

BOX_OFFICE_THRESHOLD=100_000_000

Setting a lower threshold includes more movies (e.g., 0 allows access to the complete dataset of approximately 42,000)
Raising the threshold to a higher value includes fewer movies, speeding up processing (for example, 1_000_000_000 returns only a handful)

Run the script

Navigate to the python/ folder and execute the command:

python vectorize.py

This command loads the data/movie_data.csv, filters the rows based on box office revenue, encodes each plot into a vector using your specified model in .env, and saves the results to a JSONL file.

Output file

The output will be saved to:

data/movie_vectors.jsonl

Each line in this file represents a JSON object that contains:

Complete movie metadata (title, plot, release date, etc.)
A "vector" field that stores the embedded plot

This file will facilitate the storage of both the vectors and metadata on-chain in the upcoming step.

What’s next?

In the next step, you will upload the vectors along with the movie metadata to your Chromia chain, enabling rich and semantic search capabilities.

Filter by box office revenue (optional)​

Run the script​

Output file​

What’s next?​

Filter by box office revenue (optional)

Run the script

Output file

What’s next?