Skip to main content

Run the project

Run PySpark

cd pyspark

Generate dummy data

The dummy data will be generated in the database of the active node.

python seed_products.py

Execute PySpark functionality

This process retrieves all data from the node, converts it into a PySpark DataFrame, and performs various analyses.

Note: When working with large datasets, you may encounter a Java OutOfMemoryError. This is a common issue when PySpark runs out of heap memory. To prevent this, add the following environment variables to your .env file before running the script:

For Linux/Mac/WSL users, use:

# Set these environment variables before running your script
export SPARK_DRIVER_MEMORY=2g
export SPARK_EXECUTOR_MEMORY=2g
export SPARK_DRIVER_MAXRESULTSIZE=1g

For Windows users, use:

# Set these environment variables before running your script
set SPARK_DRIVER_MEMORY=2g
set SPARK_EXECUTOR_MEMORY=2g
set SPARK_DRIVER_MAXRESULTSIZE=1g

Now run the script:

python get_products_paginated.py