Search code examples
pythonapache-sparkpysparkgithub-actions

Github Actions: Build Pyspark Package


I am using github actions to build pyspark .zip files using the following yaml snippet

name: Build Artifacts
on:
  push:
    branches:
    - main
jobs:
  build:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.9"]
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python-version }}
      - name: Make artifact directory
        run: mkdir -p ./dist
      - uses: actions/checkout@v2
      - name: Create Zip File
        uses: montudor/[email protected]
        with:
          args: sh -c "cd data_compaction && zip -r ../src.zip src/"
      - name: Push zip file to S3
        uses: qoqa/[email protected]
        env:
          AWS_S3_BUCKET: 'dev-bucket'
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          AWS_REGION: 'us-east-1'
          AWS_S3_PATH: '/artifacts/src.zip'
          FILE: 'src.zip'
      - name: Push main file to S3
        uses: qoqa/[email protected]
        env:
          AWS_S3_BUCKET: 'dev-bucket'
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          AWS_REGION: 'us-east-1'
          AWS_S3_PATH: '/artifacts/main.py'
          FILE: './data_compaction/main.py'

The zip file is getting created and is successfully pushed to S3. But when I try to import the modules in zip, I am getting a ModuleNotFound error. I am running spark-submit --py-files src.zip main.py

However, when I zip the file on my local machine using a Makefile and running the spark submit, it works file. Makefile looks like this:

build:
    rm -f -r ./dist
    mkdir ./dist
    cp main.py ./dist
    cd ./src && zip -r ../dist/src.zip .

My project directory is as follows

── data_compaction
    ├── Makefile
    ├── main.py
    └── src
        ├── jobs
        │   ├── __init__.py
        │   ├── xyz.py
        │   └── abc.py
        └── utilities
            ├── __init__.py
            └── spark_foundation.py

And my main.py has this snippet to import the modules:

if os.path.exists('src.zip'):
    sys.path.insert(0, 'src.zip')
else:
    sys.path.insert(0, './src')
from utilities.spark_foundation import spark_session
from jobs.xyz import func1
from jobs.abc import func2

PS: I am new to github actions


Solution

  • Have you also unzipped the src.zip from s3? In the makefile you change into the src directory and zip everything underneath, while in the ci yml, you change into data_compaction and zip the src directory recursively, which includes the src directory. It should work again, when you change the CI command to:

      - name: Create Zip File
        uses: montudor/[email protected]
        with:
          args: sh -c "cd data_compaction/src && zip -r ../src.zip ."