DagsterDocs

Deploying Dagster to GCP#

To deploy Dagster to GCP, Google Compute Engine (GCE) can host Dagit, Google Cloud SQL can store runs and events, and Google Cloud Service (GCS) can act as an IO manager.

Hosting Dagit or Dagster Daemon on GCE#

To host Dagit or Dagster Daemon on a bare VM or in Docker on GCE, see Running Dagster as a service.

Using Cloud SQL for run and event log storage#

We recommend launching a Cloud SQL PostgreSQL instance for run and events data. You can configure Dagit to use Cloud SQL to run and events data by setting blocks in your $DAGSTER_HOME/dagster.yaml appropriately:

run_storage:
  module: dagster_postgres.run_storage
  class: PostgresRunStorage
  config:
    postgres_db:
      username: { username }
      password: { password }
      hostname: { hostname }
      db_name: { database }
      port: { port }

event_log_storage:
  module: dagster_postgres.event_log
  class: PostgresEventLogStorage
  config:
    postgres_db:
      username: { username }
      password: { password }
      hostname: { hostname }
      db_name: { db_name }
      port: { port }

schedule_storage:
  module: dagster_postgres.schedule_storage
  class: PostgresScheduleStorage
  config:
    postgres_db:
      username: { username }
      password: { password }
      hostname: { hostname }
      db_name: { db_name }
      port: { port }

In this case, you'll want to ensure you provide the right connection strings for your Cloud SQL instance, and that the node or container hosting Dagit is able to connect to Cloud SQL.

Be sure that this file is present, and DAGSTER_HOME is set, on the node where Dagit is running.

Note that using Cloud SQL for run and event log storage does not require that Dagit be running in the cloud. If you are connecting a local Dagit instance to a remote Cloud SQL storage, double check that your local node is able to connect to Cloud SQL.

Using GCS for IO Management#

You'll probably also want to configure a GCS bucket to store intermediates via persistent IO Managers. This enables reexecution, review and audit of solid intermediate results, and cross-node cooperation (e.g., with the multiprocessing or Dagster celery executors).

You'll first need to need to use gcs_pickle_io_manager as your IO Manager or customize your own persistent io managers (see example).

from dagster import ModeDefinition
from dagster_gcp.gcs.io_manager import gcs_pickle_io_manager
from dagster_gcp.gcs.resources import gcs_resource

prod_mode = ModeDefinition(
    name="prod",
    resource_defs={"gcs": gcs_resource, "io_manager": gcs_pickle_io_manager},
)

Then, add the following YAML block in your pipeline config:

resources:
  io_manager:
    config:
      gcs_bucket: my-cool-bucket
      gcs_prefix: good/prefix-for-files-

With this in place, your pipeline runs will store intermediates on GCS in the location gs://<bucket>/dagster/storage/<pipeline run id>/intermediates/<solid name>.compute.