Polyphemus (Development)

Polyphemus is the model-serving service used for adversarial generation workloads. It proxies generation requests to Vertex AI and exposes authenticated REST endpoints.

Runtime and deployment notes

Runtime baseline: Python >=3.12
Router module: apps/polyphemus/src/rhesis/polyphemus/routers/services.py
Request schemas: apps/polyphemus/src/rhesis/polyphemus/schemas/schemas.py
Docker image: API-only service image. PyTorch is not bundled in the Polyphemus container; model weights and serving runtime live behind Vertex AI.

API endpoints

Polyphemus exposes two primary generation endpoints:

Endpoint	Purpose	Auth
`POST /generate`	Single generation request	Bearer token required
`POST /generate_batch`	Batch generation for multiple requests	Bearer token required

/generate_batch accepts up to 50 items per call (MAX_BATCH_SIZE).

Environment configuration

Polyphemus reads Vertex AI target configuration from environment variables:

Variable	Required	Description
`POLYPHEMUS_ENDPOINT_ID`	Yes	Vertex AI endpoint identifier
`POLYPHEMUS_PROJECT_ID`	Yes	GCP project ID for endpoint invocation
`POLYPHEMUS_LOCATION`	No	Vertex AI region (defaults to `us-central1`)
`VLLM_LOGGING_LEVEL`	No	vLLM container log verbosity for Vertex serving (for example, `DEBUG`, `INFO`)

If required variables are missing, the service returns HTTP 400 with configuration error details.

Deployment region variable mapping (v0.2.8+)

Region configuration uses two separate variables depending on context:

Context	Variable	Source	Where it is consumed
GitHub Actions CI/CD workflow	`REGION`	`secrets.REGION` (falls back to `secrets.TF_VAR_REGION`, then `us-central1`)	`.github/workflows/polyphemus.yml`
Running Polyphemus service	`POLYPHEMUS_LOCATION`	Set to `$REGION` by the CI workflow	`apps/polyphemus/src/rhesis/polyphemus/routers/services.py`
Vertex model deployment script	`GCP_REGION`	Set directly in the local environment (not mapped from `REGION`)	`apps/polyphemus/model_deployment/config.py`

The workflow maps REGION → POLYPHEMUS_LOCATION automatically for service deployments. The model deployment script reads GCP_REGION independently; when running it locally you must export GCP_REGION yourself (see apps/polyphemus/model_deployment/.env.example).

vLLM logging level (v0.2.9+)

When deploying Polyphemus to Vertex AI, you can control serving container verbosity with VLLM_LOGGING_LEVEL.

deploy-polyphemus.sh
export VLLM_LOGGING_LEVEL=DEBUG
python apps/polyphemus/model_deployment/deploy.py --skip-existing

If set, deployment injects VLLM_LOGGING_LEVEL into the serving container environment.

Polyphemus deployment separates the lightweight API container from the Vertex AI serving container. Configure vLLM logging on the Vertex deployment, not by installing PyTorch or model runtime dependencies into the Polyphemus API image.

Batch request and response format

generate_batch_request.json
{
  "requests": [
    {
      "messages": [
        {
          "role": "user",
          "content": "Summarize this policy document."
        }
      ],
      "temperature": 0.7,
      "max_tokens": 1024
    },
    {
      "messages": [
        {
          "role": "user",
          "content": "Extract key risks from this response."
        }
      ],
      "temperature": 0.2
    }
  ]
}

generate_batch_response.json
{
  "responses": [
    {
      "choices": [
        {
          "message": {
            "content": "..."
          }
        }
      ],
      "model": "vertex_ai/model",
      "usage": {
        "prompt_tokens": 120,
        "completion_tokens": 85
      }
    },
    {
      "error": "Generation timeout"
    }
  ]
}

Rate limiting is applied through check_rate_limit. For batch calls, one HTTP request counts as one rate-limit unit regardless of item count.