ECS Experiment control system

Covers data taking and experiments

If o^2 is a hub and spoke design, the ALiECS system is the hub

gRPC is favoured in ALICE. it allows programs to communicate as if they were making local procedure calls.

The O^2/FLP cluster

  • Service machines in the ALiECS core

    • Datataking in ALiECS
    • Workflow processing
      • How do we launch tasks on clusters of machines?
      • Workflow templates specified in yaml, stored in a git repository
        • e.g. the workflow is the program and ALiECS core is the interpreter
        • with git we can use versioning and develop workflows without touching source code on the core.
        • covers which tasks should run and how they should talk to each other.
      • Configuration of the workflows comes from Apricot config service
        • includes a templating engine
    • Integration plugins handle
      • Handles communication with non-O2 machines
        • e.g. SOR (start of run) operation
      • O2 monitoring system in kafka
      • O2 bookkeeping
      • O2 EPN cluster
    • Task scheduler
      • deals exclusively with the FLP cluster
      • keeps track of resources
      • and makes use of them by translating the output of the workflow processing to actual commands.
      • via apache mesos
  • 202 FLPs and 15 QC

    • FLPs
      • use custom PCIE hardware to collect data from the LHC
      • After these, we have a chain
        • O2 readout
        • FairMQ data flow tasks(?)
        • O2 DPL processing workflows
          • handles branching in the chain, for example for quality control.
        • Finally, data leaves the FLPs to the EPNs in the data center.
    • How does ALiECS interact with this chain?
      • ‘Opens the faucet’ at the correct time
  • Apache mesos

    • Allows a cluster of computers to be interacted with as a single computer.
    • Master/Agents architecture
    • mesos receives an ordered list of tasks from the scheduler
    • the mesos agent spins up the ALiECS executor
      • this is needed because tasks are complex, represented as state machines
      • tasks must be synchronised
        • e.g. all tasks must be configured before they can be started.
      • we can detect and handle errors
      • as well as handling non-dataflow tasks
  • Workflow load and deploy takes

    • task templates define how to run each task
    • workflow template yaml
    • task configuration templates run on the FLPs
  • and performs

    • variable precedence resolution
      • decides which variables win in the configuration files. e.g. variables in GUI takes precedence
    • DPL sub-workflow resolution
      • DPL workflows are not necessarily ALICE
    • template processing
    • resource allocation
    • task-host constraint resolution
      • e.g. FLPs have 1-3 cards.
      • e.g. the infiniband network is not multiplexed, it’s point to point