ALiECS Presentation

ECS → Experiment control system

Covers data taking and experiments

If o^2 is a hub and spoke design, the ALiECS system is the hub

gRPC is favoured in ALICE. it allows programs to communicate as if they were making local procedure calls.

The O^2/FLP cluster

Service machines in the ALiECS core
- Datataking in ALiECS
- Workflow processing
  - How do we launch tasks on clusters of machines?
  - Workflow templates specified in yaml, stored in a git repository
    - e.g. the workflow is the program and ALiECS core is the interpreter
    - with git we can use versioning and develop workflows without touching source code on the core.
    - covers which tasks should run and how they should talk to each other.
  - Configuration of the workflows comes from Apricot config service
    - includes a templating engine
- Integration plugins handle
  - Handles communication with non-O2 machines
    - e.g. SOR (start of run) operation
  - O2 monitoring system in kafka
  - O2 bookkeeping
  - O2 EPN cluster
- Task scheduler
  - deals exclusively with the FLP cluster
  - keeps track of resources
  - and makes use of them by translating the output of the workflow processing to actual commands.
  - via apache mesos
202 FLPs and 15 QC
- FLPs
  - use custom PCIE hardware to collect data from the LHC
  - After these, we have a chain
    - O2 readout
    - FairMQ data flow tasks(?)
    - O2 DPL processing workflows
      - handles branching in the chain, for example for quality control.
    - Finally, data leaves the FLPs to the EPNs in the data center.
- How does ALiECS interact with this chain?
  - ‘Opens the faucet’ at the correct time
Apache mesos
- Allows a cluster of computers to be interacted with as a single computer.
- Master/Agents architecture
- mesos receives an ordered list of tasks from the scheduler
- the mesos agent spins up the ALiECS executor
  - this is needed because tasks are complex, represented as state machines
  - tasks must be synchronised
    - e.g. all tasks must be configured before they can be started.
  - we can detect and handle errors
  - as well as handling non-dataflow tasks
Workflow load and deploy takes
- task templates define how to run each task
- workflow template yaml
- task configuration templates run on the FLPs
and performs
- variable precedence resolution
  - decides which variables win in the configuration files. e.g. variables in GUI takes precedence
- DPL sub-workflow resolution
  - DPL workflows are not necessarily ALICE
- template processing
- resource allocation
- task-host constraint resolution
  - e.g. FLPs have 1-3 cards.
  - e.g. the infiniband network is not multiplexed, it’s point to point

✍️ Joes notes

Explorer

ALiECS Presentation

Graph View

Backlinks