ECS → Experiment control system
Covers data taking and experiments
If o^2 is a hub and spoke design, the ALiECS system is the hub
gRPC is favoured in ALICE. it allows programs to communicate as if they were making local procedure calls.
The O^2/FLP cluster
-
Service machines in the ALiECS core
- Datataking in ALiECS
- Workflow processing
- How do we launch tasks on clusters of machines?
- Workflow templates specified in yaml, stored in a git repository
- e.g. the workflow is the program and ALiECS core is the interpreter
- with git we can use versioning and develop workflows without touching source code on the core.
- covers which tasks should run and how they should talk to each other.
- Configuration of the workflows comes from Apricot config service
- includes a templating engine
- Integration plugins handle
- Handles communication with non-O2 machines
- e.g. SOR (start of run) operation
- O2 monitoring system in kafka
- O2 bookkeeping
- O2 EPN cluster
- Handles communication with non-O2 machines
- Task scheduler
- deals exclusively with the FLP cluster
- keeps track of resources
- and makes use of them by translating the output of the workflow processing to actual commands.
- via apache mesos
-
202 FLPs and 15 QC
- FLPs
- use custom PCIE hardware to collect data from the LHC
- After these, we have a chain
- O2 readout
- FairMQ data flow tasks(?)
- O2 DPL processing workflows
- handles branching in the chain, for example for quality control.
- Finally, data leaves the FLPs to the EPNs in the data center.
- How does ALiECS interact with this chain?
- ‘Opens the faucet’ at the correct time
- FLPs
-
Apache mesos
- Allows a cluster of computers to be interacted with as a single computer.
- Master/Agents architecture
- mesos receives an ordered list of tasks from the scheduler
- the mesos agent spins up the ALiECS executor
- this is needed because tasks are complex, represented as state machines
- tasks must be synchronised
- e.g. all tasks must be configured before they can be started.
- we can detect and handle errors
- as well as handling non-dataflow tasks
-
Workflow load and deploy takes
- task templates define how to run each task
- workflow template yaml
- task configuration templates run on the FLPs
-
and performs
- variable precedence resolution
- decides which variables win in the configuration files. e.g. variables in GUI takes precedence
- DPL sub-workflow resolution
- DPL workflows are not necessarily ALICE
- template processing
- resource allocation
- task-host constraint resolution
- e.g. FLPs have 1-3 cards.
- e.g. the infiniband network is not multiplexed, it’s point to point
- variable precedence resolution