xref: /OK3568_Linux_fs/kernel/Documentation/scheduler/sched-domains.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=================
2*4882a593SmuzhiyunScheduler Domains
3*4882a593Smuzhiyun=================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunEach CPU has a "base" scheduling domain (struct sched_domain). The domain
6*4882a593Smuzhiyunhierarchy is built from these base domains via the ->parent pointer. ->parent
7*4882a593SmuzhiyunMUST be NULL terminated, and domain structures should be per-CPU as they are
8*4882a593Smuzhiyunlocklessly updated.
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunEach scheduling domain spans a number of CPUs (stored in the ->span field).
11*4882a593SmuzhiyunA domain's span MUST be a superset of it child's span (this restriction could
12*4882a593Smuzhiyunbe relaxed if the need arises), and a base domain for CPU i MUST span at least
13*4882a593Smuzhiyuni. The top domain for each CPU will generally span all CPUs in the system
14*4882a593Smuzhiyunalthough strictly it doesn't have to, but this could lead to a case where some
15*4882a593SmuzhiyunCPUs will never be given tasks to run unless the CPUs allowed mask is
16*4882a593Smuzhiyunexplicitly set. A sched domain's span means "balance process load among these
17*4882a593SmuzhiyunCPUs".
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunEach scheduling domain must have one or more CPU groups (struct sched_group)
20*4882a593Smuzhiyunwhich are organised as a circular one way linked list from the ->groups
21*4882a593Smuzhiyunpointer. The union of cpumasks of these groups MUST be the same as the
22*4882a593Smuzhiyundomain's span. The group pointed to by the ->groups pointer MUST contain the CPU
23*4882a593Smuzhiyunto which the domain belongs. Groups may be shared among CPUs as they contain
24*4882a593Smuzhiyunread only data after they have been set up. The intersection of cpumasks from
25*4882a593Smuzhiyunany two of these groups may be non empty. If this is the case the SD_OVERLAP
26*4882a593Smuzhiyunflag is set on the corresponding scheduling domain and its groups may not be
27*4882a593Smuzhiyunshared between CPUs.
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunBalancing within a sched domain occurs between groups. That is, each group
30*4882a593Smuzhiyunis treated as one entity. The load of a group is defined as the sum of the
31*4882a593Smuzhiyunload of each of its member CPUs, and only when the load of a group becomes
32*4882a593Smuzhiyunout of balance are tasks moved between groups.
33*4882a593Smuzhiyun
34*4882a593SmuzhiyunIn kernel/sched/core.c, trigger_load_balance() is run periodically on each CPU
35*4882a593Smuzhiyunthrough scheduler_tick(). It raises a softirq after the next regularly scheduled
36*4882a593Smuzhiyunrebalancing event for the current runqueue has arrived. The actual load
37*4882a593Smuzhiyunbalancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run
38*4882a593Smuzhiyunin softirq context (SCHED_SOFTIRQ).
39*4882a593Smuzhiyun
40*4882a593SmuzhiyunThe latter function takes two arguments: the current CPU and whether it was idle
41*4882a593Smuzhiyunat the time the scheduler_tick() happened and iterates over all sched domains
42*4882a593Smuzhiyunour CPU is on, starting from its base domain and going up the ->parent chain.
43*4882a593SmuzhiyunWhile doing that, it checks to see if the current domain has exhausted its
44*4882a593Smuzhiyunrebalance interval. If so, it runs load_balance() on that domain. It then checks
45*4882a593Smuzhiyunthe parent sched_domain (if it exists), and the parent of the parent and so
46*4882a593Smuzhiyunforth.
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunInitially, load_balance() finds the busiest group in the current sched domain.
49*4882a593SmuzhiyunIf it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in
50*4882a593Smuzhiyunthat group. If it manages to find such a runqueue, it locks both our initial
51*4882a593SmuzhiyunCPU's runqueue and the newly found busiest one and starts moving tasks from it
52*4882a593Smuzhiyunto our runqueue. The exact number of tasks amounts to an imbalance previously
53*4882a593Smuzhiyuncomputed while iterating over this sched domain's groups.
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunImplementing sched domains
56*4882a593Smuzhiyun==========================
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunThe "base" domain will "span" the first level of the hierarchy. In the case
59*4882a593Smuzhiyunof SMT, you'll span all siblings of the physical CPU, with each group being
60*4882a593Smuzhiyuna single virtual CPU.
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunIn SMP, the parent of the base domain will span all physical CPUs in the
63*4882a593Smuzhiyunnode. Each group being a single physical CPU. Then with NUMA, the parent
64*4882a593Smuzhiyunof the SMP domain will span the entire machine, with each group having the
65*4882a593Smuzhiyuncpumask of a node. Or, you could do multi-level NUMA or Opteron, for example,
66*4882a593Smuzhiyunmight have just one domain covering its one NUMA level.
67*4882a593Smuzhiyun
68*4882a593SmuzhiyunThe implementor should read comments in include/linux/sched.h:
69*4882a593Smuzhiyunstruct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of
70*4882a593Smuzhiyunthe specifics and what to tune.
71*4882a593Smuzhiyun
72*4882a593SmuzhiyunArchitectures may retain the regular override the default SD_*_INIT flags
73*4882a593Smuzhiyunwhile using the generic domain builder in kernel/sched/core.c if they wish to
74*4882a593Smuzhiyunretain the traditional SMT->SMP->NUMA topology (or some subset of that). This
75*4882a593Smuzhiyuncan be done by #define'ing ARCH_HASH_SCHED_TUNE.
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunAlternatively, the architecture may completely override the generic domain
78*4882a593Smuzhiyunbuilder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your
79*4882a593Smuzhiyunarch_init_sched_domains function. This function will attach domains to all
80*4882a593SmuzhiyunCPUs using cpu_attach_domain.
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunThe sched-domains debugging infrastructure can be enabled by enabling
83*4882a593SmuzhiyunCONFIG_SCHED_DEBUG. This enables an error checking parse of the sched domains
84*4882a593Smuzhiyunwhich should catch most possible errors (described above). It also prints out
85*4882a593Smuzhiyunthe domain structure in a visual format.
86