xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/mm/nommu-mmap.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=============================
2*4882a593SmuzhiyunNo-MMU memory mapping support
3*4882a593Smuzhiyun=============================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunThe kernel has limited support for memory mapping under no-MMU conditions, such
6*4882a593Smuzhiyunas are used in uClinux environments. From the userspace point of view, memory
7*4882a593Smuzhiyunmapping is made use of in conjunction with the mmap() system call, the shmat()
8*4882a593Smuzhiyuncall and the execve() system call. From the kernel's point of view, execve()
9*4882a593Smuzhiyunmapping is actually performed by the binfmt drivers, which call back into the
10*4882a593Smuzhiyunmmap() routines to do the actual work.
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunMemory mapping behaviour also involves the way fork(), vfork(), clone() and
13*4882a593Smuzhiyunptrace() work. Under uClinux there is no fork(), and clone() must be supplied
14*4882a593Smuzhiyunthe CLONE_VM flag.
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunThe behaviour is similar between the MMU and no-MMU cases, but not identical;
17*4882a593Smuzhiyunand it's also much more restricted in the latter case:
18*4882a593Smuzhiyun
19*4882a593Smuzhiyun (#) Anonymous mapping, MAP_PRIVATE
20*4882a593Smuzhiyun
21*4882a593Smuzhiyun	In the MMU case: VM regions backed by arbitrary pages; copy-on-write
22*4882a593Smuzhiyun	across fork.
23*4882a593Smuzhiyun
24*4882a593Smuzhiyun	In the no-MMU case: VM regions backed by arbitrary contiguous runs of
25*4882a593Smuzhiyun	pages.
26*4882a593Smuzhiyun
27*4882a593Smuzhiyun (#) Anonymous mapping, MAP_SHARED
28*4882a593Smuzhiyun
29*4882a593Smuzhiyun	These behave very much like private mappings, except that they're
30*4882a593Smuzhiyun	shared across fork() or clone() without CLONE_VM in the MMU case. Since
31*4882a593Smuzhiyun	the no-MMU case doesn't support these, behaviour is identical to
32*4882a593Smuzhiyun	MAP_PRIVATE there.
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, !PROT_WRITE
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun	In the MMU case: VM regions backed by pages read from file; changes to
37*4882a593Smuzhiyun	the underlying file are reflected in the mapping; copied across fork.
38*4882a593Smuzhiyun
39*4882a593Smuzhiyun	In the no-MMU case:
40*4882a593Smuzhiyun
41*4882a593Smuzhiyun         - If one exists, the kernel will re-use an existing mapping to the
42*4882a593Smuzhiyun           same segment of the same file if that has compatible permissions,
43*4882a593Smuzhiyun           even if this was created by another process.
44*4882a593Smuzhiyun
45*4882a593Smuzhiyun         - If possible, the file mapping will be directly on the backing device
46*4882a593Smuzhiyun           if the backing device has the NOMMU_MAP_DIRECT capability and
47*4882a593Smuzhiyun           appropriate mapping protection capabilities. Ramfs, romfs, cramfs
48*4882a593Smuzhiyun           and mtd might all permit this.
49*4882a593Smuzhiyun
50*4882a593Smuzhiyun	 - If the backing device can't or won't permit direct sharing,
51*4882a593Smuzhiyun           but does have the NOMMU_MAP_COPY capability, then a copy of the
52*4882a593Smuzhiyun           appropriate bit of the file will be read into a contiguous bit of
53*4882a593Smuzhiyun           memory and any extraneous space beyond the EOF will be cleared
54*4882a593Smuzhiyun
55*4882a593Smuzhiyun	 - Writes to the file do not affect the mapping; writes to the mapping
56*4882a593Smuzhiyun	   are visible in other processes (no MMU protection), but should not
57*4882a593Smuzhiyun	   happen.
58*4882a593Smuzhiyun
59*4882a593Smuzhiyun (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, PROT_WRITE
60*4882a593Smuzhiyun
61*4882a593Smuzhiyun	In the MMU case: like the non-PROT_WRITE case, except that the pages in
62*4882a593Smuzhiyun	question get copied before the write actually happens. From that point
63*4882a593Smuzhiyun	on writes to the file underneath that page no longer get reflected into
64*4882a593Smuzhiyun	the mapping's backing pages. The page is then backed by swap instead.
65*4882a593Smuzhiyun
66*4882a593Smuzhiyun	In the no-MMU case: works much like the non-PROT_WRITE case, except
67*4882a593Smuzhiyun	that a copy is always taken and never shared.
68*4882a593Smuzhiyun
69*4882a593Smuzhiyun (#) Regular file / blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
70*4882a593Smuzhiyun
71*4882a593Smuzhiyun	In the MMU case: VM regions backed by pages read from file; changes to
72*4882a593Smuzhiyun	pages written back to file; writes to file reflected into pages backing
73*4882a593Smuzhiyun	mapping; shared across fork.
74*4882a593Smuzhiyun
75*4882a593Smuzhiyun	In the no-MMU case: not supported.
76*4882a593Smuzhiyun
77*4882a593Smuzhiyun (#) Memory backed regular file, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
78*4882a593Smuzhiyun
79*4882a593Smuzhiyun	In the MMU case: As for ordinary regular files.
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun	In the no-MMU case: The filesystem providing the memory-backed file
82*4882a593Smuzhiyun	(such as ramfs or tmpfs) may choose to honour an open, truncate, mmap
83*4882a593Smuzhiyun	sequence by providing a contiguous sequence of pages to map. In that
84*4882a593Smuzhiyun	case, a shared-writable memory mapping will be possible. It will work
85*4882a593Smuzhiyun	as for the MMU case. If the filesystem does not provide any such
86*4882a593Smuzhiyun	support, then the mapping request will be denied.
87*4882a593Smuzhiyun
88*4882a593Smuzhiyun (#) Memory backed blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
89*4882a593Smuzhiyun
90*4882a593Smuzhiyun	In the MMU case: As for ordinary regular files.
91*4882a593Smuzhiyun
92*4882a593Smuzhiyun	In the no-MMU case: As for memory backed regular files, but the
93*4882a593Smuzhiyun	blockdev must be able to provide a contiguous run of pages without
94*4882a593Smuzhiyun	truncate being called. The ramdisk driver could do this if it allocated
95*4882a593Smuzhiyun	all its memory as a contiguous array upfront.
96*4882a593Smuzhiyun
97*4882a593Smuzhiyun (#) Memory backed chardev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
98*4882a593Smuzhiyun
99*4882a593Smuzhiyun	In the MMU case: As for ordinary regular files.
100*4882a593Smuzhiyun
101*4882a593Smuzhiyun	In the no-MMU case: The character device driver may choose to honour
102*4882a593Smuzhiyun	the mmap() by providing direct access to the underlying device if it
103*4882a593Smuzhiyun	provides memory or quasi-memory that can be accessed directly. Examples
104*4882a593Smuzhiyun	of such are frame buffers and flash devices. If the driver does not
105*4882a593Smuzhiyun	provide any such support, then the mapping request will be denied.
106*4882a593Smuzhiyun
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunFurther notes on no-MMU MMAP
109*4882a593Smuzhiyun============================
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun (#) A request for a private mapping of a file may return a buffer that is not
112*4882a593Smuzhiyun     page-aligned.  This is because XIP may take place, and the data may not be
113*4882a593Smuzhiyun     paged aligned in the backing store.
114*4882a593Smuzhiyun
115*4882a593Smuzhiyun (#) A request for an anonymous mapping will always be page aligned.  If
116*4882a593Smuzhiyun     possible the size of the request should be a power of two otherwise some
117*4882a593Smuzhiyun     of the space may be wasted as the kernel must allocate a power-of-2
118*4882a593Smuzhiyun     granule but will only discard the excess if appropriately configured as
119*4882a593Smuzhiyun     this has an effect on fragmentation.
120*4882a593Smuzhiyun
121*4882a593Smuzhiyun (#) The memory allocated by a request for an anonymous mapping will normally
122*4882a593Smuzhiyun     be cleared by the kernel before being returned in accordance with the
123*4882a593Smuzhiyun     Linux man pages (ver 2.22 or later).
124*4882a593Smuzhiyun
125*4882a593Smuzhiyun     In the MMU case this can be achieved with reasonable performance as
126*4882a593Smuzhiyun     regions are backed by virtual pages, with the contents only being mapped
127*4882a593Smuzhiyun     to cleared physical pages when a write happens on that specific page
128*4882a593Smuzhiyun     (prior to which, the pages are effectively mapped to the global zero page
129*4882a593Smuzhiyun     from which reads can take place).  This spreads out the time it takes to
130*4882a593Smuzhiyun     initialize the contents of a page - depending on the write-usage of the
131*4882a593Smuzhiyun     mapping.
132*4882a593Smuzhiyun
133*4882a593Smuzhiyun     In the no-MMU case, however, anonymous mappings are backed by physical
134*4882a593Smuzhiyun     pages, and the entire map is cleared at allocation time.  This can cause
135*4882a593Smuzhiyun     significant delays during a userspace malloc() as the C library does an
136*4882a593Smuzhiyun     anonymous mapping and the kernel then does a memset for the entire map.
137*4882a593Smuzhiyun
138*4882a593Smuzhiyun     However, for memory that isn't required to be precleared - such as that
139*4882a593Smuzhiyun     returned by malloc() - mmap() can take a MAP_UNINITIALIZED flag to
140*4882a593Smuzhiyun     indicate to the kernel that it shouldn't bother clearing the memory before
141*4882a593Smuzhiyun     returning it.  Note that CONFIG_MMAP_ALLOW_UNINITIALIZED must be enabled
142*4882a593Smuzhiyun     to permit this, otherwise the flag will be ignored.
143*4882a593Smuzhiyun
144*4882a593Smuzhiyun     uClibc uses this to speed up malloc(), and the ELF-FDPIC binfmt uses this
145*4882a593Smuzhiyun     to allocate the brk and stack region.
146*4882a593Smuzhiyun
147*4882a593Smuzhiyun (#) A list of all the private copy and anonymous mappings on the system is
148*4882a593Smuzhiyun     visible through /proc/maps in no-MMU mode.
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun (#) A list of all the mappings in use by a process is visible through
151*4882a593Smuzhiyun     /proc/<pid>/maps in no-MMU mode.
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun (#) Supplying MAP_FIXED or a requesting a particular mapping address will
154*4882a593Smuzhiyun     result in an error.
155*4882a593Smuzhiyun
156*4882a593Smuzhiyun (#) Files mapped privately usually have to have a read method provided by the
157*4882a593Smuzhiyun     driver or filesystem so that the contents can be read into the memory
158*4882a593Smuzhiyun     allocated if mmap() chooses not to map the backing device directly. An
159*4882a593Smuzhiyun     error will result if they don't. This is most likely to be encountered
160*4882a593Smuzhiyun     with character device files, pipes, fifos and sockets.
161*4882a593Smuzhiyun
162*4882a593Smuzhiyun
163*4882a593SmuzhiyunInterprocess shared memory
164*4882a593Smuzhiyun==========================
165*4882a593Smuzhiyun
166*4882a593SmuzhiyunBoth SYSV IPC SHM shared memory and POSIX shared memory is supported in NOMMU
167*4882a593Smuzhiyunmode.  The former through the usual mechanism, the latter through files created
168*4882a593Smuzhiyunon ramfs or tmpfs mounts.
169*4882a593Smuzhiyun
170*4882a593Smuzhiyun
171*4882a593SmuzhiyunFutexes
172*4882a593Smuzhiyun=======
173*4882a593Smuzhiyun
174*4882a593SmuzhiyunFutexes are supported in NOMMU mode if the arch supports them.  An error will
175*4882a593Smuzhiyunbe given if an address passed to the futex system call lies outside the
176*4882a593Smuzhiyunmappings made by a process or if the mapping in which the address lies does not
177*4882a593Smuzhiyunsupport futexes (such as an I/O chardev mapping).
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun
180*4882a593SmuzhiyunNo-MMU mremap
181*4882a593Smuzhiyun=============
182*4882a593Smuzhiyun
183*4882a593SmuzhiyunThe mremap() function is partially supported.  It may change the size of a
184*4882a593Smuzhiyunmapping, and may move it [#]_ if MREMAP_MAYMOVE is specified and if the new size
185*4882a593Smuzhiyunof the mapping exceeds the size of the slab object currently occupied by the
186*4882a593Smuzhiyunmemory to which the mapping refers, or if a smaller slab object could be used.
187*4882a593Smuzhiyun
188*4882a593SmuzhiyunMREMAP_FIXED is not supported, though it is ignored if there's no change of
189*4882a593Smuzhiyunaddress and the object does not need to be moved.
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunShared mappings may not be moved.  Shareable mappings may not be moved either,
192*4882a593Smuzhiyuneven if they are not currently shared.
193*4882a593Smuzhiyun
194*4882a593SmuzhiyunThe mremap() function must be given an exact match for base address and size of
195*4882a593Smuzhiyuna previously mapped object.  It may not be used to create holes in existing
196*4882a593Smuzhiyunmappings, move parts of existing mappings or resize parts of mappings.  It must
197*4882a593Smuzhiyunact on a complete mapping.
198*4882a593Smuzhiyun
199*4882a593Smuzhiyun.. [#] Not currently supported.
200*4882a593Smuzhiyun
201*4882a593Smuzhiyun
202*4882a593SmuzhiyunProviding shareable character device support
203*4882a593Smuzhiyun============================================
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunTo provide shareable character device support, a driver must provide a
206*4882a593Smuzhiyunfile->f_op->get_unmapped_area() operation. The mmap() routines will call this
207*4882a593Smuzhiyunto get a proposed address for the mapping. This may return an error if it
208*4882a593Smuzhiyundoesn't wish to honour the mapping because it's too long, at a weird offset,
209*4882a593Smuzhiyununder some unsupported combination of flags or whatever.
210*4882a593Smuzhiyun
211*4882a593SmuzhiyunThe driver should also provide backing device information with capabilities set
212*4882a593Smuzhiyunto indicate the permitted types of mapping on such devices. The default is
213*4882a593Smuzhiyunassumed to be readable and writable, not executable, and only shareable
214*4882a593Smuzhiyundirectly (can't be copied).
215*4882a593Smuzhiyun
216*4882a593SmuzhiyunThe file->f_op->mmap() operation will be called to actually inaugurate the
217*4882a593Smuzhiyunmapping. It can be rejected at that point. Returning the ENOSYS error will
218*4882a593Smuzhiyuncause the mapping to be copied instead if NOMMU_MAP_COPY is specified.
219*4882a593Smuzhiyun
220*4882a593SmuzhiyunThe vm_ops->close() routine will be invoked when the last mapping on a chardev
221*4882a593Smuzhiyunis removed. An existing mapping will be shared, partially or not, if possible
222*4882a593Smuzhiyunwithout notifying the driver.
223*4882a593Smuzhiyun
224*4882a593SmuzhiyunIt is permitted also for the file->f_op->get_unmapped_area() operation to
225*4882a593Smuzhiyunreturn -ENOSYS. This will be taken to mean that this operation just doesn't
226*4882a593Smuzhiyunwant to handle it, despite the fact it's got an operation. For instance, it
227*4882a593Smuzhiyunmight try directing the call to a secondary driver which turns out not to
228*4882a593Smuzhiyunimplement it. Such is the case for the framebuffer driver which attempts to
229*4882a593Smuzhiyundirect the call to the device-specific driver. Under such circumstances, the
230*4882a593Smuzhiyunmapping request will be rejected if NOMMU_MAP_COPY is not specified, and a
231*4882a593Smuzhiyuncopy mapped otherwise.
232*4882a593Smuzhiyun
233*4882a593Smuzhiyun.. important::
234*4882a593Smuzhiyun
235*4882a593Smuzhiyun	Some types of device may present a different appearance to anyone
236*4882a593Smuzhiyun	looking at them in certain modes. Flash chips can be like this; for
237*4882a593Smuzhiyun	instance if they're in programming or erase mode, you might see the
238*4882a593Smuzhiyun	status reflected in the mapping, instead of the data.
239*4882a593Smuzhiyun
240*4882a593Smuzhiyun	In such a case, care must be taken lest userspace see a shared or a
241*4882a593Smuzhiyun	private mapping showing such information when the driver is busy
242*4882a593Smuzhiyun	controlling the device. Remember especially: private executable
243*4882a593Smuzhiyun	mappings may still be mapped directly off the device under some
244*4882a593Smuzhiyun	circumstances!
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun
247*4882a593SmuzhiyunProviding shareable memory-backed file support
248*4882a593Smuzhiyun==============================================
249*4882a593Smuzhiyun
250*4882a593SmuzhiyunProvision of shared mappings on memory backed files is similar to the provision
251*4882a593Smuzhiyunof support for shared mapped character devices. The main difference is that the
252*4882a593Smuzhiyunfilesystem providing the service will probably allocate a contiguous collection
253*4882a593Smuzhiyunof pages and permit mappings to be made on that.
254*4882a593Smuzhiyun
255*4882a593SmuzhiyunIt is recommended that a truncate operation applied to such a file that
256*4882a593Smuzhiyunincreases the file size, if that file is empty, be taken as a request to gather
257*4882a593Smuzhiyunenough pages to honour a mapping. This is required to support POSIX shared
258*4882a593Smuzhiyunmemory.
259*4882a593Smuzhiyun
260*4882a593SmuzhiyunMemory backed devices are indicated by the mapping's backing device info having
261*4882a593Smuzhiyunthe memory_backed flag set.
262*4882a593Smuzhiyun
263*4882a593Smuzhiyun
264*4882a593SmuzhiyunProviding shareable block device support
265*4882a593Smuzhiyun========================================
266*4882a593Smuzhiyun
267*4882a593SmuzhiyunProvision of shared mappings on block device files is exactly the same as for
268*4882a593Smuzhiyuncharacter devices. If there isn't a real device underneath, then the driver
269*4882a593Smuzhiyunshould allocate sufficient contiguous memory to honour any supported mapping.
270*4882a593Smuzhiyun
271*4882a593Smuzhiyun
272*4882a593SmuzhiyunAdjusting page trimming behaviour
273*4882a593Smuzhiyun=================================
274*4882a593Smuzhiyun
275*4882a593SmuzhiyunNOMMU mmap automatically rounds up to the nearest power-of-2 number of pages
276*4882a593Smuzhiyunwhen performing an allocation.  This can have adverse effects on memory
277*4882a593Smuzhiyunfragmentation, and as such, is left configurable.  The default behaviour is to
278*4882a593Smuzhiyunaggressively trim allocations and discard any excess pages back in to the page
279*4882a593Smuzhiyunallocator.  In order to retain finer-grained control over fragmentation, this
280*4882a593Smuzhiyunbehaviour can either be disabled completely, or bumped up to a higher page
281*4882a593Smuzhiyunwatermark where trimming begins.
282*4882a593Smuzhiyun
283*4882a593SmuzhiyunPage trimming behaviour is configurable via the sysctl ``vm.nr_trim_pages``.
284