xref: /OK3568_Linux_fs/kernel/Documentation/crypto/descore-readme.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun.. include:: <isonum.txt>
3*4882a593Smuzhiyun
4*4882a593Smuzhiyun===========================================
5*4882a593SmuzhiyunFast & Portable DES encryption & decryption
6*4882a593Smuzhiyun===========================================
7*4882a593Smuzhiyun
8*4882a593Smuzhiyun.. note::
9*4882a593Smuzhiyun
10*4882a593Smuzhiyun   Below is the original README file from the descore.shar package,
11*4882a593Smuzhiyun   converted to ReST format.
12*4882a593Smuzhiyun
13*4882a593Smuzhiyun------------------------------------------------------------------------------
14*4882a593Smuzhiyun
15*4882a593Smuzhiyundes - fast & portable DES encryption & decryption.
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunCopyright |copy| 1992  Dana L. How
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunThis program is free software; you can redistribute it and/or modify
20*4882a593Smuzhiyunit under the terms of the GNU Library General Public License as published by
21*4882a593Smuzhiyunthe Free Software Foundation; either version 2 of the License, or
22*4882a593Smuzhiyun(at your option) any later version.
23*4882a593Smuzhiyun
24*4882a593SmuzhiyunThis program is distributed in the hope that it will be useful,
25*4882a593Smuzhiyunbut WITHOUT ANY WARRANTY; without even the implied warranty of
26*4882a593SmuzhiyunMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
27*4882a593SmuzhiyunGNU Library General Public License for more details.
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunYou should have received a copy of the GNU Library General Public License
30*4882a593Smuzhiyunalong with this program; if not, write to the Free Software
31*4882a593SmuzhiyunFoundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunAuthor's address: how@isl.stanford.edu
34*4882a593Smuzhiyun
35*4882a593Smuzhiyun.. README,v 1.15 1992/05/20 00:25:32 how E
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun==>> To compile after untarring/unsharring, just ``make`` <<==
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunThis package was designed with the following goals:
40*4882a593Smuzhiyun
41*4882a593Smuzhiyun1.	Highest possible encryption/decryption PERFORMANCE.
42*4882a593Smuzhiyun2.	PORTABILITY to any byte-addressable host with a 32bit unsigned C type
43*4882a593Smuzhiyun3.	Plug-compatible replacement for KERBEROS's low-level routines.
44*4882a593Smuzhiyun
45*4882a593SmuzhiyunThis second release includes a number of performance enhancements for
46*4882a593Smuzhiyunregister-starved machines.  My discussions with Richard Outerbridge,
47*4882a593Smuzhiyun71755.204@compuserve.com, sparked a number of these enhancements.
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunTo more rapidly understand the code in this package, inspect desSmallFips.i
50*4882a593Smuzhiyun(created by typing ``make``) BEFORE you tackle desCode.h.  The latter is set
51*4882a593Smuzhiyunup in a parameterized fashion so it can easily be modified by speed-daemon
52*4882a593Smuzhiyunhackers in pursuit of that last microsecond.  You will find it more
53*4882a593Smuzhiyunilluminating to inspect one specific implementation,
54*4882a593Smuzhiyunand then move on to the common abstract skeleton with this one in mind.
55*4882a593Smuzhiyun
56*4882a593Smuzhiyun
57*4882a593Smuzhiyunperformance comparison to other available des code which i could
58*4882a593Smuzhiyuncompile on a SPARCStation 1 (cc -O4, gcc -O2):
59*4882a593Smuzhiyun
60*4882a593Smuzhiyunthis code (byte-order independent):
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun  - 30us per encryption (options: 64k tables, no IP/FP)
63*4882a593Smuzhiyun  - 33us per encryption (options: 64k tables, FIPS standard bit ordering)
64*4882a593Smuzhiyun  - 45us per encryption (options:  2k tables, no IP/FP)
65*4882a593Smuzhiyun  - 48us per encryption (options:  2k tables, FIPS standard bit ordering)
66*4882a593Smuzhiyun  - 275us to set a new key (uses 1k of key tables)
67*4882a593Smuzhiyun
68*4882a593Smuzhiyun	this has the quickest encryption/decryption routines i've seen.
69*4882a593Smuzhiyun	since i was interested in fast des filters rather than crypt(3)
70*4882a593Smuzhiyun	and password cracking, i haven't really bothered yet to speed up
71*4882a593Smuzhiyun	the key setting routine. also, i have no interest in re-implementing
72*4882a593Smuzhiyun	all the other junk in the mit kerberos des library, so i've just
73*4882a593Smuzhiyun	provided my routines with little stub interfaces so they can be
74*4882a593Smuzhiyun	used as drop-in replacements with mit's code or any of the mit-
75*4882a593Smuzhiyun	compatible packages below. (note that the first two timings above
76*4882a593Smuzhiyun	are highly variable because of cache effects).
77*4882a593Smuzhiyun
78*4882a593Smuzhiyunkerberos des replacement from australia (version 1.95):
79*4882a593Smuzhiyun
80*4882a593Smuzhiyun  - 53us per encryption (uses 2k of tables)
81*4882a593Smuzhiyun  - 96us to set a new key (uses 2.25k of key tables)
82*4882a593Smuzhiyun
83*4882a593Smuzhiyun	so despite the author's inclusion of some of the performance
84*4882a593Smuzhiyun	improvements i had suggested to him, this package's
85*4882a593Smuzhiyun	encryption/decryption is still slower on the sparc and 68000.
86*4882a593Smuzhiyun	more specifically, 19-40% slower on the 68020 and 11-35% slower
87*4882a593Smuzhiyun	on the sparc,  depending on the compiler;
88*4882a593Smuzhiyun	in full gory detail (ALT_ECB is a libdes variant):
89*4882a593Smuzhiyun
90*4882a593Smuzhiyun	===============	==============	===============	=================
91*4882a593Smuzhiyun	compiler   	machine		desCore	libdes	ALT_ECB	slower by
92*4882a593Smuzhiyun	===============	==============	===============	=================
93*4882a593Smuzhiyun	gcc 2.1 -O2	Sun 3/110	304  uS	369.5uS	461.8uS	 22%
94*4882a593Smuzhiyun	cc      -O1	Sun 3/110	336  uS	436.6uS	399.3uS	 19%
95*4882a593Smuzhiyun	cc      -O2	Sun 3/110	360  uS	532.4uS	505.1uS	 40%
96*4882a593Smuzhiyun	cc      -O4	Sun 3/110	365  uS	532.3uS	505.3uS	 38%
97*4882a593Smuzhiyun	gcc 2.1 -O2	Sun 4/50	 48  uS	 53.4uS	 57.5uS	 11%
98*4882a593Smuzhiyun	cc      -O2	Sun 4/50	 48  uS	 64.6uS	 64.7uS	 35%
99*4882a593Smuzhiyun	cc      -O4	Sun 4/50	 48  uS	 64.7uS	 64.9uS	 35%
100*4882a593Smuzhiyun	===============	==============	===============	=================
101*4882a593Smuzhiyun
102*4882a593Smuzhiyun	(my time measurements are not as accurate as his).
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun   the comments in my first release of desCore on version 1.92:
105*4882a593Smuzhiyun
106*4882a593Smuzhiyun   - 68us per encryption (uses 2k of tables)
107*4882a593Smuzhiyun   - 96us to set a new key (uses 2.25k of key tables)
108*4882a593Smuzhiyun
109*4882a593Smuzhiyun	this is a very nice package which implements the most important
110*4882a593Smuzhiyun	of the optimizations which i did in my encryption routines.
111*4882a593Smuzhiyun	it's a bit weak on common low-level optimizations which is why
112*4882a593Smuzhiyun	it's 39%-106% slower.  because he was interested in fast crypt(3) and
113*4882a593Smuzhiyun	password-cracking applications,  he also used the same ideas to
114*4882a593Smuzhiyun	speed up the key-setting routines with impressive results.
115*4882a593Smuzhiyun	(at some point i may do the same in my package).  he also implements
116*4882a593Smuzhiyun	the rest of the mit des library.
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun	(code from eay@psych.psy.uq.oz.au via comp.sources.misc)
119*4882a593Smuzhiyun
120*4882a593Smuzhiyunfast crypt(3) package from denmark:
121*4882a593Smuzhiyun
122*4882a593Smuzhiyun	the des routine here is buried inside a loop to do the
123*4882a593Smuzhiyun	crypt function and i didn't feel like ripping it out and measuring
124*4882a593Smuzhiyun	performance. his code takes 26 sparc instructions to compute one
125*4882a593Smuzhiyun	des iteration; above, Quick (64k) takes 21 and Small (2k) takes 37.
126*4882a593Smuzhiyun	he claims to use 280k of tables but the iteration calculation seems
127*4882a593Smuzhiyun	to use only 128k.  his tables and code are machine independent.
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun	(code from glad@daimi.aau.dk via alt.sources or comp.sources.misc)
130*4882a593Smuzhiyun
131*4882a593Smuzhiyunswedish reimplementation of Kerberos des library
132*4882a593Smuzhiyun
133*4882a593Smuzhiyun  - 108us per encryption (uses 34k worth of tables)
134*4882a593Smuzhiyun  - 134us to set a new key (uses 32k of key tables to get this speed!)
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun	the tables used seem to be machine-independent;
137*4882a593Smuzhiyun	he seems to have included a lot of special case code
138*4882a593Smuzhiyun	so that, e.g., ``long`` loads can be used instead of 4 ``char`` loads
139*4882a593Smuzhiyun	when the machine's architecture allows it.
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun	(code obtained from chalmers.se:pub/des)
142*4882a593Smuzhiyun
143*4882a593Smuzhiyuncrack 3.3c package from england:
144*4882a593Smuzhiyun
145*4882a593Smuzhiyun	as in crypt above, the des routine is buried in a loop. it's
146*4882a593Smuzhiyun	also very modified for crypt.  his iteration code uses 16k
147*4882a593Smuzhiyun	of tables and appears to be slow.
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun	(code obtained from aem@aber.ac.uk via alt.sources or comp.sources.misc)
150*4882a593Smuzhiyun
151*4882a593Smuzhiyun``highly optimized`` and tweaked Kerberos/Athena code (byte-order dependent):
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun  - 165us per encryption (uses 6k worth of tables)
154*4882a593Smuzhiyun  - 478us to set a new key (uses <1k of key tables)
155*4882a593Smuzhiyun
156*4882a593Smuzhiyun	so despite the comments in this code, it was possible to get
157*4882a593Smuzhiyun	faster code AND smaller tables, as well as making the tables
158*4882a593Smuzhiyun	machine-independent.
159*4882a593Smuzhiyun	(code obtained from prep.ai.mit.edu)
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunUC Berkeley code (depends on machine-endedness):
162*4882a593Smuzhiyun  -  226us per encryption
163*4882a593Smuzhiyun  - 10848us to set a new key
164*4882a593Smuzhiyun
165*4882a593Smuzhiyun	table sizes are unclear, but they don't look very small
166*4882a593Smuzhiyun	(code obtained from wuarchive.wustl.edu)
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun
169*4882a593Smuzhiyunmotivation and history
170*4882a593Smuzhiyun======================
171*4882a593Smuzhiyun
172*4882a593Smuzhiyuna while ago i wanted some des routines and the routines documented on sun's
173*4882a593Smuzhiyunman pages either didn't exist or dumped core.  i had heard of kerberos,
174*4882a593Smuzhiyunand knew that it used des,  so i figured i'd use its routines.  but once
175*4882a593Smuzhiyuni got it and looked at the code,  it really set off a lot of pet peeves -
176*4882a593Smuzhiyunit was too convoluted, the code had been written without taking
177*4882a593Smuzhiyunadvantage of the regular structure of operations such as IP, E, and FP
178*4882a593Smuzhiyun(i.e. the author didn't sit down and think before coding),
179*4882a593Smuzhiyunit was excessively slow,  the author had attempted to clarify the code
180*4882a593Smuzhiyunby adding MORE statements to make the data movement more ``consistent``
181*4882a593Smuzhiyuninstead of simplifying his implementation and cutting down on all data
182*4882a593Smuzhiyunmovement (in particular, his use of L1, R1, L2, R2), and it was full of
183*4882a593Smuzhiyunidiotic ``tweaks`` for particular machines which failed to deliver significant
184*4882a593Smuzhiyunspeedups but which did obfuscate everything.  so i took the test data
185*4882a593Smuzhiyunfrom his verification program and rewrote everything else.
186*4882a593Smuzhiyun
187*4882a593Smuzhiyuna while later i ran across the great crypt(3) package mentioned above.
188*4882a593Smuzhiyunthe fact that this guy was computing 2 sboxes per table lookup rather
189*4882a593Smuzhiyunthan one (and using a MUCH larger table in the process) emboldened me to
190*4882a593Smuzhiyundo the same - it was a trivial change from which i had been scared away
191*4882a593Smuzhiyunby the larger table size.  in his case he didn't realize you don't need to keep
192*4882a593Smuzhiyunthe working data in TWO forms, one for easy use of half the sboxes in
193*4882a593Smuzhiyunindexing, the other for easy use of the other half; instead you can keep
194*4882a593Smuzhiyunit in the form for the first half and use a simple rotate to get the other
195*4882a593Smuzhiyunhalf.  this means i have (almost) half the data manipulation and half
196*4882a593Smuzhiyunthe table size.  in fairness though he might be encoding something particular
197*4882a593Smuzhiyunto crypt(3) in his tables - i didn't check.
198*4882a593Smuzhiyun
199*4882a593Smuzhiyuni'm glad that i implemented it the way i did, because this C version is
200*4882a593Smuzhiyunportable (the ifdef's are performance enhancements) and it is faster
201*4882a593Smuzhiyunthan versions hand-written in assembly for the sparc!
202*4882a593Smuzhiyun
203*4882a593Smuzhiyun
204*4882a593Smuzhiyunporting notes
205*4882a593Smuzhiyun=============
206*4882a593Smuzhiyun
207*4882a593Smuzhiyunone thing i did not want to do was write an enormous mess
208*4882a593Smuzhiyunwhich depended on endedness and other machine quirks,
209*4882a593Smuzhiyunand which necessarily produced different code and different lookup tables
210*4882a593Smuzhiyunfor different machines.  see the kerberos code for an example
211*4882a593Smuzhiyunof what i didn't want to do; all their endedness-specific ``optimizations``
212*4882a593Smuzhiyunobfuscate the code and in the end were slower than a simpler machine
213*4882a593Smuzhiyunindependent approach.  however, there are always some portability
214*4882a593Smuzhiyunconsiderations of some kind, and i have included some options
215*4882a593Smuzhiyunfor varying numbers of register variables.
216*4882a593Smuzhiyunperhaps some will still regard the result as a mess!
217*4882a593Smuzhiyun
218*4882a593Smuzhiyun1) i assume everything is byte addressable, although i don't actually
219*4882a593Smuzhiyun   depend on the byte order, and that bytes are 8 bits.
220*4882a593Smuzhiyun   i assume word pointers can be freely cast to and from char pointers.
221*4882a593Smuzhiyun   note that 99% of C programs make these assumptions.
222*4882a593Smuzhiyun   i always use unsigned char's if the high bit could be set.
223*4882a593Smuzhiyun2) the typedef ``word`` means a 32 bit unsigned integral type.
224*4882a593Smuzhiyun   if ``unsigned long`` is not 32 bits, change the typedef in desCore.h.
225*4882a593Smuzhiyun   i assume sizeof(word) == 4 EVERYWHERE.
226*4882a593Smuzhiyun
227*4882a593Smuzhiyunthe (worst-case) cost of my NOT doing endedness-specific optimizations
228*4882a593Smuzhiyunin the data loading and storing code surrounding the key iterations
229*4882a593Smuzhiyunis less than 12%.  also, there is the added benefit that
230*4882a593Smuzhiyunthe input and output work areas do not need to be word-aligned.
231*4882a593Smuzhiyun
232*4882a593Smuzhiyun
233*4882a593SmuzhiyunOPTIONAL performance optimizations
234*4882a593Smuzhiyun==================================
235*4882a593Smuzhiyun
236*4882a593Smuzhiyun1) you should define one of ``i386,`` ``vax,`` ``mc68000,`` or ``sparc,``
237*4882a593Smuzhiyun   whichever one is closest to the capabilities of your machine.
238*4882a593Smuzhiyun   see the start of desCode.h to see exactly what this selection implies.
239*4882a593Smuzhiyun   note that if you select the wrong one, the des code will still work;
240*4882a593Smuzhiyun   these are just performance tweaks.
241*4882a593Smuzhiyun2) for those with functional ``asm`` keywords: you should change the
242*4882a593Smuzhiyun   ROR and ROL macros to use machine rotate instructions if you have them.
243*4882a593Smuzhiyun   this will save 2 instructions and a temporary per use,
244*4882a593Smuzhiyun   or about 32 to 40 instructions per en/decryption.
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun   note that gcc is smart enough to translate the ROL/R macros into
247*4882a593Smuzhiyun   machine rotates!
248*4882a593Smuzhiyun
249*4882a593Smuzhiyunthese optimizations are all rather persnickety, yet with them you should
250*4882a593Smuzhiyunbe able to get performance equal to assembly-coding, except that:
251*4882a593Smuzhiyun
252*4882a593Smuzhiyun1) with the lack of a bit rotate operator in C, rotates have to be synthesized
253*4882a593Smuzhiyun   from shifts.  so access to ``asm`` will speed things up if your machine
254*4882a593Smuzhiyun   has rotates, as explained above in (3) (not necessary if you use gcc).
255*4882a593Smuzhiyun2) if your machine has less than 12 32-bit registers i doubt your compiler will
256*4882a593Smuzhiyun   generate good code.
257*4882a593Smuzhiyun
258*4882a593Smuzhiyun   ``i386`` tries to configure the code for a 386 by only declaring 3 registers
259*4882a593Smuzhiyun   (it appears that gcc can use ebx, esi and edi to hold register variables).
260*4882a593Smuzhiyun   however, if you like assembly coding, the 386 does have 7 32-bit registers,
261*4882a593Smuzhiyun   and if you use ALL of them, use ``scaled by 8`` address modes with displacement
262*4882a593Smuzhiyun   and other tricks, you can get reasonable routines for DesQuickCore... with
263*4882a593Smuzhiyun   about 250 instructions apiece.  For DesSmall... it will help to rearrange
264*4882a593Smuzhiyun   des_keymap, i.e., now the sbox # is the high part of the index and
265*4882a593Smuzhiyun   the 6 bits of data is the low part; it helps to exchange these.
266*4882a593Smuzhiyun
267*4882a593Smuzhiyun   since i have no way to conveniently test it i have not provided my
268*4882a593Smuzhiyun   shoehorned 386 version.  note that with this release of desCore, gcc is able
269*4882a593Smuzhiyun   to put everything in registers(!), and generate about 370 instructions apiece
270*4882a593Smuzhiyun   for the DesQuickCore... routines!
271*4882a593Smuzhiyun
272*4882a593Smuzhiyuncoding notes
273*4882a593Smuzhiyun============
274*4882a593Smuzhiyun
275*4882a593Smuzhiyunthe en/decryption routines each use 6 necessary register variables,
276*4882a593Smuzhiyunwith 4 being actively used at once during the inner iterations.
277*4882a593Smuzhiyunif you don't have 4 register variables get a new machine.
278*4882a593Smuzhiyunup to 8 more registers are used to hold constants in some configurations.
279*4882a593Smuzhiyun
280*4882a593Smuzhiyuni assume that the use of a constant is more expensive than using a register:
281*4882a593Smuzhiyun
282*4882a593Smuzhiyuna) additionally, i have tried to put the larger constants in registers.
283*4882a593Smuzhiyun   registering priority was by the following:
284*4882a593Smuzhiyun
285*4882a593Smuzhiyun	- anything more than 12 bits (bad for RISC and CISC)
286*4882a593Smuzhiyun	- greater than 127 in value (can't use movq or byte immediate on CISC)
287*4882a593Smuzhiyun	- 9-127 (may not be able to use CISC shift immediate or add/sub quick),
288*4882a593Smuzhiyun	- 1-8 were never registered, being the cheapest constants.
289*4882a593Smuzhiyun
290*4882a593Smuzhiyunb) the compiler may be too stupid to realize table and table+256 should
291*4882a593Smuzhiyun   be assigned to different constant registers and instead repetitively
292*4882a593Smuzhiyun   do the arithmetic, so i assign these to explicit ``m`` register variables
293*4882a593Smuzhiyun   when possible and helpful.
294*4882a593Smuzhiyun
295*4882a593Smuzhiyuni assume that indexing is cheaper or equivalent to auto increment/decrement,
296*4882a593Smuzhiyunwhere the index is 7 bits unsigned or smaller.
297*4882a593Smuzhiyunthis assumption is reversed for 68k and vax.
298*4882a593Smuzhiyun
299*4882a593Smuzhiyuni assume that addresses can be cheaply formed from two registers,
300*4882a593Smuzhiyunor from a register and a small constant.
301*4882a593Smuzhiyunfor the 68000, the ``two registers and small offset`` form is used sparingly.
302*4882a593Smuzhiyunall index scaling is done explicitly - no hidden shifts by log2(sizeof).
303*4882a593Smuzhiyun
304*4882a593Smuzhiyunthe code is written so that even a dumb compiler
305*4882a593Smuzhiyunshould never need more than one hidden temporary,
306*4882a593Smuzhiyunincreasing the chance that everything will fit in the registers.
307*4882a593SmuzhiyunKEEP THIS MORE SUBTLE POINT IN MIND IF YOU REWRITE ANYTHING.
308*4882a593Smuzhiyun
309*4882a593Smuzhiyun(actually, there are some code fragments now which do require two temps,
310*4882a593Smuzhiyunbut fixing it would either break the structure of the macros or
311*4882a593Smuzhiyunrequire declaring another temporary).
312*4882a593Smuzhiyun
313*4882a593Smuzhiyun
314*4882a593Smuzhiyunspecial efficient data format
315*4882a593Smuzhiyun==============================
316*4882a593Smuzhiyun
317*4882a593Smuzhiyunbits are manipulated in this arrangement most of the time (S7 S5 S3 S1)::
318*4882a593Smuzhiyun
319*4882a593Smuzhiyun	003130292827xxxx242322212019xxxx161514131211xxxx080706050403xxxx
320*4882a593Smuzhiyun
321*4882a593Smuzhiyun(the x bits are still there, i'm just emphasizing where the S boxes are).
322*4882a593Smuzhiyunbits are rotated left 4 when computing S6 S4 S2 S0::
323*4882a593Smuzhiyun
324*4882a593Smuzhiyun	282726252423xxxx201918171615xxxx121110090807xxxx040302010031xxxx
325*4882a593Smuzhiyun
326*4882a593Smuzhiyunthe rightmost two bits are usually cleared so the lower byte can be used
327*4882a593Smuzhiyunas an index into an sbox mapping table. the next two x'd bits are set
328*4882a593Smuzhiyunto various values to access different parts of the tables.
329*4882a593Smuzhiyun
330*4882a593Smuzhiyun
331*4882a593Smuzhiyunhow to use the routines
332*4882a593Smuzhiyun
333*4882a593Smuzhiyundatatypes:
334*4882a593Smuzhiyun	pointer to 8 byte area of type DesData
335*4882a593Smuzhiyun	used to hold keys and input/output blocks to des.
336*4882a593Smuzhiyun
337*4882a593Smuzhiyun	pointer to 128 byte area of type DesKeys
338*4882a593Smuzhiyun	used to hold full 768-bit key.
339*4882a593Smuzhiyun	must be long-aligned.
340*4882a593Smuzhiyun
341*4882a593SmuzhiyunDesQuickInit()
342*4882a593Smuzhiyun	call this before using any other routine with ``Quick`` in its name.
343*4882a593Smuzhiyun	it generates the special 64k table these routines need.
344*4882a593SmuzhiyunDesQuickDone()
345*4882a593Smuzhiyun	frees this table
346*4882a593Smuzhiyun
347*4882a593SmuzhiyunDesMethod(m, k)
348*4882a593Smuzhiyun	m points to a 128byte block, k points to an 8 byte des key
349*4882a593Smuzhiyun	which must have odd parity (or -1 is returned) and which must
350*4882a593Smuzhiyun	not be a (semi-)weak key (or -2 is returned).
351*4882a593Smuzhiyun	normally DesMethod() returns 0.
352*4882a593Smuzhiyun
353*4882a593Smuzhiyun	m is filled in from k so that when one of the routines below
354*4882a593Smuzhiyun	is called with m, the routine will act like standard des
355*4882a593Smuzhiyun	en/decryption with the key k. if you use DesMethod,
356*4882a593Smuzhiyun	you supply a standard 56bit key; however, if you fill in
357*4882a593Smuzhiyun	m yourself, you will get a 768bit key - but then it won't
358*4882a593Smuzhiyun	be standard.  it's 768bits not 1024 because the least significant
359*4882a593Smuzhiyun	two bits of each byte are not used.  note that these two bits
360*4882a593Smuzhiyun	will be set to magic constants which speed up the encryption/decryption
361*4882a593Smuzhiyun	on some machines.  and yes, each byte controls
362*4882a593Smuzhiyun	a specific sbox during a specific iteration.
363*4882a593Smuzhiyun
364*4882a593Smuzhiyun	you really shouldn't use the 768bit format directly;  i should
365*4882a593Smuzhiyun	provide a routine that converts 128 6-bit bytes (specified in
366*4882a593Smuzhiyun	S-box mapping order or something) into the right format for you.
367*4882a593Smuzhiyun	this would entail some byte concatenation and rotation.
368*4882a593Smuzhiyun
369*4882a593SmuzhiyunDes{Small|Quick}{Fips|Core}{Encrypt|Decrypt}(d, m, s)
370*4882a593Smuzhiyun	performs des on the 8 bytes at s into the 8 bytes at
371*4882a593Smuzhiyun	``d. (d,s: char *)``.
372*4882a593Smuzhiyun
373*4882a593Smuzhiyun	uses m as a 768bit key as explained above.
374*4882a593Smuzhiyun
375*4882a593Smuzhiyun	the Encrypt|Decrypt choice is obvious.
376*4882a593Smuzhiyun
377*4882a593Smuzhiyun	Fips|Core determines whether a completely standard FIPS initial
378*4882a593Smuzhiyun	and final permutation is done; if not, then the data is loaded
379*4882a593Smuzhiyun	and stored in a nonstandard bit order (FIPS w/o IP/FP).
380*4882a593Smuzhiyun
381*4882a593Smuzhiyun	Fips slows down Quick by 10%, Small by 9%.
382*4882a593Smuzhiyun
383*4882a593Smuzhiyun	Small|Quick determines whether you use the normal routine
384*4882a593Smuzhiyun	or the crazy quick one which gobbles up 64k more of memory.
385*4882a593Smuzhiyun	Small is 50% slower then Quick, but Quick needs 32 times as much
386*4882a593Smuzhiyun	memory.  Quick is included for programs that do nothing but DES,
387*4882a593Smuzhiyun	e.g., encryption filters, etc.
388*4882a593Smuzhiyun
389*4882a593Smuzhiyun
390*4882a593SmuzhiyunGetting it to compile on your machine
391*4882a593Smuzhiyun=====================================
392*4882a593Smuzhiyun
393*4882a593Smuzhiyunthere are no machine-dependencies in the code (see porting),
394*4882a593Smuzhiyunexcept perhaps the ``now()`` macro in desTest.c.
395*4882a593SmuzhiyunALL generated tables are machine independent.
396*4882a593Smuzhiyunyou should edit the Makefile with the appropriate optimization flags
397*4882a593Smuzhiyunfor your compiler (MAX optimization).
398*4882a593Smuzhiyun
399*4882a593Smuzhiyun
400*4882a593SmuzhiyunSpeeding up kerberos (and/or its des library)
401*4882a593Smuzhiyun=============================================
402*4882a593Smuzhiyun
403*4882a593Smuzhiyunnote that i have included a kerberos-compatible interface in desUtil.c
404*4882a593Smuzhiyunthrough the functions des_key_sched() and des_ecb_encrypt().
405*4882a593Smuzhiyunto use these with kerberos or kerberos-compatible code put desCore.a
406*4882a593Smuzhiyunahead of the kerberos-compatible library on your linker's command line.
407*4882a593Smuzhiyunyou should not need to #include desCore.h;  just include the header
408*4882a593Smuzhiyunfile provided with the kerberos library.
409*4882a593Smuzhiyun
410*4882a593SmuzhiyunOther uses
411*4882a593Smuzhiyun==========
412*4882a593Smuzhiyun
413*4882a593Smuzhiyunthe macros in desCode.h would be very useful for putting inline des
414*4882a593Smuzhiyunfunctions in more complicated encryption routines.
415