1 files changed, 198 insertions, 0 deletions
diff --git a/vendor/github.com/minio/md5-simd/README.md b/vendor/github.com/minio/md5-simd/README.md
new file mode 100644
index 0000000..fa6fce1
--- /dev/null
+++ b/vendor/github.com/minio/md5-simd/README.md
@@ -0,0 +1,198 @@
+# md5-simd
+This is a SIMD accelerated MD5 package, allowing up to either 8 (AVX2) or 16 (AVX512) independent MD5 sums to be calculated on a single CPU core.
+It was originally based on the [md5vec](https://github.com/igneous-systems/md5vec) repository by Igneous Systems, but has been made more flexible by amongst others supporting different message sizes per lane and adding AVX512.
+`md5-simd` integrates a similar mechanism as described in [minio/sha256-simd](https://github.com/minio/sha256-simd#support-for-avx512) for making it easy for clients to take advantages of the parallel nature of the MD5 calculation. This will result in reduced overall CPU load. 
+It is important to understand that `md5-simd` **does not speed up** a single threaded MD5 hash sum. 
+Rather it allows multiple __independent__  MD5 sums to be computed in parallel on the same CPU core, 
+thereby making more efficient usage of the computing resources.
+## Usage
+[![Documentation](https://godoc.org/github.com/minio/md5-simd?status.svg)](https://pkg.go.dev/github.com/minio/md5-simd?tab=doc)
+In order to use `md5-simd`, you must first create an `Server` which can be 
+used to instantiate one or more objects for MD5 hashing. 
+These objects conform to the regular [`hash.Hash`](https://pkg.go.dev/hash?tab=doc#Hash) interface 
+and as such the normal Write/Reset/Sum functionality works as expected. 
+As an example: 
+```
+    // Create server
+    server := md5simd.NewServer()
+    defer server.Close()
+    // Create hashing object (conforming to hash.Hash)
+    md5Hash := server.NewHash()
+    defer md5Hash.Close()
+    // Write one (or more) blocks
+    md5Hash.Write(block)
+    
+    // Return digest
+    digest := md5Hash.Sum([]byte{})
+```
+To keep performance both a [Server](https://pkg.go.dev/github.com/minio/md5-simd?tab=doc#Server) 
+and individual [Hasher](https://pkg.go.dev/github.com/minio/md5-simd?tab=doc#Hasher) should 
+be closed using the `Close()` function when no longer needed.
+A Hasher can efficiently be re-used by using [`Reset()`](https://pkg.go.dev/hash?tab=doc#Hash) functionality.
+In case your system does not support the instructions required it will fall back to using `crypto/md5` for hashing.
+## Limitations
+As explained above `md5-simd` does not speed up an individual MD5 hash sum computation,
+unless some hierarchical tree construct is used but this will result in different outcomes.
+Running a single hash on a server results in approximately half the throughput.
+Instead, it allows running multiple MD5 calculations in parallel on a single CPU core. 
+This can be beneficial in e.g. multi-threaded server applications where many go-routines 
+are dealing with many requests and multiple MD5 calculations can be packed/scheduled for parallel execution on a single core.
+This will result in a lower overall CPU usage as compared to using the standard `crypto/md5`
+functionality where each MD5 hash computation will consume a single thread (core).
+It is best to test and measure the overall CPU usage in a representative usage scenario in your application
+to get an overall understanding of the benefits of `md5-simd` as compared to `crypto/md5`, ideally under heavy CPU load.
+Also note that `md5-simd` is best meant to work with large objects, 
+so if your application only hashes small objects of a few kilobytes 
+you may be better of by using `crypto/md5`.
+## Performance
+For the best performance writes should be a multiple of 64 bytes, ideally a multiple of 32KB.
+To help with that a [`buffered := bufio.NewWriterSize(hasher, 32<<10)`](https://golang.org/pkg/bufio/#NewWriterSize) 
+can be inserted if you are unsure of the sizes of the writes. 
+Remember to [flush](https://golang.org/pkg/bufio/#Writer.Flush) `buffered` before reading the hash. 
+A single 'server' can process 16 streams concurrently with 1 core (AVX-512) or 2 cores (AVX2). 
+In situations where it is likely that more than 16 streams are fully loaded it may be beneficial
+to use multiple servers.
+The following chart compares the multi-core performance between `crypto/md5` vs the AVX2 vs the AVX512 code:
+![md5-performance-overview](chart/Multi-core-MD5-Aggregated-Hashing-Performance.png)
+Compared to `crypto/md5`, the AVX2 version is up to 4x faster:
+```
+$ benchcmp crypto-md5.txt avx2.txt 
+benchmark                     old MB/s     new MB/s     speedup
+BenchmarkParallel/32KB-4      2229.22      7370.50      3.31x
+BenchmarkParallel/64KB-4      2233.61      8248.46      3.69x
+BenchmarkParallel/128KB-4     2235.43      8660.74      3.87x
+BenchmarkParallel/256KB-4     2236.39      8863.87      3.96x
+BenchmarkParallel/512KB-4     2238.05      8985.39      4.01x
+BenchmarkParallel/1MB-4       2233.56      9042.62      4.05x
+BenchmarkParallel/2MB-4       2224.11      9014.46      4.05x
+BenchmarkParallel/4MB-4       2199.78      8993.61      4.09x
+BenchmarkParallel/8MB-4       2182.48      8748.22      4.01x
+```
+Compared to `crypto/md5`, the AVX512 is up to 8x faster (for larger block sizes):
+```
+$ benchcmp crypto-md5.txt avx512.txt
+benchmark                     old MB/s     new MB/s     speedup
+BenchmarkParallel/32KB-4      2229.22      11605.78     5.21x
+BenchmarkParallel/64KB-4      2233.61      14329.65     6.42x
+BenchmarkParallel/128KB-4     2235.43      16166.39     7.23x
+BenchmarkParallel/256KB-4     2236.39      15570.09     6.96x
+BenchmarkParallel/512KB-4     2238.05      16705.83     7.46x
+BenchmarkParallel/1MB-4       2233.56      16941.95     7.59x
+BenchmarkParallel/2MB-4       2224.11      17136.01     7.70x
+BenchmarkParallel/4MB-4       2199.78      17218.61     7.83x
+BenchmarkParallel/8MB-4       2182.48      17252.88     7.91x
+```
+These measurements were performed on AWS EC2 instance of type `c5.xlarge` equipped with a Xeon Platinum 8124M CPU at 3.0 GHz.
+If only one or two inputs are available the scalar calculation method will be used for the 
+optimal speed in these cases.
+## Operation
+To make operation as easy as possible there is a “Server” coordinating everything. The server keeps track of individual hash states and updates them as new data comes in. This can be visualized as follows:
+![server-architecture](chart/server-architecture.png)
+The data is sent to the server from each hash input in blocks of up to 32KB per round. In our testing we found this to be the block size that yielded the best results.
+Whenever there is data available the server will collect data for up to 16 hashes and process all 16 lanes in parallel. This means that if 16 hashes have data available all the lanes will be filled. However since that may not be the case, the server will fill less lanes and do a round anyway. Lanes can also be partially filled if less than 32KB of data is written.
+![server-lanes-example](chart/server-lanes-example.png)
+In this example 4 lanes are fully filled and 2 lanes are partially filled. In this case the black areas will simply be masked out from the results and ignored. This is also why calculating a single hash on a server will not result in any speedup and hash writes should be a multiple of 32KB for the best performance.
+For AVX512 all 16 calculations will be done on a single core, on AVX2 on 2 cores if there is data for more than 8 lanes.
+So for optimal usage there should be data available for all 16 hashes. It may be perfectly reasonable to use more than 16 concurrent hashes.
+## Design & Tech
+md5-simd has both an AVX2 (8-lane parallel), and an AVX512 (16-lane parallel version) algorithm to accelerate the computation with the following function definitions:
+```
+//go:noescape
+func block8(state *uint32, base uintptr, bufs *int32, cache *byte, n int)
+//go:noescape
+func block16(state *uint32, ptrs *int64, mask uint64, n int)
+```
+The AVX2 version is based on the [md5vec](https://github.com/igneous-systems/md5vec) repository and is essentially unchanged except for minor (cosmetic) changes.
+The AVX512 version is derived from the AVX2 version but adds some further optimizations and simplifications.
+### Caching in upper ZMM registers
+The AVX2 version passes in a `cache8` block of memory (about 0.5 KB) for temporary storage of intermediate results during `ROUND1` which are subsequently used during `ROUND2` through to `ROUND4`.
+Since AVX512 has double the amount of registers (32 ZMM registers as compared to 16 YMM registers), it is possible to use the upper 16 ZMM registers for keeping the intermediate states on the CPU. As such, there is no need to pass in a corresponding `cache16` into the AVX512 block function.
+### Direct loading using 64-bit pointers
+The AVX2 uses the `VPGATHERDD` instruction (for YMM) to do a parallel load of 8 lanes using (8 independent) 32-bit offets. Since there is no control over how the 8 slices that are passed into the (Golang) `blockMd5` function are laid out into memory, it is not possible to derive a "base" address and corresponding offsets (all within 32-bits) for all 8 slices.
+As such the AVX2 version uses an interim buffer to collect the byte slices to be hashed from all 8 inut slices and passed this buffer along with (fixed) 32-bit offsets into the assembly code.
+For the AVX512 version this interim buffer is not needed since the AVX512 code uses a pair of `VPGATHERQD` instructions to directly dereference 64-bit pointers (from a base register address that is initialized to zero).
+Note that two load (gather) instructions are needed because the AVX512 version processes 16-lanes in parallel, requiring 16 times 64-bit = 1024 bits in total to be loaded. A simple `VALIGND` and `VPORD` are subsequently used to merge the lower and upper halves together into a single ZMM register (that contains 16 lanes of 32-bit DWORDS).
+### Masking support
+Due to the fact that pointers are passed directly from the Golang slices, we need to protect against NULL pointers. 
+For this a 16-bit mask is passed in the AVX512 assembly code which is used during the `VPGATHERQD` instructions to mask out lanes that could otherwise result in segment violations.
+### Minor optimizations
+The `roll` macro (three instructions on AVX2) is no longer needed for AVX512 and is replaced by a single `VPROLD` instruction.
+Also several logical operations from the various ROUNDS of the AVX2 version could be combined into a single instruction using ternary logic (with the `VPTERMLOGD` instruction), resulting in a further simplification and speed-up.
+## Low level block function performance
+The benchmark below shows the (single thread) maximum performance of the `block()` function for AVX2 (having 8 lanes) and AVX512 (having 16 lanes). Also the baseline single-core performance from the standard `crypto/md5` package is shown for comparison.
+```
+BenchmarkCryptoMd5-4                     687.66 MB/s           0 B/op          0 allocs/op
+BenchmarkBlock8-4                       4144.80 MB/s           0 B/op          0 allocs/op
+BenchmarkBlock16-4                      8228.88 MB/s           0 B/op          0 allocs/op
+```
+## License
+`md5-simd` is released under the Apache License v2.0. You can find the complete text in the file LICENSE.
+## Contributing
+Contributions are welcome, please send PRs for any enhancements.
+\ No newline at end of file

diff --git a/vendor/github.com/minio/md5-simd/README.md b/vendor/github.com/minio/md5-simd/README.md new file mode 100644 index 0000000..fa6fce1 --- /dev/null +++ b/vendor/github.com/minio/md5-simd/README.md
@@ -0,0 +1,198 @@
	1
	2	# md5-simd
	3
	4	This is a SIMD accelerated MD5 package, allowing up to either 8 (AVX2) or 16 (AVX512) independent MD5 sums to be calculated on a single CPU core.
	5
	6	It was originally based on the [md5vec](https://github.com/igneous-systems/md5vec) repository by Igneous Systems, but has been made more flexible by amongst others supporting different message sizes per lane and adding AVX512.
	7
	8	`md5-simd` integrates a similar mechanism as described in [minio/sha256-simd](https://github.com/minio/sha256-simd#support-for-avx512) for making it easy for clients to take advantages of the parallel nature of the MD5 calculation. This will result in reduced overall CPU load.
	9
	10	It is important to understand that `md5-simd` does not speed up a single threaded MD5 hash sum.
	11	Rather it allows multiple __independent__ MD5 sums to be computed in parallel on the same CPU core,
	12	thereby making more efficient usage of the computing resources.
	13
	14	## Usage
	15
	16	[![Documentation](https://godoc.org/github.com/minio/md5-simd?status.svg)](https://pkg.go.dev/github.com/minio/md5-simd?tab=doc)
	17
	18
	19	In order to use `md5-simd`, you must first create an `Server` which can be
	20	used to instantiate one or more objects for MD5 hashing.
	21
	22	These objects conform to the regular [`hash.Hash`](https://pkg.go.dev/hash?tab=doc#Hash) interface
	23	and as such the normal Write/Reset/Sum functionality works as expected.
	24
	25	As an example:
	26	```
	27	// Create server
	28	server := md5simd.NewServer()
	29	defer server.Close()
	30
	31	// Create hashing object (conforming to hash.Hash)
	32	md5Hash := server.NewHash()
	33	defer md5Hash.Close()
	34
	35	// Write one (or more) blocks
	36	md5Hash.Write(block)
	37
	38	// Return digest
	39	digest := md5Hash.Sum([]byte{})
	40	```
	41
	42	To keep performance both a [Server](https://pkg.go.dev/github.com/minio/md5-simd?tab=doc#Server)
	43	and individual [Hasher](https://pkg.go.dev/github.com/minio/md5-simd?tab=doc#Hasher) should
	44	be closed using the `Close()` function when no longer needed.
	45
	46	A Hasher can efficiently be re-used by using [`Reset()`](https://pkg.go.dev/hash?tab=doc#Hash) functionality.
	47
	48	In case your system does not support the instructions required it will fall back to using `crypto/md5` for hashing.
	49
	50	## Limitations
	51
	52	As explained above `md5-simd` does not speed up an individual MD5 hash sum computation,
	53	unless some hierarchical tree construct is used but this will result in different outcomes.
	54	Running a single hash on a server results in approximately half the throughput.
	55
	56	Instead, it allows running multiple MD5 calculations in parallel on a single CPU core.
	57	This can be beneficial in e.g. multi-threaded server applications where many go-routines
	58	are dealing with many requests and multiple MD5 calculations can be packed/scheduled for parallel execution on a single core.
	59
	60	This will result in a lower overall CPU usage as compared to using the standard `crypto/md5`
	61	functionality where each MD5 hash computation will consume a single thread (core).
	62
	63	It is best to test and measure the overall CPU usage in a representative usage scenario in your application
	64	to get an overall understanding of the benefits of `md5-simd` as compared to `crypto/md5`, ideally under heavy CPU load.
	65
	66	Also note that `md5-simd` is best meant to work with large objects,
	67	so if your application only hashes small objects of a few kilobytes
	68	you may be better of by using `crypto/md5`.
	69
	70	## Performance
	71
	72	For the best performance writes should be a multiple of 64 bytes, ideally a multiple of 32KB.
	73	To help with that a [`buffered := bufio.NewWriterSize(hasher, 32<<10)`](https://golang.org/pkg/bufio/#NewWriterSize)
	74	can be inserted if you are unsure of the sizes of the writes.
	75	Remember to [flush](https://golang.org/pkg/bufio/#Writer.Flush) `buffered` before reading the hash.
	76
	77	A single 'server' can process 16 streams concurrently with 1 core (AVX-512) or 2 cores (AVX2).
	78	In situations where it is likely that more than 16 streams are fully loaded it may be beneficial
	79	to use multiple servers.
	80
	81	The following chart compares the multi-core performance between `crypto/md5` vs the AVX2 vs the AVX512 code:
	82
	83	![md5-performance-overview](chart/Multi-core-MD5-Aggregated-Hashing-Performance.png)
	84
	85	Compared to `crypto/md5`, the AVX2 version is up to 4x faster:
	86
	87	```
	88	$ benchcmp crypto-md5.txt avx2.txt
	89	benchmark old MB/s new MB/s speedup
	90	BenchmarkParallel/32KB-4 2229.22 7370.50 3.31x
	91	BenchmarkParallel/64KB-4 2233.61 8248.46 3.69x
	92	BenchmarkParallel/128KB-4 2235.43 8660.74 3.87x
	93	BenchmarkParallel/256KB-4 2236.39 8863.87 3.96x
	94	BenchmarkParallel/512KB-4 2238.05 8985.39 4.01x
	95	BenchmarkParallel/1MB-4 2233.56 9042.62 4.05x
	96	BenchmarkParallel/2MB-4 2224.11 9014.46 4.05x
	97	BenchmarkParallel/4MB-4 2199.78 8993.61 4.09x
	98	BenchmarkParallel/8MB-4 2182.48 8748.22 4.01x
	99	```
	100
	101	Compared to `crypto/md5`, the AVX512 is up to 8x faster (for larger block sizes):
	102
	103	```
	104	$ benchcmp crypto-md5.txt avx512.txt
	105	benchmark old MB/s new MB/s speedup
	106	BenchmarkParallel/32KB-4 2229.22 11605.78 5.21x
	107	BenchmarkParallel/64KB-4 2233.61 14329.65 6.42x
	108	BenchmarkParallel/128KB-4 2235.43 16166.39 7.23x
	109	BenchmarkParallel/256KB-4 2236.39 15570.09 6.96x
	110	BenchmarkParallel/512KB-4 2238.05 16705.83 7.46x
	111	BenchmarkParallel/1MB-4 2233.56 16941.95 7.59x
	112	BenchmarkParallel/2MB-4 2224.11 17136.01 7.70x
	113	BenchmarkParallel/4MB-4 2199.78 17218.61 7.83x
	114	BenchmarkParallel/8MB-4 2182.48 17252.88 7.91x
	115	```
	116
	117	These measurements were performed on AWS EC2 instance of type `c5.xlarge` equipped with a Xeon Platinum 8124M CPU at 3.0 GHz.
	118
	119	If only one or two inputs are available the scalar calculation method will be used for the
	120	optimal speed in these cases.
	121
	122	## Operation
	123
	124	To make operation as easy as possible there is a “Server” coordinating everything. The server keeps track of individual hash states and updates them as new data comes in. This can be visualized as follows:
	125
	126	![server-architecture](chart/server-architecture.png)
	127
	128	The data is sent to the server from each hash input in blocks of up to 32KB per round. In our testing we found this to be the block size that yielded the best results.
	129
	130	Whenever there is data available the server will collect data for up to 16 hashes and process all 16 lanes in parallel. This means that if 16 hashes have data available all the lanes will be filled. However since that may not be the case, the server will fill less lanes and do a round anyway. Lanes can also be partially filled if less than 32KB of data is written.
	131
	132	![server-lanes-example](chart/server-lanes-example.png)
	133
	134	In this example 4 lanes are fully filled and 2 lanes are partially filled. In this case the black areas will simply be masked out from the results and ignored. This is also why calculating a single hash on a server will not result in any speedup and hash writes should be a multiple of 32KB for the best performance.
	135
	136	For AVX512 all 16 calculations will be done on a single core, on AVX2 on 2 cores if there is data for more than 8 lanes.
	137	So for optimal usage there should be data available for all 16 hashes. It may be perfectly reasonable to use more than 16 concurrent hashes.
	138
	139
	140	## Design & Tech
	141
	142	md5-simd has both an AVX2 (8-lane parallel), and an AVX512 (16-lane parallel version) algorithm to accelerate the computation with the following function definitions:
	143	```
	144	//go:noescape
	145	func block8(state uint32, base uintptr, bufs int32, cache *byte, n int)
	146
	147	//go:noescape
	148	func block16(state uint32, ptrs int64, mask uint64, n int)
	149	```
	150
	151	The AVX2 version is based on the [md5vec](https://github.com/igneous-systems/md5vec) repository and is essentially unchanged except for minor (cosmetic) changes.
	152
	153	The AVX512 version is derived from the AVX2 version but adds some further optimizations and simplifications.
	154
	155	### Caching in upper ZMM registers
	156
	157	The AVX2 version passes in a `cache8` block of memory (about 0.5 KB) for temporary storage of intermediate results during `ROUND1` which are subsequently used during `ROUND2` through to `ROUND4`.
	158
	159	Since AVX512 has double the amount of registers (32 ZMM registers as compared to 16 YMM registers), it is possible to use the upper 16 ZMM registers for keeping the intermediate states on the CPU. As such, there is no need to pass in a corresponding `cache16` into the AVX512 block function.
	160
	161	### Direct loading using 64-bit pointers
	162
	163	The AVX2 uses the `VPGATHERDD` instruction (for YMM) to do a parallel load of 8 lanes using (8 independent) 32-bit offets. Since there is no control over how the 8 slices that are passed into the (Golang) `blockMd5` function are laid out into memory, it is not possible to derive a "base" address and corresponding offsets (all within 32-bits) for all 8 slices.
	164
	165	As such the AVX2 version uses an interim buffer to collect the byte slices to be hashed from all 8 inut slices and passed this buffer along with (fixed) 32-bit offsets into the assembly code.
	166
	167	For the AVX512 version this interim buffer is not needed since the AVX512 code uses a pair of `VPGATHERQD` instructions to directly dereference 64-bit pointers (from a base register address that is initialized to zero).
	168
	169	Note that two load (gather) instructions are needed because the AVX512 version processes 16-lanes in parallel, requiring 16 times 64-bit = 1024 bits in total to be loaded. A simple `VALIGND` and `VPORD` are subsequently used to merge the lower and upper halves together into a single ZMM register (that contains 16 lanes of 32-bit DWORDS).
	170
	171	### Masking support
	172
	173	Due to the fact that pointers are passed directly from the Golang slices, we need to protect against NULL pointers.
	174	For this a 16-bit mask is passed in the AVX512 assembly code which is used during the `VPGATHERQD` instructions to mask out lanes that could otherwise result in segment violations.
	175
	176	### Minor optimizations
	177
	178	The `roll` macro (three instructions on AVX2) is no longer needed for AVX512 and is replaced by a single `VPROLD` instruction.
	179
	180	Also several logical operations from the various ROUNDS of the AVX2 version could be combined into a single instruction using ternary logic (with the `VPTERMLOGD` instruction), resulting in a further simplification and speed-up.
	181
	182	## Low level block function performance
	183
	184	The benchmark below shows the (single thread) maximum performance of the `block()` function for AVX2 (having 8 lanes) and AVX512 (having 16 lanes). Also the baseline single-core performance from the standard `crypto/md5` package is shown for comparison.
	185
	186	```
	187	BenchmarkCryptoMd5-4 687.66 MB/s 0 B/op 0 allocs/op
	188	BenchmarkBlock8-4 4144.80 MB/s 0 B/op 0 allocs/op
	189	BenchmarkBlock16-4 8228.88 MB/s 0 B/op 0 allocs/op
	190	```
	191
	192	## License
	193
	194	`md5-simd` is released under the Apache License v2.0. You can find the complete text in the file LICENSE.
	195
	196	## Contributing
	197
	198	Contributions are welcome, please send PRs for any enhancements. \ No newline at end of file