Release note of version 1.3.3.0

Set release date
Add MediaList4 option
2023-10-10 13:08:48 +09:00 · 2023-10-10 13:02:35 +09:00 · 2023-09-30 18:47:27 +09:00 · 2023-09-30 18:46:28 +09:00 · 2023-09-30 18:44:34 +09:00 · 2023-09-27 11:21:22 +09:00
26 changed files with 2344 additions and 1354 deletions
--- a/README.md
+++ b/README.md
@@ -1,57 +1,65 @@
 # MultiPar
-### v1.3.2.9 is public
+### v1.3.3.0 is public
 &nbsp; This is the final release of v1.3.2 tree. 
 Because I want to public this as a stable version, I didn't change contents so much. 
 PAR clients are same as previous version. 
 Including long term used applications may be good to avoid false positive at Malware detection.
-&nbsp; I fixed a [compatibility issue in calling 7-Zip](https://github.com/Yutaka-Sawada/MultiPar/issues/92), 
+&nbsp; This is a testing version to improve speed of PAR2 calculation. 
-which I didn't know the change. 
+Because the new method isn't tested so much, there may be a bug, failure, or mistake. 
-Thanks Lyoko-Jeremie for bug report. 
+Be careful to use this non-stable version. 
-The incident happened, when a user selected many files.
+When you don't want to test by yourself, you should not use this yet. 
 If you see a problem, please report the incident. 
 I will try to solve as possible as I can.
-&nbsp; I made a sample feature to Save & Restore different "base directories". 
+&nbsp; The PAR2 calculation speed may be 10% ~ 50% faster than old version. 
-When you put PAR files in another folder from source files, it will set the previous directory automatically. 
+The optimization depends on hardware environment. 
-Because this feature was tested little, it's disabled by default at this time. 
+I don't know what is the best setting on which PC. 
-If you want to enable, add section `[Path]` on "MultiPar.ini". 
+From [many tests of debug versions](https://github.com/Yutaka-Sawada/MultiPar/issues/99), 
-Then set `MRUMax` value, which is the maximum number of stored directries. 
+it will select maybe better setting automatically. 
-You may set the value upto 26. It's disabled, when the value is 0. 
+Thanks testers for many trials. 
-These two lines are like below:
+If you want to compare speed of different settings on your PC, you may try those debug versions.
 ```
 [Path]
 MRUMax=5
 ```
-&nbsp; While I made MultiPar as an utility tool, I didn't give priority to its speed. 
+&nbsp; I changed GPU implementation largely, too. 
-If someone wants faster Parchive tool, I suggest to use ParPar tools instead of MultiPar. 
+To adopt CPU optimization, it will process smaller tasks on GPU. 
-They are "[High performance PAR2 create client for NodeJS](https://github.com/animetosho/ParPar)" or 
+Because GPU don't use CPU's cache, it's inefficient for GPU's task. 
-"[speed focused par2cmdline fork](https://github.com/animetosho/par2cmdline-turbo)". 
+I don't know that new method is faster than old version or not.
-Though the speed depends on hardware environments and user's setting, it would be 50% ~ 100 % faster than my par2j. 
+
-Only when you have a very fast graphics borad, GPU enabled par2j may be faster. 
+Threshold to use GPU:
-I plan to improve speed of par2j in next v1.3.3 tree.
+- Data size must be larger than 200 MB.
-Though it will become 20% ~ 30% faster than old par2j, ParPar would be faster mostly.
+- Block size must be larger than 64 KB.
 - Number of source blocks must be more than 192.
 - Number of recovery blocks must be more than 8.
 &nbsp; Because [a user requested](https://github.com/Yutaka-Sawada/MultiPar/issues/102), 
 I implemented a way to add 5th item in "Media size" on Create window. 
 Write this line `MediaList4=name:size` under `[Option]` section in `MultiPar.ini`. 
 Currently, you cannot change the item on Option window.
-[ Changes from 1.3.2.8 to 1.3.2.9 ]  
+[ Changes from 1.3.2.9 to 1.3.3.0 ]  
 GUI update
- New
+- Change
-  - Verification may save different base directories in MultiPar.ini file.
+  - Option adapted to new "lc" settings.
  - It's possible to add 5th item in "Media size" on Create window.
- Bug fix
+PAR2 client update
-  - Archiver's option was updated for recent 7-Zip versions.
+- Change
  - Max number of using threads is increased to 32.
  - Threshold to use GPU was decreased.
 - Improvement
  - Matrix inversion may use more threads.
  - L3 cache optimization was improved for recent CPUs.
 [ Hash value ]  
-MultiPar132.zip  
+MultiPar1330.zip  
-MD5: 305D86C8C7A0F5C1A23CEAFFBE4F02BF  
+MD5: 79570F84B74ECF8E5100561F7AAC3803  
-SHA1: 464BB7AB7D14FD35D2AEF99042EEB8E556DA0417  
+SHA1: ACF7F164001708789C5D94003ED6B5C172235D54  
-MultiPar132_setup.exe  
+MultiPar1330_setup.exe  
-MD5: 18F9BE1FF1C6D668E3A3906C691CCB98  
+MD5: D1F1A5A4DF1C9EDD698C9A017AF31039  
-SHA1: 116C6B2A15FCFD9BB74F0EF9D6C8A4BF78299588  
+SHA1: 4C3314B909572A303EBBE8E015A2E813841CFA33  
 &nbsp; To install under "Program Files" or "Program Files (x86)" directory, 
 you must start the installer with administrative privileges by selecting 
 "Run as administrator" on right-click menu.
--- a/alpha/MultiPar.exe
+++ b/alpha/MultiPar.exe
--- a/alpha/Update_English.txt
+++ b/alpha/Update_English.txt
@@ -1,3 +1,23 @@
 Release note of v1.3.3 tree
 [ Changes from 1.3.2.9 to 1.3.3.0 ] (2023/10/10)
 GUI update
 Change
  Option adapted to new "lc" settings.
  It's possible to add 5th item in "Media size" on Create window.
 PAR2 client update
 Change
  Max number of using threads is increased to 32.
  Threshold to use GPU was decreased.
 Improvement
  Matrix inversion may use more threads.
  L3 cache optimization was improved for recent CPUs.
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
 Release note of v1.3.2 tree
 I tried to decrease probability of false positive at Malware detection.
--- a/alpha/Update_Japanese.txt
+++ b/alpha/Update_Japanese.txt
@@ -1,9 +1,20 @@
-v1.3.2 の更新情報 (2023/08/26)
+v1.3.3 の更新情報 (2023/10/10)
-[ 1.3.1 から 1.3.2 への変更点 ]
+　まだ動作実験中ですので、不安な人は前のバージョンを使ってください。
 [ 1.3.2 から 1.3.3 への変更点 ]
 ・クライアントの変更点
 　CPU Cache の利用方法を改善して速くなりました。
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
 v1.3.2 の更新情報 (2023/08/26)
 　マルウェアとして誤検知されにくいようにしました。
 [ 1.3.1 から 1.3.2 への変更点 ]
 ・GUI の変更点
 　動作完了時に Python スクリプトを呼び出せるようにしました。
 　Python スクリプトで作った便利ツールをいくつか追加しました。
--- a/alpha/help/Command_GUI.txt
+++ b/alpha/help/Command_GUI.txt
@@ -1,4 +1,4 @@
-[ MultiPar GUI - version 1.3.2.9 or later ]
+[ MultiPar GUI - version 1.3.3.0 or later ]
 Usage: MultiPar.exe [command] [/base path] [/list path] [files]
@@ -106,6 +106,12 @@ Because of alphabet, "MRUMax=26" is the maximum.
 If you write "MRUMax=0" or remove the line,
 "Most Resent Used List" is disabled.
 If you want to add 5th item in "Media size" list on Create window,
 write this line "MediaList4=name:size" under "[Option]" section.
 Because the name and size are splitted by ":",
 you cannot include ":" in the name.
 The max size is 999999999999 bytes. (931 GB)
 Example of lines on "MultiPar.ini";
@@ -114,5 +120,6 @@ FontName=Arial
 RedundancyMax=10
 Sort=8
 RecoveryFileLimit=1
 MediaList4=7.9GB DVD:8480000000
 [Path]
 MRUMax=5
--- a/alpha/help/Command_par2j.txt
+++ b/alpha/help/Command_par2j.txt
@@ -1,4 +1,4 @@
-[ par2j.exe - version 1.3.2.8 or later ]
+[ par2j.exe - version 1.3.3.0 or later ]
 Type "par2j.exe" to see version, test integrity, and show usage below.
@@ -359,19 +359,22 @@ the protected archive file is made in the directory.
 /lc :
 Set this, if you want to set number of using threads for Multi-Core CPU,
 or want to disable extra feature. (SSE2 is always used.)
-The format is "/lc#", # is from 1 to 11 as the number of using threads,
+The format is "/lc#" (# is from 1 to 32 as the number of using threads).
 12 to use quarter number of physical Cores,
 13 to use half of physical Cores,
 14 to use 3/4 number of physical Cores,
 15 to use the number of physical Cores (disable Hyper Threading),
 or 15 to use one less number of physical Cores on CPU with 6 or more physical Cores.
 Without this option (or /lc0),
 it uses the number of physical Cores on CPU with 6 or more physical Cores,
 or one more threads on CPU with Hyper Threading and 5 or less physical Cores.
- You may set additional combinations; +16 to disable SSSE3,
+ It's possible to set by rate as following. (It's /lc0 by default.)
-+128 to disable CLMUL, +256 to disable JIT, +512 to disable AVX2,
+251: It uses quarter number of physical Cores.
-+32 or +64 (slower device) to enable GPU acceleration.
+252: It uses half of physical Cores.
 253: It uses 3/4 number of physical Cores.
 254: It uses one less threads than number of physical Cores.
  0: It uses the number of physical Cores.
 255: It uses one more threads than number of physical Cores.
 You may set additional combinations;
 +1024 to disable CLMUL (and use old SSSE3 code), 
 +2048 to disable JIT (for SSE2), 
 +4096 to disable SSSE3,
 +8192 to disable AVX2,
 +256 or +512 (slower device) to enable GPU acceleration.
 for example,  /lc1 to use single Core, /lc45 to use half Cores and GPU
--- a/alpha/par2j.exe
+++ b/alpha/par2j.exe
--- a/alpha/par2j64.exe
+++ b/alpha/par2j64.exe
--- a/source/par2j/Command_par2j.txt
+++ b/source/par2j/Command_par2j.txt
@@ -1,4 +1,4 @@
-[ par2j.exe - version 1.3.2.8 or later ]
+[ par2j.exe - version 1.3.3.0 or later ]
 Type "par2j.exe" to see version, test integrity, and show usage below.
@@ -359,19 +359,22 @@ the protected archive file is made in the directory.
 /lc :
 Set this, if you want to set number of using threads for Multi-Core CPU,
 or want to disable extra feature. (SSE2 is always used.)
-The format is "/lc#", # is from 1 to 11 as the number of using threads,
+The format is "/lc#" (# is from 1 to 32 as the number of using threads).
 12 to use quarter number of physical Cores,
 13 to use half of physical Cores,
 14 to use 3/4 number of physical Cores,
 15 to use the number of physical Cores (disable Hyper Threading),
 or 15 to use one less number of physical Cores on CPU with 6 or more physical Cores.
 Without this option (or /lc0),
 it uses the number of physical Cores on CPU with 6 or more physical Cores,
 or one more threads on CPU with Hyper Threading and 5 or less physical Cores.
- You may set additional combinations; +16 to disable SSSE3,
+ It's possible to set by rate as following. (It's /lc0 by default.)
-+128 to disable CLMUL, +256 to disable JIT, +512 to disable AVX2,
+251: It uses quarter number of physical Cores.
-+32 or +64 (slower device) to enable GPU acceleration.
+252: It uses half of physical Cores.
 253: It uses 3/4 number of physical Cores.
 254: It uses one less threads than number of physical Cores.
  0: It uses the number of physical Cores.
 255: It uses one more threads than number of physical Cores.
 You may set additional combinations;
 +1024 to disable CLMUL (and use old SSSE3 code), 
 +2048 to disable JIT (for SSE2), 
 +4096 to disable SSSE3,
 +8192 to disable AVX2,
 +256 or +512 (slower device) to enable GPU acceleration.
 for example,  /lc1 to use single Core, /lc45 to use half Cores and GPU
--- a/source/par2j/common2.c
+++ b/source/par2j/common2.c
@@ -1,5 +1,5 @@
 // common2.c
-// Copyright : 2023-03-14 Yutaka Sawada
+// Copyright : 2023-09-23 Yutaka Sawada
 // License : GPL
 #ifndef _UNICODE
@@ -1849,8 +1849,9 @@ int sqrt32(int num)
 int cpu_num = 1;	// CPU/Core 個数が制限されてる場合は、上位に本来の数を置く
 // /arch:SSE2, +1=SSSE3, +2=SSE4.1, +4=SSE4.2, +8=CLMUL, +16=AVX2, +128=JIT(SSE2), +256=Old
 // 上位 16-bit = L2 cache サイズから計算した制限サイズ
 unsigned int cpu_flag = 0;
-unsigned int cpu_cache = 0;	// 上位 16-bit = L2 cache * 2, 下位 16-bit = L3 cache
+unsigned int cpu_cache = 0;	// 上位 16-bit = L3 cache の制限サイズ, 下位 16-bit = 同時処理数
 unsigned int memory_use = 0;	// メモリー使用量 0=auto, 1～7 -> 1/8 ～ 7/8
 static int count_bit(DWORD_PTR value)
@@ -1869,7 +1870,7 @@ static int count_bit(DWORD_PTR value)
 void check_cpu(void)
 {
 	int core_count = 0, use_count;
-	unsigned int CPUInfo[4];
+	unsigned int CPUInfo[4], limit_size = 0;
 	unsigned int returnLength, byteOffset;
 	DWORD_PTR ProcessAffinityMask, SystemAffinityMask; // 32-bit なら 4バイト、64-bit なら 8バイト整数
 	PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = NULL, ptr;
@@ -2006,42 +2007,52 @@ void check_cpu(void)
 		//printf("Number of available physical processor cores: %d\n", core_count);
 		if (cache3_size > 0){
 			//printf("L3 cache: %d KB (%d way)\n", cache3_size >> 10 , cache3_way);
-			cache3_size /= cache3_way;	// set-associative のサイズにする
+			cpu_cache = cache3_size / cache3_way;	// set-associative のサイズにする
-			if (cache3_size < 131072)
+			if (cpu_cache < 131072)
-				cache3_size = 128 << 10;	// 128 KB 以上にする
+				cpu_cache = 128 << 10;	// 128 KB 以上にする
 		}
 		if (cache2_size > 0){
 			//printf("L2 cache: %d KB (%d way)\n", cache2_size >> 10, cache2_way);
-			cache2_size /= cache2_way;	// set-associative のサイズにする
+			limit_size = cache2_size / cache2_way;	// set-associative のサイズにする
-			if (cache2_size < 32768)
+			if (limit_size < 65536)
-				cache2_size = 32 << 10;	// 32 KB 以上にする
+				limit_size = 64 << 10;	// 64 KB 以上にする
-			//printf("Limit size of Cache Blocking: %d KB\n", cache2_size >> 10);
+			// 同時処理数を決める
-			cpu_cache = cache2_size | (cache3_size >> 17);
+			if (cache2_way >= 16){
 				returnLength = cache2_way / 2;	// L2 cache の分割数が多い場合は、その半分にする
 			} else {
 				returnLength = 0;
 			}
 			if (cache3_size > 0){	// L2 cache に対する L3 cache のサイズの倍率にする
 				byteOffset = cache3_size / cache2_size;
 				if (returnLength < byteOffset){
 					returnLength = byteOffset;
 					if (cache2_way >= cache3_way)	// L2 cache の分割数が L3 cache 以上なら 1.5倍にする
 						returnLength += returnLength / 2;
 				}
 			}
 			cpu_cache |= returnLength & 0x1FFFF;
 		}
 	}
-	if (cpu_cache == 0)	// キャッシュ・サイズが不明なら、128 KB にする
+	if (limit_size == 0)	// キャッシュ・サイズが不明なら、128 KB にする
-		cpu_cache = 128 << 10;
+		limit_size = 128 << 10;
 	//printf("Limit size of Cache Blocking: %d KB\n", limit_size >> 10);
 	// cpu_flag の上位 16-bit にキャッシュの制限サイズを置く
 	cpu_flag |= limit_size & 0xFFFF0000;	// 64 KB 未満は無視する
 	if (core_count == 0){	// 物理コア数が不明なら、論理コア数と同じにする
 		core_count = cpu_num;
 		use_count = cpu_num;
-	} else if (core_count < cpu_num){	// 物理コア数が共有されてるなら
+	} else if (core_count < cpu_num){	// 物理コアが共有されてるなら
-		if (core_count >= 6){			// 6 コア以上ならそれ以上増やさない
+		use_count = core_count;	// 物理コア数と同じにする
 			use_count = core_count;
 		} else {	// 2~5 コアなら 1個だけ増やす
 			use_count = core_count + 1;
 		}
 	} else {
-		use_count = core_count;
+		use_count = cpu_num;	// 論理コア数と同じにする
 	}
 	if (use_count > MAX_CPU)	// 利用するコア数が実装上の制限を越えないようにする
 		use_count = MAX_CPU;
 	//printf("Core count: logical, physical, use = %d, %d, %d\n", cpu_num, core_count, use_count);
 	// 上位に論理コア数と物理コア数、下位に利用するコア数を配置する
 	cpu_num = (cpu_num << 24) | (core_count << 16) | use_count;
 	// cpu_flag の上位 17-bit にキャッシュの制限サイズを置く
 	cpu_flag |= cpu_cache & 0xFFFF8000;	// 32 KB 未満は無視する
 }
 // OS が 32-bit か 64-bit かを調べる
--- a/source/par2j/common2.h
+++ b/source/par2j/common2.h
@@ -6,11 +6,11 @@ extern "C" {
 #endif
 #ifndef _WIN64	// 32-bit 版なら
-#define MAX_CPU			8			// 32-bit 版は少なくしておく
+#define MAX_CPU			16			// 32-bit 版は少なくしておく
 #define MAX_MEM_SIZE	0x7F000000	// 確保するメモリー領域の最大値 2032MB
 #define MAX_MEM_SIZE32	0x50000000	// 32-bit OS で確保するメモリー領域の最大値 1280MB
 #else
-#define MAX_CPU			16			// 最大 CPU/Core 個数 (スレッド本数)
+#define MAX_CPU			32			// 最大 CPU/Core 個数 (スレッド本数)
 #endif
 #define MAX_LEN			1024		// ファイル名の最大文字数 (末尾のNULL文字も含む)
--- a/source/par2j/create.c
+++ b/source/par2j/create.c
@@ -1,5 +1,5 @@
 // create.c
-// Copyright : 2022-02-16 Yutaka Sawada
+// Copyright : 2023-09-23 Yutaka Sawada
 // License : GPL
 #ifndef _UNICODE
--- a/source/par2j/gf16.c
+++ b/source/par2j/gf16.c
@@ -80,6 +80,9 @@ void galois_align32_multiply(unsigned char *r1, unsigned char *r2, unsigned int
 void galois_align32avx_multiply(unsigned char *r1, unsigned char *r2, unsigned int len, int factor);
 void galois_align256_multiply(unsigned char *r1, unsigned char *r2, unsigned int len, int factor);
 void galois_align32_multiply2(unsigned char *src1, unsigned char *src2, unsigned char *dst, unsigned int len, int factor1, int factor2);
 void galois_align32avx_multiply2(unsigned char *src1, unsigned char *src2, unsigned char *dst, unsigned int len, int factor1, int factor2);
 void galois_altmap_none(unsigned char *data, unsigned int bsize);
 // AVX2 と SSSE3 の ALTMAP は 32バイト単位で行う
@@ -125,6 +128,7 @@ int galois_create_table(void)
 	// CPU によって使う関数を変更する
 	sse_unit = 16;	// 16, 32, 64, 128 のどれでもいい (32のSSSE3は少し速い、GPUが識別するのに注意)
 	galois_align_multiply = galois_align16_multiply;
 	galois_align_multiply2 = NULL;
 	galois_altmap_change = galois_altmap_none;
 	galois_altmap_return = galois_altmap_none;
 	checksum16_altmap = checksum16;
@@ -135,6 +139,7 @@ int galois_create_table(void)
 		//printf("\nUse AVX2 & ALTMAP\n");
 		sse_unit = 32;	// 32, 64, 128 のどれでもいい
 		galois_align_multiply = galois_align32avx_multiply;
 		galois_align_multiply2 = galois_align32avx_multiply2;
 		galois_altmap_change = galois_altmap32_change;
 		galois_altmap_return = galois_altmap32_return;
 		checksum16_altmap = checksum16_altmap32;
@@ -144,6 +149,7 @@ int galois_create_table(void)
 			//printf("\nUse SSSE3 & ALTMAP\n");
 			sse_unit = 32;	// 32, 64, 128 のどれでもいい
 			galois_align_multiply = galois_align32_multiply;
 			galois_align_multiply2 = galois_align32_multiply2;
 			galois_altmap_change = galois_altmap32_change;
 			galois_altmap_return = galois_altmap32_return;
 			checksum16_altmap = checksum16_altmap32;
@@ -154,6 +160,7 @@ int galois_create_table(void)
 			//printf("\nUse JIT(SSE2) & ALTMAP\n");
 			sse_unit = 256;
 			galois_align_multiply = galois_align256_multiply;
 			galois_align_multiply2 = NULL;
 			galois_altmap_change = galois_altmap256_change;
 			galois_altmap_return = galois_altmap256_return;
 			checksum16_altmap = checksum16_altmap256;
@@ -777,19 +784,21 @@ lp32:
 #else	// 64-bit 版ではインライン・アセンブラを使えない
 // (__m128i *) で逐次ポインターをキャスト変換するよりも、
 // 先に __m128i* で定義しておいた方が、連続した領域へのアクセス最適化がうまくいく？
 // ほとんど変わらない気がする（むしろ遅い？）・・・コンパイラ次第なのかも
 // tables for split four combined multiplication
-static void create_eight_table(unsigned char *mtab, int factor){
+static void create_eight_table(unsigned char *mtab, int factor)
 {
 	int count = 4;
 	__m128i *tbl;
-	__m128i xmm0, xmm1, xmm2, xmm3, xmm7;
+	__m128i xmm0, xmm1, xmm2, xmm3, mask;
 	tbl = (__m128i *)mtab;
 	// create mask for 8-bit
-	xmm7 = _mm_setzero_si128();
+	mask = _mm_setzero_si128();
-	xmm7 = _mm_cmpeq_epi16(xmm7, xmm7);	// 0xFFFF *8
+	mask = _mm_cmpeq_epi16(mask, mask);	// 0xFFFF *8
-	xmm7 = _mm_srli_epi16(xmm7, 8);		// 0x00FF *8
+	mask = _mm_srli_epi16(mask, 8);		// 0x00FF *8
 	while (1){
 		xmm0 = _mm_cvtsi32_si128(factor);			// [_][_][_][_][_][_][_][1]
@@ -817,8 +826,8 @@ static void create_eight_table(unsigned char *mtab, int factor){
 		xmm0 = _mm_load_si128(&xmm2);
 		xmm1 = _mm_load_si128(&xmm3);
-		xmm0 = _mm_and_si128(xmm0, xmm7);
+		xmm0 = _mm_and_si128(xmm0, mask);
-		xmm1 = _mm_and_si128(xmm1, xmm7);
+		xmm1 = _mm_and_si128(xmm1, mask);
 		xmm0 = _mm_packus_epi16(xmm0, xmm1);	// lower 8-bit * 16
 		xmm2 = _mm_srli_epi16(xmm2, 8);
 		xmm3 = _mm_srli_epi16(xmm3, 8);
@@ -911,13 +920,9 @@ static void gf16_ssse3_block16u(unsigned char *input, unsigned char *output, uns
 // Address (input) does not need be 16-byte aligned
 static void gf16_ssse3_block32u(unsigned char *input, unsigned char *output, unsigned int bsize, unsigned char *table)
 {
 	__m128i *src, *dst;
 	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
 	__m128i tbl0, tbl1, tbl2, tbl3, tbl4, tbl5, tbl6, tbl7;
 	src = (__m128i *)input;
 	dst = (__m128i *)output;
 	// copy tables to local
 	tbl0 = _mm_load_si128((__m128i *)table);
 	tbl1 = _mm_load_si128((__m128i *)table + 1);
@@ -936,8 +941,8 @@ static void gf16_ssse3_block32u(unsigned char *input, unsigned char *output, uns
 	xmm7 = _mm_packus_epi16(xmm7, xmm7);	// 0x0F *16
 	while (bsize != 0){
-		xmm1 = _mm_loadu_si128(src);	// read source 32-bytes
+		xmm1 = _mm_loadu_si128((__m128i *)input);	// read source 32-bytes
-		xmm3 = _mm_loadu_si128(src + 1);
+		xmm3 = _mm_loadu_si128((__m128i *)input + 1);
 		xmm0 = _mm_and_si128(xmm1, xmm6);	// erase higher byte
 		xmm2 = _mm_and_si128(xmm3, xmm6);
 		xmm1 = _mm_srli_epi16(xmm1, 8);		// move higher byte to lower
@@ -975,17 +980,17 @@ static void gf16_ssse3_block32u(unsigned char *input, unsigned char *output, uns
 		xmm4 = _mm_xor_si128(xmm4, xmm2);	// combine result
 		xmm5 = _mm_xor_si128(xmm5, xmm3);
-		xmm0 = _mm_load_si128(dst);	// read dest 32-bytes
+		xmm0 = _mm_load_si128((__m128i *)output);	// read dest 32-bytes
-		xmm1 = _mm_load_si128(dst + 1);
+		xmm1 = _mm_load_si128((__m128i *)output + 1);
 		xmm3 = _mm_unpacklo_epi8(xmm4, xmm5);	// interleave lower and higher bytes
 		xmm4 = _mm_unpackhi_epi8(xmm4, xmm5);
 		xmm0 = _mm_xor_si128(xmm0, xmm3);
 		xmm1 = _mm_xor_si128(xmm1, xmm4);
-		_mm_store_si128(dst, xmm0);	// write dest 32-bytes
+		_mm_store_si128((__m128i *)output, xmm0);	// write dest 32-bytes
-		_mm_store_si128(dst + 1, xmm1);
+		_mm_store_si128((__m128i *)output + 1, xmm1);
-		src += 2;
+		input += 32;
-		dst += 2;
+		output += 32;
 		bsize -= 32;
 	}
 }
@@ -993,13 +998,9 @@ static void gf16_ssse3_block32u(unsigned char *input, unsigned char *output, uns
 // xmm レジスタにテーブルを読み込む方が 64-bit 版で微妙に速い
 static void gf16_ssse3_block32_altmap(unsigned char *input, unsigned char *output, unsigned int bsize, unsigned char *table)
 {
 	__m128i *src, *dst;
 	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm7;
 	__m128i tbl0, tbl1, tbl2, tbl3, tbl4, tbl5, tbl6, tbl7;
 	src = (__m128i *)input;
 	dst = (__m128i *)output;
 	// copy tables to local
 	tbl0 = _mm_load_si128((__m128i *)table);
 	tbl1 = _mm_load_si128((__m128i *)table + 1);
@@ -1017,8 +1018,8 @@ static void gf16_ssse3_block32_altmap(unsigned char *input, unsigned char *outpu
 	xmm7 = _mm_packus_epi16(xmm7, xmm7);	// 0x0F *16
 	while (bsize != 0){
-		xmm0 = _mm_load_si128(src);	// read source 32-bytes
+		xmm0 = _mm_load_si128((__m128i *)input);	// read source 32-bytes
-		xmm1 = _mm_load_si128(src + 1);
+		xmm1 = _mm_load_si128((__m128i *)input + 1);
 		xmm3 = _mm_load_si128(&xmm0);	// copy source
 		xmm0 = _mm_srli_epi16(xmm0, 4);	// prepare next 4-bit
@@ -1054,17 +1055,17 @@ static void gf16_ssse3_block32_altmap(unsigned char *input, unsigned char *outpu
 		xmm2 = _mm_shuffle_epi8(xmm2, xmm0);	// table look-up
 		xmm3 = _mm_shuffle_epi8(xmm3, xmm0);
-		xmm0 = _mm_load_si128(dst);	// read dest 32-bytes
+		xmm0 = _mm_load_si128((__m128i *)output);	// read dest 32-bytes
-		xmm1 = _mm_load_si128(dst + 1);
+		xmm1 = _mm_load_si128((__m128i *)output + 1);
 		xmm4 = _mm_xor_si128(xmm4, xmm2);	// combine result
 		xmm5 = _mm_xor_si128(xmm5, xmm3);
 		xmm4 = _mm_xor_si128(xmm4, xmm0);
 		xmm5 = _mm_xor_si128(xmm5, xmm1);
-		_mm_store_si128(dst, xmm4);	// write dest 32-bytes
+		_mm_store_si128((__m128i *)output, xmm4);	// write dest 32-bytes
-		_mm_store_si128(dst + 1, xmm5);
+		_mm_store_si128((__m128i *)output + 1, xmm5);
-		src += 2;
+		input += 32;
-		dst += 2;
+		output += 32;
 		bsize -= 32;
 	}
 }
@@ -1141,16 +1142,288 @@ static void gf16_ssse3_block32_altmap(unsigned char *input, unsigned char *outpu
 #endif
 // 逆行列計算用に掛け算だけする（XORで追加しない）
 static void gf16_ssse3_block16s(unsigned char *data, unsigned int bsize, unsigned char *table)
 {
 	__m128i dest, mask, xmm0, xmm1, xmm3, xmm4, xmm5, xmm6;
 	__m128i tbl0, tbl1, tbl2, tbl3, tbl4, tbl5, tbl6, tbl7;
 	// copy tables to local
 	tbl0 = _mm_load_si128((__m128i *)table);
 	tbl1 = _mm_load_si128((__m128i *)table + 1);
 	tbl2 = _mm_load_si128((__m128i *)table + 2);
 	tbl3 = _mm_load_si128((__m128i *)table + 3);
 	tbl4 = _mm_load_si128((__m128i *)table + 4);
 	tbl5 = _mm_load_si128((__m128i *)table + 5);
 	tbl6 = _mm_load_si128((__m128i *)table + 6);
 	tbl7 = _mm_load_si128((__m128i *)table + 7);
 	// create mask for 8 entries
 	mask = _mm_setzero_si128();
 	mask = _mm_cmpeq_epi16(mask, mask);	// 0xFFFF *8
 	mask = _mm_srli_epi16(mask, 12);	// 0x000F *8
 	while (bsize != 0){
 		xmm0 = _mm_load_si128((__m128i *)data);	// read source 16-bytes
 		xmm3 = _mm_load_si128(&tbl0);	// low table
 		xmm4 = _mm_load_si128(&tbl1);	// high table
 		xmm1 = _mm_load_si128(&xmm0);			// copy source
 		xmm0 = _mm_srli_epi16(xmm0, 4);			// prepare next 4-bit
 		xmm1 = _mm_and_si128(xmm1, mask);		// src & 0x000F
 		xmm3 = _mm_shuffle_epi8(xmm3, xmm1);	// table look-up
 		xmm1 = _mm_slli_epi16(xmm1, 8);			// shift 8-bit for higher table
 		xmm4 = _mm_shuffle_epi8(xmm4, xmm1);
 			xmm5 = _mm_load_si128(&tbl2);	// low table
 			xmm6 = _mm_load_si128(&tbl3);	// high table
 		dest = _mm_xor_si128(xmm3, xmm4);	// combine high and low
 			xmm1 = _mm_load_si128(&xmm0);			// copy source
 			xmm0 = _mm_srli_epi16(xmm0, 4);			// prepare next 4-bit
 			xmm1 = _mm_and_si128(xmm1, mask);		// src & 0x000F
 			xmm5 = _mm_shuffle_epi8(xmm5, xmm1);	// table look-up
 			xmm1 = _mm_slli_epi16(xmm1, 8);			// shift 8-bit for higher table
 			xmm6 = _mm_shuffle_epi8(xmm6, xmm1);
 		xmm3 = _mm_load_si128(&tbl4);	// low table
 		xmm4 = _mm_load_si128(&tbl5);	// high table
 			xmm5 = _mm_xor_si128(xmm5, xmm6);	// combine high and low
 			dest = _mm_xor_si128(dest, xmm5);
 		xmm1 = _mm_load_si128(&xmm0);			// copy source
 		xmm0 = _mm_srli_epi16(xmm0, 4);			// prepare next 4-bit
 		xmm1 = _mm_and_si128(xmm1, mask);		// src & 0x000F
 		xmm3 = _mm_shuffle_epi8(xmm3, xmm1);	// table look-up
 		xmm1 = _mm_slli_epi16(xmm1, 8);			// shift 8-bit for higher table
 		xmm4 = _mm_shuffle_epi8(xmm4, xmm1);
 			xmm5 = _mm_load_si128(&tbl6);	// low table
 			xmm6 = _mm_load_si128(&tbl7);	// high table
 		xmm3 = _mm_xor_si128(xmm3, xmm4);	// combine high and low
 		dest = _mm_xor_si128(dest, xmm3);
 			xmm5 = _mm_shuffle_epi8(xmm5, xmm0);	// table look-up
 			xmm0 = _mm_slli_epi16(xmm0, 8);			// shift 8-bit for higher table
 			xmm6 = _mm_shuffle_epi8(xmm6, xmm0);
 			xmm5 = _mm_xor_si128(xmm5, xmm6);	// combine high and low
 			dest = _mm_xor_si128(dest, xmm5);
 		_mm_store_si128((__m128i *)data, dest);
 		data += 16;
 		bsize -= 16;
 	}
 }
 // ２ブロック同時に計算することで、メモリーへのアクセス回数を減らす
 // 128バイトのテーブルを２個用意しておくこと
 // xmm レジスタの数が足りないので、テーブルを毎回ロードする
 static void gf16_ssse3_block32_altmap2(unsigned char *input1, unsigned char *input2, unsigned char *output, unsigned int bsize, unsigned char *table)
 {
 	__m128i *tbl;
 	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, mask;
 	tbl = (__m128i *)table;
 	// create mask for 16 entries
 	mask = _mm_setzero_si128();
 	mask = _mm_cmpeq_epi16(mask, mask);	// 0xFFFF *8
 	mask = _mm_srli_epi16(mask, 12);	// 0x000F *8
 	mask = _mm_packus_epi16(mask, mask);	// 0x0F *16
 	while (bsize != 0){
 		xmm0 = _mm_load_si128((__m128i *)input1);	// read source 32-bytes
 		xmm1 = _mm_load_si128((__m128i *)input1 + 1);
 		xmm6 = _mm_load_si128(&xmm0);	// copy source
 		xmm0 = _mm_srli_epi16(xmm0, 4);	// prepare next 4-bit
 		xmm6 = _mm_and_si128(xmm6, mask);	// src & 0x0F
 		xmm0 = _mm_and_si128(xmm0, mask);	// (src >> 4) & 0x0F
 		xmm4 = _mm_load_si128(tbl);	// load tables
 		xmm5 = _mm_load_si128(tbl + 1);
 		xmm4 = _mm_shuffle_epi8(xmm4, xmm6);	// table look-up
 		xmm5 = _mm_shuffle_epi8(xmm5, xmm6);
 		xmm2 = _mm_load_si128(tbl + 2);	// load tables
 		xmm3 = _mm_load_si128(tbl + 3);
 		xmm2 = _mm_shuffle_epi8(xmm2, xmm0);	// table look-up
 		xmm3 = _mm_shuffle_epi8(xmm3, xmm0);
 		xmm4 = _mm_xor_si128(xmm4, xmm2);	// combine result
 		xmm5 = _mm_xor_si128(xmm5, xmm3);
 		xmm0 = _mm_load_si128(&xmm1);	// copy source
 		xmm0 = _mm_srli_epi16(xmm0, 4);	// prepare next 4-bit
 		xmm1 = _mm_and_si128(xmm1, mask);	// src & 0x0F
 		xmm0 = _mm_and_si128(xmm0, mask);	// (src >> 4) & 0x0F
 		xmm2 = _mm_load_si128(tbl + 4);	// load tables
 		xmm3 = _mm_load_si128(tbl + 5);
 		xmm2 = _mm_shuffle_epi8(xmm2, xmm1);	// table look-up
 		xmm3 = _mm_shuffle_epi8(xmm3, xmm1);
 		xmm4 = _mm_xor_si128(xmm4, xmm2);	// combine result
 		xmm5 = _mm_xor_si128(xmm5, xmm3);
 		xmm2 = _mm_load_si128(tbl + 6);	// load tables
 		xmm3 = _mm_load_si128(tbl + 7);
 		xmm2 = _mm_shuffle_epi8(xmm2, xmm0);	// table look-up
 		xmm3 = _mm_shuffle_epi8(xmm3, xmm0);
 		xmm4 = _mm_xor_si128(xmm4, xmm2);	// combine result
 		xmm5 = _mm_xor_si128(xmm5, xmm3);
 			xmm0 = _mm_load_si128((__m128i *)input2);	// read source 32-bytes
 			xmm1 = _mm_load_si128((__m128i *)input2 + 1);
 			xmm6 = _mm_load_si128(&xmm0);	// copy source
 			xmm0 = _mm_srli_epi16(xmm0, 4);	// prepare next 4-bit
 			xmm6 = _mm_and_si128(xmm6, mask);	// src & 0x0F
 			xmm0 = _mm_and_si128(xmm0, mask);	// (src >> 4) & 0x0F
 			xmm2 = _mm_load_si128(tbl + 8);	// load tables
 			xmm3 = _mm_load_si128(tbl + 9);
 			xmm2 = _mm_shuffle_epi8(xmm2, xmm6);	// table look-up
 			xmm3 = _mm_shuffle_epi8(xmm3, xmm6);
 			xmm4 = _mm_xor_si128(xmm4, xmm2);	// combine result
 			xmm5 = _mm_xor_si128(xmm5, xmm3);
 			xmm2 = _mm_load_si128(tbl + 10);	// load tables
 			xmm3 = _mm_load_si128(tbl + 11);
 			xmm2 = _mm_shuffle_epi8(xmm2, xmm0);	// table look-up
 			xmm3 = _mm_shuffle_epi8(xmm3, xmm0);
 			xmm4 = _mm_xor_si128(xmm4, xmm2);	// combine result
 			xmm5 = _mm_xor_si128(xmm5, xmm3);
 			xmm0 = _mm_load_si128(&xmm1);	// copy source
 			xmm0 = _mm_srli_epi16(xmm0, 4);	// prepare next 4-bit
 			xmm1 = _mm_and_si128(xmm1, mask);	// src & 0x0F
 			xmm0 = _mm_and_si128(xmm0, mask);	// (src >> 4) & 0x0F
 			xmm2 = _mm_load_si128(tbl + 12);	// load tables
 			xmm3 = _mm_load_si128(tbl + 13);
 			xmm2 = _mm_shuffle_epi8(xmm2, xmm1);	// table look-up
 			xmm3 = _mm_shuffle_epi8(xmm3, xmm1);
 			xmm4 = _mm_xor_si128(xmm4, xmm2);	// combine result
 			xmm5 = _mm_xor_si128(xmm5, xmm3);
 			xmm2 = _mm_load_si128(tbl + 14);	// load tables
 			xmm3 = _mm_load_si128(tbl + 15);
 			xmm2 = _mm_shuffle_epi8(xmm2, xmm0);	// table look-up
 			xmm3 = _mm_shuffle_epi8(xmm3, xmm0);
 			xmm4 = _mm_xor_si128(xmm4, xmm2);	// combine result
 			xmm5 = _mm_xor_si128(xmm5, xmm3);
 		xmm0 = _mm_load_si128((__m128i *)output);	// read dest 32-bytes
 		xmm1 = _mm_load_si128((__m128i *)output + 1);
 		xmm0 = _mm_xor_si128(xmm0, xmm4);
 		xmm1 = _mm_xor_si128(xmm1, xmm5);
 		_mm_store_si128((__m128i *)output, xmm0);	// write dest 32-bytes
 		_mm_store_si128((__m128i *)output + 1, xmm1);
 		input1 += 32;
 		input2 += 32;
 		output += 32;
 		bsize -= 32;
 	}
 }
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
 // AVX2 命令を使うには Windows 7 以降じゃないといけない
 // _mm256_permute2x128_si256 の control の意味は以下を参照
 // http://www.felixcloutier.com/x86/VPERM2I128.html
-// テーブルを並び替えて使えば、ループ内の並び替え回数を一回に減らせる
+// AVX2 を使って全体を２倍していくと、13% ぐらい速くなる
-static void gf16_avx2_block32(unsigned char *input, unsigned char *output, unsigned int bsize, unsigned char *table)
+// でも、テーブル作成が少し速くなっても、全体的な速度はほとんど変わらない・・・
 static void create_eight_table_avx2(unsigned char *mtab, int factor)
 {
-	__m256i tbl0, tbl1, tbl2, tbl3, mask, src0, src1, tmp0, tmp1, tmp2, tmp3;
+	int count;
 	__m128i xmm0, xmm1, xmm2, xmm3, mask8;
 	__m256i ymm0, ymm1, ymm2, ymm3, base, poly, mask16;
 	// create mask for 8-bit
 	mask8 = _mm_setzero_si128();
 	mask8 = _mm_cmpeq_epi16(mask8, mask8);	// 0xFFFF *8
 	mask8 = _mm_srli_epi16(mask8, 8);		// 0x00FF *8
 	xmm0 = _mm_cvtsi32_si128(factor);			// [_][_][_][_][_][_][_][1]
 	xmm1 = _mm_setzero_si128();
 	factor = (factor << 1) ^ (((factor << 16) >> 31) & 0x1100B);
 	xmm1 = _mm_insert_epi16(xmm1, factor, 1);	// [_][_][_][_][_][_][2][_]
 	xmm2 = _mm_setzero_si128();
 	factor = (factor << 1) ^ (((factor << 16) >> 31) & 0x1100B);
 	xmm2 = _mm_insert_epi16(xmm2, factor, 4);	// [_][_][_][4][_][_][_][_]
 	xmm1 = _mm_unpacklo_epi16(xmm1, xmm1);		// [_][_][_][_][2][2][_][_]
 	factor = (factor << 1) ^ (((factor << 16) >> 31) & 0x1100B);
 	xmm3 = _mm_cvtsi32_si128(factor);			// [_][_][_][_][_][_][_][8]
 	xmm0 = _mm_shufflelo_epi16(xmm0, _MM_SHUFFLE(0, 1, 0, 1));	// [_][_][_][_][1][_][1][_]
 	xmm3 = _mm_unpacklo_epi16(xmm3, xmm3);						// [_][_][_][_][_][_][8][8]
 	xmm0 = _mm_xor_si128(xmm0, xmm1);							// [_][_][_][_][3][2][1][_]
 	xmm2 = _mm_shufflehi_epi16(xmm2, _MM_SHUFFLE(0, 0, 0, 0));	// [4][4][4][4][_][_][_][_]
 	xmm0 = _mm_unpacklo_epi64(xmm0, xmm0);						// [3][2][1][_][3][2][1][_]
 	xmm3 = _mm_shuffle_epi32(xmm3, _MM_SHUFFLE(0, 0, 0, 0));	// [8][8][8][8][8][8][8][8]
 	xmm2 = _mm_xor_si128(xmm2, xmm0);							// [7][6][5][4][3][2][1][_]
 	xmm3 = _mm_xor_si128(xmm3, xmm2);							// [15][14][13][12][11][10][9][8]
 	// 途中で AVX2 命令を使っても遅くならないっぽい
 	poly = _mm256_set1_epi32(0x100B100B);	// PRIM_POLY = 0x1100B * 16
 	mask16 = _mm256_cmpeq_epi16(poly, poly);
 	mask16 = _mm256_srli_epi16(mask16, 8);	// 0x00FF *16
 	base = _mm256_setzero_si256();
 	base = _mm256_inserti128_si256(base, xmm2, 0);
 	base = _mm256_inserti128_si256(base, xmm3, 1);
 	// ymm レジスタに読み込んでる間にメモリーに書き込んだ方が速い
 	xmm0 = _mm_and_si128(xmm2, mask8);
 	xmm1 = _mm_and_si128(xmm3, mask8);
 	xmm0 = _mm_packus_epi16(xmm0, xmm1);	// lower 8-bit * 16
 	xmm2 = _mm_srli_epi16(xmm2, 8);
 	xmm3 = _mm_srli_epi16(xmm3, 8);
 	xmm2 = _mm_packus_epi16(xmm2, xmm3);	// higher 8-bit * 16
 	_mm_store_si128((__m128i *)mtab    , xmm0);
 	_mm_store_si128((__m128i *)mtab + 1, xmm2);
 	for (count = 1; count < 4; count++){
 		// 全体を２倍する
 		ymm0 = _mm256_slli_epi16(base, 1);
 		ymm1 = _mm256_srai_epi16(base, 15);
 		ymm1 = _mm256_and_si256(ymm1, poly);
 		base = _mm256_xor_si256(ymm1, ymm0);
 		// 全体を２倍する
 		ymm0 = _mm256_slli_epi16(base, 1);
 		ymm1 = _mm256_srai_epi16(base, 15);
 		ymm1 = _mm256_and_si256(ymm1, poly);
 		base = _mm256_xor_si256(ymm1, ymm0);
 		// 全体を２倍する
 		ymm0 = _mm256_slli_epi16(base, 1);
 		ymm1 = _mm256_srai_epi16(base, 15);
 		ymm1 = _mm256_and_si256(ymm1, poly);
 		base = _mm256_xor_si256(ymm1, ymm0);
 		// 全体を２倍する
 		ymm0 = _mm256_slli_epi16(base, 1);
 		ymm1 = _mm256_srai_epi16(base, 15);
 		ymm1 = _mm256_and_si256(ymm1, poly);
 		base = _mm256_xor_si256(ymm1, ymm0);
 		// 並び替えて保存する
 		ymm0 = _mm256_and_si256(base, mask16);	// lower 8-bit * 16
 		ymm1 = _mm256_srli_epi16(base, 8);		// higher 8-bit * 16
 		ymm2 = _mm256_permute2x128_si256(ymm0, ymm1, 0x20);
 		ymm3 = _mm256_permute2x128_si256(ymm0, ymm1, 0x31);
 		ymm0 = _mm256_packus_epi16(ymm2, ymm3);
 		_mm256_store_si256((__m256i *)mtab + count, ymm0);
 	}
 	// AVX-SSE 切り替えの回避
 	_mm256_zeroupper();
 }
 // 逆行列計算用に掛け算だけする（XORで追加しない）
 static void gf16_avx2_block32s(unsigned char *data, unsigned int bsize, unsigned char *table)
 {
 	__m256i tbl0, tbl1, tbl2, tbl3, tbl4, tbl5, tbl6, tbl7;
 	__m256i mask, dest, src0, src1, tmp0, tmp1, tmp2, tmp3;
 	// copy tables to local
 	tmp0 = _mm256_load_si256((__m256i *)table);		// tbl0[low0][high0] <- 0x0f[lo][lo]
@@ -1158,11 +1431,152 @@ static void gf16_avx2_block32(unsigned char *input, unsigned char *output, unsig
 	tmp2 = _mm256_load_si256((__m256i *)table + 2);	// tbl2[low2][high2] <- 0x0f[hi][hi]
 	tmp3 = _mm256_load_si256((__m256i *)table + 3);	// tbl3[low3][high3] <- 0xf0[hi][hi]
-	// re-arrange table order
+	// split to 8 tables
-	tbl0 = _mm256_permute2x128_si256(tmp0, tmp2, 0x30);	// tblA[low0][high2] <- 0x0f[lo][hi]
+	tbl0 = _mm256_permute2x128_si256(tmp0, tmp0, 0x00);	// tbl0[low0][low0]
-	tbl1 = _mm256_permute2x128_si256(tmp1, tmp3, 0x30);	// tblB[low1][high3] <- 0xf0[lo][hi]
+	tbl1 = _mm256_permute2x128_si256(tmp1, tmp1, 0x00);	// tbl1[low1][low1]
-	tbl2 = _mm256_permute2x128_si256(tmp2, tmp0, 0x03);	// tblC[high0][low2] <- 0x0f[lo][hi]
+	tbl2 = _mm256_permute2x128_si256(tmp2, tmp2, 0x00);	// tbl2[low2][low2]
-	tbl3 = _mm256_permute2x128_si256(tmp3, tmp1, 0x03);	// tblD[high1][low3] <- 0xf0[lo][hi]
+	tbl3 = _mm256_permute2x128_si256(tmp3, tmp3, 0x00);	// tbl3[low3][low3]
 	tbl4 = _mm256_permute2x128_si256(tmp0, tmp0, 0x11);	// tbl0[high0][high0]
 	tbl5 = _mm256_permute2x128_si256(tmp1, tmp1, 0x11);	// tbl1[high1][high1]
 	tbl6 = _mm256_permute2x128_si256(tmp2, tmp2, 0x11);	// tbl2[high2][high2]
 	tbl7 = _mm256_permute2x128_si256(tmp3, tmp3, 0x11);	// tbl3[high3][high3]
 	// create mask for 16 entries
 	mask = _mm256_cmpeq_epi16(tmp0, tmp0);	// 0xFFFF *16
 	mask = _mm256_srli_epi16(mask, 12);		// 0x000F *16
 	while (bsize != 0){
 		src0 = _mm256_load_si256((__m256i *)data);	// read source 32-bytes
 		src1 = _mm256_and_si256(src0, mask);	// src & 0x0F
 		tmp0 = _mm256_shuffle_epi8(tbl0, src1);	// table look-up
 		src1 = _mm256_slli_epi16(src1, 8);		// shift 8-bit for higher table
 		tmp1 = _mm256_shuffle_epi8(tbl4, src1);
 		src0 = _mm256_srli_epi16(src0, 4);		// prepare next 4-bit
 		dest = _mm256_xor_si256(tmp0, tmp1);	// combine high and low
 		src1 = _mm256_and_si256(src0, mask);	// src & 0x0F
 		tmp0 = _mm256_shuffle_epi8(tbl1, src1);	// table look-up
 		src1 = _mm256_slli_epi16(src1, 8);		// shift 8-bit for higher table
 		tmp1 = _mm256_shuffle_epi8(tbl5, src1);
 		dest = _mm256_xor_si256(dest, tmp0);	// combine high and low
 		src0 = _mm256_srli_epi16(src0, 4);		// prepare next 4-bit
 		dest = _mm256_xor_si256(dest, tmp1);
 		src1 = _mm256_and_si256(src0, mask);	// src & 0x0F
 		tmp0 = _mm256_shuffle_epi8(tbl2, src1);	// table look-up
 		src1 = _mm256_slli_epi16(src1, 8);		// shift 8-bit for higher table
 		tmp1 = _mm256_shuffle_epi8(tbl6, src1);
 		dest = _mm256_xor_si256(dest, tmp0);	// combine high and low
 		src0 = _mm256_srli_epi16(src0, 4);		// prepare next 4-bit
 		dest = _mm256_xor_si256(dest, tmp1);
 		src1 = _mm256_and_si256(src0, mask);	// src & 0x0F
 		tmp0 = _mm256_shuffle_epi8(tbl3, src1);	// table look-up
 		src1 = _mm256_slli_epi16(src1, 8);		// shift 8-bit for higher table
 		tmp1 = _mm256_shuffle_epi8(tbl7, src1);
 		dest = _mm256_xor_si256(dest, tmp0);	// combine high and low
 		dest = _mm256_xor_si256(dest, tmp1);
 		_mm256_store_si256((__m256i *)data, dest);	// write dest 32-bytes
 		data += 32;
 		bsize -= 32;
 	}
 	// AVX-SSE 切り替えの回避
 	_mm256_zeroupper();
 }
 // 逆行列計算用に ALTMAP されてないソースにも対応しておく
 // Address (input) does not need be 32-byte aligned
 static void gf16_avx2_block32u(unsigned char *input, unsigned char *output, unsigned int bsize, unsigned char *table)
 {
 	__m256i tbl0, tbl1, tbl2, tbl3, tbl4, tbl5, tbl6, tbl7;
 	__m256i mask, dest, src0, src1, tmp0, tmp1, tmp2, tmp3;
 	// copy tables to local
 	tmp0 = _mm256_load_si256((__m256i *)table);		// tbl0[low0][high0] <- 0x0f[lo][lo]
 	tmp1 = _mm256_load_si256((__m256i *)table + 1);	// tbl1[low1][high1] <- 0xf0[lo][lo]
 	tmp2 = _mm256_load_si256((__m256i *)table + 2);	// tbl2[low2][high2] <- 0x0f[hi][hi]
 	tmp3 = _mm256_load_si256((__m256i *)table + 3);	// tbl3[low3][high3] <- 0xf0[hi][hi]
 	// split to 8 tables
 	tbl0 = _mm256_permute2x128_si256(tmp0, tmp0, 0x00);	// tbl0[low0][low0]
 	tbl1 = _mm256_permute2x128_si256(tmp1, tmp1, 0x00);	// tbl1[low1][low1]
 	tbl2 = _mm256_permute2x128_si256(tmp2, tmp2, 0x00);	// tbl2[low2][low2]
 	tbl3 = _mm256_permute2x128_si256(tmp3, tmp3, 0x00);	// tbl3[low3][low3]
 	tbl4 = _mm256_permute2x128_si256(tmp0, tmp0, 0x11);	// tbl0[high0][high0]
 	tbl5 = _mm256_permute2x128_si256(tmp1, tmp1, 0x11);	// tbl1[high1][high1]
 	tbl6 = _mm256_permute2x128_si256(tmp2, tmp2, 0x11);	// tbl2[high2][high2]
 	tbl7 = _mm256_permute2x128_si256(tmp3, tmp3, 0x11);	// tbl3[high3][high3]
 	// create mask for 16 entries
 	mask = _mm256_cmpeq_epi16(tmp0, tmp0);	// 0xFFFF *16
 	mask = _mm256_srli_epi16(mask, 12);		// 0x000F *16
 	while (bsize != 0){
 		src0 = _mm256_loadu_si256((__m256i *)input);	// read source 32-bytes
 		dest = _mm256_load_si256((__m256i *)output);	// read dest 32-bytes
 		src1 = _mm256_and_si256(src0, mask);	// src & 0x0F
 		tmp0 = _mm256_shuffle_epi8(tbl0, src1);	// table look-up
 		src1 = _mm256_slli_epi16(src1, 8);		// shift 8-bit for higher table
 		tmp1 = _mm256_shuffle_epi8(tbl4, src1);
 		dest = _mm256_xor_si256(dest, tmp0);	// combine high and low
 		src0 = _mm256_srli_epi16(src0, 4);		// prepare next 4-bit
 		dest = _mm256_xor_si256(dest, tmp1);
 		src1 = _mm256_and_si256(src0, mask);	// src & 0x0F
 		tmp0 = _mm256_shuffle_epi8(tbl1, src1);	// table look-up
 		src1 = _mm256_slli_epi16(src1, 8);		// shift 8-bit for higher table
 		tmp1 = _mm256_shuffle_epi8(tbl5, src1);
 		dest = _mm256_xor_si256(dest, tmp0);	// combine high and low
 		src0 = _mm256_srli_epi16(src0, 4);		// prepare next 4-bit
 		dest = _mm256_xor_si256(dest, tmp1);
 		src1 = _mm256_and_si256(src0, mask);	// src & 0x0F
 		tmp0 = _mm256_shuffle_epi8(tbl2, src1);	// table look-up
 		src1 = _mm256_slli_epi16(src1, 8);		// shift 8-bit for higher table
 		tmp1 = _mm256_shuffle_epi8(tbl6, src1);
 		dest = _mm256_xor_si256(dest, tmp0);	// combine high and low
 		src0 = _mm256_srli_epi16(src0, 4);		// prepare next 4-bit
 		dest = _mm256_xor_si256(dest, tmp1);
 		src1 = _mm256_and_si256(src0, mask);	// src & 0x0F
 		tmp0 = _mm256_shuffle_epi8(tbl3, src1);	// table look-up
 		src1 = _mm256_slli_epi16(src1, 8);		// shift 8-bit for higher table
 		tmp1 = _mm256_shuffle_epi8(tbl7, src1);
 		dest = _mm256_xor_si256(dest, tmp0);	// combine high and low
 		dest = _mm256_xor_si256(dest, tmp1);
 		_mm256_store_si256((__m256i *)output, dest);	// write dest 32-bytes
 		input += 32;
 		output += 32;
 		bsize -= 32;
 	}
 	// AVX-SSE 切り替えの回避
 	_mm256_zeroupper();
 }
 // テーブルを並び替えて使えば、ループ内の並び替え回数を一回に減らせる
 static void gf16_avx2_block32(unsigned char *input, unsigned char *output, unsigned int bsize, unsigned char *table)
 {
 	__m256i tbl0, tbl1, tbl2, tbl3, mask, dest, src0, src1, tmp0, tmp1, tmp2, tmp3;
 	// copy tables to local
 	tmp0 = _mm256_load_si256((__m256i *)table);		// tbl0[low0][high0] <- 0x0f[lo][lo]
 	tmp1 = _mm256_load_si256((__m256i *)table + 1);	// tbl1[low1][high1] <- 0xf0[lo][lo]
 	tmp2 = _mm256_load_si256((__m256i *)table + 2);	// tbl2[low2][high2] <- 0x0f[hi][hi]
 	tmp3 = _mm256_load_si256((__m256i *)table + 3);	// tbl3[low3][high3] <- 0xf0[hi][hi]
 	// re-arrange table order (permute より blend の方が速いらしい)
 	tbl0 = _mm256_blend_epi32(tmp0, tmp2, 0xF0);		// tbl0[low0][high2] <- 0x0f[lo][hi]
 	tbl1 = _mm256_blend_epi32(tmp1, tmp3, 0xF0);		// tbl1[low1][high3] <- 0xf0[lo][hi]
 	tbl2 = _mm256_permute2x128_si256(tmp2, tmp0, 0x03);	// tbl2[high0][low2] <- 0x0f[lo][hi]
 	tbl3 = _mm256_permute2x128_si256(tmp3, tmp1, 0x03);	// tbl3[high1][low3] <- 0xf0[lo][hi]
 	// create mask for 32 entries
 	mask = _mm256_cmpeq_epi16(tmp0, tmp0);	// 0xFFFF *16
@@ -1184,10 +1598,10 @@ static void gf16_avx2_block32(unsigned char *input, unsigned char *output, unsig
 		tmp2 = _mm256_xor_si256(tmp2, tmp3);
 		tmp2 = _mm256_permute2x128_si256(tmp2, tmp2, 0x01);	// exchange low & high 128-bit
-		src1 = _mm256_load_si256((__m256i *)output);	// read dest 32-bytes
+		dest = _mm256_load_si256((__m256i *)output);	// read dest 32-bytes
-		src1 = _mm256_xor_si256(src1, tmp0);
+		tmp0 = _mm256_xor_si256(tmp0, tmp2);
-		src1 = _mm256_xor_si256(src1, tmp2);
+		dest = _mm256_xor_si256(dest, tmp0);
-		_mm256_store_si256((__m256i *)output, src1);	// write dest 32-bytes
+		_mm256_store_si256((__m256i *)output, dest);	// write dest 32-bytes
 		input += 32;
 		output += 32;
@@ -1300,6 +1714,83 @@ static void gf16_avx2_block32(unsigned char *input, unsigned char *output, unsig
 }
 */
 // ２ブロック同時に計算することで、メモリーへのアクセス回数を減らす
 // 128バイトのテーブルを２個用意しておくこと
 static void gf16_avx2_block32_2(unsigned char *input1, unsigned char *input2, unsigned char *output, unsigned int bsize, unsigned char *table)
 {
 	__m256i mask, src0, src1, tmp0, tmp1, tmp2, tmp3;
 	__m256i tbl0, tbl1, tbl2, tbl3, tbl4, tbl5, tbl6, tbl7;
 	// copy tables to local
 	tmp0 = _mm256_load_si256((__m256i *)table);		// tbl0[low0][high0] <- 0x0f[lo][lo]
 	tmp1 = _mm256_load_si256((__m256i *)table + 1);	// tbl1[low1][high1] <- 0xf0[lo][lo]
 	tmp2 = _mm256_load_si256((__m256i *)table + 2);	// tbl2[low2][high2] <- 0x0f[hi][hi]
 	tmp3 = _mm256_load_si256((__m256i *)table + 3);	// tbl3[low3][high3] <- 0xf0[hi][hi]
 	// re-arrange table order (permute より blend の方が速いらしい)
 	tbl0 = _mm256_blend_epi32(tmp0, tmp2, 0xF0);		// tbl0[low0][high2] <- 0x0f[lo][hi]
 	tbl1 = _mm256_blend_epi32(tmp1, tmp3, 0xF0);		// tbl1[low1][high3] <- 0xf0[lo][hi]
 	tbl2 = _mm256_permute2x128_si256(tmp2, tmp0, 0x03);	// tbl2[high0][low2] <- 0x0f[lo][hi]
 	tbl3 = _mm256_permute2x128_si256(tmp3, tmp1, 0x03);	// tbl3[high1][low3] <- 0xf0[lo][hi]
 	tmp0 = _mm256_load_si256((__m256i *)table + 4);
 	tmp1 = _mm256_load_si256((__m256i *)table + 5);
 	tmp2 = _mm256_load_si256((__m256i *)table + 6);
 	tmp3 = _mm256_load_si256((__m256i *)table + 7);
 	tbl4 = _mm256_blend_epi32(tmp0, tmp2, 0xF0);
 	tbl5 = _mm256_blend_epi32(tmp1, tmp3, 0xF0);
 	tbl6 = _mm256_permute2x128_si256(tmp2, tmp0, 0x03);
 	tbl7 = _mm256_permute2x128_si256(tmp3, tmp1, 0x03);
 	// create mask for 32 entries
 	mask = _mm256_cmpeq_epi16(tmp0, tmp0);	// 0xFFFF *16
 	mask = _mm256_srli_epi16(mask, 12);		// 0x000F *16
 	mask = _mm256_packus_epi16(mask, mask);	// 0x0F *32
 	while (bsize != 0){
 		src0 = _mm256_load_si256((__m256i *)input1);	// read source 32-bytes
 		src1 = _mm256_srli_epi16(src0, 4);	// prepare next 4-bit
 		src0 = _mm256_and_si256(src0, mask);	// src & 0x0F
 		src1 = _mm256_and_si256(src1, mask);	// (src >> 4) & 0x0F
 		tmp0 = _mm256_shuffle_epi8(tbl0, src0);	// table look-up
 		tmp1 = _mm256_shuffle_epi8(tbl1, src1);
 		tmp2 = _mm256_shuffle_epi8(tbl2, src0);
 		tmp3 = _mm256_shuffle_epi8(tbl3, src1);
 		tmp0 = _mm256_xor_si256(tmp0, tmp1);	// combine result
 		tmp2 = _mm256_xor_si256(tmp2, tmp3);
 			src0 = _mm256_load_si256((__m256i *)input2);	// read source 32-bytes
 			src1 = _mm256_srli_epi16(src0, 4);	// prepare next 4-bit
 			src0 = _mm256_and_si256(src0, mask);	// src & 0x0F
 			src1 = _mm256_and_si256(src1, mask);	// (src >> 4) & 0x0F
 			tmp1 = _mm256_shuffle_epi8(tbl4, src0);	// table look-up
 			tmp3 = _mm256_shuffle_epi8(tbl6, src0);
 			tmp0 = _mm256_xor_si256(tmp0, tmp1);	// combine result
 			tmp2 = _mm256_xor_si256(tmp2, tmp3);
 			tmp1 = _mm256_shuffle_epi8(tbl5, src1);	// table look-up
 			tmp3 = _mm256_shuffle_epi8(tbl7, src1);
 			tmp0 = _mm256_xor_si256(tmp0, tmp1);	// combine result
 			tmp2 = _mm256_xor_si256(tmp2, tmp3);
 		src0 = _mm256_load_si256((__m256i *)output);	// read dest 32-bytes
 		tmp2 = _mm256_permute2x128_si256(tmp2, tmp2, 0x01);	// exchange low & high 128-bit
 		src0 = _mm256_xor_si256(src0, tmp0);
 		src0 = _mm256_xor_si256(src0, tmp2);
 		_mm256_store_si256((__m256i *)output, src0);	// write dest 32-bytes
 		input1 += 32;
 		input2 += 32;
 		output += 32;
 		bsize -= 32;
 	}
 	// AVX-SSE 切り替えの回避
 	_mm256_zeroupper();
 }
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
 // データを並び替えることで、メモリーアクセスを高速化する
@@ -1953,9 +2444,57 @@ void galois_region_multiply(
 		return;
 	}
-	if (count >= 32){	// 64バイト以上なら掛け算用のテーブルを使った方が速い
+	if (count >= 64){	// 64バイト以上なら掛け算用のテーブルを使った方が速い
 #ifndef NO_SIMD
-		if (cpu_flag & 1){	// SSSE3 対応なら
+		if (cpu_flag & 16){	// AVX2 対応なら
 			__declspec( align(32) ) unsigned char small_table[128];
 			int s, d;
 			create_eight_table_avx2(small_table, factor);
 			// アドレスが 32の倍数で無い場合は 32バイト単位で計算する効率が落ちる
 			while ((ULONG_PTR)r2 & 0x1E){
 				// そこで最初の 1～15個(2～30バイト)だけ普通に計算する
 				s = r1[0];
 				d = r2[0];
 				d ^= small_table[s & 0xF] | ((int)(small_table[16 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[32 + (s & 0xF)] | ((int)(small_table[48 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[64 + (s & 0xF)] | ((int)(small_table[80 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[96 + s] | ((int)(small_table[112 + s]) << 8);
 				r2[0] = (unsigned short)d;
 				r1++;
 				r2++;
 				count--;
 			}
 			// 16個ずつ計算するので 16の倍数にする
 			gf16_avx2_block32u((unsigned char *)r1, (unsigned char *)r2,
 					(count & 0xFFFFFFF0) << 1, small_table);
 			r1 += count & 0xFFFFFFF0;
 			r2 += count & 0xFFFFFFF0;
 			count &= 15;
 			// 残りは 1個ずつ計算する
 			while (count != 0){
 				s = r1[0];
 				d = r2[0];
 				d ^= small_table[s & 0xF] | ((int)(small_table[16 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[32 + (s & 0xF)] | ((int)(small_table[48 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[64 + (s & 0xF)] | ((int)(small_table[80 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[96 + s] | ((int)(small_table[112 + s]) << 8);
 				r2[0] = (unsigned short)d;
 				r1++;
 				r2++;
 				count--;
 			}
 		} else if (cpu_flag & 1){	// SSSE3 対応なら
 			__declspec( align(16) ) unsigned char small_table[128];
 			int s, d;
@@ -2093,7 +2632,97 @@ void galois_region_divide(
 {
 	factor = galois_reciprocal(factor);	// factor = 1 / factor
-	if (count >= 32){
+	if (count >= 64){
 // 行列サイズが小さいのでテーブル作成に時間がかかって、全く速くならない・・・
 /*
 #ifndef NO_SIMD
 		if (cpu_flag & 16){	// AVX2 対応なら
 			__declspec( align(32) ) unsigned char small_table[128];
 			int s, d;
 			create_eight_table_avx2(small_table, factor);
 			// アドレスが 32の倍数で無い場合は 32バイト単位で計算する効率が落ちる
 			while ((ULONG_PTR)r1 & 0x1E){
 				// そこで最初の 1～15個(2～30バイト)だけ普通に計算する
 				s = r1[0];
 				d = small_table[s & 0xF] | ((int)(small_table[16 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[32 + (s & 0xF)] | ((int)(small_table[48 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[64 + (s & 0xF)] | ((int)(small_table[80 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[96 + s] | ((int)(small_table[112 + s]) << 8);
 				r1[0] = (unsigned short)d;
 				r1++;
 				count--;
 			}
 			// 16個ずつ計算するので 16の倍数にする
 			gf16_avx2_block32s((unsigned char *)r1, (count & 0xFFFFFFF0) << 1, small_table);
 			r1 += count & 0xFFFFFFF0;
 			count &= 15;
 			// 残りは 1個ずつ計算する
 			while (count != 0){
 				s = r1[0];
 				d = small_table[s & 0xF] | ((int)(small_table[16 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[32 + (s & 0xF)] | ((int)(small_table[48 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[64 + (s & 0xF)] | ((int)(small_table[80 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[96 + s] | ((int)(small_table[112 + s]) << 8);
 				r1[0] = (unsigned short)d;
 				r1++;
 				count--;
 			}
 		} else if (cpu_flag & 1){	// SSSE3 対応なら
 			__declspec( align(16) ) unsigned char small_table[128];
 			int s, d;
 			create_eight_table(small_table, factor);
 			// アドレスが 16の倍数で無い場合は 16バイト単位で計算する効率が落ちる
 			while ((ULONG_PTR)r1 & 0xE){
 				// そこで最初の 1～7個(2～14バイト)だけ普通に計算する
 				s = r1[0];
 				d = small_table[s & 0xF] | ((int)(small_table[16 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[32 + (s & 0xF)] | ((int)(small_table[48 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[64 + (s & 0xF)] | ((int)(small_table[80 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[96 + s] | ((int)(small_table[112 + s]) << 8);
 				r1[0] = (unsigned short)d;
 				r1++;
 				count--;
 			}
 			// 8個ずつ計算するので 8の倍数にする
 			gf16_ssse3_block16s((unsigned char *)r1, (count & 0xFFFFFFF8) << 1, small_table);
 			r1 += count & 0xFFFFFFF8;
 			count &= 7;
 			// 残りは 1個ずつ計算する
 			while (count != 0){
 				s = r1[0];
 				d = small_table[s & 0xF] | ((int)(small_table[16 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[32 + (s & 0xF)] | ((int)(small_table[48 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[64 + (s & 0xF)] | ((int)(small_table[80 + (s & 0xF)]) << 8);
 				s = s >> 4;
 				d ^= small_table[96 + s] | ((int)(small_table[112 + s]) << 8);
 				r1[0] = (unsigned short)d;
 				r1++;
 				count--;
 			}
 		} else {	// Combined Multi Table support (2 tables of 256-entries)
 #endif
 */
 			unsigned int mtab[256 * 2];
 			create_two_table(mtab, factor);	// 掛け算用のテーブルをその場で構成する
@@ -2117,6 +2746,11 @@ void galois_region_divide(
 			// 奇数なら最後に 1個余る
 			if (count == 1)
 				r1[0] = (unsigned short)(mtab[((unsigned char *)r1)[0]] ^ mtab[256 + ((unsigned char *)r1)[1]]);
 /*
 #ifndef NO_SIMD
 		}
 #endif
 */
 	} else {	// 小さいデータは普通に計算する
 		int log_y = galois_log_table[factor];
@@ -2271,6 +2905,42 @@ void galois_align32_multiply(
 	}
 }
 // 掛け算を２回行って、一度に更新する (SSSE3 & ALTMAP)
 void galois_align32_multiply2(
 	unsigned char *src1,	// Region to multiply (must be aligned by 16)
 	unsigned char *src2,
 	unsigned char *dst,		// Products go here
 	unsigned int len,		// Byte length (must be multiple of 32)
 	int factor1,			// Number to multiply by
 	int factor2)
 {
 	if ((factor1 == 1) && (factor2 == 1)){	// 両方の factor が 1の場合
 		__m128i xmm0, xmm1, xmm2;
 		while (len != 0){
 			xmm0 = _mm_load_si128((__m128i *)dst);
 			xmm1 = _mm_load_si128((__m128i *)src1);
 			xmm2 = _mm_load_si128((__m128i *)src2);
 			xmm0 = _mm_xor_si128(xmm0, xmm1);
 			xmm0 = _mm_xor_si128(xmm0, xmm2);
 			_mm_store_si128((__m128i *)dst, xmm0);
 			src1 += 16;
 			src2 += 16;
 			dst += 16;
 			len -= 16;
 		}
 	// 掛け算用のテーブルを常に作成する (32バイトだと少し遅くなる)
 	} else {
 		__declspec( align(16) ) unsigned char small_table[256];
 		create_eight_table(small_table, factor1);
 		create_eight_table(small_table + 128, factor2);
 		gf16_ssse3_block32_altmap2(src1, src2, dst, len, small_table);
 	}
 }
 // 256バイトごとに並び替えられたバッファー専用の JIT(SSE2) を使った掛け算
 void galois_align256_multiply(
 	unsigned char *r1,	// Region to multiply (must be aligned by 16)
@@ -2321,32 +2991,54 @@ void galois_align32avx_multiply(
 			}
 			_mm256_zeroupper();	// AVX-SSE 切り替えの回避
 /*
 			__m128i xmm0, xmm1;	// 16バイトごとに XOR する
 			while (len != 0){
 				xmm0 = _mm_load_si128((__m128i *)r1);
 				xmm1 = _mm_load_si128((__m128i *)r2);
 				xmm1 = _mm_xor_si128(xmm1, xmm0);
 				_mm_store_si128((__m128i *)r2, xmm1);
 				r1 += 16;
 				r2 += 16;
 				len -= 16;
 			}
 */
 		}
 	// 掛け算用のテーブルを常に作成する (32バイトだと少し遅くなる)
 	} else {
 		__declspec( align(32) ) unsigned char small_table[128];
-		create_eight_table(small_table, factor);
+		create_eight_table_avx2(small_table, factor);
 		gf16_avx2_block32(r1, r2, len, small_table);
 	}
 }
 // 掛け算を２回行って、一度に更新する (AVX2 & ALTMAP)
 void galois_align32avx_multiply2(
 	unsigned char *src1,	// Region to multiply (must be aligned by 32)
 	unsigned char *src2,
 	unsigned char *dst,		// Products go here
 	unsigned int len,		// Byte length (must be multiple of 32)
 	int factor1,			// Number to multiply by
 	int factor2)
 {
 	if ((factor1 == 1) && (factor2 == 1)){	// 両方の factor が 1の場合
 		__m256i ymm0, ymm1, ymm2;
 		while (len != 0){
 			ymm0 = _mm256_load_si256((__m256i *)dst);
 			ymm1 = _mm256_load_si256((__m256i *)src1);
 			ymm2 = _mm256_load_si256((__m256i *)src2);
 			ymm0 = _mm256_xor_si256(ymm0, ymm1);
 			ymm0 = _mm256_xor_si256(ymm0, ymm2);
 			_mm256_store_si256((__m256i *)dst, ymm0);
 			src1 += 32;
 			src2 += 32;
 			dst += 32;
 			len -= 32;
 		}
 		_mm256_zeroupper();	// AVX-SSE 切り替えの回避
 	// 掛け算用のテーブルを常に作成する (32バイトだと少し遅くなる)
 	} else {
 		__declspec( align(32) ) unsigned char small_table[256];
 		create_eight_table_avx2(small_table, factor1);
 		create_eight_table_avx2(small_table + 128, factor2);
 		gf16_avx2_block32_2(src1, src2, dst, len, small_table);
 	}
 }
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
 // チェックサムを計算する
--- a/source/par2j/gf16.h
+++ b/source/par2j/gf16.h
@@ -47,6 +47,15 @@ typedef void (* REGION_MULTIPLY) (
 	int factor);		// Number to multiply by
 REGION_MULTIPLY galois_align_multiply;
 typedef void (* REGION_MULTIPLY2) (
 	unsigned char *src1,	// Region to multiply
 	unsigned char *src2,
 	unsigned char *dst,		// Products go here
 	unsigned int len,		// Byte length
 	int factor1,			// Number to multiply by
 	int factor2);
 REGION_MULTIPLY2 galois_align_multiply2;
 // 領域並び替え用の関数定義
 typedef void (* REGION_ALTMAP) (unsigned char *data, unsigned int bsize);
 REGION_ALTMAP galois_altmap_change;
--- a/source/par2j/lib_opencl.c
+++ b/source/par2j/lib_opencl.c
@@ -1,5 +1,5 @@
 // lib_opencl.c
-// Copyright : 2023-06-01 Yutaka Sawada
+// Copyright : 2023-09-23 Yutaka Sawada
 // License : GPL
 #ifndef _WIN32_WINNT
@@ -72,11 +72,10 @@ typedef cl_int (CL_API_CALL *API_clEnqueueNDRangeKernel)(cl_command_queue, cl_ke
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
 // グローバル変数
-extern unsigned int cpu_flag, cpu_cache;	// declared in common2.h
+extern unsigned int cpu_flag;	// declared in common2.h
 extern int cpu_num;
 #define MAX_DEVICE 3
 #define MAX_GROUP_NUM 64
 HMODULE hLibOpenCL = NULL;
@@ -103,18 +102,17 @@ API_clEnqueueNDRangeKernel gfn_clEnqueueNDRangeKernel;
 入力
 OpenCL_method : どのデバイスを選ぶか
 unit_size : ブロックの単位サイズ
 chunk_size: 分割された断片サイズ
 src_max : ソース・ブロック個数
 chunk_size = 0: 標準では分割しない
 出力
 return : エラー番号
 src_max : 最大で何ブロックまでソースを読み込めるか
 chunk_size : CPUスレッドの分割サイズ
 OpenCL_method : 動作フラグいろいろ
 */
 // 0=成功, 1～エラー番号
-int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
+int init_OpenCL(int unit_size, int chunk_size, int *src_max)
 {
 	char buf[2048], *p_source;
 	int err = 0, i, j;
@@ -141,7 +139,7 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 	API_clGetKernelWorkGroupInfo fn_clGetKernelWorkGroupInfo;
 	cl_int ret;
 	cl_uint num_platforms = 0, num_devices = 0, num_groups, param_value;
-	cl_ulong param_value8, cache_size;
+	cl_ulong param_value8;
 	cl_platform_id platform_id[MAX_DEVICE], selected_platform;	// Intel, AMD, Nvidia などドライバーの提供元
 	cl_device_id device_id[MAX_DEVICE], selected_device;	// CPU や GPU など
 	cl_program program;
@@ -309,19 +307,14 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 			ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &data_size, NULL);
 			if (ret != CL_SUCCESS)
 				continue;
-			ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_HOST_UNIFIED_MEMORY, sizeof(cl_uint), &param_value, NULL);
+			// CL_DEVICE_HOST_UNIFIED_MEMORY は OpenCL 2.0 以降で非推奨になったので、参照しない
 			if (ret != CL_SUCCESS)
 				continue;
 			if (param_value != 0)
 				param_value = 1;
 #ifdef DEBUG_OUTPUT
 			printf("MAX_COMPUTE_UNITS = %d\n", num_groups);
 			printf("MAX_WORK_GROUP_SIZE = %zd\n", data_size);
 			printf("HOST_UNIFIED_MEMORY = %d\n", param_value);
 #endif
-			// MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE で計算力を測る、外付けGPUなら値を倍にする
+			// MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE で計算力を測る
-			count = (2 - param_value) * (int)data_size * num_groups;
+			count = (int)data_size * num_groups;
 			count *= OpenCL_method;	// 符号を変える
 			//printf("prev = %d, now = %d\n", gpu_power, count);
 			if ((count > gpu_power) && (data_size >= 256) &&	// 256以上ないとテーブルを作れない
@@ -330,8 +323,6 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 				selected_device = device_id[j];	// 使うデバイスの ID
 				selected_platform = platform_id[i];
 				OpenCL_group_num = num_groups;	// ワークグループ数は COMPUTE_UNITS 数にする
 				if (OpenCL_group_num > MAX_GROUP_NUM)	// 制限を付けてローカルメモリーの消費を抑える
 					OpenCL_group_num = MAX_GROUP_NUM;
 				alloc_max = (size_t)param_value8;
 				// AMD Radeon ではメモリー領域が全体の 1/4 とは限らない
@@ -345,26 +336,6 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 					if ((cl_ulong)alloc_max > param_value8)
 						alloc_max = (size_t)param_value8;
 				}
 				cache_size = 0;
 				ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_GLOBAL_MEM_CACHE_TYPE, sizeof(cl_uint), &num_groups, NULL);
 				if (ret == CL_SUCCESS){
 #ifdef DEBUG_OUTPUT
 					printf("GLOBAL_MEM_CACHE_TYPE = %d\n", num_groups);
 #endif
 					if (num_groups & 3){	// CL_READ_ONLY_CACHE or CL_READ_WRITE_CACHE
 						ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, sizeof(cl_ulong), &cache_size, NULL);
 						if (ret == CL_SUCCESS){
 #ifdef DEBUG_OUTPUT
 							printf("GLOBAL_MEM_CACHE_SIZE = %I64d KB\n", cache_size >> 10);
 #endif
 							if (param_value != 0){	// 内蔵 GPU なら CPU との共有キャッシュを活用する
 								if (cache_size >= 1048576)	// サイズが小さい場合は分割しない
 									cache_size |= 0x40000000;
 							}
 						}
 					}
 				}
 			}
 		}
 	}
@@ -395,67 +366,28 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 		return (ret << 8) | 12;
 	// 計算方式を選択する
 	gpu_power = unit_size;	// unit_size は MEM_UNIT の倍数になってる
 	if ((((cpu_flag & 0x101) == 1) || ((cpu_flag & 16) != 0)) && (sse_unit == 32)){
 		OpenCL_method = 2;	// SSSE3 & ALTMAP または AVX2 ならデータの並び替え対応版を使う
 		if (cache_size & 0x40000000){	// 内蔵 GPU でキャッシュを利用できるなら、CPUスレッドと同じにする
 			j = cpu_cache & 0x7FFF8000;	// CPUのキャッシュ上限サイズ
 			count = (int)(cache_size & 0x3FFFFFFF) / 4;	// ただし、認識できるサイズの 1/4 までにする
 			if ((j == 0) || (j > count))
 				j = count;
 			count = 1;
 			while (gpu_power > j){	// 制限サイズより大きいなら
 				// 分割数を増やして chunk のサイズを試算してみる
 				count++;
 				gpu_power = (unit_size + count - 1) / count;
 				gpu_power = (gpu_power + (MEM_UNIT - 1)) & ~(MEM_UNIT - 1);	// MEM_UNITの倍数にする
 			}
 			if (count > 1){
 				*chunk_size = gpu_power;
 				OpenCL_method = 3;
 #ifdef DEBUG_OUTPUT
 				printf("gpu cache: limit size = %d, chunk size = %d, split = %d\n", j, gpu_power, count);
 #endif
 			}
 /*
 		// 32バイト単位のメモリーアクセスならキャッシュする必要なし？計算速度が半減する・・・
 		} else if ((cache_size & 0x3FFFFFFF) > OpenCL_group_num * 4096){	// 2KB の倍はいるかも？
 #ifdef DEBUG_OUTPUT
 			printf("gpu: cache size = %d, read size = %d\n", cache_size & 0x3FFFFFFF, OpenCL_group_num * 2048);
 #endif
 			OpenCL_method = 1;
 */
 		}
 	} else if (((cpu_flag & 128) != 0) && (sse_unit == 256)){
 		OpenCL_method = 4;	// JIT(SSE2) は bit ごとに上位から 16バイトずつ並ぶ
 		// ローカルのテーブルサイズが異なることに注意
 		// XOR 方式以外は 2KB (4バイト * 256項目 * 2個) 使う
 		// XOR (JIT) は 64バイト (4バイト * 16項目) 使う
 #ifdef DEBUG_OUTPUT
 //		printf("4 KB cache (16-bytes * 256 work items), use if\n");
 #endif
 	} else {
-		OpenCL_method = 1;	// MMX用のコードは遅いので、キャッシュ最適化する必要が無い
+		OpenCL_method = 1;	// 並び替えられてないデータ用
 	}
 	// work group 数が必要以上に多い場合は減らす
-/*
+	if (OpenCL_method == 2){
 	if (OpenCL_method == 4){
 		// work item 一個が 16バイトずつ計算する、256個なら work group ごとに 4KB 担当する
 		data_size = unit_size / 4096;
 	} else 
 */
 	if (OpenCL_method & 2){
 		// work item 一個が 8バイトずつ計算する、256個なら work group ごとに 2KB 担当する
-		data_size = unit_size / 2048;
+		data_size = chunk_size / 2048;
 	} else {
 		// work item 一個が 4バイトずつ計算する、256個なら work group ごとに 1KB 担当する
-		data_size = unit_size / 1024;
+		data_size = chunk_size / 1024;
 	}
 	if (OpenCL_group_num > data_size){
 		OpenCL_group_num = data_size;
-		printf("Number of work groups is reduced to %d\n", (int)OpenCL_group_num);
+		printf("Number of work groups is reduced to %zd\n", OpenCL_group_num);
 	}
 	// 最大で何ブロック分のメモリー領域を保持できるのか（ここではまだ確保しない）
@@ -469,9 +401,9 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 	printf("src buf : %zd KB (%d blocks), possible\n", data_size >> 10, count);
 #endif
-	// 出力先は1ブロック分だけあればいい
+	// 出力先はchunk 1個分だけあればいい
 	// CL_MEM_ALLOC_HOST_PTRを使えばpinned memoryになるらしい
-	data_size = unit_size;
+	data_size = (chunk_size + 63) & ~63;	//  cache line sizes (64 bytes) の倍数にする
 	OpenCL_dst = gfn_clCreateBuffer(OpenCL_context, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, data_size, NULL, &ret);
 	if (ret != CL_SUCCESS)
 		return (ret << 8) | 13;
@@ -574,7 +506,7 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 	FreeResource(glob);	// not required ?
 	// 定数を指定する
-	wsprintfA(buf, "-D BLK_SIZE=%d -D CHK_SIZE=%d", unit_size / 4, gpu_power / 4);
+	wsprintfA(buf, "-cl-fast-relaxed-math -D BLK_SIZE=%d", unit_size / 4);
 	// 使用する OpenCL デバイス用にコンパイルする
 	ret = fn_clBuildProgram(program, 1, &selected_device, buf, NULL, NULL);
@@ -768,11 +700,12 @@ int gpu_copy_blocks(
 }
 // ソース・ブロックを掛け算する
-int gpu_multiply_blocks(
+int gpu_multiply_chunks(
 	int src_num,			// Number of multiplying source blocks
 	unsigned short *mat,	// Matrix of numbers to multiply by
 	unsigned char *buf,		// Products go here
-	int len)				// Byte length
+	int offset,				// Offset in each block
 	int length)				// Byte length
 {
 	unsigned __int64 *vram, *src, *dst;
 	size_t global_size, local_size;
@@ -787,6 +720,14 @@ int gpu_multiply_blocks(
 	ret = gfn_clSetKernelArg(OpenCL_kernel, 3, sizeof(int), &src_num);
 	if (ret != CL_SUCCESS)
 		return (ret << 8) | 103;
 	offset /= 4;	// 4バイト整数単位にする
 	ret = gfn_clSetKernelArg(OpenCL_kernel, 4, sizeof(int), &offset);
 	if (ret != CL_SUCCESS)
 		return (ret << 8) | 104;
 	length /= 4;	// 4バイト整数単位にする
 	ret = gfn_clSetKernelArg(OpenCL_kernel, 5, sizeof(int), &length);
 	if (ret != CL_SUCCESS)
 		return (ret << 8) | 105;
 	// カーネル並列実行
 	local_size = 256;	// テーブルやキャッシュのため、work item 数は 256に固定する
@@ -797,18 +738,18 @@ int gpu_multiply_blocks(
 		return (ret << 8) | 11;
 	// 出力内容をホスト側に反映させる
-	vram = gfn_clEnqueueMapBuffer(OpenCL_command, OpenCL_dst, CL_TRUE, CL_MAP_READ, 0, len, 0, NULL, NULL, &ret);
+	vram = gfn_clEnqueueMapBuffer(OpenCL_command, OpenCL_dst, CL_TRUE, CL_MAP_READ, 0, length * 4, 0, NULL, NULL, &ret);
 	if (ret != CL_SUCCESS)
 		return (ret << 8) | 12;
 	// 8バイトごとに XOR する (SSE2 で XOR しても速くならず)
 	src = vram;
 	dst = (unsigned __int64 *)buf;
-	while (len > 0){
+	while (length > 0){
 		*dst ^= *src;
 		dst++;
 		src++;
-		len -= 8;
+		length -= 2;
 	}
 	// ホスト側でデータを変更しなくても、clEnqueueMapBufferと対で呼び出さないといけない
--- a/source/par2j/lib_opencl.h
+++ b/source/par2j/lib_opencl.h
@@ -10,20 +10,21 @@ extern "C" {
 extern int OpenCL_method;
-int init_OpenCL(int unit_size, int *src_max, int *chunk_size);
+int init_OpenCL(int unit_size, int chunk_size, int *src_max);
 int free_OpenCL(void);
 void info_OpenCL(char *buf, int buf_size);
 int gpu_copy_blocks(
 	unsigned char *data,
 	int unit_size,
-	int src_end);
+	int src_num);
-int gpu_multiply_blocks(
+int gpu_multiply_chunks(
 	int src_num,			// Number of multiplying source blocks
 	unsigned short *mat,	// Matrix of numbers to multiply by
 	unsigned char *buf,		// Products go here
-	int len);				// Byte length
+	int offset,				// Offset in each block
 	int length);			// Byte length
 int gpu_finish(void);
--- a/source/par2j/md5_crc.c
+++ b/source/par2j/md5_crc.c
@@ -1,5 +1,5 @@
 // md5_crc.c
-// Copyright : 2022-10-01 Yutaka Sawada
+// Copyright : 2023-08-28 Yutaka Sawada
 // License : GPL
 #ifndef _UNICODE
@@ -672,7 +672,7 @@ time1_start = GetTickCount();
 	// バッファー・サイズが大きいのでヒープ領域を使う
 	for (io_size = IO_SIZE; io_size < 1048576; io_size += IO_SIZE){	// 1 MB までにする
-		if ((io_size + IO_SIZE > (cpu_cache << 17)) || ((__int64)(io_size + IO_SIZE) * 4 > file_left))
+		if ((io_size + IO_SIZE > (cpu_cache & 0xFFFE0000)) || ((__int64)(io_size + IO_SIZE) * 4 > file_left))
 			break;
 	}
 	buf1 = _aligned_malloc(io_size * 2, 64);
@@ -867,7 +867,7 @@ DWORD WINAPI file_hash_crc2(LPVOID lpParameter)
 	// バッファー・サイズが大きいのでヒープ領域を使う
 	prog_tick = 1;
 	for (io_size = IO_SIZE; io_size < 1048576; io_size += IO_SIZE){	// IO_SIZE の倍数で 1 MB までにする
-		if ((io_size + IO_SIZE > (cpu_cache << 17)) || ((__int64)(io_size + IO_SIZE) * 4 > file_left))
+		if ((io_size + IO_SIZE > (cpu_cache & 0xFFFE0000)) || ((__int64)(io_size + IO_SIZE) * 4 > file_left))
 			break;
 		prog_tick++;
 	}
@@ -1304,7 +1304,7 @@ DWORD WINAPI file_hash_background(LPVOID lpParameter)
 	// バッファー・サイズが大きいのでヒープ領域を使う
 	for (io_size = IO_SIZE; io_size < 1048576; io_size += IO_SIZE){	// IO_SIZE の倍数で 1 MB までにする
-		if ((io_size + IO_SIZE > (cpu_cache << 17)) || ((__int64)(io_size + IO_SIZE) * 4 > file_size))
+		if ((io_size + IO_SIZE > (cpu_cache & 0xFFFE0000)) || ((__int64)(io_size + IO_SIZE) * 4 > file_size))
 			break;
 	}
 	//printf("\n io_size = %d\n", io_size);
--- a/source/par2j/par2.c
+++ b/source/par2j/par2.c
@@ -1,5 +1,5 @@
 // par2.c
-// Copyright : 2023-03-15 Yutaka Sawada
+// Copyright : 2023-09-21 Yutaka Sawada
 // License : GPL
 #ifndef _UNICODE
@@ -112,12 +112,12 @@ int par2_create(
 		err = -12;
 	} else {
 		// メモリーを確保できるか試す
-		err = read_block_num(parity_num, cpu_num - 1, 0, 256);
+		err = read_block_num(parity_num, 0, 256);
 		if (err == 0)
 			err = -13;
 	}
 #ifdef TIMER
-	printf("read_block_num = %d\n", read_block_num(parity_num, cpu_num - 1, 0, 256));
+	printf("read_block_num = %d\n", read_block_num(parity_num, 0, 256));
 #endif
 	if (err > 0){	// 1-pass方式が可能
 #ifdef TIMER
--- a/source/par2j/par2_cmd.c
+++ b/source/par2j/par2_cmd.c
@@ -1,5 +1,5 @@
 // par2_cmd.c
-// Copyright : 2023-03-18 Yutaka Sawada
+// Copyright : 2023-09-28 Yutaka Sawada
 // License : GPL
 #ifndef _UNICODE
@@ -86,7 +86,7 @@ static void print_environment(void)
 	printf("CPU thread\t: %d / %d\n", cpu_num & 0xFFFF, cpu_num >> 24);
 	cpu_num &= 0xFFFF;	// 利用するコア数だけにしておく
-	printf("CPU cache limit : %d KB, %d KB\n", (cpu_cache & 0x7FFF8000) >> 10, (cpu_cache & 0x00007FFF) << 7);
+	printf("CPU cache limit : %d KB, %d KB\n", (cpu_flag & 0xFFFF0000) >> 10, (cpu_cache & 0xFFFE0000) >> 10);
 #ifndef _WIN64	// 32-bit 版は MMX, SSE2, SSSE3 のどれかを表示する
 	printf("CPU extra\t:");
 	if (cpu_flag & 1){
@@ -1481,39 +1481,42 @@ ri= switch_set & 0x00040000
 					k = (k * 10) + (tmp_p[j] - '0');
 					j++;
 				}
-				if (k & 32){	// GPU を使う
+				if (k & 256){	// GPU を使う
 					OpenCL_method = 1;	// Faster GPU
-				} else if (k & 64){
+				} else if (k & 512){
 					OpenCL_method = -1;	// Slower GPU
 				}
-				if (k & 16)	// SSSE3 を使わない
+				if (k & 1024)	// CLMUL を使わない、SSSE3 の古いエンコーダーを使う
 					cpu_flag &= 0xFFFFFFFE;
 				if (k & 128)	// CLMUL を使わない、SSSE3 の古いエンコーダーを使う
 					cpu_flag = (cpu_flag & 0xFFFFFFF7) | 0x100;
-				if (k & 256)	// JIT(SSE2) を使わない
+				if (k & 2048)	// JIT(SSE2) を使わない
 					cpu_flag &= 0xFFFFFF7F;
-				if (k & 512)	// AVX2 を使わない
+				if (k & 4096)	// SSSE3 を使わない
 					cpu_flag &= 0xFFFFFFFE;
 				if (k & 8192)	// AVX2 を使わない
 					cpu_flag &= 0xFFFFFFEF;
-				if (k & 15){	// 使用するコア数を変更する
+				if (k & 255){	// 使用するコア数を変更する
-					k &= 15;	// 1～15 の範囲
+					k &= 255;	// 1～255 の範囲
 					// printf("\n lc# = %d , logical = %d, physical = %d \n", k, cpu_num >> 24, (cpu_num & 0x00FF0000) >> 16);
-					if (k == 12){	// 物理コア数の 1/4 にする
+					if (k == 251){	// 物理コア数の 1/4 にする
 						k = ((cpu_num & 0x00FF0000) >> 16) / 4;
-					} else if (k == 13){	// 物理コア数の半分にする
+					} else if (k == 252){	// 物理コア数の半分にする
 						k = ((cpu_num & 0x00FF0000) >> 16) / 2;
-					} else if (k == 14){	// 物理コア数の 3/4 にする
+					} else if (k == 253){	// 物理コア数の 3/4 にする
 						k = (((cpu_num & 0x00FF0000) >> 16) * 3) / 4;
-					} else if (k == 15){	// 物理コア数にする
+					} else if (k == 254){	// 物理コア数より減らす
-						k = (cpu_num & 0x00FF0000) >> 16;
+						k = ((cpu_num & 0x00FF0000) >> 16) - 1;
-						if (k >= 6)
+					} else if (k == 255){	// 物理コア数より増やす
-							k--;	// 物理コア数が 6以上なら、1個減らす
+						k = ((cpu_num & 0x00FF0000) >> 16) + 1;
-					} else if (k > (cpu_num >> 24)){
+						//k = cpu_num >> 16;
-						k = cpu_num >> 24;	// 論理コア数を超えないようにする
+						//k = ((k & 0xFF) + (k >> 8)) / 2;	// 物理コア数と論理コア数の中間にする？
 						// タスクマネージャーにおける CPU使用率は 100%になるけど、速くはならない・・・
 					}
 					if (k > MAX_CPU){
 						k = MAX_CPU;
 					} else if (k < 1){
 						k = 1;
 					} else if (k > (cpu_num >> 24)){
 						k = cpu_num >> 24;	// 論理コア数を超えないようにする
 					}
 					cpu_num = (cpu_num & 0xFFFF0000) | k;	// 指定されたコア数を下位に配置する
 				}
--- a/source/par2j/reedsolomon.c
+++ b/source/par2j/reedsolomon.c
@@ -1,5 +1,5 @@
 // reedsolomon.c
-// Copyright : 2023-05-29 Yutaka Sawada
+// Copyright : 2023-09-28 Yutaka Sawada
 // License : GPL
 #ifndef _UNICODE
@@ -30,13 +30,13 @@
 // GPU を使う最小データサイズ (MB 単位)
 // GPU の起動には時間がかかるので、データが小さすぎると逆に遅くなる
-#define GPU_DATA_LIMIT 512
+#define GPU_DATA_LIMIT 200
 // GPU を使う最小ブロックサイズとブロック数
 // CPU と GPU で処理を割り振る為には、ある程度のブロック数を必要とする
 #define GPU_BLOCK_SIZE_LIMIT 65536
-#define GPU_SOURCE_COUNT_LIMIT 256
+#define GPU_SOURCE_COUNT_LIMIT 192
-#define GPU_PARITY_COUNT_LIMIT 32
+#define GPU_PARITY_COUNT_LIMIT 8
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
@@ -46,7 +46,7 @@ int try_cache_blocking(int unit_size)
 	int limit_size, chunk_count, chunk_size, cache_line_diff;
 	// CPUキャッシュをどのくらいまで使うか
-	limit_size = cpu_flag & 0x7FFF8000;	// 最低でも 32KB になる
+	limit_size = cpu_flag & 0x7FFF0000;	// 最低でも 64KB になる
 	if (limit_size == 0)	// キャッシュ・サイズを取得できなかった場合は最適化しない
 		return unit_size;
@@ -160,7 +160,6 @@ unsigned int get_io_size(
 // 何ブロックまとめてファイルから読み込むかを空きメモリー量から計算する
 int read_block_num(
 	int keep_num,			// 保持するパリティ・ブロック数
 	int add_num,			// 余裕を見るブロック数
 	size_t trial_alloc,		// 確保できるか確認するのか
 	int alloc_unit)			// メモリー単位の境界 (sse_unit か MEM_UNIT)
 {
@@ -177,7 +176,7 @@ int read_block_num(
 	if (trial_alloc){
 		__int64 possible_size;
-		possible_size = (__int64)unit_size * (source_num + keep_num + add_num);
+		possible_size = (__int64)unit_size * (source_num + keep_num);
 #ifndef _WIN64	// 32-bit 版なら
 		if (possible_size > MAX_MEM_SIZE)	// 確保する最大サイズを 2GB までにする
 			possible_size = MAX_MEM_SIZE;
@@ -191,13 +190,13 @@ int read_block_num(
 	}
 	mem_size = get_mem_size(trial_alloc) / unit_size;	// 何個分確保できるか
-	if (mem_size >= (size_t)(source_num + keep_num + add_num)){	// 最大個数より多い
+	if (mem_size >= (size_t)(source_num + keep_num)){	// 最大個数より多い
 		buf_num = source_num;
-	} else if ((int)mem_size < read_min + keep_num + add_num){	// 少なすぎる
+	} else if ((int)mem_size < read_min + keep_num){	// 少なすぎる
 		buf_num = 0;	// メモリー不足の印
 	} else {	// ソース・ブロック個数を等分割する
 		int split_num;
-		buf_num = (int)mem_size - (keep_num + add_num);
+		buf_num = (int)mem_size - keep_num;
 		split_num = (source_num + buf_num - 1) / buf_num;	// 何回に別けて読み込むか
 		buf_num = (source_num + split_num - 1) / split_num;
 	}
@@ -263,7 +262,7 @@ static int invert_matrix_st(unsigned short *mat,
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
 // マルチ・プロセッサー対応
-
+/*
 typedef struct {	// RS threading control struct
 	unsigned short *mat;	// 行列
 	int cols;	// 横行の長さ
@@ -308,8 +307,57 @@ static DWORD WINAPI thread_func(LPVOID lpParameter)
 	CloseHandle(th->end);
 	return 0;
 }
 */
 typedef struct {	// Maxtrix Inversion threading control struct
 	unsigned short *mat;	// 行列
 	int cols;	// 横行の長さ
 	volatile int start;	// 掛ける行の先頭位置
 	volatile int pivot;	// 倍率となる値の位置
 	volatile int skip;	// とばす行
 	volatile int now;	// 消去する行
 	HANDLE run;
 	HANDLE end;
 } INV_TH;
 // サブ・スレッド
 static DWORD WINAPI thread_func(LPVOID lpParameter)
 {
 	unsigned short *mat;
 	int j, cols, row_start2, factor;
 	HANDLE hRun, hEnd;
 	INV_TH *th;
 	th = (INV_TH *)lpParameter;
 	mat = th->mat;
 	cols = th->cols;
 	hRun = th->run;
 	hEnd = th->end;
 	SetEvent(hEnd);	// 設定完了を通知する
 	WaitForSingleObject(hRun, INFINITE);	// 計算開始の合図を待つ
 	while (th->skip >= 0){
 		while ((j = InterlockedDecrement(&(th->now))) >= 0){	// j = --th_now
 			if (j == th->skip)
 				continue;
 			row_start2 = cols * j;	// その行の開始位置
 			factor = mat[row_start2 + th->pivot];	// j 行の pivot 列の値
 			mat[row_start2 + th->pivot] = 0;	// これが行列を一個で済ます手
 			// 先の計算により、i 行の pivot 列の値は必ず 1なので、この factor が倍率になる
 			galois_region_multiply(mat + th->start, mat + row_start2, cols, factor);
 		}
 		//_mm_sfence();	// メモリーへの書き込みを完了する
 		SetEvent(hEnd);	// 計算終了を通知する
 		WaitForSingleObject(hRun, INFINITE);	// 計算開始の合図を待つ
 	}
 	// 終了処理
 	CloseHandle(hRun);
 	CloseHandle(hEnd);
 	return 0;
 }
 // マルチ・スレッドで逆行列を計算する (利用するパリティ・ブロックの所だけ)
 /*
 static int invert_matrix_mt(unsigned short *mat,
 	int rows,				// 横行の数、行列の縦サイズ、失われたソース・ブロックの数 = 利用するパリティ・ブロック数
 	int cols,				// 縦列の数、行列の横サイズ、本来のソース・ブロック数
@@ -411,6 +459,130 @@ static int invert_matrix_mt(unsigned short *mat,
 	CloseHandle(th->h);
 	return 0;
 }
 */
 static int invert_matrix_mt(unsigned short *mat,
 	int rows,				// 横行の数、行列の縦サイズ、失われたソース・ブロックの数 = 利用するパリティ・ブロック数
 	int cols,				// 縦列の数、行列の横サイズ、本来のソース・ブロック数
 	source_ctx_r *s_blk)	// 各ソース・ブロックの情報
 {
 	int err = 0, j, row_start2, factor, sub_num;
 	unsigned int time_last = GetTickCount();
 	HANDLE hSub[MAX_CPU / 2], hRun[MAX_CPU / 2], hEnd[MAX_CPU / 2];
 	INV_TH th[1];
 	memset(hSub, 0, sizeof(HANDLE) * (MAX_CPU / 2));
 	memset(th, 0, sizeof(INV_TH));
 	// サブ・スレッドの数は平方根（切り上げ）にする
 	sub_num = 1;
 	j = 2;
 	while (j < cpu_num){	// 1~2=1, 3~4=2, 5~8=3, 9~16=4, 17~32=5
 		sub_num++;
 		j *= 2;
 	}
 	if (sub_num > rows - 2)
 		sub_num = rows - 2;	// 多過ぎても意味ないので制限する
 #ifdef TIMER
 	// 使うスレッド数は、メイン・スレッドの分も含めるので 1個増える
 	printf("\nMaxtrix Inversion with %d threads\n", sub_num + 1);
 #endif
 	// サブ・スレッドを起動する
 	th->mat = mat;
 	th->cols = cols;
 	for (j = 0; j < sub_num; j++){	// サブ・スレッドごとに
 		// イベントを作成する
 		hRun[j] = CreateEvent(NULL, FALSE, FALSE, NULL);	// 両方とも Auto Reset にする
 		if (hRun[j] == NULL){
 			print_win32_err();
 			printf("error, inv-thread\n");
 			err = 1;
 			goto error_end;
 		}
 		hEnd[j] = CreateEvent(NULL, FALSE, FALSE, NULL);
 		if (hEnd[j] == NULL){
 			print_win32_err();
 			CloseHandle(hRun[j]);
 			printf("error, inv-thread\n");
 			err = 1;
 			goto error_end;
 		}
 		// サブ・スレッドを起動する
 		th->run = hRun[j];
 		th->end = hEnd[j];
 		//_mm_sfence();	// メモリーへの書き込みを完了してからスレッドを起動する
 		hSub[j] = (HANDLE)_beginthreadex(NULL, STACK_SIZE, thread_func, (LPVOID)th, 0, NULL);
 		if (hSub[j] == NULL){
 			print_win32_err();
 			CloseHandle(hRun[j]);
 			CloseHandle(hEnd[j]);
 			printf("error, inv-thread\n");
 			err = 1;
 			goto error_end;
 		}
 		WaitForSingleObject(hEnd[j], INFINITE);	// 設定終了の合図を待つ (リセットする)
 	}
 	// Gaussian Elimination with 1 matrix
 	th->pivot = 0;
 	th->start = 0;	// その行の開始位置
 	for (th->skip = 0; th->skip < rows; th->skip++){
 		// 経過表示
 		if (GetTickCount() - time_last >= UPDATE_TIME){
 			if (print_progress((th->skip * 1000) / rows)){
 				err = 2;
 				goto error_end;
 			}
 			time_last = GetTickCount();
 		}
 		// その行 (パリティ・ブロック) がどのソース・ブロックの代用か
 		while ((th->pivot < cols) && (s_blk[th->pivot].exist != 0))
 			th->pivot++;
 		// Divide the row by element i,pivot
 		factor = mat[th->start + th->pivot];
 		if (factor > 1){
 			mat[th->start + th->pivot] = 1;	// これが行列を一個で済ます手
 			galois_region_divide(mat + th->start, cols, factor);
 		} else if (factor == 0){	// factor = 0 だと、その行列の逆行列を計算できない
 			err = (0x00010000 | th->pivot);	// どのソース・ブロックで問題が発生したのかを返す
 			goto error_end;
 		}
 		// 別の行の同じ pivot 列が 0以外なら、その値を 0にするために、
 		// i 行を何倍かしたものを XOR する
 		th->now = rows;	// 初期値 + 1
 		//_mm_sfence();	// メモリーへの書き込みを完了してからスレッドを再開する
 		for (j = 0; j < sub_num; j++)
 			SetEvent(hRun[j]);	// サブ・スレッドに計算を開始させる
 		while ((j = InterlockedDecrement(&(th->now))) >= 0){	// j = --th_now
 			if (j == th->skip)	// 同じ行はとばす
 				continue;
 			row_start2 = cols * j;	// その行の開始位置
 			factor = mat[row_start2 + th->pivot];	// j 行の pivot 列の値
 			mat[row_start2 + th->pivot] = 0;	// これが行列を一個で済ます手
 			// 先の計算により、i 行の pivot 列の値は必ず 1なので、この factor が倍率になる
 			galois_region_multiply(mat + th->start, mat + row_start2, cols, factor);
 		}
 		WaitForMultipleObjects(sub_num, hEnd, TRUE, INFINITE);	// サブ・スレッドの計算終了の合図を待つ
 		th->start += cols;
 		th->pivot++;
 	}
 error_end:
 	InterlockedExchange(&(th->skip), -1);		// 終了指示
 	for (j = 0; j < sub_num; j++){
 		if (hSub[j]){	// サブ・スレッドを終了させる
 			SetEvent(hRun[j]);
 			WaitForSingleObject(hSub[j], INFINITE);
 			CloseHandle(hSub[j]);
 		}
 	}
 	return err;
 }
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
 /*
@@ -539,11 +711,9 @@ unsigned int time_total = GetTickCount();
 	}
 	// パリティ計算用の行列演算の準備をする
-	if (parity_num > source_num){
+	len = sizeof(unsigned short) * source_num;
-		len = sizeof(unsigned short) * (source_num + parity_num);
+	if (OpenCL_method != 0)
-	} else {
+		len *= 2;	// GPU の作業領域も確保しておく
 		len = sizeof(unsigned short) * source_num * 2;
 	}
 	constant = malloc(len);
 	if (constant == NULL){
 		printf("malloc, %d\n", len);
@@ -551,7 +721,11 @@ unsigned int time_total = GetTickCount();
 		goto error_end;
 	}
 #ifdef TIMER
-	printf("\nmatrix size = %d.%d KB\n", len >> 10, (len >> 10) % 10);
+	if (len & 0xFFFFF000){
 		printf("\nmatrix size = %u KB\n", len >> 10);
 	} else {
 		printf("\nmatrix size = %u Bytes\n", len);
 	}
 #endif
 	// パリティ検査行列の基になる定数
 	make_encode_constant(constant);
@@ -623,11 +797,9 @@ unsigned int time_total = GetTickCount();
 	}
 	// パリティ計算用の行列演算の準備をする
-	if (parity_num > source_num){
+	len = sizeof(unsigned short) * source_num;
-		len = sizeof(unsigned short) * (source_num + parity_num);
+	if (OpenCL_method != 0)
-	} else {
+		len *= 2;	// GPU の作業領域も確保しておく
 		len = sizeof(unsigned short) * source_num * 2;
 	}
 	constant = malloc(len);
 	if (constant == NULL){
 		printf("malloc, %d\n", len);
@@ -635,7 +807,11 @@ unsigned int time_total = GetTickCount();
 		goto error_end;
 	}
 #ifdef TIMER
-	printf("\nmatrix size = %d.%d KB\n", len >> 10, (len >> 10) % 10);
+	if (len & 0xFFFFF000){
 		printf("\nmatrix size = %u KB\n", len >> 10);
 	} else {
 		printf("\nmatrix size = %u Bytes\n", len);
 	}
 #endif
 	// パリティ検査行列の基になる定数
 	make_encode_constant(constant);
@@ -719,9 +895,11 @@ unsigned int time_matrix = 0, time_total = GetTickCount();
 	}
 #ifdef TIMER
 	if (len & 0xFFF00000){
-		printf("\nmatrix size = %d.%d MB\n", len >> 20, (len >> 20) % 10);
+		printf("\nmatrix size = %u MB\n", len >> 20);
 	} else if (len & 0x000FF000){
 		printf("\nmatrix size = %u KB\n", len >> 10);
 	} else {
-		printf("\nmatrix size = %d.%d KB\n", len >> 10, (len >> 10) % 10);
+		printf("\nmatrix size = %u Bytes\n", len);
 	}
 #endif
 	// 何番目の消失ソース・ブロックがどのパリティで代替されるか
@@ -783,7 +961,7 @@ time_matrix = GetTickCount() - time_matrix;
 		if (memory_use & 16){
 			err = -4;	// SSD なら Read all 方式でブロックが断片化しても速い
 		} else 
-		if (read_block_num(block_lost, 2, 0, MEM_UNIT) != 0){
+		if (read_block_num(block_lost, 0, MEM_UNIT) != 0){
 			err = -5;	// HDD でメモリーが足りてるなら Read some 方式を使う
 		} else {
 			err = -4;	// メモリー不足なら Read all 方式でブロックを断片化させる
@@ -793,7 +971,7 @@ time_matrix = GetTickCount() - time_matrix;
 		if (memory_use & 16){
 			err = -2;	// SSD なら Read all 方式でブロックが断片化しても速い
 		} else 
-		if (read_block_num(block_lost, cpu_num - 1, 0, sse_unit) != 0){
+		if (read_block_num(block_lost, 0, sse_unit) != 0){
 			err = -3;	// HDD でメモリーが足りてるなら Read some 方式を使う
 		} else {
 			err = -2;	// メモリー不足なら Read all 方式でブロックを断片化させる
--- a/source/par2j/reedsolomon.h
+++ b/source/par2j/reedsolomon.h
@@ -10,7 +10,6 @@ extern "C" {
 // Read all source & Keep some parity 方式
 // 部分的なエンコードを行う最低ブロック数
 #define PART_MAX_RATE	1	// ソース・ブロック数の 1/2  = 50%
 #define PART_MIN_RATE	5	// ソース・ブロック数の 1/32 = 3.1%
 // Read some source & Keep all parity 方式
@@ -33,7 +32,6 @@ unsigned int get_io_size(
 // 何ブロックまとめてファイルから読み込むかを空きメモリー量から計算する
 int read_block_num(
 	int keep_num,			// 保持するパリティ・ブロック数
 	int add_num,			// 余裕を見るブロック数
 	size_t trial_alloc,		// 確保できるか確認するのか
 	int alloc_unit);		// メモリー単位の境界 (sse_unit か MEM_UNIT)
--- a/source/par2j/res_par2j.rc
+++ b/source/par2j/res_par2j.rc
@@ -1,8 +1,8 @@
 1 RT_STRING ".\\source.cl"
 1 VERSIONINFO
-FILEVERSION 1,3,2,8
+FILEVERSION 1,3,3,0
-PRODUCTVERSION 1,3,2,0
+PRODUCTVERSION 1,3,3,0
 FILEOS 0x40004
 FILETYPE 0x1
 {
@@ -13,8 +13,8 @@ BLOCK "StringFileInfo"
 		VALUE "FileDescription", "PAR2 client"
 		VALUE "LegalCopyright", "Copyright (C) 2023 Yutaka Sawada"
 		VALUE "ProductName", "par2j"
-		VALUE "FileVersion", "1.3.2.8"
+		VALUE "FileVersion", "1.3.3.0"
-		VALUE "ProductVersion", "1.3.2.0"
+		VALUE "ProductVersion", "1.3.3.0"
 	}
 }
--- a/source/par2j/rs_decode.c
+++ b/source/par2j/rs_decode.c
--- a/source/par2j/rs_encode.c
+++ b/source/par2j/rs_encode.c
--- a/source/par2j/source.cl
+++ b/source/par2j/source.cl
@@ -18,7 +18,9 @@ __kernel void method1(
 	__global uint *src,
 	__global uint *dst,
 	__global ushort *factors,
-	int blk_num)
+	int blk_num,
 	int offset,
 	int length)
 {
 	__local uint mtab[512];
 	int i, blk;
@@ -27,14 +29,15 @@ __kernel void method1(
 	const int work_size = get_global_size(0);
 	const int table_id = get_local_id(0);
-	for (i = work_id; i < BLK_SIZE; i += work_size)
+	src += offset;
 	for (i = work_id; i < length; i += work_size)
 		dst[i] = 0;
 	for (blk = 0; blk < blk_num; blk++){
 		calc_table(mtab, table_id, factors[blk]);
 		barrier(CLK_LOCAL_MEM_FENCE);
-		for (i = work_id; i < BLK_SIZE; i += work_size){
+		for (i = work_id; i < length; i += work_size){
 			v = src[i];
 			sum = mtab[(uchar)(v >> 16)] ^ mtab[256 + (v >> 24)];
 			sum <<= 16;
@@ -50,7 +53,9 @@ __kernel void method2(
 	__global uint *src,
 	__global uint *dst,
 	__global ushort *factors,
-	int blk_num)
+	int blk_num,
 	int offset,
 	int length)
 {
 	__local uint mtab[512];
 	int i, blk, pos;
@@ -59,7 +64,8 @@ __kernel void method2(
 	const int work_size = get_global_size(0) * 2;
 	const int table_id = get_local_id(0);
-	for (i = work_id; i < BLK_SIZE; i += work_size){
+	src += offset;
 	for (i = work_id; i < length; i += work_size){
 		dst[i    ] = 0;
 		dst[i + 1] = 0;
 	}
@@ -68,7 +74,7 @@ __kernel void method2(
 		calc_table(mtab, table_id, factors[blk]);
 		barrier(CLK_LOCAL_MEM_FENCE);
-		for (i = work_id; i < BLK_SIZE; i += work_size){
+		for (i = work_id; i < length; i += work_size){
 			pos = (i & ~7) + ((i & 7) >> 1);
 			lo = src[pos    ];
 			hi = src[pos + 4];
@@ -86,64 +92,13 @@ __kernel void method2(
 	}
 }
 __kernel void method3(
 	__global uint *src,
 	__global uint *dst,
 	__global ushort *factors,
 	int blk_num)
 {
 	__global uint *blk_src;
 	__local uint mtab[512];
 	int i, blk, chk_size, remain, pos;
 	uint lo, hi, sum1, sum2;
 	const int work_id = get_global_id(0) * 2;
 	const int work_size = get_global_size(0) * 2;
 	const int table_id = get_local_id(0);
 	remain = BLK_SIZE;
 	chk_size = CHK_SIZE;
 	while (remain > 0){
 		if (chk_size > remain)
 			chk_size = remain;
 		for (i = work_id; i < chk_size; i += work_size){
 			dst[i    ] = 0;
 			dst[i + 1] = 0;
 		}
 		blk_src = src;
 		for (blk = 0; blk < blk_num; blk++){
 			calc_table(mtab, table_id, factors[blk]);
 			barrier(CLK_LOCAL_MEM_FENCE);
 			for (i = work_id; i < chk_size; i += work_size){
 				pos = (i & ~7) + ((i & 7) >> 1);
 				lo = blk_src[pos    ];
 				hi = blk_src[pos + 4];
 				sum1 = mtab[(uchar)(lo >> 16)] ^ mtab[256 + (uchar)(hi >> 16)];
 				sum2 = mtab[lo >> 24] ^ mtab[256 + (hi >> 24)];
 				sum1 <<= 16;
 				sum2 <<= 16;
 				sum1 ^= mtab[(uchar)lo] ^ mtab[256 + (uchar)hi];
 				sum2 ^= mtab[(uchar)(lo >> 8)] ^ mtab[256 + (uchar)(hi >> 8)];
 				dst[pos    ] ^= (sum1 & 0x00FF00FF) | ((sum2 & 0x00FF00FF) << 8);
 				dst[pos + 4] ^= ((sum1 & 0xFF00FF00) >> 8) | (sum2 & 0xFF00FF00);
 			}
 			blk_src += BLK_SIZE;
 			barrier(CLK_LOCAL_MEM_FENCE);
 		}
 		src += CHK_SIZE;
 		dst += CHK_SIZE;
 		remain -= CHK_SIZE;
 	}
 }
 __kernel void method4(
 	__global uint *src,
 	__global uint *dst,
 	__global ushort *factors,
-	int blk_num)
+	int blk_num,
 	int offset,
 	int length)
 {
 	__local int table[16];
 	__local uint cache[256];
@@ -152,7 +107,8 @@ __kernel void method4(
 	const int work_id = get_global_id(0);
 	const int work_size = get_global_size(0);
-	for (i = work_id; i < BLK_SIZE; i += work_size)
+	src += offset;
 	for (i = work_id; i < length; i += work_size)
 		dst[i] = 0;
 	for (blk = 0; blk < blk_num; blk++){
@@ -166,7 +122,7 @@ __kernel void method4(
 		}
 		barrier(CLK_LOCAL_MEM_FENCE);
-		for (i = work_id; i < BLK_SIZE; i += work_size){
+		for (i = work_id; i < length; i += work_size){
 			pos = i & 255;
 			cache[pos] = src[i];
 			barrier(CLK_LOCAL_MEM_FENCE);
--- a/source/par2j/version.h
+++ b/source/par2j/version.h
@@ -1,2 +1,2 @@
-#define FILE_VERSION "1.3.2.8"	// ファイルのバージョン番号
+#define FILE_VERSION "1.3.3.0"	// ファイルのバージョン番号
-#define PRODUCT_VERSION "1.3.2"	// 製品のバージョン番号
+#define PRODUCT_VERSION "1.3.3"	// 製品のバージョン番号
Author	SHA1	Message	Date
Yutaka Sawada	82197ac0d0	Release note of version 1.3.3.0	2023-10-10 13:08:48 +09:00
Yutaka Sawada	959cf0e8d4	Set release date	2023-10-10 13:02:35 +09:00
Yutaka Sawada	bcbdc2fe38	Add MediaList4 option	2023-09-30 18:47:27 +09:00
Yutaka Sawada	ca2c7731d4	Add 5th item in "Media size"	2023-09-30 18:46:28 +09:00
Yutaka Sawada	aeb8913a58	Decrease threshold to use GPU	2023-09-30 18:44:34 +09:00
Yutaka Sawada	323a53d808	CPU usage slider was returned to 5 levels.	2023-09-27 11:21:22 +09:00
Yutaka Sawada	54931fc0e7	Optimization for CPU cache	2023-09-24 21:26:09 +09:00
Yutaka Sawada	3024186aa6	Change "lc" option	2023-09-24 21:22:57 +09:00
Yutaka Sawada	ceed4ebd83	What is changed in v1.3.3.0	2023-09-24 21:21:42 +09:00
Yutaka Sawada	471246df18	Samples of v1.3.3.0	2023-09-24 21:20:43 +09:00
`@@ -1,2 +1,2 @@`
	`#define FILE_VERSION "1.3.2.8" // ファイルのバージョン番号`	`#define FILE_VERSION "1.3.3.0" // ファイルのバージョン番号`
	`#define PRODUCT_VERSION "1.3.2" // 製品のバージョン番号`	`#define PRODUCT_VERSION "1.3.3" // 製品のバージョン番号`