Release note of version 1.3.3.0

Set release date
Add MediaList4 option
2023-10-10 13:08:48 +09:00 · 2023-10-10 13:02:35 +09:00 · 2023-09-30 18:47:27 +09:00 · 2023-09-30 18:46:28 +09:00 · 2023-09-30 18:44:34 +09:00 · 2023-09-27 11:21:22 +09:00
26 changed files with 2344 additions and 1354 deletions
--- a/README.md
+++ b/README.md
@@ -1,57 +1,65 @@
 # MultiPar

-### v1.3.2.9 is public
-&nbsp; This is the final release of v1.3.2 tree. 
-Because I want to public this as a stable version, I didn't change contents so much. 
-PAR clients are same as previous version. 
-Including long term used applications may be good to avoid false positive at Malware detection.
+### v1.3.3.0 is public

-&nbsp; I fixed a [compatibility issue in calling 7-Zip](https://github.com/Yutaka-Sawada/MultiPar/issues/92), 
-which I didn't know the change. 
-Thanks Lyoko-Jeremie for bug report. 
-The incident happened, when a user selected many files.
+&nbsp; This is a testing version to improve speed of PAR2 calculation. 
+Because the new method isn't tested so much, there may be a bug, failure, or mistake. 
+Be careful to use this non-stable version. 
+When you don't want to test by yourself, you should not use this yet. 
+If you see a problem, please report the incident. 
+I will try to solve as possible as I can.

-&nbsp; I made a sample feature to Save & Restore different "base directories". 
-When you put PAR files in another folder from source files, it will set the previous directory automatically. 
-Because this feature was tested little, it's disabled by default at this time. 
-If you want to enable, add section `[Path]` on "MultiPar.ini". 
-Then set `MRUMax` value, which is the maximum number of stored directries. 
-You may set the value upto 26. It's disabled, when the value is 0. 
-These two lines are like below:
-```
-[Path]
-MRUMax=5
-```
+&nbsp; The PAR2 calculation speed may be 10% ~ 50% faster than old version. 
+The optimization depends on hardware environment. 
+I don't know what is the best setting on which PC. 
+From [many tests of debug versions](https://github.com/Yutaka-Sawada/MultiPar/issues/99), 
+it will select maybe better setting automatically. 
+Thanks testers for many trials. 
+If you want to compare speed of different settings on your PC, you may try those debug versions.

-&nbsp; While I made MultiPar as an utility tool, I didn't give priority to its speed. 
-If someone wants faster Parchive tool, I suggest to use ParPar tools instead of MultiPar. 
-They are "[High performance PAR2 create client for NodeJS](https://github.com/animetosho/ParPar)" or 
-"[speed focused par2cmdline fork](https://github.com/animetosho/par2cmdline-turbo)". 
-Though the speed depends on hardware environments and user's setting, it would be 50% ~ 100 % faster than my par2j. 
-Only when you have a very fast graphics borad, GPU enabled par2j may be faster. 
-I plan to improve speed of par2j in next v1.3.3 tree.
-Though it will become 20% ~ 30% faster than old par2j, ParPar would be faster mostly.
+&nbsp; I changed GPU implementation largely, too. 
+To adopt CPU optimization, it will process smaller tasks on GPU. 
+Because GPU don't use CPU's cache, it's inefficient for GPU's task. 
+I don't know that new method is faster than old version or not.
+
+Threshold to use GPU:
+- Data size must be larger than 200 MB.
+- Block size must be larger than 64 KB.
+- Number of source blocks must be more than 192.
+- Number of recovery blocks must be more than 8.
+
+&nbsp; Because [a user requested](https://github.com/Yutaka-Sawada/MultiPar/issues/102), 
+I implemented a way to add 5th item in "Media size" on Create window. 
+Write this line `MediaList4=name:size` under `[Option]` section in `MultiPar.ini`. 
+Currently, you cannot change the item on Option window.


-[ Changes from 1.3.2.8 to 1.3.2.9 ]  
+[ Changes from 1.3.2.9 to 1.3.3.0 ]  

 GUI update
- New
-  - Verification may save different base directories in MultiPar.ini file.
+- Change
+  - Option adapted to new "lc" settings.
+  - It's possible to add 5th item in "Media size" on Create window.

- Bug fix
-  - Archiver's option was updated for recent 7-Zip versions.
+PAR2 client update
+- Change
+  - Max number of using threads is increased to 32.
+  - Threshold to use GPU was decreased.
+
+- Improvement
+  - Matrix inversion may use more threads.
+  - L3 cache optimization was improved for recent CPUs.


 [ Hash value ]  

-MultiPar132.zip  
-MD5: 305D86C8C7A0F5C1A23CEAFFBE4F02BF  
-SHA1: 464BB7AB7D14FD35D2AEF99042EEB8E556DA0417  
+MultiPar1330.zip  
+MD5: 79570F84B74ECF8E5100561F7AAC3803  
+SHA1: ACF7F164001708789C5D94003ED6B5C172235D54  

-MultiPar132_setup.exe  
-MD5: 18F9BE1FF1C6D668E3A3906C691CCB98  
-SHA1: 116C6B2A15FCFD9BB74F0EF9D6C8A4BF78299588  
+MultiPar1330_setup.exe  
+MD5: D1F1A5A4DF1C9EDD698C9A017AF31039  
+SHA1: 4C3314B909572A303EBBE8E015A2E813841CFA33  
 &nbsp; To install under "Program Files" or "Program Files (x86)" directory, 
 you must start the installer with administrative privileges by selecting 
 "Run as administrator" on right-click menu.
--- a/alpha/MultiPar.exe
+++ b/alpha/MultiPar.exe
--- a/alpha/Update_English.txt
+++ b/alpha/Update_English.txt
@@ -1,3 +1,23 @@
+Release note of v1.3.3 tree
+
+[ Changes from 1.3.2.9 to 1.3.3.0 ] (2023/10/10)
+
+GUI update
+ Change
+  Option adapted to new "lc" settings.
+  It's possible to add 5th item in "Media size" on Create window.
+
+PAR2 client update
+ Change
+  Max number of using threads is increased to 32.
+  Threshold to use GPU was decreased.
+
+ Improvement
+  Matrix inversion may use more threads.
+  L3 cache optimization was improved for recent CPUs.
+
+/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
+
 Release note of v1.3.2 tree

 I tried to decrease probability of false positive at Malware detection.
--- a/alpha/Update_Japanese.txt
+++ b/alpha/Update_Japanese.txt
@@ -1,9 +1,20 @@
-v1.3.2 の更新情報 (2023/08/26)
+v1.3.3 の更新情報 (2023/10/10)

-[ 1.3.1 から 1.3.2 への変更点 ]
+　まだ動作実験中ですので、不安な人は前のバージョンを使ってください。
+
+[ 1.3.2 から 1.3.3 への変更点 ]
+
+・クライアントの変更点
+　CPU Cache の利用方法を改善して速くなりました。
+
+/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
+
+v1.3.2 の更新情報 (2023/08/26)

 　マルウェアとして誤検知されにくいようにしました。

+[ 1.3.1 から 1.3.2 への変更点 ]
+
 ・GUI の変更点
 　動作完了時に Python スクリプトを呼び出せるようにしました。
 　Python スクリプトで作った便利ツールをいくつか追加しました。
--- a/alpha/help/Command_GUI.txt
+++ b/alpha/help/Command_GUI.txt
@@ -1,4 +1,4 @@
-[ MultiPar GUI - version 1.3.2.9 or later ]
+[ MultiPar GUI - version 1.3.3.0 or later ]

 Usage: MultiPar.exe [command] [/base path] [/list path] [files]

@@ -106,6 +106,12 @@ Because of alphabet, "MRUMax=26" is the maximum.
 If you write "MRUMax=0" or remove the line,
 "Most Resent Used List" is disabled.

+ If you want to add 5th item in "Media size" list on Create window,
+write this line "MediaList4=name:size" under "[Option]" section.
+Because the name and size are splitted by ":",
+you cannot include ":" in the name.
+The max size is 999999999999 bytes. (931 GB)
+

 Example of lines on "MultiPar.ini";

@@ -114,5 +120,6 @@ FontName=Arial
 RedundancyMax=10
 Sort=8
 RecoveryFileLimit=1
+MediaList4=7.9GB DVD:8480000000
 [Path]
 MRUMax=5
--- a/alpha/help/Command_par2j.txt
+++ b/alpha/help/Command_par2j.txt
@@ -1,4 +1,4 @@
-[ par2j.exe - version 1.3.2.8 or later ]
+[ par2j.exe - version 1.3.3.0 or later ]

 Type "par2j.exe" to see version, test integrity, and show usage below.

@@ -359,19 +359,22 @@ the protected archive file is made in the directory.
 /lc :
 Set this, if you want to set number of using threads for Multi-Core CPU,
 or want to disable extra feature. (SSE2 is always used.)
-The format is "/lc#", # is from 1 to 11 as the number of using threads,
-12 to use quarter number of physical Cores,
-13 to use half of physical Cores,
-14 to use 3/4 number of physical Cores,
-15 to use the number of physical Cores (disable Hyper Threading),
-or 15 to use one less number of physical Cores on CPU with 6 or more physical Cores.
-Without this option (or /lc0),
-it uses the number of physical Cores on CPU with 6 or more physical Cores,
-or one more threads on CPU with Hyper Threading and 5 or less physical Cores.
+The format is "/lc#" (# is from 1 to 32 as the number of using threads).

- You may set additional combinations; +16 to disable SSSE3,
-+128 to disable CLMUL, +256 to disable JIT, +512 to disable AVX2,
-+32 or +64 (slower device) to enable GPU acceleration.
+ It's possible to set by rate as following. (It's /lc0 by default.)
+251: It uses quarter number of physical Cores.
+252: It uses half of physical Cores.
+253: It uses 3/4 number of physical Cores.
+254: It uses one less threads than number of physical Cores.
+  0: It uses the number of physical Cores.
+255: It uses one more threads than number of physical Cores.
+
+ You may set additional combinations;
+1024 to disable CLMUL (and use old SSSE3 code), 
+2048 to disable JIT (for SSE2), 
+4096 to disable SSSE3,
+8192 to disable AVX2,
+256 or +512 (slower device) to enable GPU acceleration.

 for example,  /lc1 to use single Core, /lc45 to use half Cores and GPU

--- a/alpha/par2j.exe
+++ b/alpha/par2j.exe
--- a/alpha/par2j64.exe
+++ b/alpha/par2j64.exe
--- a/source/par2j/Command_par2j.txt
+++ b/source/par2j/Command_par2j.txt
@@ -1,4 +1,4 @@
-[ par2j.exe - version 1.3.2.8 or later ]
+[ par2j.exe - version 1.3.3.0 or later ]

 Type "par2j.exe" to see version, test integrity, and show usage below.

@@ -359,19 +359,22 @@ the protected archive file is made in the directory.
 /lc :
 Set this, if you want to set number of using threads for Multi-Core CPU,
 or want to disable extra feature. (SSE2 is always used.)
-The format is "/lc#", # is from 1 to 11 as the number of using threads,
-12 to use quarter number of physical Cores,
-13 to use half of physical Cores,
-14 to use 3/4 number of physical Cores,
-15 to use the number of physical Cores (disable Hyper Threading),
-or 15 to use one less number of physical Cores on CPU with 6 or more physical Cores.
-Without this option (or /lc0),
-it uses the number of physical Cores on CPU with 6 or more physical Cores,
-or one more threads on CPU with Hyper Threading and 5 or less physical Cores.
+The format is "/lc#" (# is from 1 to 32 as the number of using threads).

- You may set additional combinations; +16 to disable SSSE3,
-+128 to disable CLMUL, +256 to disable JIT, +512 to disable AVX2,
-+32 or +64 (slower device) to enable GPU acceleration.
+ It's possible to set by rate as following. (It's /lc0 by default.)
+251: It uses quarter number of physical Cores.
+252: It uses half of physical Cores.
+253: It uses 3/4 number of physical Cores.
+254: It uses one less threads than number of physical Cores.
+  0: It uses the number of physical Cores.
+255: It uses one more threads than number of physical Cores.
+
+ You may set additional combinations;
+1024 to disable CLMUL (and use old SSSE3 code), 
+2048 to disable JIT (for SSE2), 
+4096 to disable SSSE3,
+8192 to disable AVX2,
+256 or +512 (slower device) to enable GPU acceleration.

 for example,  /lc1 to use single Core, /lc45 to use half Cores and GPU

--- a/source/par2j/common2.c
+++ b/source/par2j/common2.c
@@ -1,5 +1,5 @@
 // common2.c
-// Copyright : 2023-03-14 Yutaka Sawada
+// Copyright : 2023-09-23 Yutaka Sawada
 // License : GPL

 #ifndef _UNICODE
@@ -1849,8 +1849,9 @@ int sqrt32(int num)

 int cpu_num = 1;	// CPU/Core 個数が制限されてる場合は、上位に本来の数を置く
 // /arch:SSE2, +1=SSSE3, +2=SSE4.1, +4=SSE4.2, +8=CLMUL, +16=AVX2, +128=JIT(SSE2), +256=Old
+// 上位 16-bit = L2 cache サイズから計算した制限サイズ
 unsigned int cpu_flag = 0;
-unsigned int cpu_cache = 0;	// 上位 16-bit = L2 cache * 2, 下位 16-bit = L3 cache
+unsigned int cpu_cache = 0;	// 上位 16-bit = L3 cache の制限サイズ, 下位 16-bit = 同時処理数
 unsigned int memory_use = 0;	// メモリー使用量 0=auto, 1～7 -> 1/8 ～ 7/8

 static int count_bit(DWORD_PTR value)
@@ -1869,7 +1870,7 @@ static int count_bit(DWORD_PTR value)
 void check_cpu(void)
 {
 	int core_count = 0, use_count;
-	unsigned int CPUInfo[4];
+	unsigned int CPUInfo[4], limit_size = 0;
 	unsigned int returnLength, byteOffset;
 	DWORD_PTR ProcessAffinityMask, SystemAffinityMask; // 32-bit なら 4バイト、64-bit なら 8バイト整数
 	PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = NULL, ptr;
@@ -2006,42 +2007,52 @@ void check_cpu(void)
 		//printf("Number of available physical processor cores: %d\n", core_count);
 		if (cache3_size > 0){
 			//printf("L3 cache: %d KB (%d way)\n", cache3_size >> 10 , cache3_way);
-			cache3_size /= cache3_way;	// set-associative のサイズにする
-			if (cache3_size < 131072)
-				cache3_size = 128 << 10;	// 128 KB 以上にする
+			cpu_cache = cache3_size / cache3_way;	// set-associative のサイズにする
+			if (cpu_cache < 131072)
+				cpu_cache = 128 << 10;	// 128 KB 以上にする
 		}
 		if (cache2_size > 0){
 			//printf("L2 cache: %d KB (%d way)\n", cache2_size >> 10, cache2_way);
-			cache2_size /= cache2_way;	// set-associative のサイズにする
-			if (cache2_size < 32768)
-				cache2_size = 32 << 10;	// 32 KB 以上にする
-			//printf("Limit size of Cache Blocking: %d KB\n", cache2_size >> 10);
-			cpu_cache = cache2_size | (cache3_size >> 17);
+			limit_size = cache2_size / cache2_way;	// set-associative のサイズにする
+			if (limit_size < 65536)
+				limit_size = 64 << 10;	// 64 KB 以上にする
+			// 同時処理数を決める
+			if (cache2_way >= 16){
+				returnLength = cache2_way / 2;	// L2 cache の分割数が多い場合は、その半分にする
+			} else {
+				returnLength = 0;
+			}
+			if (cache3_size > 0){	// L2 cache に対する L3 cache のサイズの倍率にする
+				byteOffset = cache3_size / cache2_size;
+				if (returnLength < byteOffset){
+					returnLength = byteOffset;
+					if (cache2_way >= cache3_way)	// L2 cache の分割数が L3 cache 以上なら 1.5倍にする
+						returnLength += returnLength / 2;
+				}
+			}
+			cpu_cache |= returnLength & 0x1FFFF;
 		}
 	}

-	if (cpu_cache == 0)	// キャッシュ・サイズが不明なら、128 KB にする
-		cpu_cache = 128 << 10;
+	if (limit_size == 0)	// キャッシュ・サイズが不明なら、128 KB にする
+		limit_size = 128 << 10;
+	//printf("Limit size of Cache Blocking: %d KB\n", limit_size >> 10);
+	// cpu_flag の上位 16-bit にキャッシュの制限サイズを置く
+	cpu_flag |= limit_size & 0xFFFF0000;	// 64 KB 未満は無視する
+
 	if (core_count == 0){	// 物理コア数が不明なら、論理コア数と同じにする
 		core_count = cpu_num;
 		use_count = cpu_num;
-	} else if (core_count < cpu_num){	// 物理コア数が共有されてるなら
-		if (core_count >= 6){			// 6 コア以上ならそれ以上増やさない
-			use_count = core_count;
-		} else {	// 2~5 コアなら 1個だけ増やす
-			use_count = core_count + 1;
-		}
+	} else if (core_count < cpu_num){	// 物理コアが共有されてるなら
+		use_count = core_count;	// 物理コア数と同じにする
 	} else {
-		use_count = core_count;
+		use_count = cpu_num;	// 論理コア数と同じにする
 	}
 	if (use_count > MAX_CPU)	// 利用するコア数が実装上の制限を越えないようにする
 		use_count = MAX_CPU;
 	//printf("Core count: logical, physical, use = %d, %d, %d\n", cpu_num, core_count, use_count);
 	// 上位に論理コア数と物理コア数、下位に利用するコア数を配置する
 	cpu_num = (cpu_num << 24) | (core_count << 16) | use_count;
-
-	// cpu_flag の上位 17-bit にキャッシュの制限サイズを置く
-	cpu_flag |= cpu_cache & 0xFFFF8000;	// 32 KB 未満は無視する
 }

 // OS が 32-bit か 64-bit かを調べる
--- a/source/par2j/common2.h
+++ b/source/par2j/common2.h
@@ -6,11 +6,11 @@ extern "C" {
 #endif

 #ifndef _WIN64	// 32-bit 版なら
-#define MAX_CPU			8			// 32-bit 版は少なくしておく
+#define MAX_CPU			16			// 32-bit 版は少なくしておく
 #define MAX_MEM_SIZE	0x7F000000	// 確保するメモリー領域の最大値 2032MB
 #define MAX_MEM_SIZE32	0x50000000	// 32-bit OS で確保するメモリー領域の最大値 1280MB
 #else
-#define MAX_CPU			16			// 最大 CPU/Core 個数 (スレッド本数)
+#define MAX_CPU			32			// 最大 CPU/Core 個数 (スレッド本数)
 #endif

 #define MAX_LEN			1024		// ファイル名の最大文字数 (末尾のNULL文字も含む)
--- a/source/par2j/create.c
+++ b/source/par2j/create.c
@@ -1,5 +1,5 @@
 // create.c
-// Copyright : 2022-02-16 Yutaka Sawada
+// Copyright : 2023-09-23 Yutaka Sawada
 // License : GPL

 #ifndef _UNICODE
--- a/source/par2j/gf16.c
+++ b/source/par2j/gf16.c
--- a/source/par2j/gf16.h
+++ b/source/par2j/gf16.h
@@ -47,6 +47,15 @@ typedef void (* REGION_MULTIPLY) (
 	int factor);		// Number to multiply by
 REGION_MULTIPLY galois_align_multiply;

+typedef void (* REGION_MULTIPLY2) (
+	unsigned char *src1,	// Region to multiply
+	unsigned char *src2,
+	unsigned char *dst,		// Products go here
+	unsigned int len,		// Byte length
+	int factor1,			// Number to multiply by
+	int factor2);
+REGION_MULTIPLY2 galois_align_multiply2;
+
 // 領域並び替え用の関数定義
 typedef void (* REGION_ALTMAP) (unsigned char *data, unsigned int bsize);
 REGION_ALTMAP galois_altmap_change;
--- a/source/par2j/lib_opencl.c
+++ b/source/par2j/lib_opencl.c
@@ -1,5 +1,5 @@
 // lib_opencl.c
-// Copyright : 2023-06-01 Yutaka Sawada
+// Copyright : 2023-09-23 Yutaka Sawada
 // License : GPL

 #ifndef _WIN32_WINNT
@@ -72,11 +72,10 @@ typedef cl_int (CL_API_CALL *API_clEnqueueNDRangeKernel)(cl_command_queue, cl_ke
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
 // グローバル変数

-extern unsigned int cpu_flag, cpu_cache;	// declared in common2.h
+extern unsigned int cpu_flag;	// declared in common2.h
 extern int cpu_num;

 #define MAX_DEVICE 3
-#define MAX_GROUP_NUM 64

 HMODULE hLibOpenCL = NULL;

@@ -103,18 +102,17 @@ API_clEnqueueNDRangeKernel gfn_clEnqueueNDRangeKernel;
 入力
 OpenCL_method : どのデバイスを選ぶか
 unit_size : ブロックの単位サイズ
+chunk_size: 分割された断片サイズ
 src_max : ソース・ブロック個数
-chunk_size = 0: 標準では分割しない

 出力
 return : エラー番号
 src_max : 最大で何ブロックまでソースを読み込めるか
-chunk_size : CPUスレッドの分割サイズ
 OpenCL_method : 動作フラグいろいろ
 */

 // 0=成功, 1～エラー番号
-int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
+int init_OpenCL(int unit_size, int chunk_size, int *src_max)
 {
 	char buf[2048], *p_source;
 	int err = 0, i, j;
@@ -141,7 +139,7 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 	API_clGetKernelWorkGroupInfo fn_clGetKernelWorkGroupInfo;
 	cl_int ret;
 	cl_uint num_platforms = 0, num_devices = 0, num_groups, param_value;
-	cl_ulong param_value8, cache_size;
+	cl_ulong param_value8;
 	cl_platform_id platform_id[MAX_DEVICE], selected_platform;	// Intel, AMD, Nvidia などドライバーの提供元
 	cl_device_id device_id[MAX_DEVICE], selected_device;	// CPU や GPU など
 	cl_program program;
@@ -309,19 +307,14 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 			ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &data_size, NULL);
 			if (ret != CL_SUCCESS)
 				continue;
-			ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_HOST_UNIFIED_MEMORY, sizeof(cl_uint), &param_value, NULL);
-			if (ret != CL_SUCCESS)
-				continue;
-			if (param_value != 0)
-				param_value = 1;
+			// CL_DEVICE_HOST_UNIFIED_MEMORY は OpenCL 2.0 以降で非推奨になったので、参照しない

 #ifdef DEBUG_OUTPUT
 			printf("MAX_COMPUTE_UNITS = %d\n", num_groups);
 			printf("MAX_WORK_GROUP_SIZE = %zd\n", data_size);
-			printf("HOST_UNIFIED_MEMORY = %d\n", param_value);
 #endif
-			// MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE で計算力を測る、外付けGPUなら値を倍にする
-			count = (2 - param_value) * (int)data_size * num_groups;
+			// MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE で計算力を測る
+			count = (int)data_size * num_groups;
 			count *= OpenCL_method;	// 符号を変える
 			//printf("prev = %d, now = %d\n", gpu_power, count);
 			if ((count > gpu_power) && (data_size >= 256) &&	// 256以上ないとテーブルを作れない
@@ -330,8 +323,6 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 				selected_device = device_id[j];	// 使うデバイスの ID
 				selected_platform = platform_id[i];
 				OpenCL_group_num = num_groups;	// ワークグループ数は COMPUTE_UNITS 数にする
-				if (OpenCL_group_num > MAX_GROUP_NUM)	// 制限を付けてローカルメモリーの消費を抑える
-					OpenCL_group_num = MAX_GROUP_NUM;
 				alloc_max = (size_t)param_value8;

 				// AMD Radeon ではメモリー領域が全体の 1/4 とは限らない
@@ -345,26 +336,6 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 					if ((cl_ulong)alloc_max > param_value8)
 						alloc_max = (size_t)param_value8;
 				}
-
-				cache_size = 0;
-				ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_GLOBAL_MEM_CACHE_TYPE, sizeof(cl_uint), &num_groups, NULL);
-				if (ret == CL_SUCCESS){
-#ifdef DEBUG_OUTPUT
-					printf("GLOBAL_MEM_CACHE_TYPE = %d\n", num_groups);
-#endif
-					if (num_groups & 3){	// CL_READ_ONLY_CACHE or CL_READ_WRITE_CACHE
-						ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, sizeof(cl_ulong), &cache_size, NULL);
-						if (ret == CL_SUCCESS){
-#ifdef DEBUG_OUTPUT
-							printf("GLOBAL_MEM_CACHE_SIZE = %I64d KB\n", cache_size >> 10);
-#endif
-							if (param_value != 0){	// 内蔵 GPU なら CPU との共有キャッシュを活用する
-								if (cache_size >= 1048576)	// サイズが小さい場合は分割しない
-									cache_size |= 0x40000000;
-							}
-						}
-					}
-				}
 			}
 		}
 	}
@@ -395,67 +366,28 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 		return (ret << 8) | 12;

 	// 計算方式を選択する
-	gpu_power = unit_size;	// unit_size は MEM_UNIT の倍数になってる
 	if ((((cpu_flag & 0x101) == 1) || ((cpu_flag & 16) != 0)) && (sse_unit == 32)){
 		OpenCL_method = 2;	// SSSE3 & ALTMAP または AVX2 ならデータの並び替え対応版を使う
-		if (cache_size & 0x40000000){	// 内蔵 GPU でキャッシュを利用できるなら、CPUスレッドと同じにする
-			j = cpu_cache & 0x7FFF8000;	// CPUのキャッシュ上限サイズ
-			count = (int)(cache_size & 0x3FFFFFFF) / 4;	// ただし、認識できるサイズの 1/4 までにする
-			if ((j == 0) || (j > count))
-				j = count;
-			count = 1;
-			while (gpu_power > j){	// 制限サイズより大きいなら
-				// 分割数を増やして chunk のサイズを試算してみる
-				count++;
-				gpu_power = (unit_size + count - 1) / count;
-				gpu_power = (gpu_power + (MEM_UNIT - 1)) & ~(MEM_UNIT - 1);	// MEM_UNITの倍数にする
-			}
-			if (count > 1){
-				*chunk_size = gpu_power;
-				OpenCL_method = 3;
-#ifdef DEBUG_OUTPUT
-				printf("gpu cache: limit size = %d, chunk size = %d, split = %d\n", j, gpu_power, count);
-#endif
-			}
-/*
-		// 32バイト単位のメモリーアクセスならキャッシュする必要なし？計算速度が半減する・・・
-		} else if ((cache_size & 0x3FFFFFFF) > OpenCL_group_num * 4096){	// 2KB の倍はいるかも？
-#ifdef DEBUG_OUTPUT
-			printf("gpu: cache size = %d, read size = %d\n", cache_size & 0x3FFFFFFF, OpenCL_group_num * 2048);
-#endif
-			OpenCL_method = 1;
-*/
-		}
-
 	} else if (((cpu_flag & 128) != 0) && (sse_unit == 256)){
 		OpenCL_method = 4;	// JIT(SSE2) は bit ごとに上位から 16バイトずつ並ぶ
 		// ローカルのテーブルサイズが異なることに注意
 		// XOR 方式以外は 2KB (4バイト * 256項目 * 2個) 使う
 		// XOR (JIT) は 64バイト (4バイト * 16項目) 使う
-#ifdef DEBUG_OUTPUT
-//		printf("4 KB cache (16-bytes * 256 work items), use if\n");
-#endif
 	} else {
-		OpenCL_method = 1;	// MMX用のコードは遅いので、キャッシュ最適化する必要が無い
+		OpenCL_method = 1;	// 並び替えられてないデータ用
 	}

 	// work group 数が必要以上に多い場合は減らす
-/*
-	if (OpenCL_method == 4){
-		// work item 一個が 16バイトずつ計算する、256個なら work group ごとに 4KB 担当する
-		data_size = unit_size / 4096;
-	} else 
-*/
-	if (OpenCL_method & 2){
+	if (OpenCL_method == 2){
 		// work item 一個が 8バイトずつ計算する、256個なら work group ごとに 2KB 担当する
-		data_size = unit_size / 2048;
+		data_size = chunk_size / 2048;
 	} else {
 		// work item 一個が 4バイトずつ計算する、256個なら work group ごとに 1KB 担当する
-		data_size = unit_size / 1024;
+		data_size = chunk_size / 1024;
 	}
 	if (OpenCL_group_num > data_size){
 		OpenCL_group_num = data_size;
-		printf("Number of work groups is reduced to %d\n", (int)OpenCL_group_num);
+		printf("Number of work groups is reduced to %zd\n", OpenCL_group_num);
 	}

 	// 最大で何ブロック分のメモリー領域を保持できるのか（ここではまだ確保しない）
@@ -469,9 +401,9 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 	printf("src buf : %zd KB (%d blocks), possible\n", data_size >> 10, count);
 #endif

-	// 出力先は1ブロック分だけあればいい
+	// 出力先はchunk 1個分だけあればいい
 	// CL_MEM_ALLOC_HOST_PTRを使えばpinned memoryになるらしい
-	data_size = unit_size;
+	data_size = (chunk_size + 63) & ~63;	//  cache line sizes (64 bytes) の倍数にする
 	OpenCL_dst = gfn_clCreateBuffer(OpenCL_context, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, data_size, NULL, &ret);
 	if (ret != CL_SUCCESS)
 		return (ret << 8) | 13;
@@ -574,7 +506,7 @@ int init_OpenCL(int unit_size, int *src_max, int *chunk_size)
 	FreeResource(glob);	// not required ?

 	// 定数を指定する
-	wsprintfA(buf, "-D BLK_SIZE=%d -D CHK_SIZE=%d", unit_size / 4, gpu_power / 4);
+	wsprintfA(buf, "-cl-fast-relaxed-math -D BLK_SIZE=%d", unit_size / 4);

 	// 使用する OpenCL デバイス用にコンパイルする
 	ret = fn_clBuildProgram(program, 1, &selected_device, buf, NULL, NULL);
@@ -768,11 +700,12 @@ int gpu_copy_blocks(
 }

 // ソース・ブロックを掛け算する
-int gpu_multiply_blocks(
+int gpu_multiply_chunks(
 	int src_num,			// Number of multiplying source blocks
 	unsigned short *mat,	// Matrix of numbers to multiply by
 	unsigned char *buf,		// Products go here
-	int len)				// Byte length
+	int offset,				// Offset in each block
+	int length)				// Byte length
 {
 	unsigned __int64 *vram, *src, *dst;
 	size_t global_size, local_size;
@@ -787,6 +720,14 @@ int gpu_multiply_blocks(
 	ret = gfn_clSetKernelArg(OpenCL_kernel, 3, sizeof(int), &src_num);
 	if (ret != CL_SUCCESS)
 		return (ret << 8) | 103;
+	offset /= 4;	// 4バイト整数単位にする
+	ret = gfn_clSetKernelArg(OpenCL_kernel, 4, sizeof(int), &offset);
+	if (ret != CL_SUCCESS)
+		return (ret << 8) | 104;
+	length /= 4;	// 4バイト整数単位にする
+	ret = gfn_clSetKernelArg(OpenCL_kernel, 5, sizeof(int), &length);
+	if (ret != CL_SUCCESS)
+		return (ret << 8) | 105;

 	// カーネル並列実行
 	local_size = 256;	// テーブルやキャッシュのため、work item 数は 256に固定する
@@ -797,18 +738,18 @@ int gpu_multiply_blocks(
 		return (ret << 8) | 11;

 	// 出力内容をホスト側に反映させる
-	vram = gfn_clEnqueueMapBuffer(OpenCL_command, OpenCL_dst, CL_TRUE, CL_MAP_READ, 0, len, 0, NULL, NULL, &ret);
+	vram = gfn_clEnqueueMapBuffer(OpenCL_command, OpenCL_dst, CL_TRUE, CL_MAP_READ, 0, length * 4, 0, NULL, NULL, &ret);
 	if (ret != CL_SUCCESS)
 		return (ret << 8) | 12;

 	// 8バイトごとに XOR する (SSE2 で XOR しても速くならず)
 	src = vram;
 	dst = (unsigned __int64 *)buf;
-	while (len > 0){
+	while (length > 0){
 		*dst ^= *src;
 		dst++;
 		src++;
-		len -= 8;
+		length -= 2;
 	}

 	// ホスト側でデータを変更しなくても、clEnqueueMapBufferと対で呼び出さないといけない
--- a/source/par2j/lib_opencl.h
+++ b/source/par2j/lib_opencl.h
@@ -10,20 +10,21 @@ extern "C" {

 extern int OpenCL_method;

-int init_OpenCL(int unit_size, int *src_max, int *chunk_size);
+int init_OpenCL(int unit_size, int chunk_size, int *src_max);
 int free_OpenCL(void);
 void info_OpenCL(char *buf, int buf_size);

 int gpu_copy_blocks(
 	unsigned char *data,
 	int unit_size,
-	int src_end);
+	int src_num);

-int gpu_multiply_blocks(
+int gpu_multiply_chunks(
 	int src_num,			// Number of multiplying source blocks
 	unsigned short *mat,	// Matrix of numbers to multiply by
 	unsigned char *buf,		// Products go here
-	int len);				// Byte length
+	int offset,				// Offset in each block
+	int length);			// Byte length

 int gpu_finish(void);

--- a/source/par2j/md5_crc.c
+++ b/source/par2j/md5_crc.c
@@ -1,5 +1,5 @@
 // md5_crc.c
-// Copyright : 2022-10-01 Yutaka Sawada
+// Copyright : 2023-08-28 Yutaka Sawada
 // License : GPL

 #ifndef _UNICODE
@@ -672,7 +672,7 @@ time1_start = GetTickCount();

 	// バッファー・サイズが大きいのでヒープ領域を使う
 	for (io_size = IO_SIZE; io_size < 1048576; io_size += IO_SIZE){	// 1 MB までにする
-		if ((io_size + IO_SIZE > (cpu_cache << 17)) || ((__int64)(io_size + IO_SIZE) * 4 > file_left))
+		if ((io_size + IO_SIZE > (cpu_cache & 0xFFFE0000)) || ((__int64)(io_size + IO_SIZE) * 4 > file_left))
 			break;
 	}
 	buf1 = _aligned_malloc(io_size * 2, 64);
@@ -867,7 +867,7 @@ DWORD WINAPI file_hash_crc2(LPVOID lpParameter)
 	// バッファー・サイズが大きいのでヒープ領域を使う
 	prog_tick = 1;
 	for (io_size = IO_SIZE; io_size < 1048576; io_size += IO_SIZE){	// IO_SIZE の倍数で 1 MB までにする
-		if ((io_size + IO_SIZE > (cpu_cache << 17)) || ((__int64)(io_size + IO_SIZE) * 4 > file_left))
+		if ((io_size + IO_SIZE > (cpu_cache & 0xFFFE0000)) || ((__int64)(io_size + IO_SIZE) * 4 > file_left))
 			break;
 		prog_tick++;
 	}
@@ -1304,7 +1304,7 @@ DWORD WINAPI file_hash_background(LPVOID lpParameter)

 	// バッファー・サイズが大きいのでヒープ領域を使う
 	for (io_size = IO_SIZE; io_size < 1048576; io_size += IO_SIZE){	// IO_SIZE の倍数で 1 MB までにする
-		if ((io_size + IO_SIZE > (cpu_cache << 17)) || ((__int64)(io_size + IO_SIZE) * 4 > file_size))
+		if ((io_size + IO_SIZE > (cpu_cache & 0xFFFE0000)) || ((__int64)(io_size + IO_SIZE) * 4 > file_size))
 			break;
 	}
 	//printf("\n io_size = %d\n", io_size);
--- a/source/par2j/par2.c
+++ b/source/par2j/par2.c
@@ -1,5 +1,5 @@
 // par2.c
-// Copyright : 2023-03-15 Yutaka Sawada
+// Copyright : 2023-09-21 Yutaka Sawada
 // License : GPL

 #ifndef _UNICODE
@@ -112,12 +112,12 @@ int par2_create(
 		err = -12;
 	} else {
 		// メモリーを確保できるか試す
-		err = read_block_num(parity_num, cpu_num - 1, 0, 256);
+		err = read_block_num(parity_num, 0, 256);
 		if (err == 0)
 			err = -13;
 	}
 #ifdef TIMER
-	printf("read_block_num = %d\n", read_block_num(parity_num, cpu_num - 1, 0, 256));
+	printf("read_block_num = %d\n", read_block_num(parity_num, 0, 256));
 #endif
 	if (err > 0){	// 1-pass方式が可能
 #ifdef TIMER
--- a/source/par2j/par2_cmd.c
+++ b/source/par2j/par2_cmd.c
@@ -1,5 +1,5 @@
 // par2_cmd.c
-// Copyright : 2023-03-18 Yutaka Sawada
+// Copyright : 2023-09-28 Yutaka Sawada
 // License : GPL

 #ifndef _UNICODE
@@ -86,7 +86,7 @@ static void print_environment(void)

 	printf("CPU thread\t: %d / %d\n", cpu_num & 0xFFFF, cpu_num >> 24);
 	cpu_num &= 0xFFFF;	// 利用するコア数だけにしておく
-	printf("CPU cache limit : %d KB, %d KB\n", (cpu_cache & 0x7FFF8000) >> 10, (cpu_cache & 0x00007FFF) << 7);
+	printf("CPU cache limit : %d KB, %d KB\n", (cpu_flag & 0xFFFF0000) >> 10, (cpu_cache & 0xFFFE0000) >> 10);
 #ifndef _WIN64	// 32-bit 版は MMX, SSE2, SSSE3 のどれかを表示する
 	printf("CPU extra\t:");
 	if (cpu_flag & 1){
@@ -1481,39 +1481,42 @@ ri= switch_set & 0x00040000
 					k = (k * 10) + (tmp_p[j] - '0');
 					j++;
 				}
-				if (k & 32){	// GPU を使う
+				if (k & 256){	// GPU を使う
 					OpenCL_method = 1;	// Faster GPU
-				} else if (k & 64){
+				} else if (k & 512){
 					OpenCL_method = -1;	// Slower GPU
 				}
-				if (k & 16)	// SSSE3 を使わない
-					cpu_flag &= 0xFFFFFFFE;
-				if (k & 128)	// CLMUL を使わない、SSSE3 の古いエンコーダーを使う
+				if (k & 1024)	// CLMUL を使わない、SSSE3 の古いエンコーダーを使う
 					cpu_flag = (cpu_flag & 0xFFFFFFF7) | 0x100;
-				if (k & 256)	// JIT(SSE2) を使わない
+				if (k & 2048)	// JIT(SSE2) を使わない
 					cpu_flag &= 0xFFFFFF7F;
-				if (k & 512)	// AVX2 を使わない
+				if (k & 4096)	// SSSE3 を使わない
+					cpu_flag &= 0xFFFFFFFE;
+				if (k & 8192)	// AVX2 を使わない
 					cpu_flag &= 0xFFFFFFEF;
-				if (k & 15){	// 使用するコア数を変更する
-					k &= 15;	// 1～15 の範囲
+				if (k & 255){	// 使用するコア数を変更する
+					k &= 255;	// 1～255 の範囲
 					// printf("\n lc# = %d , logical = %d, physical = %d \n", k, cpu_num >> 24, (cpu_num & 0x00FF0000) >> 16);
-					if (k == 12){	// 物理コア数の 1/4 にする
+					if (k == 251){	// 物理コア数の 1/4 にする
 						k = ((cpu_num & 0x00FF0000) >> 16) / 4;
-					} else if (k == 13){	// 物理コア数の半分にする
+					} else if (k == 252){	// 物理コア数の半分にする
 						k = ((cpu_num & 0x00FF0000) >> 16) / 2;
-					} else if (k == 14){	// 物理コア数の 3/4 にする
+					} else if (k == 253){	// 物理コア数の 3/4 にする
 						k = (((cpu_num & 0x00FF0000) >> 16) * 3) / 4;
-					} else if (k == 15){	// 物理コア数にする
-						k = (cpu_num & 0x00FF0000) >> 16;
-						if (k >= 6)
-							k--;	// 物理コア数が 6以上なら、1個減らす
-					} else if (k > (cpu_num >> 24)){
-						k = cpu_num >> 24;	// 論理コア数を超えないようにする
+					} else if (k == 254){	// 物理コア数より減らす
+						k = ((cpu_num & 0x00FF0000) >> 16) - 1;
+					} else if (k == 255){	// 物理コア数より増やす
+						k = ((cpu_num & 0x00FF0000) >> 16) + 1;
+						//k = cpu_num >> 16;
+						//k = ((k & 0xFF) + (k >> 8)) / 2;	// 物理コア数と論理コア数の中間にする？
+						// タスクマネージャーにおける CPU使用率は 100%になるけど、速くはならない・・・
 					}
 					if (k > MAX_CPU){
 						k = MAX_CPU;
 					} else if (k < 1){
 						k = 1;
+					} else if (k > (cpu_num >> 24)){
+						k = cpu_num >> 24;	// 論理コア数を超えないようにする
 					}
 					cpu_num = (cpu_num & 0xFFFF0000) | k;	// 指定されたコア数を下位に配置する
 				}
--- a/source/par2j/reedsolomon.c
+++ b/source/par2j/reedsolomon.c
@@ -1,5 +1,5 @@
 // reedsolomon.c
-// Copyright : 2023-05-29 Yutaka Sawada
+// Copyright : 2023-09-28 Yutaka Sawada
 // License : GPL

 #ifndef _UNICODE
@@ -30,13 +30,13 @@

 // GPU を使う最小データサイズ (MB 単位)
 // GPU の起動には時間がかかるので、データが小さすぎると逆に遅くなる
-#define GPU_DATA_LIMIT 512
+#define GPU_DATA_LIMIT 200

 // GPU を使う最小ブロックサイズとブロック数
 // CPU と GPU で処理を割り振る為には、ある程度のブロック数を必要とする
 #define GPU_BLOCK_SIZE_LIMIT 65536
-#define GPU_SOURCE_COUNT_LIMIT 256
-#define GPU_PARITY_COUNT_LIMIT 32
+#define GPU_SOURCE_COUNT_LIMIT 192
+#define GPU_PARITY_COUNT_LIMIT 8

 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

@@ -46,7 +46,7 @@ int try_cache_blocking(int unit_size)
 	int limit_size, chunk_count, chunk_size, cache_line_diff;

 	// CPUキャッシュをどのくらいまで使うか
-	limit_size = cpu_flag & 0x7FFF8000;	// 最低でも 32KB になる
+	limit_size = cpu_flag & 0x7FFF0000;	// 最低でも 64KB になる
 	if (limit_size == 0)	// キャッシュ・サイズを取得できなかった場合は最適化しない
 		return unit_size;

@@ -160,7 +160,6 @@ unsigned int get_io_size(
 // 何ブロックまとめてファイルから読み込むかを空きメモリー量から計算する
 int read_block_num(
 	int keep_num,			// 保持するパリティ・ブロック数
-	int add_num,			// 余裕を見るブロック数
 	size_t trial_alloc,		// 確保できるか確認するのか
 	int alloc_unit)			// メモリー単位の境界 (sse_unit か MEM_UNIT)
 {
@@ -177,7 +176,7 @@ int read_block_num(

 	if (trial_alloc){
 		__int64 possible_size;
-		possible_size = (__int64)unit_size * (source_num + keep_num + add_num);
+		possible_size = (__int64)unit_size * (source_num + keep_num);
 #ifndef _WIN64	// 32-bit 版なら
 		if (possible_size > MAX_MEM_SIZE)	// 確保する最大サイズを 2GB までにする
 			possible_size = MAX_MEM_SIZE;
@@ -191,13 +190,13 @@ int read_block_num(
 	}
 	mem_size = get_mem_size(trial_alloc) / unit_size;	// 何個分確保できるか

-	if (mem_size >= (size_t)(source_num + keep_num + add_num)){	// 最大個数より多い
+	if (mem_size >= (size_t)(source_num + keep_num)){	// 最大個数より多い
 		buf_num = source_num;
-	} else if ((int)mem_size < read_min + keep_num + add_num){	// 少なすぎる
+	} else if ((int)mem_size < read_min + keep_num){	// 少なすぎる
 		buf_num = 0;	// メモリー不足の印
 	} else {	// ソース・ブロック個数を等分割する
 		int split_num;
-		buf_num = (int)mem_size - (keep_num + add_num);
+		buf_num = (int)mem_size - keep_num;
 		split_num = (source_num + buf_num - 1) / buf_num;	// 何回に別けて読み込むか
 		buf_num = (source_num + split_num - 1) / split_num;
 	}
@@ -263,7 +262,7 @@ static int invert_matrix_st(unsigned short *mat,

 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
 // マルチ・プロセッサー対応
-
+/*
 typedef struct {	// RS threading control struct
 	unsigned short *mat;	// 行列
 	int cols;	// 横行の長さ
@@ -308,8 +307,57 @@ static DWORD WINAPI thread_func(LPVOID lpParameter)
 	CloseHandle(th->end);
 	return 0;
 }
+*/
+typedef struct {	// Maxtrix Inversion threading control struct
+	unsigned short *mat;	// 行列
+	int cols;	// 横行の長さ
+	volatile int start;	// 掛ける行の先頭位置
+	volatile int pivot;	// 倍率となる値の位置
+	volatile int skip;	// とばす行
+	volatile int now;	// 消去する行
+	HANDLE run;
+	HANDLE end;
+} INV_TH;
+
+// サブ・スレッド
+static DWORD WINAPI thread_func(LPVOID lpParameter)
+{
+	unsigned short *mat;
+	int j, cols, row_start2, factor;
+	HANDLE hRun, hEnd;
+	INV_TH *th;
+
+	th = (INV_TH *)lpParameter;
+	mat = th->mat;
+	cols = th->cols;
+	hRun = th->run;
+	hEnd = th->end;
+	SetEvent(hEnd);	// 設定完了を通知する
+
+	WaitForSingleObject(hRun, INFINITE);	// 計算開始の合図を待つ
+	while (th->skip >= 0){
+		while ((j = InterlockedDecrement(&(th->now))) >= 0){	// j = --th_now
+			if (j == th->skip)
+				continue;
+			row_start2 = cols * j;	// その行の開始位置
+			factor = mat[row_start2 + th->pivot];	// j 行の pivot 列の値
+			mat[row_start2 + th->pivot] = 0;	// これが行列を一個で済ます手
+			// 先の計算により、i 行の pivot 列の値は必ず 1なので、この factor が倍率になる
+			galois_region_multiply(mat + th->start, mat + row_start2, cols, factor);
+		}
+		//_mm_sfence();	// メモリーへの書き込みを完了する
+		SetEvent(hEnd);	// 計算終了を通知する
+		WaitForSingleObject(hRun, INFINITE);	// 計算開始の合図を待つ
+	}
+
+	// 終了処理
+	CloseHandle(hRun);
+	CloseHandle(hEnd);
+	return 0;
+}

 // マルチ・スレッドで逆行列を計算する (利用するパリティ・ブロックの所だけ)
+/*
 static int invert_matrix_mt(unsigned short *mat,
 	int rows,				// 横行の数、行列の縦サイズ、失われたソース・ブロックの数 = 利用するパリティ・ブロック数
 	int cols,				// 縦列の数、行列の横サイズ、本来のソース・ブロック数
@@ -411,6 +459,130 @@ static int invert_matrix_mt(unsigned short *mat,
 	CloseHandle(th->h);
 	return 0;
 }
+*/
+
+static int invert_matrix_mt(unsigned short *mat,
+	int rows,				// 横行の数、行列の縦サイズ、失われたソース・ブロックの数 = 利用するパリティ・ブロック数
+	int cols,				// 縦列の数、行列の横サイズ、本来のソース・ブロック数
+	source_ctx_r *s_blk)	// 各ソース・ブロックの情報
+{
+	int err = 0, j, row_start2, factor, sub_num;
+	unsigned int time_last = GetTickCount();
+	HANDLE hSub[MAX_CPU / 2], hRun[MAX_CPU / 2], hEnd[MAX_CPU / 2];
+	INV_TH th[1];
+
+	memset(hSub, 0, sizeof(HANDLE) * (MAX_CPU / 2));
+	memset(th, 0, sizeof(INV_TH));
+
+	// サブ・スレッドの数は平方根（切り上げ）にする
+	sub_num = 1;
+	j = 2;
+	while (j < cpu_num){	// 1~2=1, 3~4=2, 5~8=3, 9~16=4, 17~32=5
+		sub_num++;
+		j *= 2;
+	}
+	if (sub_num > rows - 2)
+		sub_num = rows - 2;	// 多過ぎても意味ないので制限する
+#ifdef TIMER
+	// 使うスレッド数は、メイン・スレッドの分も含めるので 1個増える
+	printf("\nMaxtrix Inversion with %d threads\n", sub_num + 1);
+#endif
+
+	// サブ・スレッドを起動する
+	th->mat = mat;
+	th->cols = cols;
+	for (j = 0; j < sub_num; j++){	// サブ・スレッドごとに
+		// イベントを作成する
+		hRun[j] = CreateEvent(NULL, FALSE, FALSE, NULL);	// 両方とも Auto Reset にする
+		if (hRun[j] == NULL){
+			print_win32_err();
+			printf("error, inv-thread\n");
+			err = 1;
+			goto error_end;
+		}
+		hEnd[j] = CreateEvent(NULL, FALSE, FALSE, NULL);
+		if (hEnd[j] == NULL){
+			print_win32_err();
+			CloseHandle(hRun[j]);
+			printf("error, inv-thread\n");
+			err = 1;
+			goto error_end;
+		}
+		// サブ・スレッドを起動する
+		th->run = hRun[j];
+		th->end = hEnd[j];
+		//_mm_sfence();	// メモリーへの書き込みを完了してからスレッドを起動する
+		hSub[j] = (HANDLE)_beginthreadex(NULL, STACK_SIZE, thread_func, (LPVOID)th, 0, NULL);
+		if (hSub[j] == NULL){
+			print_win32_err();
+			CloseHandle(hRun[j]);
+			CloseHandle(hEnd[j]);
+			printf("error, inv-thread\n");
+			err = 1;
+			goto error_end;
+		}
+		WaitForSingleObject(hEnd[j], INFINITE);	// 設定終了の合図を待つ (リセットする)
+	}
+
+	// Gaussian Elimination with 1 matrix
+	th->pivot = 0;
+	th->start = 0;	// その行の開始位置
+	for (th->skip = 0; th->skip < rows; th->skip++){
+		// 経過表示
+		if (GetTickCount() - time_last >= UPDATE_TIME){
+			if (print_progress((th->skip * 1000) / rows)){
+				err = 2;
+				goto error_end;
+			}
+			time_last = GetTickCount();
+		}
+
+		// その行 (パリティ・ブロック) がどのソース・ブロックの代用か
+		while ((th->pivot < cols) && (s_blk[th->pivot].exist != 0))
+			th->pivot++;
+
+		// Divide the row by element i,pivot
+		factor = mat[th->start + th->pivot];
+		if (factor > 1){
+			mat[th->start + th->pivot] = 1;	// これが行列を一個で済ます手
+			galois_region_divide(mat + th->start, cols, factor);
+		} else if (factor == 0){	// factor = 0 だと、その行列の逆行列を計算できない
+			err = (0x00010000 | th->pivot);	// どのソース・ブロックで問題が発生したのかを返す
+			goto error_end;
+		}
+
+		// 別の行の同じ pivot 列が 0以外なら、その値を 0にするために、
+		// i 行を何倍かしたものを XOR する
+		th->now = rows;	// 初期値 + 1
+		//_mm_sfence();	// メモリーへの書き込みを完了してからスレッドを再開する
+		for (j = 0; j < sub_num; j++)
+			SetEvent(hRun[j]);	// サブ・スレッドに計算を開始させる
+		while ((j = InterlockedDecrement(&(th->now))) >= 0){	// j = --th_now
+			if (j == th->skip)	// 同じ行はとばす
+				continue;
+			row_start2 = cols * j;	// その行の開始位置
+			factor = mat[row_start2 + th->pivot];	// j 行の pivot 列の値
+			mat[row_start2 + th->pivot] = 0;	// これが行列を一個で済ます手
+			// 先の計算により、i 行の pivot 列の値は必ず 1なので、この factor が倍率になる
+			galois_region_multiply(mat + th->start, mat + row_start2, cols, factor);
+		}
+
+		WaitForMultipleObjects(sub_num, hEnd, TRUE, INFINITE);	// サブ・スレッドの計算終了の合図を待つ
+		th->start += cols;
+		th->pivot++;
+	}
+
+error_end:
+	InterlockedExchange(&(th->skip), -1);		// 終了指示
+	for (j = 0; j < sub_num; j++){
+		if (hSub[j]){	// サブ・スレッドを終了させる
+			SetEvent(hRun[j]);
+			WaitForSingleObject(hSub[j], INFINITE);
+			CloseHandle(hSub[j]);
+		}
+	}
+	return err;
+}

 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
 /*
@@ -539,11 +711,9 @@ unsigned int time_total = GetTickCount();
 	}

 	// パリティ計算用の行列演算の準備をする
-	if (parity_num > source_num){
-		len = sizeof(unsigned short) * (source_num + parity_num);
-	} else {
-		len = sizeof(unsigned short) * source_num * 2;
-	}
+	len = sizeof(unsigned short) * source_num;
+	if (OpenCL_method != 0)
+		len *= 2;	// GPU の作業領域も確保しておく
 	constant = malloc(len);
 	if (constant == NULL){
 		printf("malloc, %d\n", len);
@@ -551,7 +721,11 @@ unsigned int time_total = GetTickCount();
 		goto error_end;
 	}
 #ifdef TIMER
-	printf("\nmatrix size = %d.%d KB\n", len >> 10, (len >> 10) % 10);
+	if (len & 0xFFFFF000){
+		printf("\nmatrix size = %u KB\n", len >> 10);
+	} else {
+		printf("\nmatrix size = %u Bytes\n", len);
+	}
 #endif
 	// パリティ検査行列の基になる定数
 	make_encode_constant(constant);
@@ -623,11 +797,9 @@ unsigned int time_total = GetTickCount();
 	}

 	// パリティ計算用の行列演算の準備をする
-	if (parity_num > source_num){
-		len = sizeof(unsigned short) * (source_num + parity_num);
-	} else {
-		len = sizeof(unsigned short) * source_num * 2;
-	}
+	len = sizeof(unsigned short) * source_num;
+	if (OpenCL_method != 0)
+		len *= 2;	// GPU の作業領域も確保しておく
 	constant = malloc(len);
 	if (constant == NULL){
 		printf("malloc, %d\n", len);
@@ -635,7 +807,11 @@ unsigned int time_total = GetTickCount();
 		goto error_end;
 	}
 #ifdef TIMER
-	printf("\nmatrix size = %d.%d KB\n", len >> 10, (len >> 10) % 10);
+	if (len & 0xFFFFF000){
+		printf("\nmatrix size = %u KB\n", len >> 10);
+	} else {
+		printf("\nmatrix size = %u Bytes\n", len);
+	}
 #endif
 	// パリティ検査行列の基になる定数
 	make_encode_constant(constant);
@@ -719,9 +895,11 @@ unsigned int time_matrix = 0, time_total = GetTickCount();
 	}
 #ifdef TIMER
 	if (len & 0xFFF00000){
-		printf("\nmatrix size = %d.%d MB\n", len >> 20, (len >> 20) % 10);
+		printf("\nmatrix size = %u MB\n", len >> 20);
+	} else if (len & 0x000FF000){
+		printf("\nmatrix size = %u KB\n", len >> 10);
 	} else {
-		printf("\nmatrix size = %d.%d KB\n", len >> 10, (len >> 10) % 10);
+		printf("\nmatrix size = %u Bytes\n", len);
 	}
 #endif
 	// 何番目の消失ソース・ブロックがどのパリティで代替されるか
@@ -783,7 +961,7 @@ time_matrix = GetTickCount() - time_matrix;
 		if (memory_use & 16){
 			err = -4;	// SSD なら Read all 方式でブロックが断片化しても速い
 		} else 
-		if (read_block_num(block_lost, 2, 0, MEM_UNIT) != 0){
+		if (read_block_num(block_lost, 0, MEM_UNIT) != 0){
 			err = -5;	// HDD でメモリーが足りてるなら Read some 方式を使う
 		} else {
 			err = -4;	// メモリー不足なら Read all 方式でブロックを断片化させる
@@ -793,7 +971,7 @@ time_matrix = GetTickCount() - time_matrix;
 		if (memory_use & 16){
 			err = -2;	// SSD なら Read all 方式でブロックが断片化しても速い
 		} else 
-		if (read_block_num(block_lost, cpu_num - 1, 0, sse_unit) != 0){
+		if (read_block_num(block_lost, 0, sse_unit) != 0){
 			err = -3;	// HDD でメモリーが足りてるなら Read some 方式を使う
 		} else {
 			err = -2;	// メモリー不足なら Read all 方式でブロックを断片化させる
--- a/source/par2j/reedsolomon.h
+++ b/source/par2j/reedsolomon.h
@@ -10,7 +10,6 @@ extern "C" {

 // Read all source & Keep some parity 方式
 // 部分的なエンコードを行う最低ブロック数
-#define PART_MAX_RATE	1	// ソース・ブロック数の 1/2  = 50%
 #define PART_MIN_RATE	5	// ソース・ブロック数の 1/32 = 3.1%

 // Read some source & Keep all parity 方式
@@ -33,7 +32,6 @@ unsigned int get_io_size(
 // 何ブロックまとめてファイルから読み込むかを空きメモリー量から計算する
 int read_block_num(
 	int keep_num,			// 保持するパリティ・ブロック数
-	int add_num,			// 余裕を見るブロック数
 	size_t trial_alloc,		// 確保できるか確認するのか
 	int alloc_unit);		// メモリー単位の境界 (sse_unit か MEM_UNIT)

--- a/source/par2j/res_par2j.rc
+++ b/source/par2j/res_par2j.rc
@@ -1,8 +1,8 @@
 1 RT_STRING ".\\source.cl"

 1 VERSIONINFO
-FILEVERSION 1,3,2,8
-PRODUCTVERSION 1,3,2,0
+FILEVERSION 1,3,3,0
+PRODUCTVERSION 1,3,3,0
 FILEOS 0x40004
 FILETYPE 0x1
 {
@@ -13,8 +13,8 @@ BLOCK "StringFileInfo"
 		VALUE "FileDescription", "PAR2 client"
 		VALUE "LegalCopyright", "Copyright (C) 2023 Yutaka Sawada"
 		VALUE "ProductName", "par2j"
-		VALUE "FileVersion", "1.3.2.8"
-		VALUE "ProductVersion", "1.3.2.0"
+		VALUE "FileVersion", "1.3.3.0"
+		VALUE "ProductVersion", "1.3.3.0"
 	}
 }

--- a/source/par2j/rs_decode.c
+++ b/source/par2j/rs_decode.c
--- a/source/par2j/rs_encode.c
+++ b/source/par2j/rs_encode.c
--- a/source/par2j/source.cl
+++ b/source/par2j/source.cl
@@ -18,7 +18,9 @@ __kernel void method1(
 	__global uint *src,
 	__global uint *dst,
 	__global ushort *factors,
-	int blk_num)
+	int blk_num,
+	int offset,
+	int length)
 {
 	__local uint mtab[512];
 	int i, blk;
@@ -27,14 +29,15 @@ __kernel void method1(
 	const int work_size = get_global_size(0);
 	const int table_id = get_local_id(0);

-	for (i = work_id; i < BLK_SIZE; i += work_size)
+	src += offset;
+	for (i = work_id; i < length; i += work_size)
 		dst[i] = 0;

 	for (blk = 0; blk < blk_num; blk++){
 		calc_table(mtab, table_id, factors[blk]);
 		barrier(CLK_LOCAL_MEM_FENCE);

-		for (i = work_id; i < BLK_SIZE; i += work_size){
+		for (i = work_id; i < length; i += work_size){
 			v = src[i];
 			sum = mtab[(uchar)(v >> 16)] ^ mtab[256 + (v >> 24)];
 			sum <<= 16;
@@ -50,7 +53,9 @@ __kernel void method2(
 	__global uint *src,
 	__global uint *dst,
 	__global ushort *factors,
-	int blk_num)
+	int blk_num,
+	int offset,
+	int length)
 {
 	__local uint mtab[512];
 	int i, blk, pos;
@@ -59,7 +64,8 @@ __kernel void method2(
 	const int work_size = get_global_size(0) * 2;
 	const int table_id = get_local_id(0);

-	for (i = work_id; i < BLK_SIZE; i += work_size){
+	src += offset;
+	for (i = work_id; i < length; i += work_size){
 		dst[i    ] = 0;
 		dst[i + 1] = 0;
 	}
@@ -68,7 +74,7 @@ __kernel void method2(
 		calc_table(mtab, table_id, factors[blk]);
 		barrier(CLK_LOCAL_MEM_FENCE);

-		for (i = work_id; i < BLK_SIZE; i += work_size){
+		for (i = work_id; i < length; i += work_size){
 			pos = (i & ~7) + ((i & 7) >> 1);
 			lo = src[pos    ];
 			hi = src[pos + 4];
@@ -86,64 +92,13 @@ __kernel void method2(
 	}
 }

-__kernel void method3(
-	__global uint *src,
-	__global uint *dst,
-	__global ushort *factors,
-	int blk_num)
-{
-	__global uint *blk_src;
-	__local uint mtab[512];
-	int i, blk, chk_size, remain, pos;
-	uint lo, hi, sum1, sum2;
-	const int work_id = get_global_id(0) * 2;
-	const int work_size = get_global_size(0) * 2;
-	const int table_id = get_local_id(0);
-
-	remain = BLK_SIZE;
-	chk_size = CHK_SIZE;
-	while (remain > 0){
-		if (chk_size > remain)
-			chk_size = remain;
-
-		for (i = work_id; i < chk_size; i += work_size){
-			dst[i    ] = 0;
-			dst[i + 1] = 0;
-		}
-
-		blk_src = src;
-		for (blk = 0; blk < blk_num; blk++){
-			calc_table(mtab, table_id, factors[blk]);
-			barrier(CLK_LOCAL_MEM_FENCE);
-
-			for (i = work_id; i < chk_size; i += work_size){
-				pos = (i & ~7) + ((i & 7) >> 1);
-				lo = blk_src[pos    ];
-				hi = blk_src[pos + 4];
-				sum1 = mtab[(uchar)(lo >> 16)] ^ mtab[256 + (uchar)(hi >> 16)];
-				sum2 = mtab[lo >> 24] ^ mtab[256 + (hi >> 24)];
-				sum1 <<= 16;
-				sum2 <<= 16;
-				sum1 ^= mtab[(uchar)lo] ^ mtab[256 + (uchar)hi];
-				sum2 ^= mtab[(uchar)(lo >> 8)] ^ mtab[256 + (uchar)(hi >> 8)];
-				dst[pos    ] ^= (sum1 & 0x00FF00FF) | ((sum2 & 0x00FF00FF) << 8);
-				dst[pos + 4] ^= ((sum1 & 0xFF00FF00) >> 8) | (sum2 & 0xFF00FF00);
-			}
-			blk_src += BLK_SIZE;
-			barrier(CLK_LOCAL_MEM_FENCE);
-		}
-
-		src += CHK_SIZE;
-		dst += CHK_SIZE;
-		remain -= CHK_SIZE;
-	}
-}
-
 __kernel void method4(
 	__global uint *src,
 	__global uint *dst,
 	__global ushort *factors,
-	int blk_num)
+	int blk_num,
+	int offset,
+	int length)
 {
 	__local int table[16];
 	__local uint cache[256];
@@ -152,7 +107,8 @@ __kernel void method4(
 	const int work_id = get_global_id(0);
 	const int work_size = get_global_size(0);

-	for (i = work_id; i < BLK_SIZE; i += work_size)
+	src += offset;
+	for (i = work_id; i < length; i += work_size)
 		dst[i] = 0;

 	for (blk = 0; blk < blk_num; blk++){
@@ -166,7 +122,7 @@ __kernel void method4(
 		}
 		barrier(CLK_LOCAL_MEM_FENCE);

-		for (i = work_id; i < BLK_SIZE; i += work_size){
+		for (i = work_id; i < length; i += work_size){
 			pos = i & 255;
 			cache[pos] = src[i];
 			barrier(CLK_LOCAL_MEM_FENCE);
--- a/source/par2j/version.h
+++ b/source/par2j/version.h
@@ -1,2 +1,2 @@
-#define FILE_VERSION "1.3.2.8"	// ファイルのバージョン番号
-#define PRODUCT_VERSION "1.3.2"	// 製品のバージョン番号
+#define FILE_VERSION "1.3.3.0"	// ファイルのバージョン番号
+#define PRODUCT_VERSION "1.3.3"	// 製品のバージョン番号
Author	SHA1	Message	Date
Yutaka Sawada	82197ac0d0	Release note of version 1.3.3.0	2023-10-10 13:08:48 +09:00
Yutaka Sawada	959cf0e8d4	Set release date	2023-10-10 13:02:35 +09:00
Yutaka Sawada	bcbdc2fe38	Add MediaList4 option	2023-09-30 18:47:27 +09:00
Yutaka Sawada	ca2c7731d4	Add 5th item in "Media size"	2023-09-30 18:46:28 +09:00
Yutaka Sawada	aeb8913a58	Decrease threshold to use GPU	2023-09-30 18:44:34 +09:00
Yutaka Sawada	323a53d808	CPU usage slider was returned to 5 levels.	2023-09-27 11:21:22 +09:00
Yutaka Sawada	54931fc0e7	Optimization for CPU cache	2023-09-24 21:26:09 +09:00
Yutaka Sawada	3024186aa6	Change "lc" option	2023-09-24 21:22:57 +09:00
Yutaka Sawada	ceed4ebd83	What is changed in v1.3.3.0	2023-09-24 21:21:42 +09:00
Yutaka Sawada	471246df18	Samples of v1.3.3.0	2023-09-24 21:20:43 +09:00