Release note of version 1.3.3.2

Update to year 2024
2024-01-10 13:18:10 +09:00 · 2024-01-10 10:44:19 +09:00 · 2024-01-10 10:40:33 +09:00 · 2023-12-26 18:57:09 +09:00 · 2023-12-26 18:56:26 +09:00 · 2023-12-26 18:53:12 +09:00
35 changed files with 2155 additions and 1012 deletions
--- a/README.md
+++ b/README.md
@@ -1,68 +1,44 @@
 # MultiPar

-### v1.3.3.0 is public
+### v1.3.3.2 is public

-&nbsp; This is a testing version to improve speed of PAR2 calculation. 
-Because the new method isn't tested so much, there may be a bug, failure, or mistake. 
-Be careful to use this non-stable version. 
-When you don't want to test by yourself, you should not use this yet. 
+&nbsp; This is a small fix version to improve performance of GPU acceleration. 
+It will become faster on AMD Radeon graphics boards. 
+It may be slightly faster on Nvidia GeForce graphics boards. 
+There is no difference in CPU calculation. 
+Because this isn't tested so much, there may be a bug, failure, or mistake. 
 If you see a problem, please report the incident. 
 I will try to solve as possible as I can.

-&nbsp; The PAR2 calculation speed may be 10% ~ 50% faster than old version. 
-The optimization depends on hardware environment. 
-I don't know what is the best setting on which PC. 
-From [many tests of debug versions](https://github.com/Yutaka-Sawada/MultiPar/issues/99), 
-it will select maybe better setting automatically. 
-Thanks testers for many trials. 
-If you want to compare speed of different settings on your PC, you may try those debug versions.
-
-&nbsp; I changed GPU implementation largely, too. 
-To adopt CPU optimization, it will process smaller tasks on GPU. 
-Because GPU don't use CPU's cache, it's inefficient for GPU's task. 
-I don't know that new method is faster than old version or not.
-
-Threshold to use GPU:
- Data size must be larger than 200 MB.
- Block size must be larger than 64 KB.
- Number of source blocks must be more than 192.
- Number of recovery blocks must be more than 8.
-
-&nbsp; Because [a user requested](https://github.com/Yutaka-Sawada/MultiPar/issues/102), 
-I implemented a way to add 5th item in "Media size" on Create window. 
-Write this line `MediaList4=name:size` under `[Option]` section in `MultiPar.ini`. 
-Currently, you cannot change the item on Option window.
+&nbsp; I changed 3 points in my OpenCL implementation. 
+It's possible to test them by `lc` option at command-line. 
+Thanks [cavalia88, Slava46, and Anime Tosho for many tests and wonderful idea](https://github.com/Yutaka-Sawada/MultiPar/issues/107). 
+OpenCL perfomance is varied in every graphics boards. 
+If you have a fast graphics board, enabling "GPU acceleration" would be faster. 
+If it's not so fast (or is slow) on your PC, just un-check the feature.
+1) Data transfur between PC's RAM and GPU's VRAM
+2) Calculation over GPU
+3) Calculate 2 blocks at once to reduce number of table lookup


-[ Changes from 1.3.2.9 to 1.3.3.0 ]  
-
-GUI update
- Change
-  - Option adapted to new "lc" settings.
-  - It's possible to add 5th item in "Media size" on Create window.
+[ Changes from 1.3.3.1 to 1.3.3.2 ]  

 PAR2 client update
- Change
-  - Max number of using threads is increased to 32.
-  - Threshold to use GPU was decreased.
-
 - Improvement
-  - Matrix inversion may use more threads.
-  - L3 cache optimization was improved for recent CPUs.
+  - GPU acceleration will work well on AMD graphics boards.


 [ Hash value ]  

-MultiPar1330.zip  
-MD5: 79570F84B74ECF8E5100561F7AAC3803  
-SHA1: ACF7F164001708789C5D94003ED6B5C172235D54  
+MultiPar1332.zip  
+MD5: 5F2848ED7F65C632D1FED42A39B66F95  
+SHA1: CFA2CC6D217704BE2AF9DEDE15B117E9DC26A25B  

-MultiPar1330_setup.exe  
-MD5: D1F1A5A4DF1C9EDD698C9A017AF31039  
-SHA1: 4C3314B909572A303EBBE8E015A2E813841CFA33  
+MultiPar1332_setup.exe  
+MD5: 338F9D0842762338DC83921BBE546AF8  
+SHA1: 2A11FD544D49AA7B952214733C9D8E53F647592E  
 &nbsp; To install under "Program Files" or "Program Files (x86)" directory, 
-you must start the installer with administrative privileges by selecting 
-"Run as administrator" on right-click menu.
+you must select "Install for all users" at the first dialog.

 &nbsp; Old versions and source code packages are available at 
 [GitHub](https://github.com/Yutaka-Sawada/MultiPar/releases) or 
--- a/alpha/MultiPar.exe
+++ b/alpha/MultiPar.exe
--- a/alpha/ReadMe_English.txt
+++ b/alpha/ReadMe_English.txt
@@ -71,14 +71,13 @@ Don't send current PAR3 files to others, who may not have the same version.

 [ How to install or uninstall with installer package ]

- Double click setup file ( MultiPar131_setup.exe or something like this name ),
+ Double click setup file ( MultiPar133_setup.exe or something like this name ),
 and follow the installer dialog.
 At version up, if you want to use previous setting, overwrite install is possible.
 Before overwrite install, you should un-check "Integrate MultiPar into Shell".
 You may need to re-start OS after overwrite install or uninstall rarely.
 To install under "Program Files" or "Program Files (x86)" directory,
-you must start the installer with administrative privileges by selecting
-"Run as administrator" on right-click menu.
+you must select "Install for all users" at the first dialog.

 You can uninstall through the Windows OS's Control Panel,
 or double click unins000.exe in a folder which MultiPar was installed.
@@ -100,7 +99,7 @@ In either case, user made icons and association are available for the user only.

 [ How to install with archive version ]

- Unpack compressed file ( MultiPar131.zip or something like this name ) in a folder.
+ Unpack compressed file ( MultiPar133.zip or something like this name ) in a folder.
 MultiPar.exe is the interface of MultiPar.

 You can create short-cut icon or send-to link at Option window later.
--- a/alpha/ReadMe_Japanese.txt
+++ b/alpha/ReadMe_Japanese.txt
@@ -107,12 +107,14 @@ PAR 3.0 仕様のフォーマットは細部が流動的で最終版との互換

 [ インストーラー版のインストールとアンインストール ]

-　インストーラー ( MultiPar131_setup.exe みたいな名前 ) をダブル・クリックすると、
+　インストーラー ( MultiPar133_setup.exe みたいな名前 ) をダブル・クリックすると、
 インストール画面が表示されるので、その指示に従ってください。
 バージョン・アップ時に、設定項目をそのまま使いたい時は上書きインストールしてもいいです。
 上書きインストールする前に、「MultiPar をシェルに統合する」のチェックを外してください。
 上書きインストールやアンインストール後に OS の再起動を求められるかもしれません。
 「Program Files」や「Program Files (x86)」内にインストールするには、
+最初のダイアログで「すべてのユーザー用にインストール」を選んでください。
+
 右クリック・メニューの「管理者として実行」を選んで
 管理者権限でインストーラーを開始する必要があります。

@@ -137,7 +139,7 @@ MultiPar をインストールしたフォルダ内の unins000.exe をダブル

 [ アーカイブ版のインストール ]

-　配布されてる圧縮ファイル ( MultiPar131.zip みたいな名前 ) を解凍してできたファイルを
+　配布されてる圧縮ファイル ( MultiPar133.zip みたいな名前 ) を解凍してできたファイルを
 どこか適当なフォルダに全て入れてください。
 その中の MultiPar.exe というのが MultiPar の実行ファイルです。
 それをダブル・クリックすると MultiPar が起動します。
--- a/alpha/Update_English.txt
+++ b/alpha/Update_English.txt
@@ -1,5 +1,29 @@
 Release note of v1.3.3 tree

+ par2j's "lc" option was changed to support more threads.
+
+[ Changes from 1.3.3.1 to 1.3.3.2 ] (2024/01/10)
+
+PAR2 client update
+ Improvement
+  GPU acceleration will work well on AMD graphics boards.
+
+/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
+
+[ Changes from 1.3.3.0 to 1.3.3.1 ] (2023/11/11)
+
+Installer update
+ It shows dialog to select "per user" or "per machine" installation.
+
+PAR2 client update
+ Change
+  Max number of threads to read files on SSD was increased to 6.
+
+ Improvement
+  GPU acceleration would become faster.
+
+/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
+
 [ Changes from 1.3.2.9 to 1.3.3.0 ] (2023/10/10)

 GUI update
--- a/alpha/Update_Japanese.txt
+++ b/alpha/Update_Japanese.txt
@@ -1,4 +1,4 @@
-v1.3.3 の更新情報 (2023/10/10)
+v1.3.3 の更新情報 (2024/01/10)

 　まだ動作実験中ですので、不安な人は前のバージョンを使ってください。

@@ -6,6 +6,7 @@

 ・クライアントの変更点
 　CPU Cache の利用方法を改善して速くなりました。
+　GPU による高速化も速くなりました。

 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

--- a/alpha/help/0409/index.htm
+++ b/alpha/help/0409/index.htm
@@ -51,7 +51,7 @@ There are command-line manuals in "<tt>help</tt>" folder.
 </table>

 <hr>
-<small>last update 2023/06/13 for version 1.3.2.9</small>
+<small>last update 2023/11/11 for version 1.3.3.1</small>

 </body>
 </html>
--- a/alpha/help/0409/python.htm
+++ b/alpha/help/0409/python.htm
@@ -173,7 +173,8 @@ It will verify each PAR2 set continuously.
 It's possible to stop queue on GUI.
 <tr><td>Script file<td><tt>queue_verify.py</tt>
 <tr><td>Caution<td>You must select MultiPar Option: "Re-use verification result" to be "For 3 days" or longer, 
-and you should check "Don't search subfolders" in "Verification and Repair options" section on "Client behavior" tab.
+and you should check "Don't search subfolders" in "Verification and Repair options" section on "Client behavior" tab. 
+Furthermore, you must set proper <tt>save_path</tt> in the script.
 </table>
 </p>
 <p>&nbsp
--- a/alpha/help/0409/readme3.htm
+++ b/alpha/help/0409/readme3.htm
@@ -8,14 +8,13 @@

 <h3>Install or uninstall with installer package</h3>
 <p>&nbsp
-Double click setup file ( <tt>MultiPar131_setup.exe</tt> or something like this name ),
+Double click setup file ( <tt>MultiPar133_setup.exe</tt> or something like this name ),
 and follow the installer dialog. 
 At version up, if you want to use previous setting, overwrite install is possible. 
 Before overwrite install, you should un-check "Integrate MultiPar into Shell". 
 You may need to re-start OS after overwrite install or uninstall rarely. 
 To install under "<tt>Program Files</tt>" or "<tt>Program Files (x86)</tt>" directory, 
-you must start the installer with administrative privileges by selecting 
-"Run as administrator" on right-click menu.
+you must select "Install for all users" at the first dialog.
 </p>
 <p>&nbsp
 You can uninstall through the Windows OS's Control Panel, 
@@ -42,7 +41,7 @@ In either case, user made icons and association are available for the user only.

 <h3>Install with archive version</h3>
 <p>&nbsp
-Unpack compressed file ( <tt>MultiPar131.zip</tt> or something like this name ) in a folder.
+Unpack compressed file ( <tt>MultiPar133.zip</tt> or something like this name ) in a folder.
 <tt>MultiPar.exe</tt> is the interface of MultiPar.
 </p>
 <p>&nbsp
--- a/alpha/help/0411/index.htm
+++ b/alpha/help/0411/index.htm
@@ -51,7 +51,7 @@
 </table>

 <hr>
-<small><EFBFBD>ŏI<EFBFBD>X<EFBFBD>V 2023/02/27 (<28>o<EFBFBD>[<5B>W<EFBFBD><57><EFBFBD><EFBFBD> 1.3.2.8)</small>
+<small><EFBFBD>ŏI<EFBFBD>X<EFBFBD>V 2023/11/11 (<28>o<EFBFBD>[<5B>W<EFBFBD><57><EFBFBD><EFBFBD> 1.3.3.1)</small>

 </body>
 </html>
--- a/alpha/help/0411/readme3.htm
+++ b/alpha/help/0411/readme3.htm
@@ -8,14 +8,13 @@

 <h3><EFBFBD>C<EFBFBD><EFBFBD><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><><EFBFBD>[<5B>ł̃C<CC83><43><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><><EFBFBD>ƃA<C683><41><EFBFBD>C<EFBFBD><43><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><></h3>
 <p>&nbsp
-<EFBFBD>C<EFBFBD><EFBFBD><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><><EFBFBD>[ ( <tt>MultiPar131_setup.exe</tt> <20>݂<EFBFBD><DD82><EFBFBD><EFBFBD>Ȗ<EFBFBD><C896>O ) <20><><EFBFBD>_<EFBFBD>u<EFBFBD><75><EFBFBD>E<EFBFBD>N<EFBFBD><4E><EFBFBD>b<EFBFBD>N<EFBFBD><4E><EFBFBD><EFBFBD><EFBFBD>ƁA
+<EFBFBD>C<EFBFBD><EFBFBD><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><><EFBFBD>[ ( <tt>MultiPar133_setup.exe</tt> <20>݂<EFBFBD><DD82><EFBFBD><EFBFBD>Ȗ<EFBFBD><C896>O ) <20><><EFBFBD>_<EFBFBD>u<EFBFBD><75><EFBFBD>E<EFBFBD>N<EFBFBD><4E><EFBFBD>b<EFBFBD>N<EFBFBD><4E><EFBFBD><EFBFBD><EFBFBD>ƁA
 <EFBFBD>C<EFBFBD><EFBFBD><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><><EFBFBD><EFBFBD><EFBFBD>ʂ<EFBFBD><CA82>\<5C><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>̂ŁA<C581><41><EFBFBD>̎w<CC8E><77><EFBFBD>ɏ]<5D><><EFBFBD>Ă<EFBFBD><C482><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>B
 <EFBFBD>o<EFBFBD>[<5B>W<EFBFBD><57><EFBFBD><EFBFBD><EFBFBD>E<EFBFBD>A<EFBFBD>b<EFBFBD>v<EFBFBD><76><EFBFBD>ɁA<C981>ݒ荀<DD92>ڂ<EFBFBD><DA82><EFBFBD><EFBFBD>̂܂܎g<DC8E><67><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>͏㏑<CD8F><E38F91><EFBFBD>C<EFBFBD><43><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><><EFBFBD><EFBFBD><EFBFBD>Ă<EFBFBD><C482><EFBFBD><EFBFBD><EFBFBD><EFBFBD>ł<EFBFBD><C582>B
 <EFBFBD>㏑<EFBFBD><EFBFBD><EFBFBD>C<EFBFBD><EFBFBD><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>O<EFBFBD>ɁA<C981>uMultiPar <20><><EFBFBD>V<EFBFBD>F<EFBFBD><46><EFBFBD>ɓ<EFBFBD><C993><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>v<EFBFBD>̃`<60>F<EFBFBD>b<EFBFBD>N<EFBFBD><4E><EFBFBD>O<EFBFBD><4F><EFBFBD>Ă<EFBFBD><C482><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>B
 <EFBFBD>㏑<EFBFBD><EFBFBD><EFBFBD>C<EFBFBD><EFBFBD><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><><EFBFBD><EFBFBD><EFBFBD>A<EFBFBD><41><EFBFBD>C<EFBFBD><43><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><><EFBFBD><EFBFBD><EFBFBD><EFBFBD> OS <20>̍ċN<C48B><4E><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>߂<EFBFBD><DF82><EFBFBD><EFBFBD>邩<EFBFBD><E982A9><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>܂<EFBFBD><DC82><EFBFBD><EFBFBD>B
 <EFBFBD>u<tt>Program Files</tt><EFBFBD>v<EFBFBD><EFBFBD><EFBFBD>u<tt>Program Files (x86)</tt><EFBFBD>v<EFBFBD><EFBFBD><EFBFBD>ɃC<EFBFBD><EFBFBD><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>ɂ́A
-<EFBFBD>E<EFBFBD>N<EFBFBD><EFBFBD><EFBFBD>b<EFBFBD>N<EFBFBD>E<EFBFBD><EFBFBD><EFBFBD>j<EFBFBD><EFBFBD><EFBFBD>[<5B>́u<CC81>Ǘ<EFBFBD><C797>҂Ƃ<D282><C682>Ď<EFBFBD><C48E>s<EFBFBD>v<EFBFBD><76><EFBFBD>I<EFBFBD><49><EFBFBD><EFBFBD>
-<EFBFBD>Ǘ<EFBFBD><EFBFBD>Ҍ<EFBFBD><EFBFBD><EFBFBD><EFBFBD>ŃC<EFBFBD><EFBFBD><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><><EFBFBD>[<5B><><EFBFBD>J<EFBFBD>n<EFBFBD><6E><EFBFBD><EFBFBD><EFBFBD>K<EFBFBD>v<EFBFBD><76><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>܂<EFBFBD><DC82>B
+<EFBFBD>ŏ<EFBFBD><EFBFBD>̃_<EFBFBD>C<EFBFBD>A<EFBFBD><EFBFBD><EFBFBD>O<EFBFBD>Łu<EFBFBD><EFBFBD><EFBFBD>ׂẴ<EFBFBD><EFBFBD>[<5B>U<EFBFBD>[<5B>p<EFBFBD>ɃC<C983><43><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><><EFBFBD>v<EFBFBD><76><EFBFBD>I<EFBFBD><49><EFBFBD>ł<EFBFBD><C582><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>B
 </p>
 <p>&nbsp
 <EFBFBD>A<EFBFBD><EFBFBD><EFBFBD>C<EFBFBD><EFBFBD><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><><EFBFBD><EFBFBD> Windows OS <20>̃R<CC83><52><EFBFBD>g<EFBFBD><67><EFBFBD>[<5B><><EFBFBD>E<EFBFBD>p<EFBFBD>l<EFBFBD><6C><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>s<EFBFBD><73><EFBFBD><EFBFBD><EFBFBD>A
@@ -43,7 +42,7 @@ MultiPar

 <h3><EFBFBD>A<EFBFBD>[<5B>J<EFBFBD>C<EFBFBD>u<EFBFBD>ł̃C<CC83><43><EFBFBD>X<EFBFBD>g<EFBFBD>[<5B><></h3>
 <p>&nbsp
-<EFBFBD>z<EFBFBD>z<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ă鈳<EFBFBD>k<EFBFBD>t<EFBFBD>@<40>C<EFBFBD><43> ( <tt>MultiPar131.zip</tt> <20>݂<EFBFBD><DD82><EFBFBD><EFBFBD>Ȗ<EFBFBD><C896>O ) <20><><EFBFBD>𓀂<EFBFBD><F0938082>Ăł<C482><C582><EFBFBD><EFBFBD>t<EFBFBD>@<40>C<EFBFBD><43><EFBFBD><EFBFBD>
+<EFBFBD>z<EFBFBD>z<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ă鈳<EFBFBD>k<EFBFBD>t<EFBFBD>@<40>C<EFBFBD><43> ( <tt>MultiPar133.zip</tt> <20>݂<EFBFBD><DD82><EFBFBD><EFBFBD>Ȗ<EFBFBD><C896>O ) <20><><EFBFBD>𓀂<EFBFBD><F0938082>Ăł<C482><C582><EFBFBD><EFBFBD>t<EFBFBD>@<40>C<EFBFBD><43><EFBFBD><EFBFBD>
 <EFBFBD>ǂ<EFBFBD><EFBFBD><EFBFBD><EFBFBD>K<EFBFBD><EFBFBD><EFBFBD>ȃt<EFBFBD>H<EFBFBD><EFBFBD><EFBFBD>_<EFBFBD>ɑS<EFBFBD>ē<EFBFBD><EFBFBD><EFBFBD><EFBFBD>Ă<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>B
 <EFBFBD><EFBFBD><EFBFBD>̒<EFBFBD><EFBFBD><EFBFBD> <tt>MultiPar.exe</tt> <20>Ƃ<EFBFBD><C682><EFBFBD><EFBFBD>̂<EFBFBD> MultiPar <20>̎<EFBFBD><CC8E>s<EFBFBD>t<EFBFBD>@<40>C<EFBFBD><43><EFBFBD>ł<EFBFBD><C582>B
 <EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>_<EFBFBD>u<EFBFBD><EFBFBD><EFBFBD>E<EFBFBD>N<EFBFBD><EFBFBD><EFBFBD>b<EFBFBD>N<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD> MultiPar <20><><EFBFBD>N<EFBFBD><4E><EFBFBD><EFBFBD><EFBFBD>܂<EFBFBD><DC82>B
--- a/alpha/help/0804/index.htm
+++ b/alpha/help/0804/index.htm
@@ -50,7 +50,7 @@
 </table>

 <hr>
-<small>最新更新于2023年2月27日，适用于1.3.2.8版本，简体中文化 Deng Shiqing</small>
+<small>最新更新于2023年11月11日，适用于1.3.3.1版本，简体中文化 Deng Shiqing</small>

 </body>
 </html>
--- a/alpha/help/0804/readme3.htm
+++ b/alpha/help/0804/readme3.htm
@@ -8,7 +8,7 @@

 <h3>使用安装包安装或卸载</h3>
 <p>&nbsp
-双击安装文件（<tt>MultiPar131_setup.exe</tt>或类似名称文件），然后按照安装程序对话框进行操作。在版本升级时，如果要使用先前的设置，可以进行覆盖安装。在覆盖安装之前, 应取消勾选“将MultiPar整合到右键菜单”。在写入安装或卸载之后，您可能需要重新启动操作系统。“<tt>Program Files</tt>”或“<tt>Program Files (x86)</tt>”目录下，必须在右键菜单上选择“以管理员身份运行”，用管理员权限启动安装程序。
+双击安装文件（<tt>MultiPar133_setup.exe</tt>或类似名称文件），然后按照安装程序对话框进行操作。在版本升级时，如果要使用先前的设置，可以进行覆盖安装。在覆盖安装之前, 应取消勾选“将MultiPar整合到右键菜单”。在写入安装或卸载之后，您可能需要重新启动操作系统。要在“<tt>Program Files</tt>”或“<tt>Program Files (x86)</tt>”目录下安装，您必须在第一个对话框中选择“为所有用户安装”。
 </p>
 <p>&nbsp
 您可以通过Windows操作系统的控制面板卸载程序，或双击MultiPar安装文件夹中的<tt>unins000.exe</tt>。由于卸载程序不会删除设置文件或安装后新添加的文件，因此您可以自行删除它们。
@@ -26,7 +26,7 @@

 <h3>使用压缩包安装</h3>
 <p>&nbsp
-在文件夹中解压压缩文件（<tt>MultiPar131.zip</tt>或类似名称文件）。
+在文件夹中解压压缩文件（<tt>MultiPar133.zip</tt>或类似名称文件）。
 <tt>MultiPar.exe</tt>是MultiPar的启动程序。
 </p>
 <p>&nbsp
--- a/alpha/help/Command_par2j.txt
+++ b/alpha/help/Command_par2j.txt
@@ -1,4 +1,4 @@
-[ par2j.exe - version 1.3.3.0 or later ]
+[ par2j.exe - version 1.3.3.2 or later ]

 Type "par2j.exe" to see version, test integrity, and show usage below.

@@ -367,16 +367,24 @@ The format is "/lc#" (# is from 1 to 32 as the number of using threads).
 253: It uses 3/4 number of physical Cores.
 254: It uses one less threads than number of physical Cores.
  0: It uses the number of physical Cores.
-255: It uses one more threads than number of physical Cores.
+255: It tries to use more threads than number of physical Cores.

- You may set additional combinations;
-+1024 to disable CLMUL (and use old SSSE3 code), 
-+2048 to disable JIT (for SSE2), 
-+4096 to disable SSSE3,
-+8192 to disable AVX2,
-+256 or +512 (slower device) to enable GPU acceleration.
+ You may set additional combinations for CPU feature;
+1024 to disable CLMUL (and use slower SSSE3 code)
+2048 to disable JIT (for SSE2)
+4096 to disable SSSE3
+8192 to disable AVX2

- for example,  /lc1 to use single Core, /lc45 to use half Cores and GPU
+ You may set additional combinations for GPU control;
+256 or +512 (slower device) to enable GPU acceleration
+65536 for classic method
+131072 for 16-byte memory access
+262144 for 4-byte memory access and calculate 2 blocks at once
+524288 for 16-byte memory access and calculate 2 blocks at once
+1048576 for CL_MEM_COPY_HOST_PTR or +2097152 for CL_MEM_USE_HOST_PTR
+(When you set exclusive bits, larger value will be used.)
+
+ for example,  /lc1 to use single Core, /lc508 to use half Cores and GPU

 /m :
 Set this, if you want to set memory usage.
--- a/alpha/par2j.exe
+++ b/alpha/par2j.exe
--- a/alpha/par2j64.exe
+++ b/alpha/par2j64.exe
--- a/source/par2j/Command_par2j.txt
+++ b/source/par2j/Command_par2j.txt
@@ -1,4 +1,4 @@
-[ par2j.exe - version 1.3.3.0 or later ]
+[ par2j.exe - version 1.3.3.2 or later ]

 Type "par2j.exe" to see version, test integrity, and show usage below.

@@ -367,16 +367,24 @@ The format is "/lc#" (# is from 1 to 32 as the number of using threads).
 253: It uses 3/4 number of physical Cores.
 254: It uses one less threads than number of physical Cores.
  0: It uses the number of physical Cores.
-255: It uses one more threads than number of physical Cores.
+255: It tries to use more threads than number of physical Cores.

- You may set additional combinations;
-+1024 to disable CLMUL (and use old SSSE3 code), 
-+2048 to disable JIT (for SSE2), 
-+4096 to disable SSSE3,
-+8192 to disable AVX2,
-+256 or +512 (slower device) to enable GPU acceleration.
+ You may set additional combinations for CPU feature;
+1024 to disable CLMUL (and use slower SSSE3 code)
+2048 to disable JIT (for SSE2)
+4096 to disable SSSE3
+8192 to disable AVX2

- for example,  /lc1 to use single Core, /lc45 to use half Cores and GPU
+ You may set additional combinations for GPU control;
+256 or +512 (slower device) to enable GPU acceleration
+65536 for classic method
+131072 for 16-byte memory access
+262144 for 4-byte memory access and calculate 2 blocks at once
+524288 for 16-byte memory access and calculate 2 blocks at once
+1048576 for CL_MEM_COPY_HOST_PTR or +2097152 for CL_MEM_USE_HOST_PTR
+(When you set exclusive bits, larger value will be used.)
+
+ for example,  /lc1 to use single Core, /lc508 to use half Cores and GPU

 /m :
 Set this, if you want to set memory usage.
--- a/source/par2j/common2.c
+++ b/source/par2j/common2.c
@@ -1,5 +1,5 @@
 // common2.c
-// Copyright : 2023-09-23 Yutaka Sawada
+// Copyright : 2023-10-13 Yutaka Sawada
 // License : GPL

 #ifndef _UNICODE
@@ -1848,7 +1848,7 @@ int sqrt32(int num)
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

 int cpu_num = 1;	// CPU/Core 個数が制限されてる場合は、上位に本来の数を置く
-// /arch:SSE2, +1=SSSE3, +2=SSE4.1, +4=SSE4.2, +8=CLMUL, +16=AVX2, +128=JIT(SSE2), +256=Old
+// /arch:SSE2, +1=SSSE3, +2=SSE4.1, +4=SSE4.2, +8=CLMUL, +16=AVX2, +128=JIT(SSE2), +256=ALTMAPなし
 // 上位 16-bit = L2 cache サイズから計算した制限サイズ
 unsigned int cpu_flag = 0;
 unsigned int cpu_cache = 0;	// 上位 16-bit = L3 cache の制限サイズ, 下位 16-bit = 同時処理数
--- a/source/par2j/create.c
+++ b/source/par2j/create.c
@@ -1,5 +1,5 @@
 // create.c
-// Copyright : 2023-09-23 Yutaka Sawada
+// Copyright : 2023-12-12 Yutaka Sawada
 // License : GPL

 #ifndef _UNICODE
@@ -26,6 +26,11 @@

 //#define TIMER // 実験用

+#ifdef TIMER
+#include <time.h>
+static double time_sec, time_speed;
+#endif
+
 // ソート時に項目を比較する
 static int sort_cmp(const void *elem1, const void *elem2)
 {
@@ -196,7 +201,7 @@ int set_common_packet(
 	__int64 prog_now = 0;

 #ifdef TIMER
-unsigned int time_start = GetTickCount();
+clock_t time_start = clock();
 #endif
 	print_progress_text(0, "Computing file hash");

@@ -305,14 +310,14 @@ unsigned int time_start = GetTickCount();
 	off += (64 + main_packet_size);

 #ifdef TIMER
-time_start = GetTickCount() - time_start;
-printf("hash %d.%03d sec", time_start / 1000, time_start % 1000);
-if (time_start > 0){
-	time_start = (int)((total_file_size * 125) / ((__int64)time_start * 131072));
-	printf(", %d MB/s\n", time_start);
+time_start = clock() - time_start;
+time_sec = (double)time_start / CLOCKS_PER_SEC;
+if (time_sec > 0){
+	time_speed = (double)total_file_size / (time_sec * 1048576);
 } else {
-	printf("\n");
+	time_speed = 0;
 }
+printf("hash %.3f sec, %.0f MB/s\n", time_sec, time_speed);
 #endif

 error_end:
@@ -320,7 +325,7 @@ error_end:
 	return off;
 }

-#define MAX_MULTI_READ	4	// SSDで同時に読み込む最大ファイル数
+#define MAX_MULTI_READ	6	// SSDで同時に読み込む最大ファイル数

 // SSD 上で複数ファイルのハッシュ値を同時に求めるバージョン
 int set_common_packet_multi(
@@ -341,18 +346,16 @@ int set_common_packet_multi(
 	FILE_HASH_TH th[MAX_MULTI_READ];

 #ifdef TIMER
-unsigned int time_start = GetTickCount();
+clock_t time_start = clock();
 #endif

 	memset(hSub, 0, sizeof(HANDLE) * MAX_MULTI_READ);
 	memset(th, 0, sizeof(FILE_HASH_TH) * MAX_MULTI_READ);
 	// Core数に応じてスレッド数を増やす
 	if ((memory_use & 32) != 0){	// NVMe SSD
-		if (cpu_num >= 8){	// 8 ~ 16 Cores
-			multi_read = 4;
-		} else {	// 3 Cores + Hyper-threading, or 4 ~ 7 Cores
-			multi_read = 3;
-		}
+		multi_read = (cpu_num + 2) / 3 + 1;	// 3=2, 4~6=3, 7~9=4, 10~12=5, 13~=6
+		if (multi_read > MAX_MULTI_READ)
+			multi_read = MAX_MULTI_READ;
 	} else {	// SATA SSD
 		multi_read = 2;
 	}
@@ -547,14 +550,14 @@ unsigned int time_start = GetTickCount();
 	}
 	print_progress_done();	// 改行して行の先頭に戻しておく
 #ifdef TIMER
-time_start = GetTickCount() - time_start;
-printf("hash %d.%03d sec", time_start / 1000, time_start % 1000);
-if (time_start > 0){
-	time_start = (int)((total_file_size * 125) / ((__int64)time_start * 131072));
-	printf(", %d MB/s\n", time_start);
+time_start = clock() - time_start;
+time_sec = (double)time_start / CLOCKS_PER_SEC;
+if (time_sec > 0){
+	time_speed = (double)total_file_size / (time_sec * 1048576);
 } else {
-	printf("\n");
+	time_speed = 0;
 }
+printf("hash %.3f sec, %.0f MB/s\n", time_sec, time_speed);
 #endif

 error_end:
@@ -702,7 +705,7 @@ int set_common_packet_hash(
 	__int64 prog_now = 0;

 #ifdef TIMER
-unsigned int time_start = GetTickCount();
+clock_t time_start = clock();
 #endif
 	print_progress_text(0, "Computing file hash");

@@ -742,8 +745,8 @@ unsigned int time_start = GetTickCount();
 	print_progress_done();	// 改行して行の先頭に戻しておく

 #ifdef TIMER
-time_start = GetTickCount() - time_start;
-printf("hash %d.%03d sec\n", time_start / 1000, time_start % 1000);
+time_start = clock() - time_start;
+printf("hash %.3f sec\n", (double)time_start / CLOCKS_PER_SEC);
 #endif
 	return 0;
 }
@@ -1067,7 +1070,7 @@ int create_recovery_file(
 #endif

 #ifdef TIMER
-unsigned int time_start = GetTickCount();
+clock_t time_start = clock();
 #endif
 	print_progress_text(0, "Constructing recovery file");
 	time_last = GetTickCount();
@@ -1260,8 +1263,8 @@ unsigned int time_start = GetTickCount();
 	print_progress_done();	// 改行して行の先頭に戻しておく

 #ifdef TIMER
-time_start = GetTickCount() - time_start;
-printf("write %d.%03d sec\n", time_start / 1000, time_start % 1000);
+time_start = clock() - time_start;
+printf("write %.3f sec\n", (double)time_start / CLOCKS_PER_SEC);
 #endif

 	return 0;
@@ -1282,6 +1285,7 @@ int create_recovery_file_1pass(
 	int footer_size,			// 末尾パケットのバッファー・サイズ
 	HANDLE *rcv_hFile,			// 各リカバリ・ファイルのハンドル
 	unsigned char *p_buf,		// 計算済みのパリティ・ブロック
+	unsigned char *g_buf,		// GPU用 (GPUを使わない場合は NULLにすること)
 	unsigned int unit_size)
 {
 	unsigned char *packet_header, hash[HASH_SIZE];
@@ -1438,6 +1442,10 @@ int create_recovery_file_1pass(

 		// Recovery Slice packet は後から書き込む
 		for (j = block_start; j < block_start + block_count; j++){
+			if (g_buf != NULL){	// GPUを使った場合
+				// CPUスレッドと GPUスレッドの計算結果を合わせる
+				galois_align_xor(g_buf + (size_t)unit_size * j, p_buf, unit_size);
+			}
 			// パリティ・ブロックのチェックサムを検証する
 			checksum16_return(p_buf, hash, unit_size - HASH_SIZE);
 			if (memcmp(p_buf + unit_size - HASH_SIZE, hash, HASH_SIZE) != 0){
--- a/source/par2j/create.h
+++ b/source/par2j/create.h
@@ -82,6 +82,7 @@ int create_recovery_file_1pass(
 	int footer_size,			// 末尾パケットのバッファー・サイズ
 	HANDLE *rcv_hFile,			// 各リカバリ・ファイルのハンドル
 	unsigned char *p_buf,		// 計算済みのパリティ・ブロック
+	unsigned char *g_buf,		// GPU用 (GPUを使わない場合は NULLにすること)
 	unsigned int unit_size);

 // 作成中のリカバリ・ファイルを削除する
--- a/source/par2j/gf16.c
+++ b/source/par2j/gf16.c
@@ -71,7 +71,6 @@ extern unsigned int cpu_flag;	// declared in common2.h
 // CPU によって使う関数を変更する際の仮宣言

 //#define NO_SIMD	// SIMD を使わない場合
-//#define NO_ALTMAP	// SSSE3 や JIT(SSE2) の並び替えを使わない場合 (CLMULや32バイト単位は有効)

 int sse_unit;

@@ -134,8 +133,11 @@ int galois_create_table(void)
 	checksum16_altmap = checksum16;
 	checksum16_return = checksum16;
 #ifndef NO_SIMD
-#ifndef NO_ALTMAP
-	if (cpu_flag & 16){	// AVX2 対応なら
+	if (cpu_flag & 256){	// AVX2, SSSE3, JIT(SSE2) の並び替えを使わない場合
+		// 将来的には AVX-512 などの命令に対応してもいい
+		//printf("\nWithout ALTMAP\n");
+		//sse_unit = 32;
+	} else if (cpu_flag & 16){	// AVX2 対応なら
 		//printf("\nUse AVX2 & ALTMAP\n");
 		sse_unit = 32;	// 32, 64, 128 のどれでもいい
 		galois_align_multiply = galois_align32avx_multiply;
@@ -145,16 +147,14 @@ int galois_create_table(void)
 		checksum16_altmap = checksum16_altmap32;
 		checksum16_return = checksum16_return32;
 	} else if (cpu_flag & 1){	// SSSE3 対応なら
-		if ((cpu_flag & 256) == 0){	// SSSE3 & ALTMAP を使う
-			//printf("\nUse SSSE3 & ALTMAP\n");
-			sse_unit = 32;	// 32, 64, 128 のどれでもいい
-			galois_align_multiply = galois_align32_multiply;
-			galois_align_multiply2 = galois_align32_multiply2;
-			galois_altmap_change = galois_altmap32_change;
-			galois_altmap_return = galois_altmap32_return;
-			checksum16_altmap = checksum16_altmap32;
-			checksum16_return = checksum16_return32;
-		}
+		//printf("\nUse SSSE3 & ALTMAP\n");
+		sse_unit = 32;	// 32, 64, 128 のどれでもいい
+		galois_align_multiply = galois_align32_multiply;
+		galois_align_multiply2 = galois_align32_multiply2;
+		galois_altmap_change = galois_altmap32_change;
+		galois_altmap_return = galois_altmap32_return;
+		checksum16_altmap = checksum16_altmap32;
+		checksum16_return = checksum16_return32;
 	} else {	// SSSE3 が利用できない場合
 		if ((cpu_flag & 128) && (jit_alloc() == 0)){	// JIT(SSE2) を使う
 			//printf("\nUse JIT(SSE2) & ALTMAP\n");
@@ -167,7 +167,6 @@ int galois_create_table(void)
 			checksum16_return = checksum16_return256;
 		}
 	}
-#endif
 #endif

 	return 0;
@@ -2792,11 +2791,11 @@ void galois_align_xor(
 #endif
 }

-// 16バイト境界のバッファー専用の掛け算
+// 16バイト境界のバッファー専用の掛け算 (ALTMAP しない)
 void galois_align16_multiply(
 	unsigned char *r1,	// Region to multiply (must be aligned by 16)
 	unsigned char *r2,	// Products go here
-	unsigned int len,	// Byte length (must be multiple of 32)
+	unsigned int len,	// Byte length (must be multiple of 16)
 	int factor)			// Number to multiply by
 {
 	if (factor <= 1){
@@ -2826,6 +2825,16 @@ void galois_align16_multiply(

 	// 掛け算用のテーブルを常に作成する (32バイトだと少し遅くなる)
 #ifndef NO_SIMD
+/*
+	// sse_unit が 32の倍数な時だけ
+	} else if (cpu_flag & 16){	// AVX2 対応なら
+		__declspec( align(32) ) unsigned char small_table[128];
+
+		create_eight_table_avx2(small_table, factor);
+
+		gf16_avx2_block32u(r1, r2, len, small_table);
+*/
+
 	} else if (cpu_flag & 1){	// SSSE3 対応なら
 		__declspec( align(16) ) unsigned char small_table[128];

@@ -2869,7 +2878,6 @@ void galois_align16_multiply(
 			len -= 8;
 		}
 #endif
-
 	}
 }

--- a/source/par2j/gf16.h
+++ b/source/par2j/gf16.h
@@ -6,7 +6,7 @@ extern "C" {
 #endif


-extern unsigned short *galois_log_table;
+//extern unsigned short *galois_log_table;
 extern unsigned int cpu_flag;

 int galois_create_table(void);	// Returns 0 on success, -1 on failure
--- a/source/par2j/lib_opencl.c
+++ b/source/par2j/lib_opencl.c
@@ -1,5 +1,5 @@
 // lib_opencl.c
-// Copyright : 2023-09-23 Yutaka Sawada
+// Copyright : 2023-12-26 Yutaka Sawada
 // License : GPL

 #ifndef _WIN32_WINNT
@@ -84,7 +84,7 @@ cl_command_queue OpenCL_command = NULL;
 cl_kernel OpenCL_kernel = NULL;
 cl_mem OpenCL_src = NULL, OpenCL_dst = NULL, OpenCL_buf = NULL;
 size_t OpenCL_group_num;
-int OpenCL_method = 0;	// 正=速い機器を選ぶ, 負=遅い機器を選ぶ
+int OpenCL_method = 0;	// 標準では GPU を使わず、動作は自動選択される

 API_clCreateBuffer gfn_clCreateBuffer;
 API_clReleaseMemObject gfn_clReleaseMemObject;
@@ -100,9 +100,12 @@ API_clEnqueueNDRangeKernel gfn_clEnqueueNDRangeKernel;

 /*
 入力
-OpenCL_method : どのデバイスを選ぶか
+OpenCL_method : どのデバイスや関数を選ぶか
+    0x100 = 速い機器を選ぶ,          0x200 = 遅い機器を選ぶ
+  0x10000 = 1ブロックずつ計算する, 0x20000 = 2ブロックずつ計算しようとする
+  0x40000 = 4-byte memory access,  0x80000 = try 16-byte memory access
+ 0x100000 = CL_MEM_COPY_HOST_PTR, 0x200000 = CL_MEM_USE_HOST_PTR
 unit_size : ブロックの単位サイズ
-chunk_size: 分割された断片サイズ
 src_max : ソース・ブロック個数

 出力
@@ -112,11 +115,12 @@ OpenCL_method : 動作フラグいろいろ
 */

 // 0=成功, 1～エラー番号
-int init_OpenCL(int unit_size, int chunk_size, int *src_max)
+int init_OpenCL(unsigned int unit_size, int *src_max)
 {
 	char buf[2048], *p_source;
 	int err = 0, i, j;
 	int gpu_power, count;
+	int unified_memory;	// non zero = Integrated GPU
 	size_t data_size, alloc_max;
 	//FILE *fp;
 	HRSRC res;
@@ -137,6 +141,7 @@ int init_OpenCL(int unit_size, int chunk_size, int *src_max)
 	API_clReleaseProgram fn_clReleaseProgram;
 	API_clCreateKernel fn_clCreateKernel;
 	API_clGetKernelWorkGroupInfo fn_clGetKernelWorkGroupInfo;
+	API_clReleaseKernel fn_clReleaseKernel;
 	cl_int ret;
 	cl_uint num_platforms = 0, num_devices = 0, num_groups, param_value;
 	cl_ulong param_value8;
@@ -216,6 +221,9 @@ int init_OpenCL(int unit_size, int chunk_size, int *src_max)
 	fn_clGetKernelWorkGroupInfo = (API_clGetKernelWorkGroupInfo)GetProcAddress(hLibOpenCL, "clGetKernelWorkGroupInfo");
 	if (fn_clGetKernelWorkGroupInfo == NULL)
 		return err;
+	fn_clReleaseKernel = (API_clReleaseKernel)GetProcAddress(hLibOpenCL, "clReleaseKernel");
+	if (fn_clReleaseKernel == NULL)
+		return err;
 	gfn_clFinish = (API_clFinish)GetProcAddress(hLibOpenCL, "clFinish");
 	if (gfn_clFinish == NULL)
 		return err;
@@ -227,12 +235,10 @@ int init_OpenCL(int unit_size, int chunk_size, int *src_max)
 	ret = fn_clGetPlatformIDs(MAX_DEVICE, platform_id, &num_platforms);
 	if (ret != CL_SUCCESS)
 		return (ret << 8) | 10;
-	if (OpenCL_method >= 0){	// 選択する順序と初期値を変える
-		OpenCL_method = 1;
-		gpu_power = 0;
-	} else {
-		OpenCL_method = -1;
+	if (OpenCL_method & 0x200){	// 選択する順序と初期値を変える
 		gpu_power = INT_MIN;
+	} else {
+		gpu_power = 0;
 	}
 	alloc_max = 0;

@@ -266,56 +272,42 @@ int init_OpenCL(int unit_size, int chunk_size, int *src_max)
 			ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_VERSION, sizeof(buf), buf, NULL);
 			if (ret == CL_SUCCESS)
 				printf("Device version = %s\n", buf);
-			ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_LOCAL_MEM_SIZE, sizeof(cl_ulong), &param_value8, NULL);
-			if (ret == CL_SUCCESS)
-				printf("LOCAL_MEM_SIZE = %I64d KB\n", param_value8 >> 10);
-
-			// 無理とは思うけど、一応チェックする
-//#define CL_DEVICE_SVM_CAPABILITIES                  0x1053
-//#define CL_DEVICE_SVM_COARSE_GRAIN_BUFFER           (1 << 0)
-//#define CL_DEVICE_SVM_FINE_GRAIN_BUFFER             (1 << 1)
-//#define CL_DEVICE_SVM_FINE_GRAIN_SYSTEM             (1 << 2)
-//#define CL_DEVICE_SVM_ATOMICS                       (1 << 3)
-//			ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_SVM_CAPABILITIES, sizeof(cl_ulong), &param_value8, NULL);
-//			if (ret == CL_INVALID_VALUE)
-//				printf("Shared Virtual Memory is not supported\n");
-//			if (ret == CL_SUCCESS)
-//				printf("Shared Virtual Memory = 0x%I64X\n", param_value8);
 #endif

-			ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_ADDRESS_BITS, sizeof(cl_uint), &param_value, NULL);
-			if (ret != CL_SUCCESS)
-				continue;
+			// 取得できなくてもエラーにしない
+			ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_HOST_UNIFIED_MEMORY, sizeof(cl_uint), &param_value, NULL);
+			if (ret == CL_SUCCESS){
+				if (param_value != 0){
+#ifdef DEBUG_OUTPUT
+					printf("HOST_UNIFIED_MEMORY = %d\n", param_value);
+#endif
+					param_value = 1;
+				}
+			} else {	// CL_DEVICE_HOST_UNIFIED_MEMORY は OpenCL 2.0 以降で非推奨になった
+				param_value = 0;
+			}
+
+			// 取得できない場合はエラーにする
 			ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(cl_ulong), &param_value8, NULL);
 			if (ret != CL_SUCCESS)
 				continue;
 #ifdef DEBUG_OUTPUT
-			printf("ADDRESS_BITS = %d\n", param_value);
 			printf("MAX_MEM_ALLOC_SIZE = %I64d MB\n", param_value8 >> 20);
 #endif
-			if (param_value == 32){	// CL_DEVICE_ADDRESS_BITS によって確保するメモリー領域の上限を変える
-				if (param_value8 > 0x30000000)	// 768MB までにする
-					param_value8 = 0x30000000;
-			} else {	// 64-bit OS でも 2GB までにする
-				if (param_value8 > 0x80000000)
-					param_value8 = 0x80000000;
-			}
-
 			ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(cl_uint), &num_groups, NULL);
 			if (ret != CL_SUCCESS)
 				continue;
 			ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &data_size, NULL);
 			if (ret != CL_SUCCESS)
 				continue;
-			// CL_DEVICE_HOST_UNIFIED_MEMORY は OpenCL 2.0 以降で非推奨になったので、参照しない
-
 #ifdef DEBUG_OUTPUT
 			printf("MAX_COMPUTE_UNITS = %d\n", num_groups);
 			printf("MAX_WORK_GROUP_SIZE = %zd\n", data_size);
 #endif
-			// MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE で計算力を測る
-			count = (int)data_size * num_groups;
-			count *= OpenCL_method;	// 符号を変える
+			// MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE で計算力を測る、外付けGPUなら値を倍にする
+			count = (2 - param_value) * (int)data_size * num_groups;
+			if (OpenCL_method & 0x200)	// Prefer slower device
+				count *= -1;	// 符号を変える
 			//printf("prev = %d, now = %d\n", gpu_power, count);
 			if ((count > gpu_power) && (data_size >= 256) &&	// 256以上ないとテーブルを作れない
 					(param_value8 / 8 > (cl_ulong)unit_size)){	// CL_DEVICE_MAX_MEM_ALLOC_SIZE に収まるか
@@ -324,14 +316,15 @@ int init_OpenCL(int unit_size, int chunk_size, int *src_max)
 				selected_platform = platform_id[i];
 				OpenCL_group_num = num_groups;	// ワークグループ数は COMPUTE_UNITS 数にする
 				alloc_max = (size_t)param_value8;
+				unified_memory = param_value;	// 0 = discrete GPU, 1 = integrated GPU

-				// AMD Radeon ではメモリー領域が全体の 1/4 とは限らない
+				// AMD や Intel の GPU ではメモリー領域が全体の 1/4 とは限らない
 				ret = fn_clGetDeviceInfo(device_id[j], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(cl_ulong), &param_value8, NULL);
 				if (ret == CL_SUCCESS){
 #ifdef DEBUG_OUTPUT
 					printf("GLOBAL_MEM_SIZE = %I64d MB\n", param_value8 >> 20);
 #endif
-					// 領域一個あたりのサイズは全体の 1/4 までにする
+					// 領域一個あたりのサイズは全体の 1/4 までにする（VRAMを使いすぎると不安定になる）
 					param_value8 /= 4;
 					if ((cl_ulong)alloc_max > param_value8)
 						alloc_max = (size_t)param_value8;
@@ -365,31 +358,6 @@ int init_OpenCL(int unit_size, int chunk_size, int *src_max)
 	if (ret != CL_SUCCESS)
 		return (ret << 8) | 12;

-	// 計算方式を選択する
-	if ((((cpu_flag & 0x101) == 1) || ((cpu_flag & 16) != 0)) && (sse_unit == 32)){
-		OpenCL_method = 2;	// SSSE3 & ALTMAP または AVX2 ならデータの並び替え対応版を使う
-	} else if (((cpu_flag & 128) != 0) && (sse_unit == 256)){
-		OpenCL_method = 4;	// JIT(SSE2) は bit ごとに上位から 16バイトずつ並ぶ
-		// ローカルのテーブルサイズが異なることに注意
-		// XOR 方式以外は 2KB (4バイト * 256項目 * 2個) 使う
-		// XOR (JIT) は 64バイト (4バイト * 16項目) 使う
-	} else {
-		OpenCL_method = 1;	// 並び替えられてないデータ用
-	}
-
-	// work group 数が必要以上に多い場合は減らす
-	if (OpenCL_method == 2){
-		// work item 一個が 8バイトずつ計算する、256個なら work group ごとに 2KB 担当する
-		data_size = chunk_size / 2048;
-	} else {
-		// work item 一個が 4バイトずつ計算する、256個なら work group ごとに 1KB 担当する
-		data_size = chunk_size / 1024;
-	}
-	if (OpenCL_group_num > data_size){
-		OpenCL_group_num = data_size;
-		printf("Number of work groups is reduced to %zd\n", OpenCL_group_num);
-	}
-
 	// 最大で何ブロック分のメモリー領域を保持できるのか（ここではまだ確保しない）
 	// 後で実際に確保する量はこれよりも少なくなる
 	count = (int)(alloc_max / unit_size);	// 確保できるメモリー量から逆算する
@@ -401,25 +369,6 @@ int init_OpenCL(int unit_size, int chunk_size, int *src_max)
 	printf("src buf : %zd KB (%d blocks), possible\n", data_size >> 10, count);
 #endif

-	// 出力先はchunk 1個分だけあればいい
-	// CL_MEM_ALLOC_HOST_PTRを使えばpinned memoryになるらしい
-	data_size = (chunk_size + 63) & ~63;	//  cache line sizes (64 bytes) の倍数にする
-	OpenCL_dst = gfn_clCreateBuffer(OpenCL_context, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, data_size, NULL, &ret);
-	if (ret != CL_SUCCESS)
-		return (ret << 8) | 13;
-#ifdef DEBUG_OUTPUT
-	printf("dst buf : %zd KB (%zd Bytes), OK\n", data_size >> 10, data_size);
-#endif
-
-	// factor は最大個数分 (src_max個)
-	data_size = sizeof(unsigned short) * (*src_max);
-	OpenCL_buf = gfn_clCreateBuffer(OpenCL_context, CL_MEM_READ_ONLY, data_size, NULL, &ret);
-	if (ret != CL_SUCCESS)
-		return (ret << 8) | 14;
-#ifdef DEBUG_OUTPUT
-	printf("factor buf : %zd Bytes (%d factors), OK\n", data_size, (*src_max));
-#endif
-
 /*
 	// テキスト形式の OpenCL C ソース・コードを読み込む
 	err = 4;
@@ -520,18 +469,208 @@ int init_OpenCL(int unit_size, int chunk_size, int *src_max)
 		return (ret << 8) | 21;
 	}

-	// カーネル関数を抽出する
-	wsprintfA(buf, "method%d", OpenCL_method & 7);
-	OpenCL_kernel = fn_clCreateKernel(program, buf, &ret);
-	if (ret != CL_SUCCESS)
-		return (ret << 8) | 22;
+	// 計算方式を選択する
+	if ((((cpu_flag & 0x101) == 1) || ((cpu_flag & 0x110) == 0x10)) && (sse_unit == 32)){
+		int select_method;	// SSSE3 & ALTMAP または AVX2 ならデータの並び替え対応版を使う
+		if (OpenCL_method & 0x80000){	// 16-byte and 2 blocks
+			select_method = 12;
+		} else if (OpenCL_method & 0x40000){	// 4-byte and 2 blocks
+			select_method = 10;
+		} else if (OpenCL_method & 0x20000){	// 16-byte
+			select_method = 4;
+		} else if (OpenCL_method & 0x10000){	// 4-byte
+			select_method = 2;
+		} else {	// kernel を作って詳細を確かめる
+			size_t item2, item4, item10, item12;
+			cl_kernel kernel2, kernel4, kernel10, kernel12;
+			item2 = item4 = item10 = item12 = 0;
+			// まずは一番重くて速い奴を調べる
+			wsprintfA(buf, "method%d", 12);
+			kernel12 = fn_clCreateKernel(program, buf, &ret);
+			if (ret == CL_SUCCESS){
+				ret = fn_clGetKernelWorkGroupInfo(kernel12, selected_device, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), &item12, NULL);
+				if (ret == CL_SUCCESS){
 #ifdef DEBUG_OUTPUT
-	printf("CreateKernel : %s\n", buf);
+					printf("\nTesting %s\n", buf);
+					printf("PREFERRED_WORK_GROUP_SIZE_MULTIPLE = %zu\n", item12);
 #endif
+				}
+			}
+			if (item12 >= 32){	// 32以上あれば余裕で動くとみなす
+				select_method = 12;
+				OpenCL_kernel = kernel12;
+#ifdef DEBUG_OUTPUT
+				printf("\nSelected method%d\n", select_method);
+#endif
+			} else {	// 他の奴と比較する
+				wsprintfA(buf, "method%d", 2);
+				kernel2 = fn_clCreateKernel(program, buf, &ret);
+				if (ret == CL_SUCCESS){
+					ret = fn_clGetKernelWorkGroupInfo(kernel2, selected_device, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), &item2, NULL);
+					if (ret == CL_SUCCESS){
+#ifdef DEBUG_OUTPUT
+						printf("\nTesting %s\n", buf);
+						printf("PREFERRED_WORK_GROUP_SIZE_MULTIPLE = %zu\n", item2);
+#endif
+					}
+				}
+				if (item12 >= item2){
+					select_method = 12;
+					OpenCL_kernel = kernel12;
+					ret = fn_clReleaseKernel(kernel2);
+#ifdef DEBUG_OUTPUT
+					if (ret != CL_SUCCESS)
+						printf("clReleaseKernel : Failed\n");
+					printf("\nSelected method%d\n", select_method);
+#endif
+				} else {
+					ret = fn_clReleaseKernel(kernel12);
+#ifdef DEBUG_OUTPUT
+					if (ret != CL_SUCCESS)
+						printf("clReleaseKernel : Failed\n");
+#endif
+					wsprintfA(buf, "method%d", 10);
+					kernel10 = fn_clCreateKernel(program, buf, &ret);
+					if (ret == CL_SUCCESS){
+						ret = fn_clGetKernelWorkGroupInfo(kernel10, selected_device, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), &item10, NULL);
+						if (ret == CL_SUCCESS){
+#ifdef DEBUG_OUTPUT
+							printf("\nTesting %s\n", buf);
+							printf("PREFERRED_WORK_GROUP_SIZE_MULTIPLE = %zu\n", item10);
+#endif
+						}
+					}
+					if (item10 >= item2){
+						select_method = 10;
+						OpenCL_kernel = kernel10;
+						ret = fn_clReleaseKernel(kernel2);
+#ifdef DEBUG_OUTPUT
+						if (ret != CL_SUCCESS)
+							printf("clReleaseKernel : Failed\n");
+						printf("\nSelected method%d\n", select_method);
+#endif
+					} else {
+						wsprintfA(buf, "method%d", 4);
+						kernel4 = fn_clCreateKernel(program, buf, &ret);
+						if (ret == CL_SUCCESS){
+							ret = fn_clGetKernelWorkGroupInfo(kernel4, selected_device, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), &item4, NULL);
+							if (ret == CL_SUCCESS){
+#ifdef DEBUG_OUTPUT
+								printf("\nTesting %s\n", buf);
+								printf("PREFERRED_WORK_GROUP_SIZE_MULTIPLE = %zu\n", item4);
+#endif
+							}
+						}
+						if (item4 >= item2){
+							select_method = 4;
+							OpenCL_kernel = kernel4;
+							ret = fn_clReleaseKernel(kernel2);
+#ifdef DEBUG_OUTPUT
+							if (ret != CL_SUCCESS)
+								printf("clReleaseKernel : Failed\n");
+							printf("\nSelected method%d\n", select_method);
+#endif
+						} else {
+							select_method = 2;
+							OpenCL_kernel = kernel2;
+							ret = fn_clReleaseKernel(kernel4);
+#ifdef DEBUG_OUTPUT
+							if (ret != CL_SUCCESS)
+								printf("clReleaseKernel : Failed\n");
+							printf("\nSelected method%d\n", select_method);
+#endif
+						}
+					}
+				}
+			}
+		}
+		OpenCL_method |= select_method;
+	} else if (((cpu_flag & 128) != 0) && (sse_unit == 256)){
+		OpenCL_method |= 16;	// JIT(SSE2) は bit ごとに上位から 16バイトずつ並ぶ
+		// ローカルのテーブルサイズが異なることに注意
+		// XOR 方式以外は 2KB (4バイト * 256項目 * 2個) 使う
+		// XOR (JIT) は 64バイト (4バイト * 16項目) 使う
+	} else {
+		int select_method;	// 並び替えられてないデータ用
+		if (OpenCL_method & 0x40000){	// 4-byte and 2 blocks
+			select_method = 9;
+		} else if (OpenCL_method & 0x10000){	// 4-byte
+			select_method = 1;
+		} else {	// kernel を作って詳細を確かめる
+			size_t item1, item9;
+			cl_kernel kernel1, kernel9;
+			item1 = item9 = 0;
+			// まずは一番重くて速い奴を調べる
+			wsprintfA(buf, "method%d", 9);
+			kernel9 = fn_clCreateKernel(program, buf, &ret);
+			if (ret == CL_SUCCESS){
+				ret = fn_clGetKernelWorkGroupInfo(kernel9, selected_device, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), &item9, NULL);
+				if (ret == CL_SUCCESS){
+#ifdef DEBUG_OUTPUT
+					printf("\nTesting %s\n", buf);
+					printf("PREFERRED_WORK_GROUP_SIZE_MULTIPLE = %zu\n", item9);
+#endif
+				}
+			}
+			if (item9 >= 32){	// 32以上あれば余裕で動くとみなす
+				select_method = 9;
+				OpenCL_kernel = kernel9;
+#ifdef DEBUG_OUTPUT
+				printf("\nSelected method%d\n", select_method);
+#endif
+			} else {	// 他の奴と比較する
+				wsprintfA(buf, "method%d", 1);
+				kernel1 = fn_clCreateKernel(program, buf, &ret);
+				if (ret == CL_SUCCESS){
+					ret = fn_clGetKernelWorkGroupInfo(kernel1, selected_device, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), &item1, NULL);
+					if (ret == CL_SUCCESS){
+#ifdef DEBUG_OUTPUT
+						printf("\nTesting %s\n", buf);
+						printf("PREFERRED_WORK_GROUP_SIZE_MULTIPLE = %zu\n", item1);
+#endif
+					}
+				}
+				if (item9 >= item1){
+					select_method = 9;
+					OpenCL_kernel = kernel9;
+					ret = fn_clReleaseKernel(kernel1);
+#ifdef DEBUG_OUTPUT
+					if (ret != CL_SUCCESS)
+						printf("clReleaseKernel : Failed\n");
+					printf("\nSelected method%d\n", select_method);
+#endif
+				} else {
+					select_method = 1;
+					OpenCL_kernel = kernel1;
+					ret = fn_clReleaseKernel(kernel9);
+#ifdef DEBUG_OUTPUT
+					if (ret != CL_SUCCESS)
+						printf("clReleaseKernel : Failed\n");
+					printf("\nSelected method%d\n", select_method);
+#endif
+				}
+			}
+		}
+		OpenCL_method |= select_method;
+	}
+
+	// カーネル関数を抽出する
+	if (OpenCL_kernel == NULL){
+		wsprintfA(buf, "method%d", OpenCL_method & 31);
+		OpenCL_kernel = fn_clCreateKernel(program, buf, &ret);
+		if (ret != CL_SUCCESS)
+			return (ret << 8) | 22;
+#ifdef DEBUG_OUTPUT
+		printf("CreateKernel : %s\n", buf);
+		ret = fn_clGetKernelWorkGroupInfo(OpenCL_kernel, selected_device, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), &data_size, NULL);
+		if (ret == CL_SUCCESS)
+			printf("PREFERRED_WORK_GROUP_SIZE_MULTIPLE = %zu\n", data_size);
+#endif
+	}

 	// カーネルが実行できる work item 数を調べる
-	ret = fn_clGetKernelWorkGroupInfo(OpenCL_kernel, NULL, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &data_size, NULL);
-	if ((ret == CL_SUCCESS) && (data_size < 256)){	// 最低でも 256以上は必要
+	ret = fn_clGetKernelWorkGroupInfo(OpenCL_kernel, selected_device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &data_size, NULL);
+	if ((ret == CL_SUCCESS) && (data_size < 256)){	// 最低でも 256 以上は必要
 #ifdef DEBUG_OUTPUT
 		printf("KERNEL_WORK_GROUP_SIZE = %zd\n", data_size);
 #endif
@@ -550,6 +689,60 @@ int init_OpenCL(int unit_size, int chunk_size, int *src_max)
 		fn_clUnloadCompiler();
 	}

+	// work group 数が必要以上に多い場合は減らす
+	if (OpenCL_method & 4){
+		// work item 一個が 32バイトずつ計算する、256個なら work group ごとに 8KB 担当する
+		data_size = unit_size / 8192;
+	} else if (OpenCL_method & 2){
+		// work item 一個が 8バイトずつ計算する、256個なら work group ごとに 2KB 担当する
+		data_size = unit_size / 2048;
+	} else {
+		// work item 一個が 4バイトずつ計算する、256個なら work group ごとに 1KB 担当する
+		data_size = unit_size / 1024;
+	}
+	if (OpenCL_group_num > data_size){
+		OpenCL_group_num = data_size;
+		printf("Number of work groups is reduced to %zd\n", OpenCL_group_num);
+	}
+
+	// データへのアクセス方法をデバイスによって変える
+	if (OpenCL_method & 0x200000){
+		OpenCL_method |= 32;
+	} else if ((OpenCL_method & 0x100000) == 0){
+		if (unified_memory){
+			OpenCL_method |= 32;	// Integrated GPU なら CL_MEM_USE_HOST_PTR を使う
+		} else {	// Discrete GPU でも Nvidia のは動作を変える
+			ret = fn_clGetDeviceInfo(selected_device, CL_DEVICE_VERSION, sizeof(buf), buf, NULL);
+			if (ret == CL_SUCCESS){
+				if (strstr(buf, "CUDA") != NULL)
+					OpenCL_method |= 32;	// NVIDIA GPU なら CL_MEM_USE_HOST_PTR を使う
+			}
+		}
+	}
+
+	// 出力先は1ブロック分だけあればいい
+	// CL_MEM_ALLOC_HOST_PTRを使えばpinned memoryになるらしい
+	data_size = unit_size;
+	if (OpenCL_method & 8)
+		data_size *= 2;	// 2ブロックずつ計算できるように、2倍確保しておく
+	OpenCL_dst = gfn_clCreateBuffer(OpenCL_context, CL_MEM_ALLOC_HOST_PTR, data_size, NULL, &ret);
+	if (ret != CL_SUCCESS)
+		return (ret << 8) | 13;
+#ifdef DEBUG_OUTPUT
+	printf("dst buf : %zd KB (%zd Bytes), OK\n", data_size >> 10, data_size);
+#endif
+
+	// factor は最大個数分 (src_max個)
+	data_size = sizeof(unsigned short) * (*src_max);
+	if (OpenCL_method & 8)
+		data_size *= 2;	// 2ブロックずつ計算できるように、2倍確保しておく
+	OpenCL_buf = gfn_clCreateBuffer(OpenCL_context, CL_MEM_READ_ONLY, data_size, NULL, &ret);
+	if (ret != CL_SUCCESS)
+		return (ret << 8) | 14;
+#ifdef DEBUG_OUTPUT
+	printf("factor buf : %zd Bytes (%d factors), OK\n", data_size, (*src_max));
+#endif
+
 	// カーネル引数を指定する
 	ret = gfn_clSetKernelArg(OpenCL_kernel, 1, sizeof(cl_mem), &OpenCL_dst);
 	if (ret != CL_SUCCESS)
@@ -557,13 +750,12 @@ int init_OpenCL(int unit_size, int chunk_size, int *src_max)
 	ret = gfn_clSetKernelArg(OpenCL_kernel, 2, sizeof(cl_mem), &OpenCL_buf);
 	if (ret != CL_SUCCESS)
 		return (ret << 8) | 102;
-	if (ret != CL_SUCCESS)
-		return (ret << 8) | 103;

 #ifdef DEBUG_OUTPUT
 	// ワークアイテム数
 	printf("\nMax number of work items = %zd (256 * %zd)\n", OpenCL_group_num * 256, OpenCL_group_num);
 #endif
+	OpenCL_method &= 0xFF;	// 最後に選択設定を消去する

 	return 0;
 }
@@ -675,16 +867,24 @@ void info_OpenCL(char *buf, int buf_size)
 // ソース・ブロックをデバイス側にコピーする
 int gpu_copy_blocks(
 	unsigned char *data,	// ブロックのバッファー (境界は 4096にすること)
-	int unit_size,			// 4096の倍数にすること
+	unsigned int unit_size,	// 4096の倍数にすること
 	int src_num)			// 何ブロックをコピーするのか
 {
 	size_t data_size;
 	cl_int ret;
+	cl_mem_flags flags;

 	// Integrated GPU と Discrete GPU の違いに関係なく、使う分だけ毎回メモリー領域を確保する
 	data_size = (size_t)unit_size * src_num;
-	// Intel GPUならZeroCopyできる、GeForce GPUでもメモリー消費量が少なくてコピーが速い
-	OpenCL_src = gfn_clCreateBuffer(OpenCL_context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, data_size, data, &ret);
+	if (OpenCL_method & 32){	// AMD's APU や Integrated GPU なら ZeroCopy する
+		// 実際に比較してみると GeForce GPU でもメモリー消費量が少なくてコピーが速い
+		// NVIDIA GPU は CL_MEM_USE_HOST_PTR でも VRAM 上にキャッシュするので速いらしい
+		flags = CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR;
+	} else {	// Discrete GPU ならデータを VRAM にコピーする
+		// AMD GPU は明示的にコピーするよう指定しないといけない
+		flags = CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR;
+	}
+	OpenCL_src = gfn_clCreateBuffer(OpenCL_context, flags, data_size, data, &ret);
 	if (ret != CL_SUCCESS)
 		return (ret << 8) | 1;
 #ifdef DEBUG_OUTPUT
@@ -700,62 +900,67 @@ int gpu_copy_blocks(
 }

 // ソース・ブロックを掛け算する
-int gpu_multiply_chunks(
+int gpu_multiply_blocks(
 	int src_num,			// Number of multiplying source blocks
 	unsigned short *mat,	// Matrix of numbers to multiply by
+	unsigned short *mat2,	// Set to calculate 2 blocks at once
 	unsigned char *buf,		// Products go here
-	int offset,				// Offset in each block
-	int length)				// Byte length
+	unsigned int len)		// Byte length
 {
 	unsigned __int64 *vram, *src, *dst;
 	size_t global_size, local_size;
 	cl_int ret;

 	// 倍率の配列をデバイス側に書き込む
-	ret = gfn_clEnqueueWriteBuffer(OpenCL_command, OpenCL_buf, CL_FALSE, 0, sizeof(short) * src_num, mat, 0, NULL, NULL);
+	if (mat2 == NULL){	// 1ブロック分だけコピーする
+		ret = gfn_clEnqueueWriteBuffer(OpenCL_command, OpenCL_buf, CL_FALSE, 0, sizeof(short) * src_num, mat, 0, NULL, NULL);
+	} else {	// 2ブロックずつ計算する場合は、配列のサイズも２倍になる
+		if ((size_t)mat2 == 1){	// アドレスが 1 になることはあり得ないので、識別できる
+			ret = gfn_clEnqueueWriteBuffer(OpenCL_command, OpenCL_buf, CL_FALSE, 0, sizeof(short) * src_num * 2, mat, 0, NULL, NULL);
+		} else {	// 2回コピーする
+			size_t data_size = sizeof(short) * src_num;
+			ret = gfn_clEnqueueWriteBuffer(OpenCL_command, OpenCL_buf, CL_FALSE, 0, data_size, mat, 0, NULL, NULL);
+			if (ret != CL_SUCCESS)
+				return (ret << 8) | 10;
+			// もう一つの配列は違う場所からコピーする
+			ret = gfn_clEnqueueWriteBuffer(OpenCL_command, OpenCL_buf, CL_FALSE, data_size, data_size, mat2, 0, NULL, NULL);
+		}
+	}
 	if (ret != CL_SUCCESS)
-		return (ret << 8) | 10;
+		return (ret << 8) | 11;

 	// 引数を指定する
 	ret = gfn_clSetKernelArg(OpenCL_kernel, 3, sizeof(int), &src_num);
 	if (ret != CL_SUCCESS)
 		return (ret << 8) | 103;
-	offset /= 4;	// 4バイト整数単位にする
-	ret = gfn_clSetKernelArg(OpenCL_kernel, 4, sizeof(int), &offset);
-	if (ret != CL_SUCCESS)
-		return (ret << 8) | 104;
-	length /= 4;	// 4バイト整数単位にする
-	ret = gfn_clSetKernelArg(OpenCL_kernel, 5, sizeof(int), &length);
-	if (ret != CL_SUCCESS)
-		return (ret << 8) | 105;

 	// カーネル並列実行
-	local_size = 256;	// テーブルやキャッシュのため、work item 数は 256に固定する
+	local_size = 256;	// テーブルやキャッシュのため、work item 数は 256 に固定する
 	global_size = OpenCL_group_num * 256;
-	//printf("group num = %d, global size = %d, local size = 256 \n", OpenCL_group_num, global_size);
+	//printf("group num = %d, global size = %d, local size = %d \n", OpenCL_group_num, global_size, local_size);
 	ret = gfn_clEnqueueNDRangeKernel(OpenCL_command, OpenCL_kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);
 	if (ret != CL_SUCCESS)
-		return (ret << 8) | 11;
+		return (ret << 8) | 12;

 	// 出力内容をホスト側に反映させる
-	vram = gfn_clEnqueueMapBuffer(OpenCL_command, OpenCL_dst, CL_TRUE, CL_MAP_READ, 0, length * 4, 0, NULL, NULL, &ret);
+	vram = gfn_clEnqueueMapBuffer(OpenCL_command, OpenCL_dst, CL_TRUE, CL_MAP_READ, 0, len, 0, NULL, NULL, &ret);
 	if (ret != CL_SUCCESS)
-		return (ret << 8) | 12;
+		return (ret << 8) | 13;

 	// 8バイトごとに XOR する (SSE2 で XOR しても速くならず)
 	src = vram;
 	dst = (unsigned __int64 *)buf;
-	while (length > 0){
+	while (len > 0){
 		*dst ^= *src;
 		dst++;
 		src++;
-		length -= 2;
+		len -= 8;
 	}

 	// ホスト側でデータを変更しなくても、clEnqueueMapBufferと対で呼び出さないといけない
 	ret = gfn_clEnqueueUnmapMemObject(OpenCL_command, OpenCL_dst, vram, 0, NULL, NULL);
 	if (ret != CL_SUCCESS)
-		return (ret << 8) | 13;
+		return (ret << 8) | 14;

 	return 0;
 }
@@ -768,12 +973,12 @@ int gpu_finish(void)
 	// 全ての処理が終わるのを待つ
 	ret = gfn_clFinish(OpenCL_command);
 	if (ret != CL_SUCCESS)
-		return (ret << 8) | 20;
+		return (ret << 8) | 30;

 	if (OpenCL_src != NULL){	// 確保されてる場合は解除する
 		ret = gfn_clReleaseMemObject(OpenCL_src);
 		if (ret != CL_SUCCESS)
-			return (ret << 8) | 21;
+			return (ret << 8) | 31;
 		OpenCL_src = NULL;
 	}

--- a/source/par2j/lib_opencl.h
+++ b/source/par2j/lib_opencl.h
@@ -10,21 +10,21 @@ extern "C" {

 extern int OpenCL_method;

-int init_OpenCL(int unit_size, int chunk_size, int *src_max);
+int init_OpenCL(unsigned int unit_size, int *src_max);
 int free_OpenCL(void);
 void info_OpenCL(char *buf, int buf_size);

 int gpu_copy_blocks(
 	unsigned char *data,
-	int unit_size,
+	unsigned int unit_size,
 	int src_num);

-int gpu_multiply_chunks(
+int gpu_multiply_blocks(
 	int src_num,			// Number of multiplying source blocks
 	unsigned short *mat,	// Matrix of numbers to multiply by
+	unsigned short *mat2,	// Set to calculate 2 blocks at once
 	unsigned char *buf,		// Products go here
-	int offset,				// Offset in each block
-	int length);			// Byte length
+	unsigned int len);		// Byte length

 int gpu_finish(void);

--- a/source/par2j/list.c
+++ b/source/par2j/list.c
@@ -1,5 +1,5 @@
 // list.c
-// Copyright : 2022-10-14 Yutaka Sawada
+// Copyright : 2023-12-12 Yutaka Sawada
 // License : GPL

 #ifndef _UNICODE
@@ -26,6 +26,11 @@

 //#define TIMER // 実験用

+#ifdef TIMER
+#include <time.h>
+static double time_sec, time_speed;
+#endif
+
 // recovery set のファイルのハッシュ値を調べる (空のファイルは除く)
 // 0x00 = ファイルが存在して完全である
 // 0x01 = ファイルが存在しない
@@ -296,7 +301,7 @@ int check_file_complete(
 {
 	int i, rv;
 #ifdef TIMER
-unsigned int time_start = GetTickCount();
+clock_t time_start = clock();
 #endif

 	printf("\nVerifying Input File   :\n");
@@ -332,14 +337,14 @@ unsigned int time_start = GetTickCount();
 	}

 #ifdef TIMER
-time_start = GetTickCount() - time_start;
-printf("\n hash %d.%03d sec", time_start / 1000, time_start % 1000);
-if (time_start > 0){
-	time_start = (int)((total_file_size * 125) / ((__int64)time_start * 131072));
-	printf(", %d MB/s\n", time_start);
+time_start = clock() - time_start;
+time_sec = (double)time_start / CLOCKS_PER_SEC;
+if (time_sec > 0){
+	time_speed = (double)total_file_size / (time_sec * 1048576);
 } else {
-	printf("\n");
+	time_speed = 0;
 }
+printf("\n hash %.3f sec, %.0f MB/s\n", time_sec, time_speed);
 #endif
 	return 0;
 }
@@ -348,7 +353,7 @@ if (time_start > 0){
 // SSD 上で複数ファイルを同時に検査する

 // MAX_MULTI_READ の２倍ぐらいにする？
-#define MAX_READ_NUM 10
+#define MAX_READ_NUM 12

 int check_file_complete_multi(
 	char *ascii_buf,
@@ -364,17 +369,15 @@ int check_file_complete_multi(
 	HANDLE hSub[MAX_READ_NUM];
 	FILE_CHECK_TH th[MAX_READ_NUM];
 #ifdef TIMER
-unsigned int time_start = GetTickCount();
+clock_t time_start = clock();
 #endif

 	memset(hSub, 0, sizeof(HANDLE) * MAX_READ_NUM);
 	// Core数に応じてスレッド数を増やす
 	if ((memory_use & 32) != 0){	// NVMe SSD
-		if (cpu_num >= 8){	// 8 ~ 16 Cores
-			multi_read = 4;
-		} else {	// 3 Cores + Hyper-threading, or 4 ~ 7 Cores
-			multi_read = 3;
-		}
+		multi_read = (cpu_num + 2) / 3 + 1;	// 3=2, 4~6=3, 7~9=4, 10~12=5, 13~=6
+		if (multi_read > MAX_READ_NUM / 2)
+			multi_read = MAX_READ_NUM / 2;
 	} else {	// SATA SSD
 		multi_read = 2;
 	}
@@ -632,14 +635,14 @@ unsigned int time_start = GetTickCount();
 	}

 #ifdef TIMER
-time_start = GetTickCount() - time_start;
-printf("\n hash %d.%03d sec", time_start / 1000, time_start % 1000);
-if (time_start > 0){
-	time_start = (int)((total_file_size * 125) / ((__int64)time_start * 131072));
-	printf(", %d MB/s\n", time_start);
+time_start = clock() - time_start;
+time_sec = (double)time_start / CLOCKS_PER_SEC;
+if (time_sec > 0){
+	time_speed = (double)total_file_size / (time_sec * 1048576);
 } else {
-	printf("\n");
+	time_speed = 0;
 }
+printf("\n hash %.3f sec, %.0f MB/s\n", time_sec, time_speed);
 #endif

 error_end:
--- a/source/par2j/md5_crc.c
+++ b/source/par2j/md5_crc.c
@@ -1,5 +1,5 @@
 // md5_crc.c
-// Copyright : 2023-08-28 Yutaka Sawada
+// Copyright : 2023-12-12 Yutaka Sawada
 // License : GPL

 #ifndef _UNICODE
@@ -21,7 +21,6 @@
 #include "phmd5.h"
 #include "md5_crc.h"

-
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

 // バイト配列の MD5 ハッシュ値を求める
@@ -200,10 +199,14 @@ int file_md5_crc32_block(
 //#define TIMER // 実験用

 #ifdef TIMER
-static unsigned int time_start, time1_start;
-static unsigned int time_total = 0, time2_total = 0, time3_total = 0;
+#include <time.h>
+static double time_sec, time_speed;
+static clock_t time_start, time1_start;
+static clock_t time_total = 0, time2_total = 0, time3_total = 0;
 #endif

+#define MAX_BUF_SIZE	2097152	// ヒープ領域を使う場合の最大サイズ
+
 // ファイルのハッシュ値と各スライスのチェックサムを同時に計算する
 int file_hash_crc(
 	wchar_t *file_name,			// ハッシュ値を求めるファイル
@@ -222,7 +225,7 @@ int file_hash_crc(
 	HANDLE hFile;
 	OVERLAPPED ol;
 #ifdef TIMER
-time1_start = GetTickCount();
+time1_start = clock();
 #endif

 	// ソース・ファイルを開く
@@ -249,11 +252,11 @@ time1_start = GetTickCount();
 	if (file_left < IO_SIZE)
 		read_size = (unsigned int)file_left;
 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 	off = ReadFile(hFile, buf1, read_size, NULL, &ol);
 #ifdef TIMER
-time2_total += GetTickCount() - time_start;
+time2_total += clock() - time_start;
 #endif
 	if ((off == 0) && (GetLastError() != ERROR_IO_PENDING)){
 		print_win32_err();
@@ -279,11 +282,11 @@ time2_total += GetTickCount() - time_start;
 			ol.OffsetHigh = (unsigned int)(file_off >> 32);
 			file_off += IO_SIZE;
 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 			off = ReadFile(hFile, buf, read_size, NULL, &ol);
 #ifdef TIMER
-time2_total += GetTickCount() - time_start;
+time2_total += clock() - time_start;
 #endif
 			if ((off == 0) && (GetLastError() != ERROR_IO_PENDING)){
 				print_win32_err();
@@ -299,7 +302,7 @@ time2_total += GetTickCount() - time_start;
 		}

 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 		off = 0;	// チェックサム計算
 		if (block_left > 0){	// 前回足りなかった分を追加する
@@ -336,7 +339,7 @@ time_start = GetTickCount();
 			}
 		}
 #ifdef TIMER
-time3_total += GetTickCount() - time_start;
+time3_total += clock() - time_start;
 #endif

 		// 経過表示
@@ -367,16 +370,17 @@ error_end:
 		CloseHandle(ol.hEvent);

 #ifdef TIMER
-time_total += GetTickCount() - time1_start;
+time_total += clock() - time1_start;
 if (*prog_now == total_file_size){
-	printf("\nread  %d.%03d sec\n", time2_total / 1000, time2_total % 1000);
-	printf("main  %d.%03d sec\n", time3_total / 1000, time3_total % 1000);
-	if (time_total > 0){
-		time_start = (int)((total_file_size * 125) / ((__int64)time_total * 131072));
+	printf("\nread  %.3f sec\n", (double)time2_total / CLOCKS_PER_SEC);
+	printf("main  %.3f sec\n", (double)time3_total / CLOCKS_PER_SEC);
+	time_sec = (double)time_total / CLOCKS_PER_SEC;
+	if (time_sec > 0){
+		time_speed = (double)total_file_size / (time_sec * 1048576);
 	} else {
-		time_start = 0;
+		time_speed = 0;
 	}
-	printf("total %d.%03d sec, %d MB/s\n", time_total / 1000, time_total % 1000, time_start);
+	printf("total %.3f sec, %.0f MB/s\n", time_sec, time_speed);
 }
 #endif
 	return err;
@@ -401,7 +405,7 @@ int file_hash_crc(
 	HANDLE hFile;
 	OVERLAPPED ol;
 #ifdef TIMER
-time1_start = GetTickCount();
+time1_start = clock();
 #endif

 	// ソース・ファイルを開く
@@ -440,11 +444,11 @@ error_retry_read:
 	if (file_left < IO_SIZE)
 		read_size = (unsigned int)file_left;
 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 	off = ReadFile(hFile, buf1, read_size, NULL, &ol);
 #ifdef TIMER
-time2_total += GetTickCount() - time_start;
+time2_total += clock() - time_start;
 #endif
 	if ((off == 0) && (GetLastError() != ERROR_IO_PENDING)){
 		print_win32_err();
@@ -534,11 +538,11 @@ error_retry_pause:
 			ol.OffsetHigh = (unsigned int)(file_off >> 32);
 			file_off += IO_SIZE;
 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 			off = ReadFile(hFile, buf, read_size, NULL, &ol);
 #ifdef TIMER
-time2_total += GetTickCount() - time_start;
+time2_total += clock() - time_start;
 #endif
 			if ((off == 0) && (GetLastError() != ERROR_IO_PENDING)){
 				print_win32_err();
@@ -555,7 +559,7 @@ time2_total += GetTickCount() - time_start;
 		}

 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 		off = 0;	// チェックサム計算
 		if (block_left > 0){	// 前回足りなかった分を追加する
@@ -592,7 +596,7 @@ time_start = GetTickCount();
 			}
 		}
 #ifdef TIMER
-time3_total += GetTickCount() - time_start;
+time3_total += clock() - time_start;
 #endif

 		// 経過表示
@@ -623,16 +627,17 @@ error_end:
 		CloseHandle(ol.hEvent);

 #ifdef TIMER
-time_total += GetTickCount() - time1_start;
+time_total += clock() - time1_start;
 if (*prog_now == total_file_size){
-	printf("\nread  %d.%03d sec\n", time2_total / 1000, time2_total % 1000);
-	printf("main  %d.%03d sec\n", time3_total / 1000, time3_total % 1000);
-	if (time_total > 0){
-		time_start = (int)((total_file_size * 125) / ((__int64)time_total * 131072));
+	printf("\nread  %.3f sec\n", (double)time2_total / CLOCKS_PER_SEC);
+	printf("main  %.3f sec\n", (double)time3_total / CLOCKS_PER_SEC);
+	time_sec = (double)time_total / CLOCKS_PER_SEC;
+	if (time_sec > 0){
+		time_speed = (double)total_file_size / (time_sec * 1048576);
 	} else {
-		time_start = 0;
+		time_speed = 0;
 	}
-	printf("total %d.%03d sec, %d MB/s\n", time_total / 1000, time_total % 1000, time_start);
+	printf("total %.3f sec, %.0f MB/s\n", time_sec, time_speed);
 }
 #endif
 	return err;
@@ -658,7 +663,7 @@ int file_hash_crc(
 	HANDLE hFile;
 	OVERLAPPED ol;
 #ifdef TIMER
-time1_start = GetTickCount();
+time1_start = clock();
 #endif

 	// ソース・ファイルを開く
@@ -671,7 +676,7 @@ time1_start = GetTickCount();
 	}

 	// バッファー・サイズが大きいのでヒープ領域を使う
-	for (io_size = IO_SIZE; io_size < 1048576; io_size += IO_SIZE){	// 1 MB までにする
+	for (io_size = IO_SIZE; io_size <= MAX_BUF_SIZE; io_size += IO_SIZE){	// IO_SIZE の倍数にする
 		if ((io_size + IO_SIZE > (cpu_cache & 0xFFFE0000)) || ((__int64)(io_size + IO_SIZE) * 4 > file_left))
 			break;
 	}
@@ -697,11 +702,11 @@ time1_start = GetTickCount();
 	if (file_left < io_size)
 		read_size = (unsigned int)file_left;
 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 	off = ReadFile(hFile, buf1, read_size, NULL, &ol);
 #ifdef TIMER
-time2_total += GetTickCount() - time_start;
+time2_total += clock() - time_start;
 #endif
 	if ((off == 0) && (GetLastError() != ERROR_IO_PENDING)){
 		print_win32_err();
@@ -727,11 +732,11 @@ time2_total += GetTickCount() - time_start;
 			ol.OffsetHigh = (unsigned int)(file_off >> 32);
 			file_off += io_size;
 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 			off = ReadFile(hFile, buf, read_size, NULL, &ol);
 #ifdef TIMER
-time2_total += GetTickCount() - time_start;
+time2_total += clock() - time_start;
 #endif
 			if ((off == 0) && (GetLastError() != ERROR_IO_PENDING)){
 				print_win32_err();
@@ -747,7 +752,7 @@ time2_total += GetTickCount() - time_start;
 		}

 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 		off = 0;	// チェックサム計算
 		if (block_left > 0){	// 前回足りなかった分を追加する
@@ -784,7 +789,7 @@ time_start = GetTickCount();
 			}
 		}
 #ifdef TIMER
-time3_total += GetTickCount() - time_start;
+time3_total += clock() - time_start;
 #endif

 		// 経過表示
@@ -817,16 +822,17 @@ error_end:
 		_aligned_free(buf1);

 #ifdef TIMER
-time_total += GetTickCount() - time1_start;
+time_total += clock() - time1_start;
 if (*prog_now == total_file_size){
-	printf("\nread  %d.%03d sec\n", time2_total / 1000, time2_total % 1000);
-	printf("main  %d.%03d sec\n", time3_total / 1000, time3_total % 1000);
-	if (time_total > 0){
-		time_start = (int)((total_file_size * 125) / ((__int64)time_total * 131072));
+	printf("\nread  %.3f sec\n", (double)time2_total / CLOCKS_PER_SEC);
+	printf("main  %.3f sec\n", (double)time3_total / CLOCKS_PER_SEC);
+	time_sec = (double)time_total / CLOCKS_PER_SEC;
+	if (time_sec > 0){
+		time_speed = (double)total_file_size / (time_sec * 1048576);
 	} else {
-		time_start = 0;
+		time_speed = 0;
 	}
-	printf("total %d.%03d sec, %d MB/s\n", time_total / 1000, time_total % 1000, time_start);
+	printf("total %.3f sec, %.0f MB/s\n", time_sec, time_speed);
 }
 #endif
 	return err;
@@ -866,7 +872,7 @@ DWORD WINAPI file_hash_crc2(LPVOID lpParameter)

 	// バッファー・サイズが大きいのでヒープ領域を使う
 	prog_tick = 1;
-	for (io_size = IO_SIZE; io_size < 1048576; io_size += IO_SIZE){	// IO_SIZE の倍数で 1 MB までにする
+	for (io_size = IO_SIZE; io_size <= MAX_BUF_SIZE; io_size += IO_SIZE){	// IO_SIZE の倍数にする
 		if ((io_size + IO_SIZE > (cpu_cache & 0xFFFE0000)) || ((__int64)(io_size + IO_SIZE) * 4 > file_left))
 			break;
 		prog_tick++;
@@ -1036,7 +1042,7 @@ int file_hash_check(
 	PHMD5 hash_ctx, block_ctx;
 	OVERLAPPED ol;
 #ifdef TIMER
-time1_start = GetTickCount();
+time1_start = clock();
 #endif

 	prog_last = -1;	// 検証中のファイル名を毎回表示する
@@ -1060,11 +1066,11 @@ time1_start = GetTickCount();
 		file_left = file_size - 16384;	// 本来のファイル・サイズまでしか検査しない
 	}
 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 	off = ReadFile(hFile, buf, len, NULL, &ol);
 #ifdef TIMER
-time2_total += GetTickCount() - time_start;
+time2_total += clock() - time_start;
 #endif
 	if ((off == 0) && (GetLastError() != ERROR_IO_PENDING)){
 		print_win32_err();
@@ -1139,11 +1145,11 @@ time2_total += GetTickCount() - time_start;
 	if (file_left < IO_SIZE)
 		read_size = (unsigned int)file_left;
 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 	off = ReadFile(hFile, buf1, read_size, NULL, &ol);
 #ifdef TIMER
-time2_total += GetTickCount() - time_start;
+time2_total += clock() - time_start;
 #endif
 	if ((off == 0) && (GetLastError() != ERROR_IO_PENDING)){
 		print_win32_err();
@@ -1166,11 +1172,11 @@ time2_total += GetTickCount() - time_start;
 			ol.OffsetHigh = (unsigned int)(file_off >> 32);
 			file_off += IO_SIZE;
 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 			off = ReadFile(hFile, buf, read_size, NULL, &ol);
 #ifdef TIMER
-time2_total += GetTickCount() - time_start;
+time2_total += clock() - time_start;
 #endif
 			if ((off == 0) && (GetLastError() != ERROR_IO_PENDING)){
 				print_win32_err();
@@ -1185,7 +1191,7 @@ time2_total += GetTickCount() - time_start;
 		}

 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 		if (s_blk != NULL){
 			off = 0;
@@ -1228,7 +1234,7 @@ time_start = GetTickCount();
 			Phmd5Process(&hash_ctx, buf, len);	// MD5 計算
 		}
 #ifdef TIMER
-time3_total += GetTickCount() - time_start;
+time3_total += clock() - time_start;
 #endif

 		// 経過表示
@@ -1265,15 +1271,16 @@ error_end:
 		CloseHandle(ol.hEvent);

 #ifdef TIMER
-time_total += GetTickCount() - time1_start;
-	printf("\nread  %d.%03d sec\n", time2_total / 1000, time2_total % 1000);
-	printf("main  %d.%03d sec\n", time3_total / 1000, time3_total % 1000);
-	if (time_total > 0){
-		time_start = (int)((file_size * 125) / ((__int64)time_total * 131072));
+time_total += clock() - time1_start;
+	printf("\nread  %.3f sec\n", (double)time2_total / CLOCKS_PER_SEC);
+	printf("main  %.3f sec\n", (double)time3_total / CLOCKS_PER_SEC);
+	time_sec = (double)time_total / CLOCKS_PER_SEC;
+	if (time_sec > 0){
+		time_speed = (double)file_size / (time_sec * 1048576);
 	} else {
-		time_start = 0;
+		time_speed = 0;
 	}
-	printf("total %d.%03d sec, %d MB/s\n", time_total / 1000, time_total % 1000, time_start);
+	printf("total %.3f sec, %.0f MB/s\n", time_sec, time_speed);
 #endif
 	return comp_num;
 }
@@ -1303,7 +1310,7 @@ DWORD WINAPI file_hash_background(LPVOID lpParameter)
 	find_next = files[num].b_off;	// 先頭ブロックの番号

 	// バッファー・サイズが大きいのでヒープ領域を使う
-	for (io_size = IO_SIZE; io_size < 1048576; io_size += IO_SIZE){	// IO_SIZE の倍数で 1 MB までにする
+	for (io_size = IO_SIZE; io_size <= MAX_BUF_SIZE; io_size += IO_SIZE){	// IO_SIZE の倍数にする
 		if ((io_size + IO_SIZE > (cpu_cache & 0xFFFE0000)) || ((__int64)(io_size + IO_SIZE) * 4 > file_size))
 			break;
 	}
@@ -1534,7 +1541,7 @@ int file_hash_direct(
 	HANDLE hFile;
 	OVERLAPPED ol;
 #ifdef TIMER
-time1_start = GetTickCount();
+time1_start = clock();
 #endif

 	prog_last = -1;	// 検証中のファイル名を毎回表示する
@@ -1590,11 +1597,11 @@ time1_start = GetTickCount();
 		file_left = file_size - 16384;	// 本来のファイル・サイズまでしか検査しない
 	}
 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 	off = ReadFile(hFile, buf, read_size, NULL, &ol);
 #ifdef TIMER
-time2_total += GetTickCount() - time_start;
+time2_total += clock() - time_start;
 #endif
 	if ((off == 0) && (GetLastError() != ERROR_IO_PENDING)){
 		comp_num = -1;
@@ -1677,11 +1684,11 @@ time2_total += GetTickCount() - time_start;
 		read_size = (read_size + 4095) & ~4095;	// 4KB の倍数にする
 	}
 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 	off = ReadFile(hFile, buf1, read_size, NULL, &ol);
 #ifdef TIMER
-time2_total += GetTickCount() - time_start;
+time2_total += clock() - time_start;
 #endif
 	if ((off == 0) && (GetLastError() != ERROR_IO_PENDING)){
 		print_win32_err();
@@ -1708,11 +1715,11 @@ time2_total += GetTickCount() - time_start;
 			ol.OffsetHigh = (unsigned int)(file_off >> 32);
 			file_off += IO_SIZE;
 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 			off = ReadFile(hFile, buf, read_size, NULL, &ol);
 #ifdef TIMER
-time2_total += GetTickCount() - time_start;
+time2_total += clock() - time_start;
 #endif
 			if ((off == 0) && (GetLastError() != ERROR_IO_PENDING)){
 				print_win32_err();
@@ -1727,7 +1734,7 @@ time2_total += GetTickCount() - time_start;
 		}

 #ifdef TIMER
-time_start = GetTickCount();
+time_start = clock();
 #endif
 		if (s_blk != NULL){
 			off = 0;
@@ -1769,7 +1776,7 @@ time_start = GetTickCount();
 			Phmd5Process(&hash_ctx, buf, len);	// MD5 計算
 		}
 #ifdef TIMER
-time3_total += GetTickCount() - time_start;
+time3_total += clock() - time_start;
 #endif

 		// 経過表示
@@ -1810,10 +1817,16 @@ error_end:
 		_aligned_free(buf1);

 #ifdef TIMER
-time_total += GetTickCount() - time1_start;
-	printf("\nread  %d.%03d sec\n", time2_total / 1000, time2_total % 1000);
-	printf("main  %d.%03d sec\n", time3_total / 1000, time3_total % 1000);
-	printf("total %d.%03d sec\n", time_total / 1000, time_total % 1000);
+time_total += clock() - time1_start;
+	printf("\nread  %.3f sec\n", (double)time2_total / CLOCKS_PER_SEC);
+	printf("main  %.3f sec\n", (double)time3_total / CLOCKS_PER_SEC);
+	time_sec = (double)time_total / CLOCKS_PER_SEC;
+	if (time_sec > 0){
+		time_speed = (double)file_size / (time_sec * 1048576);
+	} else {
+		time_speed = 0;
+	}
+	printf("total %.3f sec, %.0f MB/s\n", time_sec, time_speed);
 #endif
 	return comp_num;
 }
--- a/source/par2j/par2.c
+++ b/source/par2j/par2.c
@@ -1,5 +1,5 @@
 // par2.c
-// Copyright : 2023-09-21 Yutaka Sawada
+// Copyright : 2023-10-15 Yutaka Sawada
 // License : GPL

 #ifndef _UNICODE
@@ -181,7 +181,7 @@ int par2_create(
 			}
 		} else {
 			// 共通パケットを作成する
-			if ((memory_use & 16) && (cpu_num >= 4) && (entity_num >= 2)){	// SSDなら複数ファイルを同時に処理する
+			if ((memory_use & 16) && (cpu_num >= 3) && (entity_num >= 2)){	// SSDなら複数ファイルを同時に処理する
 				common_size = set_common_packet_multi(common_buf, &packet_num, (switch_p & 2) >> 1, files);
 			} else {
 				common_size = set_common_packet(common_buf, &packet_num, (switch_p & 2) >> 1, files);
@@ -529,7 +529,7 @@ int par2_verify(

 	// ソース・ファイルが完全かどうかを調べる
 	// ファイルの状態は 完全、消失、追加、破損(完全なブロックの数) の4種類
-	if ((memory_use & 16) && (cpu_num >= 4) && (entity_num >= 2)){	// SSDなら複数ファイルを同時に処理する
+	if ((memory_use & 16) && (cpu_num >= 3) && (entity_num >= 2)){	// SSDなら複数ファイルを同時に処理する
 		err = check_file_complete_multi(ascii_buf, uni_buf, files, s_blk);
 	} else {
 		err = check_file_complete(ascii_buf, uni_buf, files, s_blk);
@@ -741,7 +741,7 @@ int par2_repair(

 	// ソース・ファイルが完全かどうかを一覧表示する
 	// ファイルの状態は 完全、消失、追加、破損(完全なブロックの数) の4種類
-	if ((memory_use & 16) && (cpu_num >= 4) && (entity_num >= 2)){	// SSDなら複数ファイルを同時に処理する
+	if ((memory_use & 16) && (cpu_num >= 3) && (entity_num >= 2)){	// SSDなら複数ファイルを同時に処理する
 		err = check_file_complete_multi(ascii_buf, uni_buf, files, s_blk);
 	} else {
 		err = check_file_complete(ascii_buf, uni_buf, files, s_blk);
--- a/source/par2j/par2_cmd.c
+++ b/source/par2j/par2_cmd.c
@@ -1,5 +1,5 @@
 // par2_cmd.c
-// Copyright : 2023-09-28 Yutaka Sawada
+// Copyright : 2023-12-09 Yutaka Sawada
 // License : GPL

 #ifndef _UNICODE
@@ -87,35 +87,37 @@ static void print_environment(void)
 	printf("CPU thread\t: %d / %d\n", cpu_num & 0xFFFF, cpu_num >> 24);
 	cpu_num &= 0xFFFF;	// 利用するコア数だけにしておく
 	printf("CPU cache limit : %d KB, %d KB\n", (cpu_flag & 0xFFFF0000) >> 10, (cpu_cache & 0xFFFE0000) >> 10);
-#ifndef _WIN64	// 32-bit 版は MMX, SSE2, SSSE3 のどれかを表示する
+#ifndef _WIN64	// 32-bit 版は MMX, SSE2, SSSE3, AVX2 のどれかを表示する
 	printf("CPU extra\t:");
-	if (cpu_flag & 1){
+	if (((cpu_flag & 16) != 0) && ((cpu_flag & 256) == 0)){
+		printf(" AVX2");
+	} else if (cpu_flag & 1){
 		if (cpu_flag & 256){
-			printf(" SSSE3(old)");
+			printf(" SSSE3(slow)");
 		} else {
 			printf(" SSSE3");
 		}
-	} else if (cpu_flag & 128){
+	} else if (((cpu_flag & 128) != 0) && ((cpu_flag & 256) == 0)){
 		printf(" SSE2");
 	} else {
 		printf(" MMX");
 	}
-#else	// 64-bit 版は SSE2, SSSE3 を表示する
+#else	// 64-bit 版は SSE2, SSSE3, AVX2 を表示する
 	printf("CPU extra\t: x64");
-	if (cpu_flag & 1){
+	if (((cpu_flag & 16) != 0) && ((cpu_flag & 256) == 0)){
+		printf(" AVX2");
+	} else if (cpu_flag & 1){
 		if (cpu_flag & 256){
-			printf(" SSSE3(old)");
+			printf(" SSSE3(slow)");
 		} else {
 			printf(" SSSE3");
 		}
-	} else if (cpu_flag & 128){
+	} else if (((cpu_flag & 128) != 0) && ((cpu_flag & 256) == 0)){
 		printf(" SSE2");
 	}
 #endif
 	if (cpu_flag & 8)
 		printf(" CLMUL");
-	if (cpu_flag & 16)
-		printf(" AVX2");
 	printf("\nMemory usage\t: ");
 	if (memory_use & 7){
 		printf("%d/8", memory_use & 7);
@@ -1477,17 +1479,15 @@ ri= switch_set & 0x00040000
 			} else if (wcsncmp(tmp_p, L"lc", 2) == 0){
 				k = 0;
 				j = 2;
-				while ((j < 2 + 5) && (tmp_p[j] >= '0') && (tmp_p[j] <= '9')){
+				while ((j < 2 + 7) && (tmp_p[j] >= '0') && (tmp_p[j] <= '9')){
 					k = (k * 10) + (tmp_p[j] - '0');
 					j++;
 				}
-				if (k & 256){	// GPU を使う
-					OpenCL_method = 1;	// Faster GPU
-				} else if (k & 512){
-					OpenCL_method = -1;	// Slower GPU
+				if (k & 0x300){	// GPU を使う
+					OpenCL_method = k & 0x003F0300;
 				}
-				if (k & 1024)	// CLMUL を使わない、SSSE3 の古いエンコーダーを使う
-					cpu_flag = (cpu_flag & 0xFFFFFFF7) | 0x100;
+				if (k & 1024)	// CLMUL と ALTMAP を使わない
+					cpu_flag = (cpu_flag & 0xFFFFFFF7) | 256;
 				if (k & 2048)	// JIT(SSE2) を使わない
 					cpu_flag &= 0xFFFFFF7F;
 				if (k & 4096)	// SSSE3 を使わない
@@ -1506,10 +1506,10 @@ ri= switch_set & 0x00040000
 					} else if (k == 254){	// 物理コア数より減らす
 						k = ((cpu_num & 0x00FF0000) >> 16) - 1;
 					} else if (k == 255){	// 物理コア数より増やす
-						k = ((cpu_num & 0x00FF0000) >> 16) + 1;
-						//k = cpu_num >> 16;
-						//k = ((k & 0xFF) + (k >> 8)) / 2;	// 物理コア数と論理コア数の中間にする？
+						k = cpu_num >> 16;
+						k = ((k & 0xFF) + (k >> 8)) / 2;	// 物理コア数と論理コア数の中間にする？
 						// タスクマネージャーにおける CPU使用率は 100%になるけど、速くはならない・・・
+						// k = (k & 0xFF) + ((k >> 8) - (k & 0xFF)) / 4;	// 物理コア数の 5/4 にする？
 					}
 					if (k > MAX_CPU){
 						k = MAX_CPU;
--- a/source/par2j/reedsolomon.c
+++ b/source/par2j/reedsolomon.c
@@ -1,5 +1,5 @@
 // reedsolomon.c
-// Copyright : 2023-09-28 Yutaka Sawada
+// Copyright : 2023-12-12 Yutaka Sawada
 // License : GPL

 #ifndef _UNICODE
@@ -27,6 +27,9 @@
 #include "rs_decode.h"
 #include "reedsolomon.h"

+#ifdef TIMER
+#include <time.h>
+#endif

 // GPU を使う最小データサイズ (MB 単位)
 // GPU の起動には時間がかかるので、データが小さすぎると逆に遅くなる
@@ -204,6 +207,48 @@ int read_block_num(
 	return buf_num;
 }

+// 1st encode, decode を何スレッドで実行するか決める
+int calc_thread_num1(int max_num)
+{
+	int i, num;
+
+	// 読み込み中はスレッド数を減らす（シングル・スレッドの時は 0にする）
+	num = 0;
+	i = 1;
+	while (i * 2 <= cpu_num){	// 1=0, 2~3=1, 4~7=2, 8~15=3, 16~31=4, 32=5
+		num++;
+		i *= 2;
+	}
+	if (num > max_num)
+		num = max_num;
+
+	return num;
+}
+
+// 1st & 2nd encode, decode を何スレッドで実行するか決める
+int calc_thread_num2(int max_num, int *cpu_num2)
+{
+	int i, num1, num2;
+
+	// 読み込み中はスレッド数を減らす（シングル・スレッドの時は 0にする）
+	num1 = 0;
+	i = 2;
+	while (i <= cpu_num){	// 1=0, 2~3=1, 4~7=2, 8~15=3, 16~31=4, 32=5
+		num1++;
+		i *= 2;
+	}
+	if (num1 > max_num)
+		num1 = max_num;
+
+	// CPU と GPU で必ず２スレッド使う
+	num2 = cpu_num;
+	if (num2 < 2)
+		num2 = 2;
+	*cpu_num2 = num2;
+
+	return num1;
+}
+
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
 // 戸川 隼人 の「演習と応用FORTRAN77」の逆行列の計算方法を参考にして
 // Gaussian Elimination を少し修正して行列の数を一つにしてみた
@@ -697,7 +742,7 @@ int rs_encode(
 	int err = 0;
 	unsigned int len;
 #ifdef TIMER
-unsigned int time_total = GetTickCount();
+clock_t time_total = clock();
 #endif

 	if (galois_create_table()){
@@ -713,7 +758,7 @@ unsigned int time_total = GetTickCount();
 	// パリティ計算用の行列演算の準備をする
 	len = sizeof(unsigned short) * source_num;
 	if (OpenCL_method != 0)
-		len *= 2;	// GPU の作業領域も確保しておく
+		len *= 3;	// GPU の作業領域も確保しておく
 	constant = malloc(len);
 	if (constant == NULL){
 		printf("malloc, %d\n", len);
@@ -757,8 +802,8 @@ unsigned int time_total = GetTickCount();
 		err = encode_method2(file_path, header_buf, rcv_hFile, files, s_blk, p_blk, constant);
 #ifdef TIMER
 	if (err != 1){
-		time_total = GetTickCount() - time_total;
-		printf("total  %d.%03d sec\n", time_total / 1000, time_total % 1000);
+		time_total = clock() - time_total;
+		printf("total  %.3f sec\n", (double)time_total / CLOCKS_PER_SEC);
 	}
 #endif

@@ -788,7 +833,7 @@ int rs_encode_1pass(
 	int err = 0;
 	unsigned int len;
 #ifdef TIMER
-unsigned int time_total = GetTickCount();
+clock_t time_total = clock();
 #endif

 	if (galois_create_table()){
@@ -799,7 +844,7 @@ unsigned int time_total = GetTickCount();
 	// パリティ計算用の行列演算の準備をする
 	len = sizeof(unsigned short) * source_num;
 	if (OpenCL_method != 0)
-		len *= 2;	// GPU の作業領域も確保しておく
+		len *= 3;	// GPU の作業領域も確保しておく
 	constant = malloc(len);
 	if (constant == NULL){
 		printf("malloc, %d\n", len);
@@ -846,8 +891,8 @@ unsigned int time_total = GetTickCount();
 	if (err < 0){
 		printf("switching to 2-pass processing, %d\n", err);
 	} else if (err != 1){
-		time_total = GetTickCount() - time_total;
-		printf("total  %d.%03d sec\n", time_total / 1000, time_total % 1000);
+		time_total = clock() - time_total;
+		printf("total  %.3f sec\n", (double)time_total / CLOCKS_PER_SEC);
 	}
 #endif

@@ -871,7 +916,7 @@ int rs_decode(
 	int err = 0, i, j, k;
 	unsigned int len;
 #ifdef TIMER
-unsigned int time_matrix = 0, time_total = GetTickCount();
+clock_t time_matrix = 0, time_total = clock();
 #endif

 	if (galois_create_table()){
@@ -906,7 +951,7 @@ unsigned int time_matrix = 0, time_total = GetTickCount();
 	id = mat + (block_lost * source_num);

 #ifdef TIMER
-time_matrix = GetTickCount();
+time_matrix = clock();
 #endif
 	// 復元用の行列を計算する
 	print_progress_text(0, "Computing matrix");
@@ -947,7 +992,7 @@ time_matrix = GetTickCount();
 	//for (i = 0; i < block_lost; i++)
 	//	printf("id[%d] = %d\n", i, id[i]);
 #ifdef TIMER
-time_matrix = GetTickCount() - time_matrix;
+time_matrix = clock() - time_matrix;
 #endif

 #ifdef TIMER
@@ -960,8 +1005,7 @@ time_matrix = GetTickCount() - time_matrix;
 		// ブロック数が多いなら、ブロックごとにスレッドを割り当てる (GPU を使う)
 		if (memory_use & 16){
 			err = -4;	// SSD なら Read all 方式でブロックが断片化しても速い
-		} else 
-		if (read_block_num(block_lost, 0, MEM_UNIT) != 0){
+		} else if (read_block_num(block_lost * 2, 0, MEM_UNIT) != 0){
 			err = -5;	// HDD でメモリーが足りてるなら Read some 方式を使う
 		} else {
 			err = -4;	// メモリー不足なら Read all 方式でブロックを断片化させる
@@ -970,8 +1014,7 @@ time_matrix = GetTickCount() - time_matrix;
 		// ソース・ブロックを全て断片的に読み込むか、いくつかを丸ごと読み込むかを決める
 		if (memory_use & 16){
 			err = -2;	// SSD なら Read all 方式でブロックが断片化しても速い
-		} else 
-		if (read_block_num(block_lost, 0, sse_unit) != 0){
+		} else if (read_block_num(block_lost, 0, sse_unit) != 0){
 			err = -3;	// HDD でメモリーが足りてるなら Read some 方式を使う
 		} else {
 			err = -2;	// メモリー不足なら Read all 方式でブロックを断片化させる
@@ -992,9 +1035,9 @@ time_matrix = GetTickCount() - time_matrix;
 		err = decode_method2(file_path, block_lost, rcv_hFile, files, s_blk, p_blk, mat);
 #ifdef TIMER
 	if (err != 1){
-		time_total = GetTickCount() - time_total;
-		printf("total  %d.%03d sec\n", time_total / 1000, time_total % 1000);
-		printf("matrix %d.%03d sec\n", time_matrix / 1000, time_matrix % 1000);
+		time_total = clock() - time_total;
+		printf("total  %.3f sec\n", (double)time_total / CLOCKS_PER_SEC);
+		printf("matrix %.3f sec\n", (double)time_matrix / CLOCKS_PER_SEC);
 	}
 #endif

--- a/source/par2j/reedsolomon.h
+++ b/source/par2j/reedsolomon.h
@@ -17,6 +17,10 @@ extern "C" {
 #define READ_MIN_RATE	1	// 保持するブロック数の 1/2 = 50%
 #define READ_MIN_NUM	16

+// CPU cache 最適化のため、同時に処理するブロック数を制限する
+#define CACHE_MIN_NUM	8
+#define CACHE_MAX_NUM	128
+
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

 // Cache Blocking を試みる
@@ -35,6 +39,12 @@ int read_block_num(
 	size_t trial_alloc,		// 確保できるか確認するのか
 	int alloc_unit);		// メモリー単位の境界 (sse_unit か MEM_UNIT)

+// 1st encode, decode を何スレッドで実行するか決める
+int calc_thread_num1(int max_num);
+
+// 1st & 2nd encode, decode を何スレッドで実行するか決める
+int calc_thread_num2(int max_num, int *cpu_num2);
+
 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

 // リード・ソロモン符号を使ってエンコードする
--- a/source/par2j/res_par2j.rc
+++ b/source/par2j/res_par2j.rc
@@ -1,7 +1,7 @@
 1 RT_STRING ".\\source.cl"

 1 VERSIONINFO
-FILEVERSION 1,3,3,0
+FILEVERSION 1,3,3,2
 PRODUCTVERSION 1,3,3,0
 FILEOS 0x40004
 FILETYPE 0x1
@@ -11,9 +11,9 @@ BLOCK "StringFileInfo"
 	BLOCK "040904B0"
 	{
 		VALUE "FileDescription", "PAR2 client"
-		VALUE "LegalCopyright", "Copyright (C) 2023 Yutaka Sawada"
+		VALUE "LegalCopyright", "Copyright (C) 2024 Yutaka Sawada"
 		VALUE "ProductName", "par2j"
-		VALUE "FileVersion", "1.3.3.0"
+		VALUE "FileVersion", "1.3.3.2"
 		VALUE "ProductVersion", "1.3.3.0"
 	}
 }
--- a/source/par2j/rs_decode.c
+++ b/source/par2j/rs_decode.c
--- a/source/par2j/rs_encode.c
+++ b/source/par2j/rs_encode.c
--- a/source/par2j/source.cl
+++ b/source/par2j/source.cl
@@ -1,10 +1,11 @@
 void calc_table(__local uint *mtab, int id, int factor)
 {
-	int i, sum = 0;
+	int i, sum;

-	for (i = 0; i < 8; i++){
-		sum = (id & (1 << i)) ? (sum ^ factor) : sum;
-		factor = (factor & 0x8000) ? ((factor << 1) ^ 0x1100B) : (factor << 1);
+	sum = ((id << 31) >> 31) & factor;
+	for (i = 1; i < 8; i++){
+		factor = (factor << 1) ^ (((factor << 16) >> 31) & 0x1100B);
+		sum ^= ((id << (31 - i)) >> 31) & factor;
 	}
 	mtab[id] = sum;

@@ -14,13 +15,35 @@ void calc_table(__local uint *mtab, int id, int factor)
 	mtab[id + 256] = sum;
 }

+void calc_table2(__local uint *mtab, int id, int factor, int factor2)
+{
+	int i, sum, sum2, mask;
+
+	mask = (id << 31) >> 31;
+	sum = mask & factor;
+	sum2 = mask & factor2;
+	for (i = 1; i < 8; i++){
+		factor = (factor << 1) ^ (((factor << 16) >> 31) & 0x1100B);
+		factor2 = (factor2 << 1) ^ (((factor2 << 16) >> 31) & 0x1100B);
+		mask = (id << (31 - i)) >> 31;
+		sum ^= mask & factor;
+		sum2 ^= mask & factor2;
+	}
+	mtab[id] = sum | (sum2 << 16);
+
+	sum = (sum << 4) ^ (((sum << 16) >> 31) & 0x88058) ^ (((sum << 17) >> 31) & 0x4402C) ^ (((sum << 18) >> 31) & 0x22016) ^ (((sum << 19) >> 31) & 0x1100B);
+	sum = (sum << 4) ^ (((sum << 16) >> 31) & 0x88058) ^ (((sum << 17) >> 31) & 0x4402C) ^ (((sum << 18) >> 31) & 0x22016) ^ (((sum << 19) >> 31) & 0x1100B);
+	sum2 = (sum2 << 4) ^ (((sum2 << 16) >> 31) & 0x88058) ^ (((sum2 << 17) >> 31) & 0x4402C) ^ (((sum2 << 18) >> 31) & 0x22016) ^ (((sum2 << 19) >> 31) & 0x1100B);
+	sum2 = (sum2 << 4) ^ (((sum2 << 16) >> 31) & 0x88058) ^ (((sum2 << 17) >> 31) & 0x4402C) ^ (((sum2 << 18) >> 31) & 0x22016) ^ (((sum2 << 19) >> 31) & 0x1100B);
+
+	mtab[id + 256] = sum | (sum2 << 16);
+}
+
 __kernel void method1(
 	__global uint *src,
 	__global uint *dst,
 	__global ushort *factors,
-	int blk_num,
-	int offset,
-	int length)
+	int blk_num)
 {
 	__local uint mtab[512];
 	int i, blk;
@@ -29,15 +52,15 @@ __kernel void method1(
 	const int work_size = get_global_size(0);
 	const int table_id = get_local_id(0);

-	src += offset;
-	for (i = work_id; i < length; i += work_size)
+	for (i = work_id; i < BLK_SIZE; i += work_size)
 		dst[i] = 0;

 	for (blk = 0; blk < blk_num; blk++){
+		barrier(CLK_LOCAL_MEM_FENCE);
 		calc_table(mtab, table_id, factors[blk]);
 		barrier(CLK_LOCAL_MEM_FENCE);

-		for (i = work_id; i < length; i += work_size){
+		for (i = work_id; i < BLK_SIZE; i += work_size){
 			v = src[i];
 			sum = mtab[(uchar)(v >> 16)] ^ mtab[256 + (v >> 24)];
 			sum <<= 16;
@@ -45,7 +68,6 @@ __kernel void method1(
 			dst[i] ^= sum;
 		}
 		src += BLK_SIZE;
-		barrier(CLK_LOCAL_MEM_FENCE);
 	}
 }

@@ -53,9 +75,7 @@ __kernel void method2(
 	__global uint *src,
 	__global uint *dst,
 	__global ushort *factors,
-	int blk_num,
-	int offset,
-	int length)
+	int blk_num)
 {
 	__local uint mtab[512];
 	int i, blk, pos;
@@ -64,17 +84,17 @@ __kernel void method2(
 	const int work_size = get_global_size(0) * 2;
 	const int table_id = get_local_id(0);

-	src += offset;
-	for (i = work_id; i < length; i += work_size){
+	for (i = work_id; i < BLK_SIZE; i += work_size){
 		dst[i    ] = 0;
 		dst[i + 1] = 0;
 	}

 	for (blk = 0; blk < blk_num; blk++){
+		barrier(CLK_LOCAL_MEM_FENCE);
 		calc_table(mtab, table_id, factors[blk]);
 		barrier(CLK_LOCAL_MEM_FENCE);

-		for (i = work_id; i < length; i += work_size){
+		for (i = work_id; i < BLK_SIZE; i += work_size){
 			pos = (i & ~7) + ((i & 7) >> 1);
 			lo = src[pos    ];
 			hi = src[pos + 4];
@@ -88,50 +108,220 @@ __kernel void method2(
 			dst[pos + 4] ^= ((sum1 & 0xFF00FF00) >> 8) | (sum2 & 0xFF00FF00);
 		}
 		src += BLK_SIZE;
-		barrier(CLK_LOCAL_MEM_FENCE);
 	}
 }

 __kernel void method4(
+	__global uint4 *src,
+	__global uint4 *dst,
+	__global ushort *factors,
+	int blk_num)
+{
+	__local uint mtab[512];
+	int i, blk;
+	uchar4 r0, r1, r2, r3, r4, r5, r6, r7;
+	uchar16 lo, hi;
+	const int work_id = get_global_id(0) * 2;
+	const int work_size = get_global_size(0) * 2;
+	const int table_id = get_local_id(0);
+
+	for (i = work_id; i < BLK_SIZE / 4; i += work_size){
+		dst[i    ] = 0;
+		dst[i + 1] = 0;
+	}
+
+	for (blk = 0; blk < blk_num; blk++){
+		barrier(CLK_LOCAL_MEM_FENCE);
+		calc_table(mtab, table_id, factors[blk]);
+		barrier(CLK_LOCAL_MEM_FENCE);
+
+		for (i = work_id; i < BLK_SIZE / 4; i += work_size){
+			lo = as_uchar16(src[i    ]);
+			hi = as_uchar16(src[i + 1]);
+			r0 = (uchar4)(as_uchar2((ushort)(mtab[lo.s0] ^ mtab[256 + hi.s0])), as_uchar2((ushort)(mtab[lo.s1] ^ mtab[256 + hi.s1])));
+			r1 = (uchar4)(as_uchar2((ushort)(mtab[lo.s2] ^ mtab[256 + hi.s2])), as_uchar2((ushort)(mtab[lo.s3] ^ mtab[256 + hi.s3])));
+			r2 = (uchar4)(as_uchar2((ushort)(mtab[lo.s4] ^ mtab[256 + hi.s4])), as_uchar2((ushort)(mtab[lo.s5] ^ mtab[256 + hi.s5])));
+			r3 = (uchar4)(as_uchar2((ushort)(mtab[lo.s6] ^ mtab[256 + hi.s6])), as_uchar2((ushort)(mtab[lo.s7] ^ mtab[256 + hi.s7])));
+			r4 = (uchar4)(as_uchar2((ushort)(mtab[lo.s8] ^ mtab[256 + hi.s8])), as_uchar2((ushort)(mtab[lo.s9] ^ mtab[256 + hi.s9])));
+			r5 = (uchar4)(as_uchar2((ushort)(mtab[lo.sa] ^ mtab[256 + hi.sa])), as_uchar2((ushort)(mtab[lo.sb] ^ mtab[256 + hi.sb])));
+			r6 = (uchar4)(as_uchar2((ushort)(mtab[lo.sc] ^ mtab[256 + hi.sc])), as_uchar2((ushort)(mtab[lo.sd] ^ mtab[256 + hi.sd])));
+			r7 = (uchar4)(as_uchar2((ushort)(mtab[lo.se] ^ mtab[256 + hi.se])), as_uchar2((ushort)(mtab[lo.sf] ^ mtab[256 + hi.sf])));
+			dst[i    ] ^= as_uint4((uchar16)(r0.x, r0.z, r1.x, r1.z, r2.x, r2.z, r3.x, r3.z, r4.x, r4.z, r5.x, r5.z, r6.x, r6.z, r7.x, r7.z));
+			dst[i + 1] ^= as_uint4((uchar16)(r0.y, r0.w, r1.y, r1.w, r2.y, r2.w, r3.y, r3.w, r4.y, r4.w, r5.y, r5.w, r6.y, r6.w, r7.y, r7.w));
+		}
+		src += BLK_SIZE / 4;
+	}
+}
+
+__kernel void method9(
 	__global uint *src,
 	__global uint *dst,
 	__global ushort *factors,
-	int blk_num,
-	int offset,
-	int length)
+	int blk_num)
+{
+	__local uint mtab[512];
+	int i, blk;
+	uint v, sum, sum2;
+	const int work_id = get_global_id(0);
+	const int work_size = get_global_size(0);
+	const int table_id = get_local_id(0);
+
+	for (i = work_id; i < BLK_SIZE; i += work_size){
+		dst[i] = 0;
+		dst[i + BLK_SIZE] = 0;
+	}
+
+	for (blk = 0; blk < blk_num; blk++){
+		barrier(CLK_LOCAL_MEM_FENCE);
+		calc_table2(mtab, table_id, factors[blk], factors[blk_num + blk]);
+		barrier(CLK_LOCAL_MEM_FENCE);
+
+		for (i = work_id; i < BLK_SIZE; i += work_size){
+			v = src[i];
+			sum  = mtab[(uchar)v] ^ mtab[256 + (uchar)(v >> 8)];
+			sum2 = mtab[(uchar)(v >> 16)] ^ mtab[256 + (v >> 24)];
+			dst[i] ^= (sum & 0xFFFF) | (sum2 << 16);
+			dst[i + BLK_SIZE] ^= (sum >> 16) | (sum2 & 0xFFFF0000);
+		}
+		src += BLK_SIZE;
+	}
+}
+
+__kernel void method10(
+	__global uint *src,
+	__global uint *dst,
+	__global ushort *factors,
+	int blk_num)
+{
+	__local uint mtab[512];
+	int i, blk, pos;
+	uint lo, hi, t0, t1, t2, t3;
+	const int work_id = get_global_id(0) * 2;
+	const int work_size = get_global_size(0) * 2;
+	const int table_id = get_local_id(0);
+
+	for (i = work_id; i < BLK_SIZE; i += work_size){
+		dst[i    ] = 0;
+		dst[i + 1] = 0;
+		dst[i + BLK_SIZE    ] = 0;
+		dst[i + BLK_SIZE + 1] = 0;
+	}
+
+	for (blk = 0; blk < blk_num; blk++){
+		barrier(CLK_LOCAL_MEM_FENCE);
+		calc_table2(mtab, table_id, factors[blk], factors[blk_num + blk]);
+		barrier(CLK_LOCAL_MEM_FENCE);
+
+		for (i = work_id; i < BLK_SIZE; i += work_size){
+			pos = (i & ~7) + ((i & 7) >> 1);
+			lo = src[pos    ];
+			hi = src[pos + 4];
+			t0 = mtab[(uchar)lo] ^ mtab[256 + (uchar)hi];
+			t1 = mtab[(uchar)(lo >> 8)] ^ mtab[256 + (uchar)(hi >> 8)];
+			t2 = mtab[(uchar)(lo >> 16)] ^ mtab[256 + (uchar)(hi >> 16)];
+			t3 = mtab[lo >> 24] ^ mtab[256 + (hi >> 24)];
+			dst[pos    ] ^= (uchar)t0 | ((t1 << 8) & 0xFF00) | ((t2 << 16) & 0xFF0000) | (t3 << 24);
+			dst[pos + 4] ^= (uchar)(t0 >> 8) | (t1 & 0xFF00) | ((t2 << 8) & 0xFF0000) | ((t3 << 16) & 0xFF000000);
+			dst[pos + BLK_SIZE    ] ^= (uchar)(t0 >> 16) | ((t1 >> 8) & 0xFF00) | (t2 & 0xFF0000) | ((t3 << 8) & 0xFF000000);
+			dst[pos + BLK_SIZE + 4] ^= (t0 >> 24) | ((t1 >> 16) & 0xFF00) | ((t2 >> 8) & 0xFF0000) | (t3 & 0xFF000000);
+		}
+		src += BLK_SIZE;
+	}
+}
+
+__kernel void method12(
+	__global uint4 *src,
+	__global uint4 *dst,
+	__global ushort *factors,
+	int blk_num)
+{
+	__local uint mtab[512];
+	int i, blk;
+	uchar4 r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, rA, rB, rC, rD, rE, rF;
+	uchar16 lo, hi;
+	const int work_id = get_global_id(0) * 2;
+	const int work_size = get_global_size(0) * 2;
+	const int table_id = get_local_id(0);
+
+	for (i = work_id; i < BLK_SIZE / 4; i += work_size){
+		dst[i    ] = 0;
+		dst[i + 1] = 0;
+		dst[i + BLK_SIZE / 4    ] = 0;
+		dst[i + BLK_SIZE / 4 + 1] = 0;
+	}
+
+	for (blk = 0; blk < blk_num; blk++){
+		barrier(CLK_LOCAL_MEM_FENCE);
+		calc_table2(mtab, table_id, factors[blk], factors[blk_num + blk]);
+		barrier(CLK_LOCAL_MEM_FENCE);
+
+		for (i = work_id; i < BLK_SIZE / 4; i += work_size){
+			lo = as_uchar16(src[i    ]);
+			hi = as_uchar16(src[i + 1]);
+			r0 = as_uchar4(mtab[lo.s0] ^ mtab[256 + hi.s0]);
+			r1 = as_uchar4(mtab[lo.s1] ^ mtab[256 + hi.s1]);
+			r2 = as_uchar4(mtab[lo.s2] ^ mtab[256 + hi.s2]);
+			r3 = as_uchar4(mtab[lo.s3] ^ mtab[256 + hi.s3]);
+			r4 = as_uchar4(mtab[lo.s4] ^ mtab[256 + hi.s4]);
+			r5 = as_uchar4(mtab[lo.s5] ^ mtab[256 + hi.s5]);
+			r6 = as_uchar4(mtab[lo.s6] ^ mtab[256 + hi.s6]);
+			r7 = as_uchar4(mtab[lo.s7] ^ mtab[256 + hi.s7]);
+			r8 = as_uchar4(mtab[lo.s8] ^ mtab[256 + hi.s8]);
+			r9 = as_uchar4(mtab[lo.s9] ^ mtab[256 + hi.s9]);
+			rA = as_uchar4(mtab[lo.sa] ^ mtab[256 + hi.sa]);
+			rB = as_uchar4(mtab[lo.sb] ^ mtab[256 + hi.sb]);
+			rC = as_uchar4(mtab[lo.sc] ^ mtab[256 + hi.sc]);
+			rD = as_uchar4(mtab[lo.sd] ^ mtab[256 + hi.sd]);
+			rE = as_uchar4(mtab[lo.se] ^ mtab[256 + hi.se]);
+			rF = as_uchar4(mtab[lo.sf] ^ mtab[256 + hi.sf]);
+			dst[i    ] ^= as_uint4((uchar16)(r0.x, r1.x, r2.x, r3.x, r4.x, r5.x, r6.x, r7.x, r8.x, r9.x, rA.x, rB.x, rC.x, rD.x, rE.x, rF.x));
+			dst[i + 1] ^= as_uint4((uchar16)(r0.y, r1.y, r2.y, r3.y, r4.y, r5.y, r6.y, r7.y, r8.y, r9.y, rA.y, rB.y, rC.y, rD.y, rE.y, rF.y));
+			dst[i + BLK_SIZE / 4    ] ^= as_uint4((uchar16)(r0.z, r1.z, r2.z, r3.z, r4.z, r5.z, r6.z, r7.z, r8.z, r9.z, rA.z, rB.z, rC.z, rD.z, rE.z, rF.z));
+			dst[i + BLK_SIZE / 4 + 1] ^= as_uint4((uchar16)(r0.w, r1.w, r2.w, r3.w, r4.w, r5.w, r6.w, r7.w, r8.w, r9.w, rA.w, rB.w, rC.w, rD.w, rE.w, rF.w));
+		}
+		src += BLK_SIZE / 4;
+	}
+}
+
+__kernel void method16(
+	__global uint *src,
+	__global uint *dst,
+	__global ushort *factors,
+	int blk_num)
 {
 	__local int table[16];
 	__local uint cache[256];
-	int i, j, blk, pos, sht, mask;
+	int i, j, blk, pos, mask, tmp;
 	uint sum;
 	const int work_id = get_global_id(0);
 	const int work_size = get_global_size(0);

-	src += offset;
-	for (i = work_id; i < length; i += work_size)
+	for (i = work_id; i < BLK_SIZE; i += work_size)
 		dst[i] = 0;

 	for (blk = 0; blk < blk_num; blk++){
 		if (get_local_id(0) == 0){
-			pos = factors[blk] << 16;
-			table[0] = pos;
+			tmp = factors[blk];
+			table[0] = tmp;
 			for (j = 1; j < 16; j++){
-				pos = (pos << 1) ^ ((pos >> 31) & 0x100B0000);
-				table[j] = pos;
+				mask = (tmp & 0x8000) ? 0x1100B : 0;
+				tmp = (tmp << 1) ^ mask;
+				table[j] = tmp;
 			}
 		}
 		barrier(CLK_LOCAL_MEM_FENCE);

-		for (i = work_id; i < length; i += work_size){
+		for (i = work_id; i < BLK_SIZE; i += work_size){
 			pos = i & 255;
 			cache[pos] = src[i];
 			barrier(CLK_LOCAL_MEM_FENCE);

 			sum = 0;
-			sht = (i & 60) >> 2;
+			tmp = (i & 60) >> 2;
+			tmp = 0x8000 >> tmp;
 			pos &= ~60;
 			for (j = 15; j >= 0; j--){
-				mask = (table[j] << sht) >> 31;
+				mask = (table[j] & tmp) ? 0xFFFFFFFF : 0;
 				sum ^= mask & cache[pos];
 				pos += 4;
 			}
--- a/source/par2j/version.h
+++ b/source/par2j/version.h
@@ -1,2 +1,2 @@
-#define FILE_VERSION "1.3.3.0"	// ファイルのバージョン番号
+#define FILE_VERSION "1.3.3.2"	// ファイルのバージョン番号
 #define PRODUCT_VERSION "1.3.3"	// 製品のバージョン番号
Author	SHA1	Message	Date
Yutaka Sawada	3b8d510aeb	Release note of version 1.3.3.2	2024-01-10 13:18:10 +09:00
Yutaka Sawada	9132c437fc	Update to year 2024	2024-01-10 10:44:19 +09:00
Yutaka Sawada	7159bbb1fd	Update to year 2024	2024-01-10 10:40:33 +09:00
Yutaka Sawada	ae9643f2ce	Add files via upload	2023-12-26 18:57:09 +09:00
Yutaka Sawada	6559e62276	Change lc option	2023-12-26 18:56:26 +09:00
Yutaka Sawada	1552fb8ec8	Add files via upload	2023-12-26 18:53:12 +09:00
Yutaka Sawada	79d0b184b8	Add notice of save_path	2023-12-03 21:43:40 +09:00
Yutaka Sawada	2793349268	Update PAR2 clients	2023-11-27 14:31:12 +09:00
Yutaka Sawada	4a7845dc7a	Erase old section	2023-11-27 14:29:58 +09:00
Yutaka Sawada	978bbe4b40	Optimization for AMD GPU	2023-11-27 14:19:43 +09:00
Yutaka Sawada	0bd2b92237	Optimization for AMD GPU	2023-11-27 14:18:13 +09:00
Yutaka Sawada	be51d4c842	Update for v1.3.3.1	2023-11-19 11:28:42 +09:00
Yutaka Sawada	af2ac4b113	Notice of changed option	2023-11-18 19:30:13 +09:00
Yutaka Sawada	6dd7949030	Release note of version 1.3.3.1	2023-11-11 13:11:15 +09:00
Yutaka Sawada	a27f8221cd	Update installer usage	2023-11-11 12:50:47 +09:00
Yutaka Sawada	8cff776c5e	Update installer usage	2023-11-11 12:49:45 +09:00
Yutaka Sawada	bae9e8a0d8	Add files via upload	2023-11-11 12:49:05 +09:00
Yutaka Sawada	bbfad5b9df	Add files via upload	2023-11-11 12:48:13 +09:00
Yutaka Sawada	fd24693c6b	Update installer usage	2023-11-11 11:54:12 +09:00
Yutaka Sawada	cc9d3595bd	Update installer usage	2023-11-11 11:53:34 +09:00
Yutaka Sawada	1b397d8976	Update version number	2023-11-07 12:53:49 +09:00
Yutaka Sawada	8c06ad76b6	Change max buffer size	2023-10-29 16:57:15 +09:00
Yutaka Sawada	50b735d3a5	Update PAR2 clients	2023-10-29 16:56:04 +09:00
Yutaka Sawada	5660fcf7c5	Improve GPU function a little	2023-10-27 13:31:56 +09:00
Yutaka Sawada	e979c07600	Update PAR2 clients	2023-10-27 13:31:18 +09:00
Yutaka Sawada	cd7d1f9450	Fixed a bug in GPU function	2023-10-25 14:06:06 +09:00
Yutaka Sawada	cdaceef840	Fixed a bug	2023-10-25 14:05:16 +09:00
Yutaka Sawada	6ce606977b	Update PAR2 clients	2023-10-23 10:57:51 +09:00
Yutaka Sawada	fb72e811d0	Improve GPU function	2023-10-23 10:54:28 +09:00