Command Reference ( extract mode )
List of extract mode commands.
extract ( chunklet ) : cmdbox -m extract -c chunklet <Option>
Extracts text from the specified document file.
Option |
Type |
Multi |
Required |
Default |
Choices |
Description |
|---|---|---|---|---|---|---|
--host <host> |
str |
required |
localhost |
Specify the service host of the Redis server. |
||
--port <port> |
int |
required |
6379 |
Specify the service port of the Redis server. |
||
--password <password> |
passwd |
required |
password |
Specify the access password of the Redis server (optional). If omitted, password is used. |
||
--svname <svname> |
str |
required |
cmdbox |
Specify the service name of the inference server. |
||
--scope <scope> |
str |
required |
current |
|
Specifies the scope to be referenced. When omitted, 'client' is used. |
|
--fwpath <fwpath> |
file |
multi |
required |
Specify the path to determine whether the specified path is out of bounds. If it is not under this path, it will result in an error. |
||
--loadpath <loadpath> |
file |
required |
Specify the source file path. |
|||
--client_data <client_data> |
str |
Specify the path of the data folder when local is referenced. |
||||
--chunk_lang <chunk_lang> |
str |
auto |
auto | ja | en |
Specify the language of the text to be chunked. If auto is specified, the language will be automatically detected. |
||
--chunk_max_token_counter <chunk_max_token_counter> |
str |
gpt-4o |
Specify the maximum number of tokens for chunking text. |
|||
--chunk_max_tokens <chunk_max_tokens> |
int |
1024 |
Specify the maximum number of tokens for chunking text. |
|||
--chunk_max_sentences <chunk_max_sentences> |
int |
4 |
Specify the maximum number of sentences (not characters) for chunking text. |
|||
--chunk_overlap_percent <chunk_overlap_percent> |
int |
20 |
Specifies the overlap percentage of the chunk. |
|||
--retry_count <retry_count> |
int |
3 |
Specifies the number of reconnections to the Redis server. |
|||
--retry_interval <retry_interval> |
int |
5 |
Specifies the number of seconds before reconnecting to the Redis server. |
|||
--timeout <timeout> |
int |
120 |
Specify the maximum waiting time until the server responds. |
Output Schema
This command implements output_schema() returning Result model.
{
"success": {
"performance": [
{
"key": "string",
"value": null
}
],
"file": "<class 'pathlib.Path'>",
"data": [
{
"content": "string",
"metadata": {}
}
]
},
"warn": {},
"error": {},
"schema": {},
"end": false
}
Field |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
success |
Data | null |
no |
null |
成功した場合の結果 |
success.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
success.file |
Path | str | null |
no |
null |
ファイルパス |
success.data |
list[ContentRecord] | null |
no |
null |
処理結果のデータ |
warn |
dict[str, any] | Data | str | bool | null |
no |
null |
警告がある場合の結果 |
warn.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
error |
dict[str, any] | Data | str | bool | null |
no |
null |
エラーがある場合の結果 |
error.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
schema |
dict[str, any] | null |
no |
null |
スキーマ情報 |
end |
bool | null |
no |
null |
終了フラグ |
extract ( del ) : cmdbox -m extract -c del <Option>
Delete the extraction configuration.
Option |
Type |
Multi |
Required |
Default |
Choices |
Description |
|---|---|---|---|---|---|---|
--host <host> |
str |
required |
localhost |
Specify the service host of the Redis server. |
||
--port <port> |
int |
required |
6379 |
Specify the service port of the Redis server. |
||
--password <password> |
passwd |
required |
password |
Specify the access password of the Redis server (optional). If omitted, password is used. |
||
--svname <svname> |
str |
required |
cmdbox |
Specify the service name of the inference server. |
||
--retry_count <retry_count> |
int |
3 |
Specifies the number of reconnections to the Redis server. |
|||
--retry_interval <retry_interval> |
int |
5 |
Specifies the number of seconds before reconnecting to the Redis server. |
|||
--timeout <timeout> |
int |
120 |
Specify the maximum waiting time until the server responds. |
|||
--extract_name <extract_name> |
str |
required |
Specify the name of the extraction configuration to delete. |
Output Schema
This command implements output_schema() returning Result model.
{
"success": {
"performance": [
{
"key": "string",
"value": null
}
],
"data": "string"
},
"warn": {},
"error": {},
"schema": {},
"end": false
}
Field |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
success |
Data | null |
no |
null |
成功した場合の結果 |
success.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
success.data |
str | null |
no |
null |
処理結果のデータ |
warn |
dict[str, any] | Data | str | bool | null |
no |
null |
警告がある場合の結果 |
warn.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
error |
dict[str, any] | Data | str | bool | null |
no |
null |
エラーがある場合の結果 |
error.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
schema |
dict[str, any] | null |
no |
null |
スキーマ情報 |
end |
bool | null |
no |
null |
終了フラグ |
extract ( list ) : cmdbox -m extract -c list <Option>
Display a list of saved extraction settings.
Option |
Type |
Multi |
Required |
Default |
Choices |
Description |
|---|---|---|---|---|---|---|
--host <host> |
str |
required |
localhost |
Specify the service host of the Redis server. |
||
--port <port> |
int |
required |
6379 |
Specify the service port of the Redis server. |
||
--password <password> |
passwd |
required |
password |
Specify the access password of the Redis server (optional). If omitted, password is used. |
||
--svname <svname> |
str |
required |
cmdbox |
Specify the service name of the inference server. If omitted, server is used. |
||
--retry_count <retry_count> |
int |
3 |
Specifies the number of reconnections to the Redis server.If less than 0 is specified, reconnection is forever. |
|||
--retry_interval <retry_interval> |
int |
5 |
Specifies the number of seconds before reconnecting to the Redis server. |
|||
--timeout <timeout> |
int |
60 |
Specify the maximum waiting time until the server responds. |
|||
--kwd <kwd> |
str |
Specify the name you want to search for. Searches for partial matches. |
Output Schema
This command implements output_schema() returning Result model.
{
"success": {
"performance": [
{
"key": "string",
"value": null
}
],
"data": [
{
"name": "string",
"path": "<class 'pathlib.Path'>"
}
]
},
"warn": {},
"error": {},
"schema": {},
"end": false
}
Field |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
success |
Data | null |
no |
null |
成功した場合の結果 |
success.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
success.data |
list[NamePath] |
no |
(必須) |
処理結果のデータ |
success.data.name |
str |
yes |
(必須) |
名前 |
success.data.path |
Path | str | null |
no |
null |
パス |
warn |
dict[str, any] | Data | str | bool | null |
no |
null |
警告がある場合の結果 |
warn.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
error |
dict[str, any] | Data | str | bool | null |
no |
null |
エラーがある場合の結果 |
error.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
schema |
dict[str, any] | null |
no |
null |
スキーマ情報 |
end |
bool | null |
no |
null |
終了フラグ |
extract ( load ) : cmdbox -m extract -c load <Option>
Loads settings for extracting text from the specified file.
Option |
Type |
Multi |
Required |
Default |
Choices |
Description |
|---|---|---|---|---|---|---|
--host <host> |
str |
required |
localhost |
Specify the service host of the Redis server. |
||
--port <port> |
int |
required |
6379 |
Specify the service port of the Redis server. |
||
--password <password> |
passwd |
required |
password |
Specify the access password of the Redis server (optional). If omitted, password is used. |
||
--svname <svname> |
str |
required |
cmdbox |
Specify the service name of the inference server. |
||
--retry_count <retry_count> |
int |
3 |
Specifies the number of reconnections to the Redis server. |
|||
--retry_interval <retry_interval> |
int |
5 |
Specifies the number of seconds before reconnecting to the Redis server. |
|||
--timeout <timeout> |
int |
120 |
Specify the maximum waiting time until the server responds. |
|||
--extract_name <extract_name> |
str |
required |
Specify the name of the extraction configuration to load. |
Output Schema
This command implements output_schema() returning Result model.
{
"success": {
"performance": [
{
"key": "string",
"value": null
}
],
"extract_name": "string",
"extract_type": "string",
"extract_cmd": "string",
"scope": "string",
"client_data": "string",
"loadpath": "string",
"loadregs": "string"
},
"warn": {},
"error": {},
"schema": {},
"end": false
}
Field |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
success |
Data | null |
no |
null |
成功した場合の結果 |
success.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
success.extract_name |
str | null |
no |
null |
エクストラクト名 |
success.extract_type |
str | null |
no |
null |
エクストラクトタイプ |
success.extract_cmd |
str | null |
no |
null |
エクストラクトコマンド |
success.scope |
str | null |
no |
null |
スコープ |
success.client_data |
str | null |
no |
null |
クライアントデータ |
success.loadpath |
str | null |
no |
null |
読み込みパス |
success.loadregs |
str | null |
no |
null |
読み込み正規表現 |
warn |
dict[str, any] | Data | str | bool | null |
no |
null |
警告がある場合の結果 |
warn.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
error |
dict[str, any] | Data | str | bool | null |
no |
null |
エラーがある場合の結果 |
error.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
schema |
dict[str, any] | null |
no |
null |
スキーマ情報 |
end |
bool | null |
no |
null |
終了フラグ |
extract ( pdfplumber ) : cmdbox -m extract -c pdfplumber <Option>
Extracts text from the specified document file.
Option |
Type |
Multi |
Required |
Default |
Choices |
Description |
|---|---|---|---|---|---|---|
--host <host> |
str |
required |
localhost |
Specify the service host of the Redis server. |
||
--port <port> |
int |
required |
6379 |
Specify the service port of the Redis server. |
||
--password <password> |
passwd |
required |
password |
Specify the access password of the Redis server (optional). If omitted, password is used. |
||
--svname <svname> |
str |
required |
cmdbox |
Specify the service name of the inference server. |
||
--scope <scope> |
str |
required |
current |
|
Specifies the scope to be referenced. When omitted, 'client' is used. |
|
--loadpath <loadpath> |
file |
required |
Specify the source file path. |
|||
--fwpath <fwpath> |
file |
multi |
required |
Specify the path to determine whether the specified path is out of bounds. If it is not under this path, it will result in an error. |
||
--client_data <client_data> |
str |
Specify the path of the data folder when local is referenced. |
||||
--chunk_table <chunk_table> |
str |
table |
none | table | row_with_header |
Specifies how to chunk tables in the PDF file. none :do not chunk by table, table :by table, row_with_header :by row (with header) |
||
--chunk_table_header <chunk_table_header> |
str |
multi |
Replaces existing header items by specifying the names of the table header items in the PDF file, from left to right. |
|||
--chunk_exclude <chunk_exclude> |
str |
multi |
A regular expression specifying a string that should not be included in the chunk. If this specification is matched, embedding will not be performed. |
|||
--chunk_size <chunk_size> |
int |
1000 |
Specifies the chunk size. |
|||
--chunk_overlap <chunk_overlap> |
int |
50 |
Specifies the overlap size of the chunk. |
|||
--chunk_separator <chunk_separator> |
str |
multi |
Specifies the delimiter character for chunking. |
|||
--chunk_spage <chunk_spage> |
int |
0 |
Specifies the starting page of the embedding range. |
|||
--chunk_epage <chunk_epage> |
int |
9999 |
Specifies the ending page of the embedding range. |
|||
--retry_count <retry_count> |
int |
3 |
Specifies the number of reconnections to the Redis server. |
|||
--retry_interval <retry_interval> |
int |
5 |
Specifies the number of seconds before reconnecting to the Redis server. |
|||
--timeout <timeout> |
int |
120 |
Specify the maximum waiting time until the server responds. |
Output Schema
This command implements output_schema() returning Result model.
{
"success": {
"performance": [
{
"key": "string",
"value": null
}
],
"file": "<class 'pathlib.Path'>",
"data": [
{
"content": "string",
"metadata": {}
}
]
},
"warn": {},
"error": {},
"schema": {},
"end": false
}
Field |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
success |
Data | null |
no |
null |
成功した場合の結果 |
success.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
success.file |
Path | str | null |
no |
null |
ファイルパス |
success.data |
list[ContentRecord] | null |
no |
null |
処理結果のデータ |
warn |
dict[str, any] | Data | str | bool | null |
no |
null |
警告がある場合の結果 |
warn.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
error |
dict[str, any] | Data | str | bool | null |
no |
null |
エラーがある場合の結果 |
error.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
schema |
dict[str, any] | null |
no |
null |
スキーマ情報 |
end |
bool | null |
no |
null |
終了フラグ |
extract ( save ) : cmdbox -m extract -c save <Option>
Saves settings for extracting text from the specified file.
Option |
Type |
Multi |
Required |
Default |
Choices |
Description |
|---|---|---|---|---|---|---|
--host <host> |
str |
required |
localhost |
Specify the service host of the Redis server. |
||
--port <port> |
int |
required |
6379 |
Specify the service port of the Redis server. |
||
--password <password> |
passwd |
required |
password |
Specify the access password of the Redis server (optional). If omitted, password is used. |
||
--svname <svname> |
str |
required |
cmdbox |
Specify the service name of the inference server. |
||
--retry_count <retry_count> |
int |
3 |
Specifies the number of reconnections to the Redis server. |
|||
--retry_interval <retry_interval> |
int |
5 |
Specifies the number of seconds before reconnecting to the Redis server. |
|||
--timeout <timeout> |
int |
120 |
Specify the maximum waiting time until the server responds. |
|||
--extract_name <extract_name> |
str |
required |
Specify the name of the extraction configuration. |
|||
--extract_cmd <extract_cmd> |
str |
required |
Specify the name of the extraction command setting. |
|||
--extract_type <extract_type> |
str |
required |
file |
|
Specify the type of extraction. |
|
--scope <scope> |
str |
client |
|
Specify the reference scope. The available image types are client and server. |
||
--client_data <client_data> |
str |
Specify the path of the data folder when local is referenced. |
||||
--loadpath <loadpath> |
dir |
required |
Specify the source path. |
|||
--loadregs <loadregs> |
str |
required |
.* |
Specifies a load regular expression pattern. |
Output Schema
This command implements output_schema() returning Result model.
{
"success": {
"performance": [
{
"key": "string",
"value": null
}
],
"data": "string"
},
"warn": {},
"error": {},
"schema": {},
"end": false
}
Field |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
success |
Data | null |
no |
null |
成功した場合の結果 |
success.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
success.data |
str | null |
no |
null |
処理結果のデータ |
warn |
dict[str, any] | Data | str | bool | null |
no |
null |
警告がある場合の結果 |
warn.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
error |
dict[str, any] | Data | str | bool | null |
no |
null |
エラーがある場合の結果 |
error.performance |
list[KeyVal] | null |
no |
null |
パフォーマンス情報のリスト |
schema |
dict[str, any] | null |
no |
null |
スキーマ情報 |
end |
bool | null |
no |
null |
終了フラグ |