- 1. Prerequisites
- 2. Command Specifications
- 3. Telemetry Specifications
- 4. FDIR Specifications
- 5. Recovery Processing
- 6. How to use FTP
This document explains how to use the Open-SET software. First, use the installer to install Open-SET. (see README)
The configuration of this software platform is as follows.
- IFSW and the IFSW verification tool are used for sending and receiving commands.
- We use mosquitto as the internal messaging broker.
- We use telegraf as the status information collection tool.
- In addition, we use prometheus as the alert management tool, and we use the docker engine, etc., as the execution environment for user applications.
This chapter summarizes the specifications for the commands that are requested from the BUS-OBC command client and executed within the MISSION-OBC.
The list of commands that can be requested from the BUS-OBC command client is as follows.
| ID | Command name | Function | Parameters | Response |
|---|---|---|---|---|
| 0x00 | GetTelemetry | Obtain telemetry information | { "command_id": "0x00", "parameter": {} } |
0x80 Telemetry information |
| 0x01 | GetLog | Get system log information | { "command_id": "0x01", "parameter": [ {"time_id": "YYYY-MM-DD"} ] } |
0x81 System log information |
| 0x02 | Shutdown | Shut down MISSION-OBC | { "command_id": "0x02", "parameter": {} } |
0x82 MISSION-OBC shutdown result |
| 0x10 | PowerOnDevice | Start the device to be restored. | { "command_id": "0x10", "parameter": [ {"device_id": 0} //Device identifier to be restored //0: nvme0 //1:ssd0 //2:nvme1 ] } |
0x90 OS Restore Preprocessing Result |
| 0x11 | RestoreOS | Restore the OS image of the device to be restored. | { "command_id": "0x11", "parameter": [ {"device_id": 0}, //Device identifier to be restored //0: nvme0 //1: ssd0 //2: nvme1 {"progress_id": 0}, //Partition of the device to be restored //1: boot partition and root partition //2: root partition //3 : Boot device switching only (do not process if there is no target for restoration, and only the device switching part of the post-processing will run) {"progress_bytes": 0} //0 (fixed value) ] } |
0x91 OS Restore Result |
| 0x12 | RestoreFile | Restores any file on the device to be restored. | { "command_id": "0x12", "parameter": [ {"installer_list": ["/opt/open-set/sys/restore_symbolic_link.sh"]} //Script name (array) for performing additional installation. If additional installation is not required, specify only the script that creates a symbolic link in the home directory. If additional installation is required, add the script to the array. ] } |
0x92 Post-processing results of OS restoration |
| 0x21 | DeployApp | Deploy apps uploaded from the ground. | { "command_id": "0x21", "parameter": [ {"deploy_method": "0x00"}, // "0x00" is fixed {"image_file": "hello.tar.gz"}, // App file name to be deployed docker load or deb install {"app_uid": "0x01"} // App user ID ] } |
0xA1 Container deployment result, deb file installation result |
| 0x22 | GetAppInfo | Obtain information (container information) about deployed all apps. | { "command_id": "0x22", "parameter": [ {"deploy_method": "0x00"} //"0x00" is fixed . ] } |
0xA2 Information on deployed containers and installed deb files |
| 0x24 | GetFile | Get a file (text file) | { "command_id": "0x24", "parameter": [ {"config_file_name": "/export/home/exp-01/01_docker-compose.yml"} //text file name ] } |
0xA4 Parameter setting details |
| 0x25 | ExecuteApp | Execute the app | { "command_id": "0x25", "parameter": [ {"deploy_method": "0x00"}, //"0x00" fixed {"app_uid": "0x01"}, //Application user ID {"obs_id": "0x0001"} //Observation request number ] } |
0xA5 Container app and native app execution results |
| 0x27 | GetResult | Obtain the result of executing the app. | { "command_id": "0x27", "parameter": [ {"app_uid": "0x01"}, //App user ID {"obs_id": "0x0001"} //Observation request number ] } |
0xA7 Container app and native app execution results |
| 0x60 | MoveFile | Move a file | { "command_id": "0x60", "parameter": [ {"file_path": "hello.tar.gz"}, //File name {"attribute": "0x00"}, //Attribute ID //"0x00": App //"0x01": Application configuration file //"0x10": Execution condition file yaml (yml), sh file //"0x20": Satellite images taken in orbit //"0x21": Other files required by the AI application other than the above {"app_uid": "0x01"}, //Application user ID {"obs_id": "0x0001"} ]} |
0xE0 Result of placing a file that has been interfaced with FTP PUT in the application area |
| 0x61 | NotifyGetFile | Notifies that the satellite bus system has already acquired the file from MISSION-OBC. | { "command_id": "0x61", "parameter": [ {"path": "export/home/exp-01/obs-0001/toGround_arch/result_exp-01_obs-0001.tar.gz"}, // Filename (full path) {"delete_file": 1}, // Whether or not to delete the retrieved file // 0: Do not delete // 1: Delete {"app_uid": "0x01"}, // App user ID // 0x00 No user, the user number is entered in sequential order ] } |
0xE1 Post-processing result for files that have been interfaced with FTP GET |
| 0x62 | GetDirectory | Obtain information on a specific directory in MISSION-OBC | { "command_id": "0x62", "parameter": [ {"path": ". /"} // Directory path (full path) ] } |
0xE2 MISSION-OBC specific directory information archived with tar, compressed with gzip, and encoded with base64 |
| 0x63 | DeleteFile | Deletes a specific file or container in MISSION-OBC. | { "command_id": "0x63", "parameter": [ {"target": "/mnt/open-set/trash"}, // Full path of the object to be deleted, or the docker image name of the target for deletion: tag name // e.g. {"target" : "AppXXX:latest"} {"type": "0x00"}, // normal file 0x00 / directory 0x01 / Docker image 0x02 {"date": "2023-01-01T00:00:00+09:00"} // Specify the time conditions for deletion (can be omitted). // Delete files whose last modified date is before this time. // Note that the system's standard time is not Japan Standard Time, but 32-bit Unix time (UTC)! // Format: // - If you do not specify a time zone (UTC): "YYYY-MM-ddThh:mm:ss" // - If you specify a time zone: "YYYY-MM-ddThh:mm:ss+hh:mm" // Example: To delete files created before 2023 in Japan time, specify it as "2023-01-01T00:00:00+09:00". ] } |
0xE3 MISSION-OBC result of deleting a specific file or container within MISSION-OBC |
| 0x70 | ExecuteCommand | Execute a specific shell command within MISSION-OBC | { "command_id": "0x70", "parameter": [ {"shell": "date"} // The shell command to execute // The command must satisfy the following conditions // - It must be specified on a single line // - It must not include commands that require pre-installation // - It must not include commands that require input during execution // - Do not include commands that will destroy the system ] } |
0xF0 MISSION-OBC The result of a specific shell command executed within the MISSION-OBC is archived with tar, compressed with gzip, and encoded with base64. |
The method for sending and receiving command request data is as follows.
- Check that IFSW is operating normally. If it is not operating normally, restart the service.
- Checking method:
sudo systemctl status ifsw* - Restart method:
sudo systemctl restart ifsw*
- Start the status command server.
cd /opt/open-set/tests/tools/ifsw_verification/commandspython3 status_command_server.py
- Open another terminal window and start sending commands.
cd /opt/open-set/tests/tools/ifsw_verification/commandspython3 command_client_manual.py
If successful, you will be prompted to select a command ID as follows.
For example, if you want to execute the "GetTelemetry" command to obtain telemetry, enter 0x00.
==========================================
please select a command ID from the list:
ID | NAME
0x00 | GetTelemetry
0x01 | GetLog
0x02 | Shutdown
...
- After sending the command, the contents of the command you sent will be displayed, followed by a response message from the IFSW server. If the command is received successfully,
SERVER > Command accepted.will be returned.
Change the following three places.
/opt/open-set/src/IFSW/command_handler
/opt/open-set/tests/tools/ifsw_verification/commands/status_command_handler.py
/opt/open-set/tests/tools/ifsw_verification/commands/BUSOBC_consts.py
First, add a folder with the following structure to /opt/open-set/src/IFSW/command_handler.
+-- command_XX_your_command
+-- your_command.py
+-- template.json
Here, replace "YourCommand" with the name of the command you want to add. Also, "XX" is a two-digit hexadecimal ID, so choose one that does not conflict with the IDs of existing commands. Please note that command IDs are assigned to each command according to the following criteria.
Range of IDs for control commands: 0x00-0x7F
- 0x00-0x0F: System-related (ex. telemetry, logging, power)
- 0x10-0x1F: Error Clear-related (ex. recovery)
- 0x20-0x5F: Mission-related (ex. application loading/execution)
- 0x60-0x6F: File operation-related
- 0x70-0x7F: Development-related
- 0x80-0xFF is reserved for status command IDs.
In template.json, please describe the default parameter values for the command in JSON format.
Command script files are written in the following three basic components. When creating your_command.py, please edit the commented parts in accordance with the following format.
class InvokeCommand(BaseCommand):
def check_parameter(self, *args):
super().check_parameter()
result = check_parameter(*args)
return result
def pre_process(self, params): # If there are any parameters included in the return content of the command execution, include them here.
super().pre_process()
self.parameter = [{"response_code": ER_COMMAND_EXE}, {"data": params}] # Write the return content in the event of a command error here.
def main_process(self, *args):
super().main_process()
parameter = invoke_command(*args)
return parameter
def check_parameter(parameter): # Include the command parameters here.
# Write the code to check the command parameters here.
return True
def invoke_command(params): # Include any parameters used in the command-specific processing and return content here.
# Write the command-specific processing here.
:
parameter = [{"response_code": E_OK}] # Describe the return content when the command ends normally here.
return parameter
Once you have made the above changes, restart the services related to IFSW in order to reflect the changes in MISSION-OBC.
sudo systemctl restart ifsw-*
Open /opt/open-set/tests/tools/ifsw_verification/commands/status_command_handler.py and edit the following items.
status_commandTBL = {
0x80: EndGetTelemetry,
0x81: EndGetLogInfo,
0x82: EndShutdown,
0x90: EndPowerOnDevice,
0x91: EndRestoreOS,
0x92: EndRestoreFile,
0xA1: EndDeployApp,
0xA2: EndGetAppInfo,
0xA4: EndGetFile,
0xA5: EndExecuteApp,
0xA7: EndGetAppResult,
0xE0: EndMoveFile,
0xE1: EndNotifyGetFile,
0xE2: EndGetDirectoryList,
0xE3: EndDeleteFile,
0xF0: EndExecuteCommand,
}
To this list, add the status command ID (command ID + 0x80) corresponding to the new command you want to add, and the function name (e.g. PutMonitorStatus) that will receive the status command.
Next, add the function that will receive the status command. In the case of GetTelemetry, the following is the corresponding function.
@timeout_decorator.timeout(COM_HANDLER_TIMEOUT, use_signals=False)
def EndGetTelemetry(response_code, telemetry, parameterCheck=False):
if parameterCheck:
return True
print("Command executed : " + sys._getframe().f_code.co_name)
print("Telemetry:\n")
for id in range(len(telemetry)):
print(TELEMETRY_FIELDS[id] + ": " + str(telemetry[id]))
return
Please define the function parameters according to the content returned by invoke_command in YourCommand.py. Set the parameter check if necessary. Also, please make sure that the necessary information is output, as the contents of the print output will be displayed on the status command server.
Open /opt/open-set/tests/tools/ifsw_verification/commands/BUSOBC_consts.py and edit the following items.
commandLIST = [
"0x00-GetTelemetry",
"0x01-GetLog",
"0x02-Shutdown",
"0x10-PowerOnDevice",
"0x11-RestoreOS",
"0x12-RestoreFile",
"0x21-DeployApp",
"0x22-GetAppInfo",
"0x24-GetFile",
"0x25-ExecuteApp",
"0x27-GetResult",
"0x60-MoveFile",
"0x61-NotifyGetFile",
"0x62-GetDirectory",
"0x63-DeleteFile",
"0x70-ExecuteCommand",
]
Add the command ID and command name corresponding to the newly added command to this list.
Commands can be deleted by deleting the relevant parts from the file updated in [2.3](#23-Adding Commands). In addition, please restart the IFSW-related services to reflect the configuration changes.
This chapter summarizes the specifications for the telemetry collected within the MISSION-OBC.
We use telegraf as a tool for collecting MISSION-OBC status information. In addition, some values are managed in a database shared within Open-SET, and are updated by IFSW and IR handlers. The following is a list of the telemetry collected by Open-SET.
| Category | Item | telegraf_metrics | Description / Remarks | Byte Pos | Byte Size | Type | Bit | Display Type[Unit] |
|---|---|---|---|---|---|---|---|---|
| Time | Time UNIX time | Outputs UNIX time | 0 | 4 | uint32 | - | DEC[sec] | |
| Command Response | Command Counter*1 | Incremented each time a command is received successfully. | 4 | 1 | uint8 | - | DEC[count] | |
| Command Reject Counter*1 | Incremented when a command is rejected. | 5 | 1 | uint8 | - | DEC[count] | ||
| Last Executed Command | ID of the last executed command | 6 | 1 | uint8 | - | DEC | ||
| Last Reject command | ID of the last rejected command | 7 | 1 | uint8 | - | DEC | ||
| Last Reject Code | Code (response code) when command is rejected | 8 | 1 | uint8 | - | DEC | ||
| Status Command Counter*1 | client_status_commnad_counter | Incremented for each successful command reception. | 9 | 1 | uint8 | - | DEC[count] | |
| Status Command Reject Counter*1 | client_status_commnad_reject_counter | Incremented when a command is rejected. | 10 | 1 | uint8 | - | DEC[count] | |
| Status Last Executed Command | client_status_last_command_code | ID of the last executed command | 11 | 1 | uint8 | - | DEC | |
| Status Last Reject command | client_status_last_reject_command_code | ID of the last rejected command | 12 | 1 | uint8 | - | DEC | |
| System Status | shutdown_request | Each bit indicates whether or not a mode that requires a reboot is in effect. Unassigned bits are reserved. Bit positions are expressed in network byte order. | 13 | 1 | uint8 | - | STATUS | |
| Request to execute forced power-off | 0:not requested, 1: requested | 13 | 0 | - | 0 | STATUS | ||
| Request to execute command sequence [GetLog -> Get log via FTP -> Request Shut down] | 0:not requested, 1: requested | 13 | 0 | - | 1 | STATUS | ||
| Reserved | 13 | 0 | - | 2:7 | STATUS | |||
| dmesg lines | dmesg_length | Number of dmsg lines | 14 | 2 | uint16 | - | DEC[lines] | |
| dmesg error | dmesg_errornum | Number of dmsg error lines | 16 | 2 | uint16 | - | DEC[lines] | |
| syslog error | syslog_errnum | Number of error lines in syslog | 18 | 2 | uint16 | - | DEC[lines] | |
| System Uptime | system_uptime | Total number of seconds the system has been running since it was turned on (up to 65535, or 10 hours) | 20 | 4 | uint32 | - | DEC[sec] | |
| Service Status | sensor-server.status | Missing values are assumed to be 0. The value of each bit indicates the status of the following services. Unassigned bits are reserved. Bit positions are expressed in network byte order. | 24 | 4 | uint32 | - | STATUS | |
| Reserved | 24 | 0 | - | 0:16 | STATUS | |||
| status_network.status | 0:stop, 1: run | 24 | 0 | - | 17 | STATUS | ||
| status_ifsw-server.status | 0:stop, 1: run | 24 | 0 | - | 18 | STATUS | ||
| status_ifsw-cliant.status | 0:stop, 1: run | 24 | 0 | - | 19 | STATUS | ||
| status_ifsw-monitoring.status | 0:stop, 1: run | 24 | 0 | - | 20 | STATUS | ||
| status_alertmanager.status | 0:stop, 1: run | 24 | 0 | - | 21 | STATUS | ||
| status_webhook_receiver.status | 0:stop, 1: run | 24 | 0 | - | 22 | STATUS | ||
| status_prometheus.status | 0:stop, 1: run | 24 | 0 | - | 23 | STATUS | ||
| status_telegraf.status | 0:stop, 1: run | 24 | 0 | - | 24 | STATUS | ||
| status_mosquitto.status | 0:stop, 1: run | 24 | 0 | - | 25 | STATUS | ||
| status_docker.status | 0:stop, 1: run | 24 | 0 | - | 26 | STATUS | ||
| status_health_check.status | 0:stop, 1: run | 24 | 0 | - | 27 | STATUS | ||
| status_logrotate.status | 0:stop, 1: run | 24 | 0 | - | 28 | STATUS | ||
| status_vsftpd.status | 0:stop, 1: run | 24 | 0 | - | 29 | STATUS | ||
| status_ssh.status | 0:stop, 1: run | 24 | 0 | - | 30 | STATUS | ||
| status_systemd-timesyncd.status | 0:stop, 1: run | 24 | 0 | - | 31 | STATUS | ||
| failed status | status_failed_num | Number of services with status fail | 28 | 1 | uint8 | - | DEC[count] | |
| CPU/GPU Status | CPU load ratio | cpu_usage_user | CPU load ratio | 29 | 1 | uint8 | - | DEC[%] |
| mean CPU frequency | linux_cpu_scaling_cur_freq | Average CPU clock frequency (MHz) | 30 | 2 | uint16 | - | DEC[MHz] | |
| Num of Process | processes_running | Number of processes running | 32 | 2 | uint16 | - | DEC[count] | |
| RAM Free | mem_available | Available memory size | 34 | 2 | uint16 | - | DEC[MB] | |
| GPU load ratio | amd_rocm_smi_utilization_gpu | GPU load ratio | 36 | 1 | uint8 | - | DEC[%] | |
| GPU Memory clock | amd_rocm_smi_clocks_current_memory | 37 | 2 | float16 | - | DEC[GHz] | ||
| GPU Shader clock | amd_rocm_smi_clocks_current_sm | 39 | 2 | float16 | - | DEC[GHz] | ||
| GPU VRAM usage | amd_rocm_smi_memory_used/amd_rocm_smi_memory_total | 41 | 1 | uint8 | - | DEC[%] | ||
| GPU GTT (Graphics Translation Table) usage | amd_rocm_gttmem_used/amd_rocm_gttmem_total | 42 | 1 | uint8 | - | DEC[%] | ||
| Communication | throughput | net_packets_throughput | LAN communication throughput | 43 | 2 | float16 | - | DEC[Mbps] |
| tcp byte received (4000) | net_bytes_recv | TCP port received amount (4000), roll over | 45 | 2 | uint16 | - | DEC[MB] | |
| tcp byte sent (4003) | net_bytes_sent | TCP port sent amount (4003), Roll over | 47 | 2 | uint16 | - | DEC[MB] | |
| err_in | net_err_in | Cumulative number of packets with errors | 49 | 1 | uint8 | - | DEC[count] | |
| err_out | net_err_out | cumulative number of packets that have been dropped | 50 | 1 | uint8 | - | DEC[count] | |
| drop_in | net_drop_in | cumulative number of packets that have been dropped in | 51 | 1 | uint8 | - | DEC[count] | |
| drop_out | net_drop_out | Cumulative number of packets dropped out | 52 | 1 | uint8 | - | DEC[count] | |
| NVMe0 | SSD Temperature | smart_temperature | Disk temperature | 53 | 2 | float16 | - | DEC[mC] |
| Power on hours | smart_power_on_hours | Total time the disk has been powered on | 55 | 4 | uint32 | - | DEC[hour] | |
| Powercycle count | smart_power_cycle_count | Number of power cycles the disk has been through | 59 | 4 | uint32 | - | DEC[count] | |
| Error information log entries | smart_error_information_log_entries | Number of entries recorded in the error log | 63 | 1 | uint8 | - | DEC[count] | |
| Available spare | smart_available_spare | Percentage of available spare space | 64 | 1 | uint8 | - | DEC[%] | |
| Smart critical warning | smart_critical_warning | Status of critical warnings 0: No warnings, 1: Warnings have occurred. The value of each bit indicates the status of the following warning types. Unassigned bits are reserved. The bit positions are expressed in little endian format in accordance with the external specifications. | 65 | 1 | uint8 | - | STATUS | |
| Reserved | 65 | 0 | - | 7:6 | STATUS | |||
| Invalid Persistent Memory | Persistent Memory Region is in Read-Only state or reliability is questionable. | 65 | 0 | - | 5 | STATUS | ||
| Volatile Backup | There is a problem with the volatile memory backup mechanism. | 65 | 0 | - | 4 | STATUS | ||
| Media Read-only | The media is in Read-Only mode. | 65 | 0 | - | 3 | STATUS | ||
| Reliability Status | The reliability of the NVM subsystem has degraded. | 65 | 0 | - | 2 | STATUS | ||
| Temperature Threshold | The temperature has exceeded the upper threshold or dropped below the lower threshold. | 65 | 0 | - | 1 | STATUS | ||
| Available Spare | The amount of spare space has dropped below the lower threshold. | 65 | 0 | - | 0 | STATUS | ||
| Percentage Used | smart_percentage_used | The percentage of life used. | 66 | 1 | uint8 | 0 | DEC[%] | |
| Unsafe Shutdowns | smart_unsafe_shutdowns | Number of unsafe shutdowns | 67 | 2 | uint16 | - | DEC[count] | |
| Media And Data Integrity Errors | smart_media_and_data_integrity_errors | Number of media and data integrity errors | 69 | 1 | uint8 | - | DEC[count] | |
| Percentage Rate | storage_raw_read_error_rate | Error rate when reading data | 70 | 1 | uint8 | - | DEC[%] | |
| Reallocated_Sector Ct | storage_reallocated_sector_ct | Number of reallocated bad sectors | 71 | 1 | uint8 | - | DEC[count] | |
| Hardware_Ecc Recovered | storage_hardware_ecc_recovered | Number of errors recovered by ECC | 72 | 1 | uint8 | - | DEC[count] | |
| Reallocated_Event Count | storage_reallocated_event_count | Number of reallocated events | 73 | 1 | uint8 | - | DEC[count] | |
| Offline Uncorrectable | storage_offline_uncorrectable | Number of uncorrectable errors in offline scan | 74 | 1 | uint8 | - | DEC[count] | |
| Udma_Crc_Error Count | storage_udma_crc_error_count | Number of CRC errors during UDMA transfer | 75 | 1 | uint8 | - | DEC[count] | |
| NVMe1 | SSD Temperature | smart_temperature | Disk temperature | 76 | 2 | float16 | - | DEC[mC] |
| Power on hours | smart_power_on_hours | Total time the disk was powered on | 78 | 4 | uint32 | - | DEC[hour] | |
| Powercycle count | smart_power_cycle_count | Number of times the disk has been power-cycled | 82 | 4 | uint32 | - | DEC[count] | |
| Error information log entries | smart_error_information_log_entries | Number of entries recorded in the error log | 86 | 1 | uint8 | - | DEC[count] | |
| Available spare | smart_available_spare | Percentage of available spare space | 87 | 1 | uint8 | - | DEC[%] | |
| Smart critical warning | smart_critical_warning | Status of critical warnings 0: No warnings, 1: Warnings have occurred. The value of each bit indicates the status of the following warning types. Unassigned bits are reserved. The bit positions are expressed in little endian format in accordance with the external specifications. | 88 | 1 | uint8 | - | STATUS | |
| Reserved | 88 | 0 | - | 7:6 | STATUS | |||
| Invalid Persistent Memory | Persistent Memory Region is in Read-Only state or reliability is questionable | 88 | 0 | - | 5 | STATUS | ||
| Volatile Backup | There is a problem with the volatile memory backup mechanism | 88 | 0 | - | 4 | STATUS | ||
| Media Read-only | The media is in Read-Only mode. | 88 | 0 | - | 3 | STATUS | ||
| Reliability Status | The reliability of the NVM subsystem has degraded. | 88 | 0 | - | 2 | STATUS | ||
| Temperature Threshold | The temperature has exceeded the upper threshold or dropped below the lower threshold. | 88 | 0 | - | 1 | STATUS | ||
| Available Spare | The amount of spare space has dropped below the lower threshold. | 88 | 0 | - | 0 | STATUS | ||
| Percentage Used | smart_percentage_used | Percentage of life used. | 89 | 1 | uint8 | 0 | DEC[%] | |
| Unsafe Shutdowns | smart_unsafe_shutdowns | Number of unsafe shutdowns | 90 | 2 | uint16 | - | DEC[count] | |
| Media And Data Integrity Errors | smart_media_and_data_integrity_errors | Number of media and data integrity errors | 92 | 1 | uint8 | - | DEC[count] | |
| Percentage Rate | storage_raw_read_error_rate | Error rate when reading data | 93 | 1 | uint8 | - | DEC[%] | |
| Reallocated_Sector Ct | storage_reallocated_sector_ct | Number of reallocated bad sectors | 94 | 1 | uint8 | - | DEC[count] | |
| Hardware_Ecc Recovered | storage_hardware_ecc_recovered | Number of errors recovered by ECC | 95 | 1 | uint8 | - | DEC[count] | |
| Reallocated_Event Count | storage_reallocated_event_count | Number of reallocated events | 96 | 1 | uint8 | - | DEC[count] | |
| Offline Uncorrectable | storage_offline_uncorrectable | Number of uncorrectable errors in offline scan | 97 | 1 | uint8 | - | DEC[count] | |
| Udma_Crc_Error Count | storage_udma_crc_error_count | Number of CRC errors during UDMA transfer | 98 | 1 | uint8 | - | DEC[count] | |
| SSD | SSD Temperature | smart_temperature | Disk temperature | 99 | 2 | float16 | - | DEC[mC] |
| Power on hours | smart_power_on_hours | Total time the disk has been powered on | 101 | 4 | uint32 | - | DEC[hour] | |
| Powercycle count | smart_power_cycle_count | Number of times the disk has been power cycled | 105 | 4 | uint32 | - | DEC[count] | |
| Error information log entries | smart_error_information_log_entries | Number of entries recorded in the error log | 109 | 1 | uint8 | - | DEC[count] | |
| Available spare | smart_available_spare | Percentage of available spare space | 110 | 1 | uint8 | - | DEC[%] | |
| Smart critical warning | smart_critical_warning | Status of critical warnings 0: No warnings, 1: Warnings have occurred. The value of each bit indicates the status of the following warning types. Unassigned bits are reserved. The bit positions are expressed in little endian format in accordance with the external specifications. | 111 | 1 | uint8 | - | STATUS | |
| Reserved | 111 | 0 | - | 7:6 | STATUS | |||
| Invalid Persistent Memory | Persistent Memory Region is in Read-Only state or reliability is questionable | 111 | 0 | - | 5 | STATUS | ||
| Volatile Backup | There is a problem with the volatile memory backup mechanism | 111 | 0 | - | 4 | STATUS | ||
| Media Read-only | The media is in Read-Only mode. | 111 | 0 | - | 3 | STATUS | ||
| Reliability Status | The reliability of the NVM subsystem has degraded. | 111 | 0 | - | 2 | STATUS | ||
| Temperature Threshold | The temperature has exceeded the upper threshold or dropped below the lower threshold. | 111 | 0 | - | 1 | STATUS | ||
| Available Spare | The amount of spare space has dropped below the lower threshold. | 111 | 0 | - | 0 | STATUS | ||
| Percentage Used | smart_percentage_used | Percentage of life used. | 112 | 1 | uint8 | 0 | DEC[%] | |
| Unsafe Shutdowns | smart_unsafe_shutdowns | Number of unsafe shutdowns | 113 | 2 | uint16 | - | DEC[count] | |
| Media And Data Integrity Errors | smart_media_and_data_integrity_errors | Number of media and data integrity errors | 115 | 1 | uint8 | - | DEC[count] | |
| Percentage Rate | storage_raw_read_error_rate | Error rate when reading data | 116 | 1 | uint8 | - | DEC[%] | |
| Reallocated_Sector Ct | storage_reallocated_sector_ct | Number of reallocated bad sectors | 117 | 1 | uint8 | - | DEC[count] | |
| Hardware_Ecc Recovered | storage_hardware_ecc_recovered | Number of errors recovered by ECC | 118 | 1 | uint8 | - | DEC[count] | |
| Reallocated_Event Count | storage_reallocated_event_count | Number of reallocated events | 118 | 1 | uint8 | - | DEC[count] | |
| Offline Uncorrectable | storage_offline_uncorrectable | Number of uncorrectable errors in offline scan | 120 | 1 | uint8 | - | DEC[count] | |
| Udma_Crc_Error Count | storage_udma_crc_error_count | Number of CRC errors during UDMA transfer | 121 | 1 | uint8 | - | DEC[count] | |
| Storage | Disk Used Percentage(DATA) | disk_used_percent | Disk usage (data) | 122 | 1 | uint8 | - | Usage rate [%] |
| Disk Used Percentage(System) | disk_used_percent | Disk usage percentage (system) | 123 | 1 | uint8 | - | Usage rate [%] | |
| Disk Inodes Used Percentage(DATA) | disk_inodes_used_percent | i-nodes usage percentage (data) | 124 | 1 | uint8 | - | Usage rate [%] | |
| Disk Inodes Used Percentage(System) | disk_inodes_used_percent | i-nodes usage rate (system) | 125 | 1 | uint8 | - | usage rate [%] | |
| Boot Device | storage_boot_device | device being booted up | 126 | 1 | uint8 | - | STATUS | |
| Boot NVMe No | storage_nvme_number | NVMe number that is booting up | 127 | 1 | uint8 | - | STATUS | |
| container | Running Container num | docker_n_containers_running | Number of docker processes in docker ps | 128 | 1 | uint8 | - | DEC[count] |
| Exited Container num | docker_n_containers_exited | Number of docker processes in docker ps -a --filter "status=exited" | 129 | 1 | uint8 | - | DEC[count] | |
| cAdvisor | max_storage_used | max_storage_used | Total file system usage by running container services | 130 | 1 | uint8 | - | DEC[%] |
| memory_failed_counter | memory_failed_counter | Number of memory allocation failures + number of OOM (Out Of Memory) events | 131 | 1 | uint8 | - | DEC[count] | |
| Temperature monitor | gpu_temp | gpu_temp | GPU temperature | 132 | 2 | float16 | - | DEC[C] |
| cpu_temp | sensors_temp_input | CPU temperature | 134 | 2 | float16 | - | DEC[mC] | |
| FDIR | Event Occurrence | fdir_event_flag | 0: No event occurred, 1: Event occurred The value of each bit indicates the status of the following alert event types. The bit positions are expressed in network byte order. | 136 | 2 | uint16 | - | STATUS |
| event_loss_packet | TCP/IP packet loss occurrence flag | 136 | 0 | - | 0 | STATUS | ||
| event_storage_overload | Storage overuse occurrence flag | 136 | 0 | - | 1 | STATUS | ||
| event_high_gpu_temp | GPU temperature warning flag (application shutdown) | 136 | 0 | - | 2 | STATUS | ||
| event_too_high_gpu_temp | GPU temperature warning flag (immediate power supply shutdown) | 136 | 0 | - | 3 | STATUS | ||
| event_high_cpu_temp | CPU temperature warning flag (clock down) | 136 | 0 | - | 4 | STATUS | ||
| event_too_high_cpu_temp | CPU abnormally high temperature flag (immediate power supply stop) | 136 | 0 | - | 5 | STATUS | ||
| event_memory_overload | Memory overuse occurrence flag | 136 | 0 | - | 6 | STATUS | ||
| event_application_timeout | Application execution timeout detection flag | 136 | 0 | - | 7 | STATUS | ||
| event_too_high_storage_temp | Storage temperature too high flag (power supply stopped after logging) | 136 | 0 | - | 8 | STATUS | ||
| Reserved | 136 | 0 | - | 9:15 | STATUS | |||
| APP | App ID in running[0] | Application ID in running[0] | 138 | 1 | uint8 | - | HEX | |
| OBS ID in running[0] | Observation ID in running[0] | 139 | 2 | uint16 | - | HEX | ||
| Script No in running[0] | Running script number in running[0] | 141 | 1 | uint8 | - | HEX | ||
| App ID in running[1] | Application ID in running[1] | 142 | 1 | uint8 | - | HEX | ||
| OBS ID in running[1] | Observation ID in running[1] | 143 | 2 | uint16 | - | HEX | ||
| Script No in running[1] | Running script number in running[1] | 145 | 1 | uint8 | - | HEX | ||
| App ID in running[2] | Application ID in running[2] | 146 | 1 | uint8 | - | HEX | ||
| OBS ID in running[2] | Observation ID in running[2] | 147 | 2 | uint16 | - | HEX | ||
| Script No in running[2] | Running script number in running[2] | 149 | 1 | uint8 | - | HEX | ||
| EDAC | Corrected Errors | Number of corrected memory errors | 150 | 1 | uint8 | - | DEC[count] | |
| Uncorrected Errors | Number of uncorrectable errors in memory | 151 | 1 | uint8 | - | DEC[count] |
*1: The counter value returns to 0 and starts counting again when it exceeds 255.
Change the following three locations.
/opt/open-set/src/IFSW/config/config.ini
/opt/open-set/src/IFSW/command_handler/command_00_get_telemetry/get_telemetry.py
/opt/open-set/tests/tools/ifsw_verification/commands/BUSOBC_consts.py
In addition, if you want to add a method for collecting telemetry, you will need to make changes to the following two locations.
/opt/open-set/src/FDIRModule/telegraf_tools/telegraf.conf
/opt/open-set/src/FDIRModule/telegraf_tools/execd_plugin_files
The former is the telegraf configuration file.
The latter is the management directory for the script files that are executed by the telegraf inputs.execd plugin.
For example, if your environment is equipped with an AMD GPU and ROCm is installed, you can enable inputs.amd_rocm_smi and the get_status_rocm plugin under inputs.execd in telegraf.conf to collect GPU-related telemetry.
For information on how to collect telemetry using telegraf, please refer to telegraf's github (https://github.com/influxdata/telegraf). Here, we will explain the changes required in IFSW when adding telemetry.
This file contains the configuration values used within IFSW, and also includes the names of the metrics collected by telegraf.
Open /opt/open-set/src/IFSW/config/config.ini and add the names of the metrics corresponding to the telemetry you want to add to the following.
TELEGRAF_FIELDS = client_status_command_counter, client_status_reject_counter, ...(omitted), ue_count
This file links the values of the metrics collected by telegraf to telemetry and outputs them.
First, open /opt/open-set/src/IFSW/command_handler/command_00_GetTelemetry/GetTelemetry.py and add the telemetry names and their initial values (invalid values) to the following telemetry_initial_values.
def initialize_status_data():
global_db = SqliteDict(DB_IFSW_GLOBAL_VARS)
telemetry_initial_values = {
# ======== Time ==========#
"unix_time": np.uint32(time.time()),
: (omitted)
"ue_count": np.uint8(0),
}
Metrics collected by telegraf = If it is telemetry, the changes are complete. Otherwise, continue to create a class that links metrics and telemetry. For example, the telemetry "gpu_gtt_usage", which indicates GPU usage, is linked to the metrics "gpu_gtt_used" and "gpu_gtt_total" by the following class.
class GPUGTTUsage(BaseTelemetryMsgHandler):
def update_status(self):
status = "gpu_gtt_usage"
try:
var_type = type(self.status_dict[status])
value_used = self.metrics_dict["gpu_gtt_used"][0]["value"]
value_total = self.metrics_dict["gpu_gtt_total"][0]["value"]
self.status_dict[status] = var_type(
100*float(value_used)/float(value_total))
except Exception as e:
logging.warning("Exception occurred when \
updating gpu_gtt_usage. \n%s", e)
Similarly, create a class (NewTelemetry) that links the telemetry (new_telemetry) and metrics (metrics) to be added according to the following format.
class NewTelemetry(BaseTelemetryMsgHandler):
def update_status(self):
status = "new_telemetry"
try:
var_type = type(self.status_dict[status])
self.status_dict[status] = var_type(
# Here, write the function for the metric that indicates the telemetry value.
# The metric value can be obtained with float(self.metrics_dict["metrics"][0]["value"])
)
except Exception as e:
logging.warning("Exception occurred when \
updating new_telemetry. \n%s", e)
Please change the names of "new_telemetry", "metrics" and NewTelemetry to match the telemetry you are adding.
Once you have created the class, the next step is to add it to the SpecificTelemetryMsgHandler of the compose_telemetry_msg method of TelemetryMsgAggregator.
def compose_telemetry_msg(self):
BaseTelemetry = BaseTelemetryMsgHandler(
self.status_data, self.metrics_data)
for telemetry in self.status_data.keys():
BaseTelemetry.update_status(telemetry)
SpecificTelemetryMsgHandler = [
ShutdownRequest(self.status_data, self.metrics_data),
RamFree(self.status_data, self.metrics_data),
:(omitted)
StorageNVMeNumber(self.status_data, self.metrics_data),
]
Once you have done this, restart the services related to IFSW to apply the changes.
sudo systemctl restart ifsw-*
Open /opt/open-set/tests/tools/ifsw_verification/commands/BUSOBC_consts.py and edit the following items.
TELEMETRY_FIELDS = [
# ======== Time ==========#
"unix_time",
:(omitted)
"ue_count",
]
Add the name of the new telemetry to this list.
Delete the telemetry by deleting the relevant parts of the file updated in [3.2] (#32-Adding Telemetry). In addition, please restart the IFSW-related services to reflect the configuration changes.
If you have made changes to the telegraf configuration file or the inputs.execd plugin script file in order to collect telemetry, please delete those changes as well.
This chapter summarizes the specifications of FDIR, which detects MISSION-OBC abnormalities and performs isolation and recovery processing.
Alert Management Tool Prometheus is used as the alert management tool, and it works in conjunction with telegraf to monitor abnormal values in metrics. The following is a list of FDIRs that perform detection, isolation, and recovery processing in Open-SET.
| No | FDIR Name | FD Contents | IR_ID*2 | request_shutdown value*3 |
|---|---|---|---|---|
| 1 | Storage High Temperature | Detects abnormal temperatures in storage devices. | 2 | 0x40 |
| 2 | Application Execution Time | Detects MISSION-OBC startup time overruns. | 4 | 0x00 |
| 3 | Memory Overuse | Detects failures to secure memory requested by container applications. | 6 | 0x00 |
| 4 | CPU high temperature (>=90°C) | Detects abnormal temperatures in the CPU of the MISSION-OBC. | 1 | 0x80 |
| 5 | CPU high temperature (>=70°C) | Detects abnormal temperatures in the CPU of the MISSION-OBC. | 3 | 0x00 |
| 6 | GPU high temperature (>=90°C) | Detects abnormal temperatures in the GPU installed in the MISSION-OBC. | 1 | 0x80 |
| 7 | GPU high temperature (>=70°C) | Detects abnormal temperatures in the GPU installed in the MISSION-OBC. | 4 | 0x00 |
| 8 | Storage Overuse | Detects excessive use of the file system used by the container application. | 4 | 0x00 |
| 9 | Packet Loss | Detects packet loss that occurs in the NIC inside the MISSION-OBC. | 6 | 0x00 |
*2: IR_ID is an ID that is linked to IR processing.
*3: request_shutdown is a flag value that notifies the BUS-OBC of the status inside the MISSION-OBC. The BUS-OBC processes data in request_shutdown. The following settings are used for MISSION-OBC.
0x00: No processing
0x40: Shut down after retrieving log information
0x80: Shut down immediately
Next, we will show the IR processing for IR_ID.
| IR_ID | IR content |
|---|---|
| 1 | Perform shutdown and stop power supply from BUS-OBC to MISSION-OBC. |
| 2 | Retrieve log information and perform shutdown. |
| 3 | Perform clock down. |
| 4 | If a container application is running, stop the application. |
| 6 | Outputs a message to the log file indicating that packet loss has occurred. |
Change the following three locations.
/opt/open-set/src/EventHub/WebhookReceiver/config/config.ini
/opt/open-set/src/FDIRModule/prometheus_tools/rule_files/
/opt/open-set/src/FDIRModule/prometheus_tools/prometheus.yml
config.ini sets the alert name and bit position in the telemetry fdir_event_flag.
Open /opt/open-set/src/EventHub/WebhookReceiver/config/config.ini and add the alert names and bit positions you want to add to the following.
[alertTBL]
too_high_storage_temp = 7
application_timeout = 8
memory_overload = 9
too_high_cpu_temp = 10
high_cpu_temp = 11
too_high_gpu_temp = 12
high_gpu_temp = 13
storage_overload = 14
loss_packet = 15
alertfiles manages the files that define alert rules. First, create an alert configuration file for each monitoring item.
Creating an alert definition
vi [file name].yml
How to write an alert file (sample.yml)
groups:
- name: <alert rule name>
rules:
- alert: <alert name>
expr: <alert conditions>
for: <time until the conditions are judged to be true>
labels:
severity: <alert content>
annotations:
summary: <alert summary>
description: <Explanation of alert conditions>
Next, place the file you created in /opt/open-set/src/FDIRModule/prometheus_tools/rule_files.
sudo cp sample.yml /opt/open-set/src/FDIRModule/prometheus_tools/rule_files
Copy the contents of rule_files to the location where the prometheus executable is located.
sudo cp /opt/open-set/src/FDIRModule/prometheus_tools/rule_files/* /prometheus-2.47.0-linux-amd64/rule_files
prometheus.yml is the configuration file for prometheus.
Open /opt/open-set/src/FDIRModule/prometheus_tools/prometheus.yml and add the following to the configuration file for the alert you want to add.
rule_files:
- "/prometheus-2.47.0.linux-amd64/rule_files/app_runtime.yml"
- "/prometheus-2.47.0.linux-amd64/rule_files/high_cpu_temp.yml"
- "/prometheus-2.47.0.linux-amd64/rule_files/too_high_cpu_temp.yml"
- "/prometheus-2.47.0.linux-amd64/rule_files/high_gpu_temp.yml"
- "/prometheus-2.47.0.linux-amd64/rule_files/too_high_gpu_temp.yml"
- "/prometheus-2.47.0.linux-amd64/rule_files/loss_packet.yml"
- "/prometheus-2.47.0.linux-amd64/rule_files/storage_overload.yml"
- "/prometheus-2.47.0.linux-amd64/rule_files/memory_overload.yml"
- "/prometheus-2.47.0.linux-amd64/rule_files/high_temp_storage.yml"
Next, copy this file to the location where the prometheus executable is located.
sudo cp /opt/open-set/src/FDIRModule/prometheus_tools/prometheus.yml /prometheus-2.47.0-linux-amd64
Once this is complete, restart the following FDIR-related services to apply the settings.
sudo systemctl restart prometheus.service
sudo systemctl restart webhook_receiver.service
sudo systemctl restart alertmanager.service
By deleting the relevant parts from the file updated in [4.2] (#42 - Adding monitoring items), the monitoring items will be deleted. Please restart the FDIR-related services to reflect the configuration changes.
IR events are IR processes that are executed in response to IR_ID. You can add IR events by saving script files that define IR processes in the following directory.
/opt/open-set/src/EventHub/WebhookReceiver/IR_events
First, create a script file for the IR event. The file name should be in the format XX_*.py, with a two-digit hexadecimal number XX at the beginning. XX is the IR_ID for that event.
Creating a script file
vi [file name].py
How to write a script file (FF_sample.py)
def ExecuteIR():
response_code = E_OK
try:
# Write the IR event processing here.
except Exception as e:
logging.warning("Error: %s", e)
response_code = ER_IR_EXE
return response_code
Next, please place the file you created in /opt/open-set/src/EventHub/WebhookReceiver/IR_events.
sudo cp FF_sample.py /opt/open-set/src/EventHub/WebhookReceiver/IR_events
Once you have completed the above, restart the following FDIR-related services to apply the settings.
sudo systemctl restart prometheus.service
sudo systemctl restart webhook_receiver.service
sudo systemctl restart alertmanager.service
IR events can be deleted by deleting the relevant parts of the file updated in [4.4] (#44-Adding IR Events).
This chapter explains the device recovery process. The following explains recovery operations in line with the default Open-SET specifications, but please customize as necessary to suit your own environment.
The default specification of Open-SET assumes that there are three storage devices, NVMe0, NVMe1, and SATA0, and of these, SATA0 is used as the storage device for recovery. Recovery is performed using the following three commands.
- 0x10 PowerOnDevice
- 0x11 RestoreOS
- 0x12 RestoreFile
The basic flow of recovery is as follows
- If NVMe0/NVMe1 fails to boot for some reason, boot from SATA0.
- Execute the PowerOnDevice command on SATA0 to enable the device to be recovered (do not enable multiple devices at the same time).
- Execute the RestoreOS command on SATA0 to deploy the image file in /initialdata on SATA0 to the device to be recovered.
- Set the device to be recovered to the boot device and reboot.
- Confirm that the device to be recovered has started up.
- If necessary, use the RestoreFile command to execute the update script in /initialdata on SATA0 on the current device and apply the differences from the image file.
As a preliminary step, please place the image file and update application script in /initialdata on SATA0. Also, please check the following recovery-related settings in /opt/open-set/src/IFSW/config/config.ini.
[recovery_sequence]
(omitted)
DD_UPTIME_TIMEOUT = 1500 # Timeout for RestoreOS command (seconds since MISSION-OBC booted)
DD_BLOCK_SIZE = 1048576 # Block size when expanding image files with the RestoreOS command
MOUNT_WAIT_TIME = 3 # Time to wait for SATA0 to be enabled and mount to be executed with the RestoreFile command
DIR_MOUNT_RESTORE = /mnt/restore # Directory name to mount the device to be recovered with the RestoreOS command
DIR_INITIAL_DATA = /initialdata # Directory for recovery files
DIR_INSTALLER = / # Base path for unmounting SATA0 in the RestoreFile command
REBOOT_SCRIPT = /opt/open-set/sys/system_reboot # Directory for reboot script files
BOOT_ISO = boot_golden-image.iso # Image file name of the boot partition to be placed in /initialdata on SATA0
ROOT_ISO = root_golden-image.iso # Image file name of the root partition to be placed in /initialdata on SATA0
In addition, at the time of installation, the three recovery commands are disabled. To enable them, you need to change the contents of the following file.
/opt/open-set/src/IFSW/tools/control_device.py
Replace the following four lines in this file with the appropriate commands for your environment.
cmd = ["echo", "Powering on device:",
DEVICE_NAME_INFO[device_id]["storage_device"]]
->
Replace with the command to enable the storage device corresponding to device_id.
cmd = ["echo", "Setting boot device to:",
DEVICE_NAME_INFO[device_id]["storage_device"]]
->
Replace with a command that sets the storage device corresponding to device_id as the device to boot next time.
cmd = ["echo", "Setting boot device to: default"]
->
Replace with a command that sets the device to boot next time to the default.
cmd = ["echo", "Current boot device:"] # Edite here
->
Replace with a command that displays the name of the currently booted storage device.
When you have finished, restart the services related to IFSW to apply the changes.
sudo systemctl restart ifsw-*
This chapter explains data transfer between BUS-OBC and MISSION-OBC using FTP commands.
The procedure for performing PUT/GET using SFTP from the client side is as follows.
sftp [user name]@[IP address/domain name of the connection destination
Connected to [IP address/domain name of the connection destination].
sftp> get [path of the file to be acquired (@MISSION-OBC)]
sftp> put [path of file to be sent (@satellite bus system)]
Please note that the authentication information set in the vsftpd of the Installer is used for the connection, so files cannot be sent to directories that do not have editing permissions. Please replace the [username] here with your actual username, and the [IP address/domain name of the connection destination] with the IP address or domain name of the server set in the vsftpd settings in the Installer.
