Skip to content

Latest commit

 

History

History
787 lines (681 loc) · 51.9 KB

File metadata and controls

787 lines (681 loc) · 51.9 KB

1. Prerequisites

This document explains how to use the Open-SET software. First, use the installer to install Open-SET. (see README)

The configuration of this software platform is as follows.

  • IFSW and the IFSW verification tool are used for sending and receiving commands.
  • We use mosquitto as the internal messaging broker.
  • We use telegraf as the status information collection tool.
  • In addition, we use prometheus as the alert management tool, and we use the docker engine, etc., as the execution environment for user applications.

Software architecture

2. Command Specifications

This chapter summarizes the specifications for the commands that are requested from the BUS-OBC command client and executed within the MISSION-OBC.

2.1. List of Commands

The list of commands that can be requested from the BUS-OBC command client is as follows.

ID Command name Function Parameters Response
0x00 GetTelemetry Obtain telemetry information { "command_id": "0x00",
  "parameter": {} }
0x80 Telemetry information
0x01 GetLog Get system log information { "command_id": "0x01",
  "parameter": [
    {"time_id": "YYYY-MM-DD"}
  ] }
0x81 System log information
0x02 Shutdown Shut down MISSION-OBC { "command_id": "0x02",
  "parameter": {} }
0x82 MISSION-OBC shutdown result
0x10 PowerOnDevice Start the device to be restored. { "command_id": "0x10",
  "parameter": [
    {"device_id": 0}
    //Device identifier to be restored
    //0: nvme0
    //1:ssd0
    //2:nvme1
  ] }
0x90 OS Restore Preprocessing Result
0x11 RestoreOS Restore the OS image of the device to be restored. { "command_id": "0x11",
  "parameter": [
    {"device_id": 0},
    //Device identifier to be restored
    //0: nvme0
    //1: ssd0
    //2: nvme1
    {"progress_id": 0},
    //Partition of the device to be restored
    //1: boot partition and root partition
    //2: root partition
    //3 : Boot device switching only (do not process if there is no target for restoration, and only the device switching part of the post-processing will run)
    {"progress_bytes": 0}
    //0 (fixed value)
  ] }
0x91 OS Restore Result
0x12 RestoreFile Restores any file on the device to be restored. { "command_id": "0x12",
  "parameter": [
    {"installer_list": ["/opt/open-set/sys/restore_symbolic_link.sh"]}
    //Script name (array) for performing additional installation. If additional installation is not required, specify only the script that creates a symbolic link in the home directory. If additional installation is required, add the script to the array.
  ] }
0x92 Post-processing results of OS restoration
0x21 DeployApp Deploy apps uploaded from the ground. { "command_id": "0x21",
  "parameter": [
    {"deploy_method": "0x00"},
    // "0x00" is fixed
      {"image_file": "hello.tar.gz"},
    // App file name to be deployed docker load or deb install
    {"app_uid": "0x01"}
    // App user ID
  ] }
0xA1 Container deployment result, deb file installation result
0x22 GetAppInfo Obtain information (container information) about deployed all apps. { "command_id": "0x22",
  "parameter": [
    {"deploy_method": "0x00"}
    //"0x00" is fixed . ] }
0xA2 Information on deployed containers and installed deb files
0x24 GetFile Get a file (text file) { "command_id": "0x24",
  "parameter": [
    {"config_file_name": "/export/home/exp-01/01_docker-compose.yml"}
    //text file name
  ] }
0xA4 Parameter setting details
0x25 ExecuteApp Execute the app { "command_id": "0x25",
  "parameter": [
    {"deploy_method": "0x00"},
    //"0x00" fixed
    {"app_uid": "0x01"},
    //Application user ID
    {"obs_id": "0x0001"}
    //Observation request number
  ] }
0xA5 Container app and native app execution results
0x27 GetResult Obtain the result of executing the app. { "command_id": "0x27",
  "parameter": [
    {"app_uid": "0x01"},
    //App user ID
    {"obs_id": "0x0001"}
    //Observation request number
  ] }
0xA7 Container app and native app execution results
0x60 MoveFile Move a file { "command_id": "0x60",
  "parameter": [
    {"file_path": "hello.tar.gz"},
    //File name
    {"attribute": "0x00"},
    //Attribute ID
    //"0x00": App
    //"0x01": Application configuration file
    //"0x10": Execution condition file yaml (yml), sh file
    //"0x20": Satellite images taken in orbit
    //"0x21": Other files required by the AI application other than the above
    {"app_uid": "0x01"},
    //Application user ID
    {"obs_id": "0x0001"}
    ]}
0xE0 Result of placing a file that has been interfaced with FTP PUT in the application area
0x61 NotifyGetFile Notifies that the satellite bus system has already acquired the file from MISSION-OBC. { "command_id": "0x61",
  "parameter": [
    {"path": "export/home/exp-01/obs-0001/toGround_arch/result_exp-01_obs-0001.tar.gz"},
    // Filename (full path)
    {"delete_file": 1},
    // Whether or not to delete the retrieved file
    // 0: Do not delete
    // 1: Delete
    {"app_uid": "0x01"},
    // App user ID
    // 0x00 No user, the user number is entered in sequential order
  ] }
0xE1 Post-processing result for files that have been interfaced with FTP GET
0x62 GetDirectory Obtain information on a specific directory in MISSION-OBC { "command_id": "0x62",
  "parameter": [
    {"path": ". /"}
    // Directory path (full path)
  ] }
0xE2 MISSION-OBC specific directory information archived with tar, compressed with gzip, and encoded with base64
0x63 DeleteFile Deletes a specific file or container in MISSION-OBC. { "command_id": "0x63",
  "parameter": [
    {"target": "/mnt/open-set/trash"},
    // Full path of the object to be deleted, or the docker image name of the target for deletion: tag name
    // e.g. {"target" : "AppXXX:latest"}
    {"type": "0x00"},
    // normal file 0x00 / directory 0x01 / Docker image 0x02
    {"date": "2023-01-01T00:00:00+09:00"}
    // Specify the time conditions for deletion (can be omitted).
    // Delete files whose last modified date is before this time.
    // Note that the system's standard time is not Japan Standard Time, but 32-bit Unix time (UTC)!
    // Format:
    // - If you do not specify a time zone (UTC): "YYYY-MM-ddThh:mm:ss"
    // - If you specify a time zone: "YYYY-MM-ddThh:mm:ss+hh:mm"
    // Example: To delete files created before 2023 in Japan time, specify it as "2023-01-01T00:00:00+09:00".
  ] }
0xE3 MISSION-OBC result of deleting a specific file or container within MISSION-OBC
0x70 ExecuteCommand Execute a specific shell command within MISSION-OBC { "command_id": "0x70",
  "parameter": [
    {"shell": "date"}
    // The shell command to execute
    // The command must satisfy the following conditions
    // - It must be specified on a single line
    // - It must not include commands that require pre-installation
    // - It must not include commands that require input during execution
    // - Do not include commands that will destroy the system
  ] }
0xF0 MISSION-OBC The result of a specific shell command executed within the MISSION-OBC is archived with tar, compressed with gzip, and encoded with base64.

2.2. How to send and receive command request data

The method for sending and receiving command request data is as follows.

  1. Check that IFSW is operating normally. If it is not operating normally, restart the service.
  • Checking method: sudo systemctl status ifsw*
  • Restart method: sudo systemctl restart ifsw*
  1. Start the status command server.
  • cd /opt/open-set/tests/tools/ifsw_verification/commands
  • python3 status_command_server.py
  1. Open another terminal window and start sending commands.
  • cd /opt/open-set/tests/tools/ifsw_verification/commands
  • python3 command_client_manual.py

If successful, you will be prompted to select a command ID as follows. For example, if you want to execute the "GetTelemetry" command to obtain telemetry, enter 0x00.

==========================================
please select a command ID from the list:
ID | NAME
0x00 | GetTelemetry
0x01 | GetLog
0x02 | Shutdown
...
  1. After sending the command, the contents of the command you sent will be displayed, followed by a response message from the IFSW server. If the command is received successfully, SERVER > Command accepted. will be returned.

2.3. Adding a command

Change the following three places.

/opt/open-set/src/IFSW/command_handler
/opt/open-set/tests/tools/ifsw_verification/commands/status_command_handler.py
/opt/open-set/tests/tools/ifsw_verification/commands/BUSOBC_consts.py

2.3.1. Modifying command_handler

First, add a folder with the following structure to /opt/open-set/src/IFSW/command_handler.

 +-- command_XX_your_command
   +-- your_command.py
   +-- template.json

Here, replace "YourCommand" with the name of the command you want to add. Also, "XX" is a two-digit hexadecimal ID, so choose one that does not conflict with the IDs of existing commands. Please note that command IDs are assigned to each command according to the following criteria.

Range of IDs for control commands: 0x00-0x7F 
- 0x00-0x0F: System-related (ex. telemetry, logging, power)
- 0x10-0x1F: Error Clear-related (ex. recovery)
- 0x20-0x5F: Mission-related (ex. application loading/execution)
- 0x60-0x6F: File operation-related
- 0x70-0x7F: Development-related
- 0x80-0xFF is reserved for status command IDs.

In template.json, please describe the default parameter values for the command in JSON format.

Command script files are written in the following three basic components. When creating your_command.py, please edit the commented parts in accordance with the following format.

class InvokeCommand(BaseCommand):
    def check_parameter(self, *args):
        super().check_parameter()
        result = check_parameter(*args)
        return result

    def pre_process(self, params): # If there are any parameters included in the return content of the command execution, include them here.
        super().pre_process()
        self.parameter = [{"response_code": ER_COMMAND_EXE}, {"data": params}] # Write the return content in the event of a command error here.

    def main_process(self, *args):
        super().main_process()
        parameter = invoke_command(*args)
        return parameter
def check_parameter(parameter): # Include the command parameters here.
    # Write the code to check the command parameters here.
    return True
def invoke_command(params): # Include any parameters used in the command-specific processing and return content here.
    # Write the command-specific processing here.
       :
    parameter = [{"response_code": E_OK}] # Describe the return content when the command ends normally here.

    return parameter

Once you have made the above changes, restart the services related to IFSW in order to reflect the changes in MISSION-OBC.

sudo systemctl restart ifsw-*

2.3.2. Change to status_command_handler

Open /opt/open-set/tests/tools/ifsw_verification/commands/status_command_handler.py and edit the following items.

status_commandTBL = {
    0x80: EndGetTelemetry,
    0x81: EndGetLogInfo,
    0x82: EndShutdown,
    0x90: EndPowerOnDevice,
    0x91: EndRestoreOS,
    0x92: EndRestoreFile,
    0xA1: EndDeployApp,
    0xA2: EndGetAppInfo,
    0xA4: EndGetFile,
    0xA5: EndExecuteApp,
    0xA7: EndGetAppResult,
    0xE0: EndMoveFile,
    0xE1: EndNotifyGetFile,
    0xE2: EndGetDirectoryList,
    0xE3: EndDeleteFile,
    0xF0: EndExecuteCommand,
}

To this list, add the status command ID (command ID + 0x80) corresponding to the new command you want to add, and the function name (e.g. PutMonitorStatus) that will receive the status command.

Next, add the function that will receive the status command. In the case of GetTelemetry, the following is the corresponding function.

@timeout_decorator.timeout(COM_HANDLER_TIMEOUT, use_signals=False)
def EndGetTelemetry(response_code, telemetry, parameterCheck=False):
    if parameterCheck:
        return True

    print("Command executed : " + sys._getframe().f_code.co_name)    
    print("Telemetry:\n")
    for id in range(len(telemetry)):
        print(TELEMETRY_FIELDS[id] + ": " + str(telemetry[id]))
    return

Please define the function parameters according to the content returned by invoke_command in YourCommand.py. Set the parameter check if necessary. Also, please make sure that the necessary information is output, as the contents of the print output will be displayed on the status command server.

2.3.3. Changing BUSOBC_consts

Open /opt/open-set/tests/tools/ifsw_verification/commands/BUSOBC_consts.py and edit the following items.

commandLIST = [
    "0x00-GetTelemetry",
    "0x01-GetLog",
    "0x02-Shutdown",
    "0x10-PowerOnDevice",
    "0x11-RestoreOS",
    "0x12-RestoreFile",
    "0x21-DeployApp",
    "0x22-GetAppInfo",
    "0x24-GetFile",
    "0x25-ExecuteApp",
    "0x27-GetResult",
    "0x60-MoveFile",
    "0x61-NotifyGetFile",
    "0x62-GetDirectory",
    "0x63-DeleteFile",
    "0x70-ExecuteCommand",
]

Add the command ID and command name corresponding to the newly added command to this list.

2.4. Deleting Commands

Commands can be deleted by deleting the relevant parts from the file updated in [2.3](#23-Adding Commands). In addition, please restart the IFSW-related services to reflect the configuration changes.

3. Telemetry Specifications

This chapter summarizes the specifications for the telemetry collected within the MISSION-OBC.

3.1. Telemetry List

We use telegraf as a tool for collecting MISSION-OBC status information. In addition, some values are managed in a database shared within Open-SET, and are updated by IFSW and IR handlers. The following is a list of the telemetry collected by Open-SET.

Category Item telegraf_metrics Description / Remarks Byte Pos Byte Size Type Bit Display Type[Unit]
Time Time UNIX time Outputs UNIX time 0 4 uint32 - DEC[sec]
Command Response Command Counter*1 Incremented each time a command is received successfully. 4 1 uint8 - DEC[count]
Command Reject Counter*1 Incremented when a command is rejected. 5 1 uint8 - DEC[count]
Last Executed Command ID of the last executed command 6 1 uint8 - DEC
Last Reject command ID of the last rejected command 7 1 uint8 - DEC
Last Reject Code Code (response code) when command is rejected 8 1 uint8 - DEC
Status Command Counter*1 client_status_commnad_counter Incremented for each successful command reception. 9 1 uint8 - DEC[count]
Status Command Reject Counter*1 client_status_commnad_reject_counter Incremented when a command is rejected. 10 1 uint8 - DEC[count]
Status Last Executed Command client_status_last_command_code ID of the last executed command 11 1 uint8 - DEC
Status Last Reject command client_status_last_reject_command_code ID of the last rejected command 12 1 uint8 - DEC
System Status shutdown_request Each bit indicates whether or not a mode that requires a reboot is in effect. Unassigned bits are reserved. Bit positions are expressed in network byte order. 13 1 uint8 - STATUS
Request to execute forced power-off 0:not requested, 1: requested 13 0 - 0 STATUS
Request to execute command sequence [GetLog -> Get log via FTP -> Request Shut down] 0:not requested, 1: requested 13 0 - 1 STATUS
Reserved 13 0 - 2:7 STATUS
dmesg lines dmesg_length Number of dmsg lines 14 2 uint16 - DEC[lines]
dmesg error dmesg_errornum Number of dmsg error lines 16 2 uint16 - DEC[lines]
syslog error syslog_errnum Number of error lines in syslog 18 2 uint16 - DEC[lines]
System Uptime system_uptime Total number of seconds the system has been running since it was turned on (up to 65535, or 10 hours) 20 4 uint32 - DEC[sec]
Service Status sensor-server.status Missing values are assumed to be 0. The value of each bit indicates the status of the following services. Unassigned bits are reserved. Bit positions are expressed in network byte order. 24 4 uint32 - STATUS
Reserved 24 0 - 0:16 STATUS
status_network.status 0:stop, 1: run 24 0 - 17 STATUS
status_ifsw-server.status 0:stop, 1: run 24 0 - 18 STATUS
status_ifsw-cliant.status 0:stop, 1: run 24 0 - 19 STATUS
status_ifsw-monitoring.status 0:stop, 1: run 24 0 - 20 STATUS
status_alertmanager.status 0:stop, 1: run 24 0 - 21 STATUS
status_webhook_receiver.status 0:stop, 1: run 24 0 - 22 STATUS
status_prometheus.status 0:stop, 1: run 24 0 - 23 STATUS
status_telegraf.status 0:stop, 1: run 24 0 - 24 STATUS
status_mosquitto.status 0:stop, 1: run 24 0 - 25 STATUS
status_docker.status 0:stop, 1: run 24 0 - 26 STATUS
status_health_check.status 0:stop, 1: run 24 0 - 27 STATUS
status_logrotate.status 0:stop, 1: run 24 0 - 28 STATUS
status_vsftpd.status 0:stop, 1: run 24 0 - 29 STATUS
status_ssh.status 0:stop, 1: run 24 0 - 30 STATUS
status_systemd-timesyncd.status 0:stop, 1: run 24 0 - 31 STATUS
failed status status_failed_num Number of services with status fail 28 1 uint8 - DEC[count]
CPU/GPU Status CPU load ratio cpu_usage_user CPU load ratio 29 1 uint8 - DEC[%]
mean CPU frequency linux_cpu_scaling_cur_freq Average CPU clock frequency (MHz) 30 2 uint16 - DEC[MHz]
Num of Process processes_running Number of processes running 32 2 uint16 - DEC[count]
RAM Free mem_available Available memory size 34 2 uint16 - DEC[MB]
GPU load ratio amd_rocm_smi_utilization_gpu GPU load ratio 36 1 uint8 - DEC[%]
GPU Memory clock amd_rocm_smi_clocks_current_memory 37 2 float16 - DEC[GHz]
GPU Shader clock amd_rocm_smi_clocks_current_sm 39 2 float16 - DEC[GHz]
GPU VRAM usage amd_rocm_smi_memory_used/amd_rocm_smi_memory_total 41 1 uint8 - DEC[%]
GPU GTT (Graphics Translation Table) usage amd_rocm_gttmem_used/amd_rocm_gttmem_total 42 1 uint8 - DEC[%]
Communication throughput net_packets_throughput LAN communication throughput 43 2 float16 - DEC[Mbps]
tcp byte received (4000) net_bytes_recv TCP port received amount (4000), roll over 45 2 uint16 - DEC[MB]
tcp byte sent (4003) net_bytes_sent TCP port sent amount (4003), Roll over 47 2 uint16 - DEC[MB]
err_in net_err_in Cumulative number of packets with errors 49 1 uint8 - DEC[count]
err_out net_err_out cumulative number of packets that have been dropped 50 1 uint8 - DEC[count]
drop_in net_drop_in cumulative number of packets that have been dropped in 51 1 uint8 - DEC[count]
drop_out net_drop_out Cumulative number of packets dropped out 52 1 uint8 - DEC[count]
NVMe0 SSD Temperature smart_temperature Disk temperature 53 2 float16 - DEC[mC]
Power on hours smart_power_on_hours Total time the disk has been powered on 55 4 uint32 - DEC[hour]
Powercycle count smart_power_cycle_count Number of power cycles the disk has been through 59 4 uint32 - DEC[count]
Error information log entries smart_error_information_log_entries Number of entries recorded in the error log 63 1 uint8 - DEC[count]
Available spare smart_available_spare Percentage of available spare space 64 1 uint8 - DEC[%]
Smart critical warning smart_critical_warning Status of critical warnings 0: No warnings, 1: Warnings have occurred. The value of each bit indicates the status of the following warning types. Unassigned bits are reserved. The bit positions are expressed in little endian format in accordance with the external specifications. 65 1 uint8 - STATUS
Reserved 65 0 - 7:6 STATUS
Invalid Persistent Memory Persistent Memory Region is in Read-Only state or reliability is questionable. 65 0 - 5 STATUS
Volatile Backup There is a problem with the volatile memory backup mechanism. 65 0 - 4 STATUS
Media Read-only The media is in Read-Only mode. 65 0 - 3 STATUS
Reliability Status The reliability of the NVM subsystem has degraded. 65 0 - 2 STATUS
Temperature Threshold The temperature has exceeded the upper threshold or dropped below the lower threshold. 65 0 - 1 STATUS
Available Spare The amount of spare space has dropped below the lower threshold. 65 0 - 0 STATUS
Percentage Used smart_percentage_used The percentage of life used. 66 1 uint8 0 DEC[%]
Unsafe Shutdowns smart_unsafe_shutdowns Number of unsafe shutdowns 67 2 uint16 - DEC[count]
Media And Data Integrity Errors smart_media_and_data_integrity_errors Number of media and data integrity errors 69 1 uint8 - DEC[count]
Percentage Rate storage_raw_read_error_rate Error rate when reading data 70 1 uint8 - DEC[%]
Reallocated_Sector Ct storage_reallocated_sector_ct Number of reallocated bad sectors 71 1 uint8 - DEC[count]
Hardware_Ecc Recovered storage_hardware_ecc_recovered Number of errors recovered by ECC 72 1 uint8 - DEC[count]
Reallocated_Event Count storage_reallocated_event_count Number of reallocated events 73 1 uint8 - DEC[count]
Offline Uncorrectable storage_offline_uncorrectable Number of uncorrectable errors in offline scan 74 1 uint8 - DEC[count]
Udma_Crc_Error Count storage_udma_crc_error_count Number of CRC errors during UDMA transfer 75 1 uint8 - DEC[count]
NVMe1 SSD Temperature smart_temperature Disk temperature 76 2 float16 - DEC[mC]
Power on hours smart_power_on_hours Total time the disk was powered on 78 4 uint32 - DEC[hour]
Powercycle count smart_power_cycle_count Number of times the disk has been power-cycled 82 4 uint32 - DEC[count]
Error information log entries smart_error_information_log_entries Number of entries recorded in the error log 86 1 uint8 - DEC[count]
Available spare smart_available_spare Percentage of available spare space 87 1 uint8 - DEC[%]
Smart critical warning smart_critical_warning Status of critical warnings 0: No warnings, 1: Warnings have occurred. The value of each bit indicates the status of the following warning types. Unassigned bits are reserved. The bit positions are expressed in little endian format in accordance with the external specifications. 88 1 uint8 - STATUS
Reserved 88 0 - 7:6 STATUS
Invalid Persistent Memory Persistent Memory Region is in Read-Only state or reliability is questionable 88 0 - 5 STATUS
Volatile Backup There is a problem with the volatile memory backup mechanism 88 0 - 4 STATUS
Media Read-only The media is in Read-Only mode. 88 0 - 3 STATUS
Reliability Status The reliability of the NVM subsystem has degraded. 88 0 - 2 STATUS
Temperature Threshold The temperature has exceeded the upper threshold or dropped below the lower threshold. 88 0 - 1 STATUS
Available Spare The amount of spare space has dropped below the lower threshold. 88 0 - 0 STATUS
Percentage Used smart_percentage_used Percentage of life used. 89 1 uint8 0 DEC[%]
Unsafe Shutdowns smart_unsafe_shutdowns Number of unsafe shutdowns 90 2 uint16 - DEC[count]
Media And Data Integrity Errors smart_media_and_data_integrity_errors Number of media and data integrity errors 92 1 uint8 - DEC[count]
Percentage Rate storage_raw_read_error_rate Error rate when reading data 93 1 uint8 - DEC[%]
Reallocated_Sector Ct storage_reallocated_sector_ct Number of reallocated bad sectors 94 1 uint8 - DEC[count]
Hardware_Ecc Recovered storage_hardware_ecc_recovered Number of errors recovered by ECC 95 1 uint8 - DEC[count]
Reallocated_Event Count storage_reallocated_event_count Number of reallocated events 96 1 uint8 - DEC[count]
Offline Uncorrectable storage_offline_uncorrectable Number of uncorrectable errors in offline scan 97 1 uint8 - DEC[count]
Udma_Crc_Error Count storage_udma_crc_error_count Number of CRC errors during UDMA transfer 98 1 uint8 - DEC[count]
SSD SSD Temperature smart_temperature Disk temperature 99 2 float16 - DEC[mC]
Power on hours smart_power_on_hours Total time the disk has been powered on 101 4 uint32 - DEC[hour]
Powercycle count smart_power_cycle_count Number of times the disk has been power cycled 105 4 uint32 - DEC[count]
Error information log entries smart_error_information_log_entries Number of entries recorded in the error log 109 1 uint8 - DEC[count]
Available spare smart_available_spare Percentage of available spare space 110 1 uint8 - DEC[%]
Smart critical warning smart_critical_warning Status of critical warnings 0: No warnings, 1: Warnings have occurred. The value of each bit indicates the status of the following warning types. Unassigned bits are reserved. The bit positions are expressed in little endian format in accordance with the external specifications. 111 1 uint8 - STATUS
Reserved 111 0 - 7:6 STATUS
Invalid Persistent Memory Persistent Memory Region is in Read-Only state or reliability is questionable 111 0 - 5 STATUS
Volatile Backup There is a problem with the volatile memory backup mechanism 111 0 - 4 STATUS
Media Read-only The media is in Read-Only mode. 111 0 - 3 STATUS
Reliability Status The reliability of the NVM subsystem has degraded. 111 0 - 2 STATUS
Temperature Threshold The temperature has exceeded the upper threshold or dropped below the lower threshold. 111 0 - 1 STATUS
Available Spare The amount of spare space has dropped below the lower threshold. 111 0 - 0 STATUS
Percentage Used smart_percentage_used Percentage of life used. 112 1 uint8 0 DEC[%]
Unsafe Shutdowns smart_unsafe_shutdowns Number of unsafe shutdowns 113 2 uint16 - DEC[count]
Media And Data Integrity Errors smart_media_and_data_integrity_errors Number of media and data integrity errors 115 1 uint8 - DEC[count]
Percentage Rate storage_raw_read_error_rate Error rate when reading data 116 1 uint8 - DEC[%]
Reallocated_Sector Ct storage_reallocated_sector_ct Number of reallocated bad sectors 117 1 uint8 - DEC[count]
Hardware_Ecc Recovered storage_hardware_ecc_recovered Number of errors recovered by ECC 118 1 uint8 - DEC[count]
Reallocated_Event Count storage_reallocated_event_count Number of reallocated events 118 1 uint8 - DEC[count]
Offline Uncorrectable storage_offline_uncorrectable Number of uncorrectable errors in offline scan 120 1 uint8 - DEC[count]
Udma_Crc_Error Count storage_udma_crc_error_count Number of CRC errors during UDMA transfer 121 1 uint8 - DEC[count]
Storage Disk Used Percentage(DATA) disk_used_percent Disk usage (data) 122 1 uint8 - Usage rate [%]
Disk Used Percentage(System) disk_used_percent Disk usage percentage (system) 123 1 uint8 - Usage rate [%]
Disk Inodes Used Percentage(DATA) disk_inodes_used_percent i-nodes usage percentage (data) 124 1 uint8 - Usage rate [%]
Disk Inodes Used Percentage(System) disk_inodes_used_percent i-nodes usage rate (system) 125 1 uint8 - usage rate [%]
Boot Device storage_boot_device device being booted up 126 1 uint8 - STATUS
Boot NVMe No storage_nvme_number NVMe number that is booting up 127 1 uint8 - STATUS
container Running Container num docker_n_containers_running Number of docker processes in docker ps 128 1 uint8 - DEC[count]
Exited Container num docker_n_containers_exited Number of docker processes in docker ps -a --filter "status=exited" 129 1 uint8 - DEC[count]
cAdvisor max_storage_used max_storage_used Total file system usage by running container services 130 1 uint8 - DEC[%]
memory_failed_counter memory_failed_counter Number of memory allocation failures + number of OOM (Out Of Memory) events 131 1 uint8 - DEC[count]
Temperature monitor gpu_temp gpu_temp GPU temperature 132 2 float16 - DEC[C]
cpu_temp sensors_temp_input CPU temperature 134 2 float16 - DEC[mC]
FDIR Event Occurrence fdir_event_flag 0: No event occurred, 1: Event occurred The value of each bit indicates the status of the following alert event types. The bit positions are expressed in network byte order. 136 2 uint16 - STATUS
event_loss_packet TCP/IP packet loss occurrence flag 136 0 - 0 STATUS
event_storage_overload Storage overuse occurrence flag 136 0 - 1 STATUS
event_high_gpu_temp GPU temperature warning flag (application shutdown) 136 0 - 2 STATUS
event_too_high_gpu_temp GPU temperature warning flag (immediate power supply shutdown) 136 0 - 3 STATUS
event_high_cpu_temp CPU temperature warning flag (clock down) 136 0 - 4 STATUS
event_too_high_cpu_temp CPU abnormally high temperature flag (immediate power supply stop) 136 0 - 5 STATUS
event_memory_overload Memory overuse occurrence flag 136 0 - 6 STATUS
event_application_timeout Application execution timeout detection flag 136 0 - 7 STATUS
event_too_high_storage_temp Storage temperature too high flag (power supply stopped after logging) 136 0 - 8 STATUS
Reserved 136 0 - 9:15 STATUS
APP App ID in running[0] Application ID in running[0] 138 1 uint8 - HEX
OBS ID in running[0] Observation ID in running[0] 139 2 uint16 - HEX
Script No in running[0] Running script number in running[0] 141 1 uint8 - HEX
App ID in running[1] Application ID in running[1] 142 1 uint8 - HEX
OBS ID in running[1] Observation ID in running[1] 143 2 uint16 - HEX
Script No in running[1] Running script number in running[1] 145 1 uint8 - HEX
App ID in running[2] Application ID in running[2] 146 1 uint8 - HEX
OBS ID in running[2] Observation ID in running[2] 147 2 uint16 - HEX
Script No in running[2] Running script number in running[2] 149 1 uint8 - HEX
EDAC Corrected Errors Number of corrected memory errors 150 1 uint8 - DEC[count]
Uncorrected Errors Number of uncorrectable errors in memory 151 1 uint8 - DEC[count]

*1: The counter value returns to 0 and starts counting again when it exceeds 255.

3.2. Adding Telemetry

Change the following three locations.

/opt/open-set/src/IFSW/config/config.ini
/opt/open-set/src/IFSW/command_handler/command_00_get_telemetry/get_telemetry.py
/opt/open-set/tests/tools/ifsw_verification/commands/BUSOBC_consts.py

In addition, if you want to add a method for collecting telemetry, you will need to make changes to the following two locations.

/opt/open-set/src/FDIRModule/telegraf_tools/telegraf.conf
/opt/open-set/src/FDIRModule/telegraf_tools/execd_plugin_files

The former is the telegraf configuration file. The latter is the management directory for the script files that are executed by the telegraf inputs.execd plugin.
For example, if your environment is equipped with an AMD GPU and ROCm is installed, you can enable inputs.amd_rocm_smi and the get_status_rocm plugin under inputs.execd in telegraf.conf to collect GPU-related telemetry.

For information on how to collect telemetry using telegraf, please refer to telegraf's github (https://github.com/influxdata/telegraf). Here, we will explain the changes required in IFSW when adding telemetry.

3.2.1. Changes to config.ini

This file contains the configuration values used within IFSW, and also includes the names of the metrics collected by telegraf. Open /opt/open-set/src/IFSW/config/config.ini and add the names of the metrics corresponding to the telemetry you want to add to the following.

TELEGRAF_FIELDS = client_status_command_counter, client_status_reject_counter, ...(omitted), ue_count

3.2.2. Changes to GetTelemetry.py

This file links the values of the metrics collected by telegraf to telemetry and outputs them. First, open /opt/open-set/src/IFSW/command_handler/command_00_GetTelemetry/GetTelemetry.py and add the telemetry names and their initial values (invalid values) to the following telemetry_initial_values.

def initialize_status_data():
    global_db = SqliteDict(DB_IFSW_GLOBAL_VARS)

    telemetry_initial_values = {
        # ======== Time ==========#
        "unix_time": np.uint32(time.time()),

         : (omitted) 
        "ue_count": np.uint8(0),
    }

Metrics collected by telegraf = If it is telemetry, the changes are complete. Otherwise, continue to create a class that links metrics and telemetry. For example, the telemetry "gpu_gtt_usage", which indicates GPU usage, is linked to the metrics "gpu_gtt_used" and "gpu_gtt_total" by the following class.

class GPUGTTUsage(BaseTelemetryMsgHandler):
def update_status(self):
    status = "gpu_gtt_usage"
    try:
        var_type = type(self.status_dict[status])
        value_used = self.metrics_dict["gpu_gtt_used"][0]["value"]
        value_total = self.metrics_dict["gpu_gtt_total"][0]["value"]
        self.status_dict[status] = var_type(
            100*float(value_used)/float(value_total))
    except Exception as e:
        logging.warning("Exception occurred when \
                        updating gpu_gtt_usage. \n%s", e)

Similarly, create a class (NewTelemetry) that links the telemetry (new_telemetry) and metrics (metrics) to be added according to the following format.

class NewTelemetry(BaseTelemetryMsgHandler):
    def update_status(self):
        status = "new_telemetry"
        try:
            var_type = type(self.status_dict[status])
            self.status_dict[status] = var_type(
                # Here, write the function for the metric that indicates the telemetry value.
                # The metric value can be obtained with float(self.metrics_dict["metrics"][0]["value"])

                )
        except Exception as e:
            logging.warning("Exception occurred when \
                            updating new_telemetry. \n%s", e)

Please change the names of "new_telemetry", "metrics" and NewTelemetry to match the telemetry you are adding.

Once you have created the class, the next step is to add it to the SpecificTelemetryMsgHandler of the compose_telemetry_msg method of TelemetryMsgAggregator.

    def compose_telemetry_msg(self):
        BaseTelemetry = BaseTelemetryMsgHandler(
            self.status_data, self.metrics_data)
        for telemetry in self.status_data.keys():
            BaseTelemetry.update_status(telemetry)

        SpecificTelemetryMsgHandler = [
            ShutdownRequest(self.status_data, self.metrics_data),
            RamFree(self.status_data, self.metrics_data),
            :(omitted)
            StorageNVMeNumber(self.status_data, self.metrics_data),
        ]       

Once you have done this, restart the services related to IFSW to apply the changes.

sudo systemctl restart ifsw-*

3.3.3. Changing BUSOBC_consts

Open /opt/open-set/tests/tools/ifsw_verification/commands/BUSOBC_consts.py and edit the following items.

TELEMETRY_FIELDS = [
    # ======== Time ==========#
    "unix_time",
    :(omitted)
    "ue_count",
]

Add the name of the new telemetry to this list.

3.3. Deleting Telemetry

Delete the telemetry by deleting the relevant parts of the file updated in [3.2] (#32-Adding Telemetry). In addition, please restart the IFSW-related services to reflect the configuration changes.

If you have made changes to the telegraf configuration file or the inputs.execd plugin script file in order to collect telemetry, please delete those changes as well.

4. FDIR Specifications

This chapter summarizes the specifications of FDIR, which detects MISSION-OBC abnormalities and performs isolation and recovery processing.

4.1. FDIR List

Alert Management Tool Prometheus is used as the alert management tool, and it works in conjunction with telegraf to monitor abnormal values in metrics. The following is a list of FDIRs that perform detection, isolation, and recovery processing in Open-SET.

No FDIR Name FD Contents IR_ID*2 request_shutdown value*3
1 Storage High Temperature Detects abnormal temperatures in storage devices. 2 0x40
2 Application Execution Time Detects MISSION-OBC startup time overruns. 4 0x00
3 Memory Overuse Detects failures to secure memory requested by container applications. 6 0x00
4 CPU high temperature (>=90°C) Detects abnormal temperatures in the CPU of the MISSION-OBC. 1 0x80
5 CPU high temperature (>=70°C) Detects abnormal temperatures in the CPU of the MISSION-OBC. 3 0x00
6 GPU high temperature (>=90°C) Detects abnormal temperatures in the GPU installed in the MISSION-OBC. 1 0x80
7 GPU high temperature (>=70°C) Detects abnormal temperatures in the GPU installed in the MISSION-OBC. 4 0x00
8 Storage Overuse Detects excessive use of the file system used by the container application. 4 0x00
9 Packet Loss Detects packet loss that occurs in the NIC inside the MISSION-OBC. 6 0x00

*2: IR_ID is an ID that is linked to IR processing.
*3: request_shutdown is a flag value that notifies the BUS-OBC of the status inside the MISSION-OBC. The BUS-OBC processes data in request_shutdown. The following settings are used for MISSION-OBC.
        0x00: No processing
        0x40: Shut down after retrieving log information
        0x80: Shut down immediately

Next, we will show the IR processing for IR_ID.

IR_ID IR content
1 Perform shutdown and stop power supply from BUS-OBC to MISSION-OBC.
2 Retrieve log information and perform shutdown.
3 Perform clock down.
4 If a container application is running, stop the application.
6 Outputs a message to the log file indicating that packet loss has occurred.

4.2. Adding monitoring items

Change the following three locations.

/opt/open-set/src/EventHub/WebhookReceiver/config/config.ini
/opt/open-set/src/FDIRModule/prometheus_tools/rule_files/
/opt/open-set/src/FDIRModule/prometheus_tools/prometheus.yml

4.2.1. Modifying config.ini

config.ini sets the alert name and bit position in the telemetry fdir_event_flag. Open /opt/open-set/src/EventHub/WebhookReceiver/config/config.ini and add the alert names and bit positions you want to add to the following.

[alertTBL]
too_high_storage_temp = 7
application_timeout = 8
memory_overload = 9
too_high_cpu_temp = 10
high_cpu_temp = 11
too_high_gpu_temp = 12
high_gpu_temp = 13
storage_overload = 14
loss_packet = 15

4.2.2. Changing alertfiles

alertfiles manages the files that define alert rules. First, create an alert configuration file for each monitoring item.

Creating an alert definition
vi [file name].yml


How to write an alert file (sample.yml)
groups:
- name: <alert rule name>
rules:
- alert: <alert name>
    expr: <alert conditions>
    for: <time until the conditions are judged to be true>
    labels:
    severity: <alert content>
    annotations:
    summary: <alert summary>
    description: <Explanation of alert conditions>

Next, place the file you created in /opt/open-set/src/FDIRModule/prometheus_tools/rule_files.

sudo cp sample.yml /opt/open-set/src/FDIRModule/prometheus_tools/rule_files

Copy the contents of rule_files to the location where the prometheus executable is located.

sudo cp /opt/open-set/src/FDIRModule/prometheus_tools/rule_files/* /prometheus-2.47.0-linux-amd64/rule_files

4.2.3. Modifying prometheus.yml

prometheus.yml is the configuration file for prometheus. Open /opt/open-set/src/FDIRModule/prometheus_tools/prometheus.yml and add the following to the configuration file for the alert you want to add.

rule_files:
    - "/prometheus-2.47.0.linux-amd64/rule_files/app_runtime.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/high_cpu_temp.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/too_high_cpu_temp.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/high_gpu_temp.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/too_high_gpu_temp.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/loss_packet.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/storage_overload.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/memory_overload.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/high_temp_storage.yml"

Next, copy this file to the location where the prometheus executable is located.

sudo cp /opt/open-set/src/FDIRModule/prometheus_tools/prometheus.yml /prometheus-2.47.0-linux-amd64

Once this is complete, restart the following FDIR-related services to apply the settings.

sudo systemctl restart prometheus.service
sudo systemctl restart webhook_receiver.service
sudo systemctl restart alertmanager.service

4.3. Deleting monitoring items

By deleting the relevant parts from the file updated in [4.2] (#42 - Adding monitoring items), the monitoring items will be deleted. Please restart the FDIR-related services to reflect the configuration changes.

4.4. Adding IR Events

IR events are IR processes that are executed in response to IR_ID. You can add IR events by saving script files that define IR processes in the following directory.

/opt/open-set/src/EventHub/WebhookReceiver/IR_events

First, create a script file for the IR event. The file name should be in the format XX_*.py, with a two-digit hexadecimal number XX at the beginning. XX is the IR_ID for that event.

Creating a script file
vi [file name].py


How to write a script file (FF_sample.py)
def ExecuteIR():
    response_code = E_OK
    try:
        # Write the IR event processing here.
    except Exception as e:
        logging.warning("Error: %s", e)
        response_code = ER_IR_EXE
    return response_code

Next, please place the file you created in /opt/open-set/src/EventHub/WebhookReceiver/IR_events.

sudo cp FF_sample.py /opt/open-set/src/EventHub/WebhookReceiver/IR_events

Once you have completed the above, restart the following FDIR-related services to apply the settings.

sudo systemctl restart prometheus.service
sudo systemctl restart webhook_receiver.service
sudo systemctl restart alertmanager.service

4.5. Deleting IR Events

IR events can be deleted by deleting the relevant parts of the file updated in [4.4] (#44-Adding IR Events).

5. Recovery Processing

This chapter explains the device recovery process. The following explains recovery operations in line with the default Open-SET specifications, but please customize as necessary to suit your own environment.

The default specification of Open-SET assumes that there are three storage devices, NVMe0, NVMe1, and SATA0, and of these, SATA0 is used as the storage device for recovery. Recovery is performed using the following three commands.

  • 0x10 PowerOnDevice
  • 0x11 RestoreOS
  • 0x12 RestoreFile

The basic flow of recovery is as follows

  1. If NVMe0/NVMe1 fails to boot for some reason, boot from SATA0.
  2. Execute the PowerOnDevice command on SATA0 to enable the device to be recovered (do not enable multiple devices at the same time).
  3. Execute the RestoreOS command on SATA0 to deploy the image file in /initialdata on SATA0 to the device to be recovered.
  4. Set the device to be recovered to the boot device and reboot.
  5. Confirm that the device to be recovered has started up.
  6. If necessary, use the RestoreFile command to execute the update script in /initialdata on SATA0 on the current device and apply the differences from the image file.

As a preliminary step, please place the image file and update application script in /initialdata on SATA0. Also, please check the following recovery-related settings in /opt/open-set/src/IFSW/config/config.ini.

[recovery_sequence]
(omitted)
DD_UPTIME_TIMEOUT = 1500 # Timeout for RestoreOS command (seconds since MISSION-OBC booted)
DD_BLOCK_SIZE = 1048576 # Block size when expanding image files with the RestoreOS command
MOUNT_WAIT_TIME = 3 # Time to wait for SATA0 to be enabled and mount to be executed with the RestoreFile command
DIR_MOUNT_RESTORE = /mnt/restore # Directory name to mount the device to be recovered with the RestoreOS command
DIR_INITIAL_DATA = /initialdata # Directory for recovery files
DIR_INSTALLER = / # Base path for unmounting SATA0 in the RestoreFile command
REBOOT_SCRIPT = /opt/open-set/sys/system_reboot # Directory for reboot script files
BOOT_ISO = boot_golden-image.iso # Image file name of the boot partition to be placed in /initialdata on SATA0
ROOT_ISO = root_golden-image.iso # Image file name of the root partition to be placed in /initialdata on SATA0

In addition, at the time of installation, the three recovery commands are disabled. To enable them, you need to change the contents of the following file.

/opt/open-set/src/IFSW/tools/control_device.py

Replace the following four lines in this file with the appropriate commands for your environment.

cmd = ["echo", "Powering on device:",
DEVICE_NAME_INFO[device_id]["storage_device"]]
->
Replace with the command to enable the storage device corresponding to device_id.
cmd = ["echo", "Setting boot device to:",
DEVICE_NAME_INFO[device_id]["storage_device"]]
->
Replace with a command that sets the storage device corresponding to device_id as the device to boot next time.
cmd = ["echo", "Setting boot device to: default"]
->
Replace with a command that sets the device to boot next time to the default.
cmd = ["echo", "Current boot device:"] # Edite here
->
Replace with a command that displays the name of the currently booted storage device.

When you have finished, restart the services related to IFSW to apply the changes.

sudo systemctl restart ifsw-*

6. How to use FTP

This chapter explains data transfer between BUS-OBC and MISSION-OBC using FTP commands.

The procedure for performing PUT/GET using SFTP from the client side is as follows.

sftp [user name]@[IP address/domain name of the connection destination
Connected to [IP address/domain name of the connection destination].
sftp> get [path of the file to be acquired (@MISSION-OBC)]
sftp> put [path of file to be sent (@satellite bus system)]

Please note that the authentication information set in the vsftpd of the Installer is used for the connection, so files cannot be sent to directories that do not have editing permissions. Please replace the [username] here with your actual username, and the [IP address/domain name of the connection destination] with the IP address or domain name of the server set in the vsftpd settings in the Installer.