open-set/ADVANCED_TOPICS.md at main · jaxa/open-set

1. Prerequisites
2. Command Specifications
3. Telemetry Specifications
4. FDIR Specifications
5. Recovery Processing
6. How to use FTP

1. Prerequisites

This document explains how to use the Open-SET software. First, use the installer to install Open-SET. (see README)

The configuration of this software platform is as follows.

IFSW and the IFSW verification tool are used for sending and receiving commands.
We use mosquitto as the internal messaging broker.
We use telegraf as the status information collection tool.
In addition, we use prometheus as the alert management tool, and we use the docker engine, etc., as the execution environment for user applications.

2. Command Specifications

This chapter summarizes the specifications for the commands that are requested from the BUS-OBC command client and executed within the MISSION-OBC.

2.1. List of Commands

The list of commands that can be requested from the BUS-OBC command client is as follows.

ID	Command name	Function	Parameters	Response
0x00	GetTelemetry	Obtain telemetry information	{ "command_id": "0x00", "parameter": {} }	0x80 Telemetry information
0x01	GetLog	Get system log information	{ "command_id": "0x01", "parameter": [ {"time_id": "YYYY-MM-DD"} ] }	0x81 System log information
0x02	Shutdown	Shut down MISSION-OBC	{ "command_id": "0x02", "parameter": {} }	0x82 MISSION-OBC shutdown result
0x10	PowerOnDevice	Start the device to be restored.	{ "command_id": "0x10", "parameter": [ {"device_id": 0} //Device identifier to be restored //0: nvme0 //1:ssd0 //2:nvme1 ] }	0x90 OS Restore Preprocessing Result
0x11	RestoreOS	Restore the OS image of the device to be restored.	{ "command_id": "0x11", "parameter": [ {"device_id": 0}, //Device identifier to be restored //0: nvme0 //1: ssd0 //2: nvme1 {"progress_id": 0}, //Partition of the device to be restored //1: boot partition and root partition //2: root partition //3 : Boot device switching only (do not process if there is no target for restoration, and only the device switching part of the post-processing will run) {"progress_bytes": 0} //0 (fixed value) ] }	0x91 OS Restore Result
0x12	RestoreFile	Restores any file on the device to be restored.	{ "command_id": "0x12", "parameter": [ {"installer_list": ["/opt/open-set/sys/restore_symbolic_link.sh"]} //Script name (array) for performing additional installation. If additional installation is not required, specify only the script that creates a symbolic link in the home directory. If additional installation is required, add the script to the array. ] }	0x92 Post-processing results of OS restoration
0x21	DeployApp	Deploy apps uploaded from the ground.	{ "command_id": "0x21", "parameter": [ {"deploy_method": "0x00"}, // "0x00" is fixed {"image_file": "hello.tar.gz"}, // App file name to be deployed docker load or deb install {"app_uid": "0x01"} // App user ID ] }	0xA1 Container deployment result, deb file installation result
0x22	GetAppInfo	Obtain information (container information) about deployed all apps.	{ "command_id": "0x22", "parameter": [ {"deploy_method": "0x00"} //"0x00" is fixed . ] }	0xA2 Information on deployed containers and installed deb files
0x24	GetFile	Get a file (text file)	{ "command_id": "0x24", "parameter": [ {"config_file_name": "/export/home/exp-01/01_docker-compose.yml"} //text file name ] }	0xA4 Parameter setting details
0x25	ExecuteApp	Execute the app	{ "command_id": "0x25", "parameter": [ {"deploy_method": "0x00"}, //"0x00" fixed {"app_uid": "0x01"}, //Application user ID {"obs_id": "0x0001"} //Observation request number ] }	0xA5 Container app and native app execution results
0x27	GetResult	Obtain the result of executing the app.	{ "command_id": "0x27", "parameter": [ {"app_uid": "0x01"}, //App user ID {"obs_id": "0x0001"} //Observation request number ] }	0xA7 Container app and native app execution results
0x60	MoveFile	Move a file	{ "command_id": "0x60", "parameter": [ {"file_path": "hello.tar.gz"}, //File name {"attribute": "0x00"}, //Attribute ID //"0x00": App //"0x01": Application configuration file //"0x10": Execution condition file yaml (yml), sh file //"0x20": Satellite images taken in orbit //"0x21": Other files required by the AI application other than the above {"app_uid": "0x01"}, //Application user ID {"obs_id": "0x0001"} ]}	0xE0 Result of placing a file that has been interfaced with FTP PUT in the application area
0x61	NotifyGetFile	Notifies that the satellite bus system has already acquired the file from MISSION-OBC.	{ "command_id": "0x61", "parameter": [ {"path": "export/home/exp-01/obs-0001/toGround_arch/result_exp-01_obs-0001.tar.gz"}, // Filename (full path) {"delete_file": 1}, // Whether or not to delete the retrieved file // 0: Do not delete // 1: Delete {"app_uid": "0x01"}, // App user ID // 0x00 No user, the user number is entered in sequential order ] }	0xE1 Post-processing result for files that have been interfaced with FTP GET
0x62	GetDirectory	Obtain information on a specific directory in MISSION-OBC	{ "command_id": "0x62", "parameter": [ {"path": ". /"} // Directory path (full path) ] }	0xE2 MISSION-OBC specific directory information archived with tar, compressed with gzip, and encoded with base64
0x63	DeleteFile	Deletes a specific file or container in MISSION-OBC.	{ "command_id": "0x63", "parameter": [ {"target": "/mnt/open-set/trash"}, // Full path of the object to be deleted, or the docker image name of the target for deletion: tag name // e.g. {"target" : "AppXXX:latest"} {"type": "0x00"}, // normal file 0x00 / directory 0x01 / Docker image 0x02 {"date": "2023-01-01T00:00:00+09:00"} // Specify the time conditions for deletion (can be omitted). // Delete files whose last modified date is before this time. // Note that the system's standard time is not Japan Standard Time, but 32-bit Unix time (UTC)! // Format: // - If you do not specify a time zone (UTC): "YYYY-MM-ddThh:mm:ss" // - If you specify a time zone: "YYYY-MM-ddThh:mm:ss+hh:mm" // Example: To delete files created before 2023 in Japan time, specify it as "2023-01-01T00:00:00+09:00". ] }	0xE3 MISSION-OBC result of deleting a specific file or container within MISSION-OBC
0x70	ExecuteCommand	Execute a specific shell command within MISSION-OBC	{ "command_id": "0x70", "parameter": [ {"shell": "date"} // The shell command to execute // The command must satisfy the following conditions // - It must be specified on a single line // - It must not include commands that require pre-installation // - It must not include commands that require input during execution // - Do not include commands that will destroy the system ] }	0xF0 MISSION-OBC The result of a specific shell command executed within the MISSION-OBC is archived with tar, compressed with gzip, and encoded with base64.

2.2. How to send and receive command request data

The method for sending and receiving command request data is as follows.

Check that IFSW is operating normally. If it is not operating normally, restart the service.

Checking method: sudo systemctl status ifsw*
Restart method: sudo systemctl restart ifsw*

Start the status command server.

cd /opt/open-set/tests/tools/ifsw_verification/commands
python3 status_command_server.py

Open another terminal window and start sending commands.

cd /opt/open-set/tests/tools/ifsw_verification/commands
python3 command_client_manual.py

If successful, you will be prompted to select a command ID as follows. For example, if you want to execute the "GetTelemetry" command to obtain telemetry, enter 0x00.

==========================================
please select a command ID from the list:
ID | NAME
0x00 | GetTelemetry
0x01 | GetLog
0x02 | Shutdown
...

After sending the command, the contents of the command you sent will be displayed, followed by a response message from the IFSW server. If the command is received successfully, SERVER > Command accepted. will be returned.

2.3. Adding a command

Change the following three places.

/opt/open-set/src/IFSW/command_handler
/opt/open-set/tests/tools/ifsw_verification/commands/status_command_handler.py
/opt/open-set/tests/tools/ifsw_verification/commands/BUSOBC_consts.py

2.3.1. Modifying command_handler

First, add a folder with the following structure to /opt/open-set/src/IFSW/command_handler.

 +-- command_XX_your_command
   +-- your_command.py
   +-- template.json

Here, replace "YourCommand" with the name of the command you want to add. Also, "XX" is a two-digit hexadecimal ID, so choose one that does not conflict with the IDs of existing commands. Please note that command IDs are assigned to each command according to the following criteria.

Range of IDs for control commands: 0x00-0x7F 
- 0x00-0x0F: System-related (ex. telemetry, logging, power)
- 0x10-0x1F: Error Clear-related (ex. recovery)
- 0x20-0x5F: Mission-related (ex. application loading/execution)
- 0x60-0x6F: File operation-related
- 0x70-0x7F: Development-related
- 0x80-0xFF is reserved for status command IDs.

In template.json, please describe the default parameter values for the command in JSON format.

Command script files are written in the following three basic components. When creating your_command.py, please edit the commented parts in accordance with the following format.

class InvokeCommand(BaseCommand):
    def check_parameter(self, *args):
        super().check_parameter()
        result = check_parameter(*args)
        return result

    def pre_process(self, params): # If there are any parameters included in the return content of the command execution, include them here.
        super().pre_process()
        self.parameter = [{"response_code": ER_COMMAND_EXE}, {"data": params}] # Write the return content in the event of a command error here.

    def main_process(self, *args):
        super().main_process()
        parameter = invoke_command(*args)
        return parameter

def check_parameter(parameter): # Include the command parameters here.
    # Write the code to check the command parameters here.
    return True

def invoke_command(params): # Include any parameters used in the command-specific processing and return content here.
    # Write the command-specific processing here.
       :
    parameter = [{"response_code": E_OK}] # Describe the return content when the command ends normally here.

    return parameter

Once you have made the above changes, restart the services related to IFSW in order to reflect the changes in MISSION-OBC.

sudo systemctl restart ifsw-*

2.3.2. Change to status_command_handler

Open /opt/open-set/tests/tools/ifsw_verification/commands/status_command_handler.py and edit the following items.

status_commandTBL = {
    0x80: EndGetTelemetry,
    0x81: EndGetLogInfo,
    0x82: EndShutdown,
    0x90: EndPowerOnDevice,
    0x91: EndRestoreOS,
    0x92: EndRestoreFile,
    0xA1: EndDeployApp,
    0xA2: EndGetAppInfo,
    0xA4: EndGetFile,
    0xA5: EndExecuteApp,
    0xA7: EndGetAppResult,
    0xE0: EndMoveFile,
    0xE1: EndNotifyGetFile,
    0xE2: EndGetDirectoryList,
    0xE3: EndDeleteFile,
    0xF0: EndExecuteCommand,
}

To this list, add the status command ID (command ID + 0x80) corresponding to the new command you want to add, and the function name (e.g. PutMonitorStatus) that will receive the status command.

Next, add the function that will receive the status command. In the case of GetTelemetry, the following is the corresponding function.

@timeout_decorator.timeout(COM_HANDLER_TIMEOUT, use_signals=False)
def EndGetTelemetry(response_code, telemetry, parameterCheck=False):
    if parameterCheck:
        return True

    print("Command executed : " + sys._getframe().f_code.co_name)    
    print("Telemetry:\n")
    for id in range(len(telemetry)):
        print(TELEMETRY_FIELDS[id] + ": " + str(telemetry[id]))
    return

Please define the function parameters according to the content returned by invoke_command in YourCommand.py. Set the parameter check if necessary. Also, please make sure that the necessary information is output, as the contents of the print output will be displayed on the status command server.

2.3.3. Changing BUSOBC_consts

Open /opt/open-set/tests/tools/ifsw_verification/commands/BUSOBC_consts.py and edit the following items.

commandLIST = [
    "0x00-GetTelemetry",
    "0x01-GetLog",
    "0x02-Shutdown",
    "0x10-PowerOnDevice",
    "0x11-RestoreOS",
    "0x12-RestoreFile",
    "0x21-DeployApp",
    "0x22-GetAppInfo",
    "0x24-GetFile",
    "0x25-ExecuteApp",
    "0x27-GetResult",
    "0x60-MoveFile",
    "0x61-NotifyGetFile",
    "0x62-GetDirectory",
    "0x63-DeleteFile",
    "0x70-ExecuteCommand",
]

Add the command ID and command name corresponding to the newly added command to this list.

2.4. Deleting Commands

Commands can be deleted by deleting the relevant parts from the file updated in [2.3](#23-Adding Commands). In addition, please restart the IFSW-related services to reflect the configuration changes.

3. Telemetry Specifications

This chapter summarizes the specifications for the telemetry collected within the MISSION-OBC.

3.1. Telemetry List

We use telegraf as a tool for collecting MISSION-OBC status information. In addition, some values are managed in a database shared within Open-SET, and are updated by IFSW and IR handlers. The following is a list of the telemetry collected by Open-SET.

Category	Item	telegraf_metrics	Description / Remarks	Byte Pos	Byte Size	Type	Bit	Display Type[Unit]
Time	Time UNIX time		Outputs UNIX time	0	4	uint32	-	DEC[sec]
Command Response	Command Counter*1		Incremented each time a command is received successfully.	4	1	uint8	-	DEC[count]
	Command Reject Counter*1		Incremented when a command is rejected.	5	1	uint8	-	DEC[count]
	Last Executed Command		ID of the last executed command	6	1	uint8	-	DEC
	Last Reject command		ID of the last rejected command	7	1	uint8	-	DEC
	Last Reject Code		Code (response code) when command is rejected	8	1	uint8	-	DEC
	Status Command Counter*1	client_status_commnad_counter	Incremented for each successful command reception.	9	1	uint8	-	DEC[count]
	Status Command Reject Counter*1	client_status_commnad_reject_counter	Incremented when a command is rejected.	10	1	uint8	-	DEC[count]
	Status Last Executed Command	client_status_last_command_code	ID of the last executed command	11	1	uint8	-	DEC
	Status Last Reject command	client_status_last_reject_command_code	ID of the last rejected command	12	1	uint8	-	DEC
System Status	shutdown_request		Each bit indicates whether or not a mode that requires a reboot is in effect. Unassigned bits are reserved. Bit positions are expressed in network byte order.	13	1	uint8	-	STATUS
		Request to execute forced power-off	0:not requested, 1: requested	13	0	-	0	STATUS
		Request to execute command sequence [GetLog -> Get log via FTP -> Request Shut down]	0:not requested, 1: requested	13	0	-	1	STATUS
		Reserved		13	0	-	2:7	STATUS
	dmesg lines	dmesg_length	Number of dmsg lines	14	2	uint16	-	DEC[lines]
	dmesg error	dmesg_errornum	Number of dmsg error lines	16	2	uint16	-	DEC[lines]
	syslog error	syslog_errnum	Number of error lines in syslog	18	2	uint16	-	DEC[lines]
	System Uptime	system_uptime	Total number of seconds the system has been running since it was turned on (up to 65535, or 10 hours)	20	4	uint32	-	DEC[sec]
Service Status	sensor-server.status		Missing values are assumed to be 0. The value of each bit indicates the status of the following services. Unassigned bits are reserved. Bit positions are expressed in network byte order.	24	4	uint32	-	STATUS
		Reserved		24	0	-	0:16	STATUS
		status_network.status	0:stop, 1: run	24	0	-	17	STATUS
		status_ifsw-server.status	0:stop, 1: run	24	0	-	18	STATUS
		status_ifsw-cliant.status	0:stop, 1: run	24	0	-	19	STATUS
		status_ifsw-monitoring.status	0:stop, 1: run	24	0	-	20	STATUS
		status_alertmanager.status	0:stop, 1: run	24	0	-	21	STATUS
		status_webhook_receiver.status	0:stop, 1: run	24	0	-	22	STATUS
		status_prometheus.status	0:stop, 1: run	24	0	-	23	STATUS
		status_telegraf.status	0:stop, 1: run	24	0	-	24	STATUS
		status_mosquitto.status	0:stop, 1: run	24	0	-	25	STATUS
		status_docker.status	0:stop, 1: run	24	0	-	26	STATUS
		status_health_check.status	0:stop, 1: run	24	0	-	27	STATUS
		status_logrotate.status	0:stop, 1: run	24	0	-	28	STATUS
		status_vsftpd.status	0:stop, 1: run	24	0	-	29	STATUS
		status_ssh.status	0:stop, 1: run	24	0	-	30	STATUS
		status_systemd-timesyncd.status	0:stop, 1: run	24	0	-	31	STATUS
	failed status	status_failed_num	Number of services with status fail	28	1	uint8	-	DEC[count]
CPU/GPU Status	CPU load ratio	cpu_usage_user	CPU load ratio	29	1	uint8	-	DEC[%]
	mean CPU frequency	linux_cpu_scaling_cur_freq	Average CPU clock frequency (MHz)	30	2	uint16	-	DEC[MHz]
	Num of Process	processes_running	Number of processes running	32	2	uint16	-	DEC[count]
	RAM Free	mem_available	Available memory size	34	2	uint16	-	DEC[MB]
	GPU load ratio	amd_rocm_smi_utilization_gpu	GPU load ratio	36	1	uint8	-	DEC[%]
	GPU Memory clock	amd_rocm_smi_clocks_current_memory		37	2	float16	-	DEC[GHz]
	GPU Shader clock	amd_rocm_smi_clocks_current_sm		39	2	float16	-	DEC[GHz]
	GPU VRAM usage	amd_rocm_smi_memory_used/amd_rocm_smi_memory_total		41	1	uint8	-	DEC[%]
	GPU GTT (Graphics Translation Table) usage	amd_rocm_gttmem_used/amd_rocm_gttmem_total		42	1	uint8	-	DEC[%]
Communication	throughput	net_packets_throughput	LAN communication throughput	43	2	float16	-	DEC[Mbps]
	tcp byte received (4000)	net_bytes_recv	TCP port received amount (4000), roll over	45	2	uint16	-	DEC[MB]
	tcp byte sent (4003)	net_bytes_sent	TCP port sent amount (4003), Roll over	47	2	uint16	-	DEC[MB]
	err_in	net_err_in	Cumulative number of packets with errors	49	1	uint8	-	DEC[count]
	err_out	net_err_out	cumulative number of packets that have been dropped	50	1	uint8	-	DEC[count]
	drop_in	net_drop_in	cumulative number of packets that have been dropped in	51	1	uint8	-	DEC[count]
	drop_out	net_drop_out	Cumulative number of packets dropped out	52	1	uint8	-	DEC[count]
NVMe0	SSD Temperature	smart_temperature	Disk temperature	53	2	float16	-	DEC[mC]
	Power on hours	smart_power_on_hours	Total time the disk has been powered on	55	4	uint32	-	DEC[hour]
	Powercycle count	smart_power_cycle_count	Number of power cycles the disk has been through	59	4	uint32	-	DEC[count]
	Error information log entries	smart_error_information_log_entries	Number of entries recorded in the error log	63	1	uint8	-	DEC[count]
	Available spare	smart_available_spare	Percentage of available spare space	64	1	uint8	-	DEC[%]
	Smart critical warning	smart_critical_warning	Status of critical warnings 0: No warnings, 1: Warnings have occurred. The value of each bit indicates the status of the following warning types. Unassigned bits are reserved. The bit positions are expressed in little endian format in accordance with the external specifications.	65	1	uint8	-	STATUS
		Reserved		65	0	-	7:6	STATUS
		Invalid Persistent Memory	Persistent Memory Region is in Read-Only state or reliability is questionable.	65	0	-	5	STATUS
		Volatile Backup	There is a problem with the volatile memory backup mechanism.	65	0	-	4	STATUS
		Media Read-only	The media is in Read-Only mode.	65	0	-	3	STATUS
		Reliability Status	The reliability of the NVM subsystem has degraded.	65	0	-	2	STATUS
		Temperature Threshold	The temperature has exceeded the upper threshold or dropped below the lower threshold.	65	0	-	1	STATUS
		Available Spare	The amount of spare space has dropped below the lower threshold.	65	0	-	0	STATUS
	Percentage Used	smart_percentage_used	The percentage of life used.	66	1	uint8	0	DEC[%]
	Unsafe Shutdowns	smart_unsafe_shutdowns	Number of unsafe shutdowns	67	2	uint16	-	DEC[count]
	Media And Data Integrity Errors	smart_media_and_data_integrity_errors	Number of media and data integrity errors	69	1	uint8	-	DEC[count]
	Percentage Rate	storage_raw_read_error_rate	Error rate when reading data	70	1	uint8	-	DEC[%]
	Reallocated_Sector Ct	storage_reallocated_sector_ct	Number of reallocated bad sectors	71	1	uint8	-	DEC[count]
	Hardware_Ecc Recovered	storage_hardware_ecc_recovered	Number of errors recovered by ECC	72	1	uint8	-	DEC[count]
	Reallocated_Event Count	storage_reallocated_event_count	Number of reallocated events	73	1	uint8	-	DEC[count]
	Offline Uncorrectable	storage_offline_uncorrectable	Number of uncorrectable errors in offline scan	74	1	uint8	-	DEC[count]
	Udma_Crc_Error Count	storage_udma_crc_error_count	Number of CRC errors during UDMA transfer	75	1	uint8	-	DEC[count]
NVMe1	SSD Temperature	smart_temperature	Disk temperature	76	2	float16	-	DEC[mC]
	Power on hours	smart_power_on_hours	Total time the disk was powered on	78	4	uint32	-	DEC[hour]
	Powercycle count	smart_power_cycle_count	Number of times the disk has been power-cycled	82	4	uint32	-	DEC[count]
	Error information log entries	smart_error_information_log_entries	Number of entries recorded in the error log	86	1	uint8	-	DEC[count]
	Available spare	smart_available_spare	Percentage of available spare space	87	1	uint8	-	DEC[%]
	Smart critical warning	smart_critical_warning	Status of critical warnings 0: No warnings, 1: Warnings have occurred. The value of each bit indicates the status of the following warning types. Unassigned bits are reserved. The bit positions are expressed in little endian format in accordance with the external specifications.	88	1	uint8	-	STATUS
		Reserved		88	0	-	7:6	STATUS
		Invalid Persistent Memory	Persistent Memory Region is in Read-Only state or reliability is questionable	88	0	-	5	STATUS
		Volatile Backup	There is a problem with the volatile memory backup mechanism	88	0	-	4	STATUS
		Media Read-only	The media is in Read-Only mode.	88	0	-	3	STATUS
		Reliability Status	The reliability of the NVM subsystem has degraded.	88	0	-	2	STATUS
		Temperature Threshold	The temperature has exceeded the upper threshold or dropped below the lower threshold.	88	0	-	1	STATUS
		Available Spare	The amount of spare space has dropped below the lower threshold.	88	0	-	0	STATUS
	Percentage Used	smart_percentage_used	Percentage of life used.	89	1	uint8	0	DEC[%]
	Unsafe Shutdowns	smart_unsafe_shutdowns	Number of unsafe shutdowns	90	2	uint16	-	DEC[count]
	Media And Data Integrity Errors	smart_media_and_data_integrity_errors	Number of media and data integrity errors	92	1	uint8	-	DEC[count]
	Percentage Rate	storage_raw_read_error_rate	Error rate when reading data	93	1	uint8	-	DEC[%]
	Reallocated_Sector Ct	storage_reallocated_sector_ct	Number of reallocated bad sectors	94	1	uint8	-	DEC[count]
	Hardware_Ecc Recovered	storage_hardware_ecc_recovered	Number of errors recovered by ECC	95	1	uint8	-	DEC[count]
	Reallocated_Event Count	storage_reallocated_event_count	Number of reallocated events	96	1	uint8	-	DEC[count]
	Offline Uncorrectable	storage_offline_uncorrectable	Number of uncorrectable errors in offline scan	97	1	uint8	-	DEC[count]
	Udma_Crc_Error Count	storage_udma_crc_error_count	Number of CRC errors during UDMA transfer	98	1	uint8	-	DEC[count]
SSD	SSD Temperature	smart_temperature	Disk temperature	99	2	float16	-	DEC[mC]
	Power on hours	smart_power_on_hours	Total time the disk has been powered on	101	4	uint32	-	DEC[hour]
	Powercycle count	smart_power_cycle_count	Number of times the disk has been power cycled	105	4	uint32	-	DEC[count]
	Error information log entries	smart_error_information_log_entries	Number of entries recorded in the error log	109	1	uint8	-	DEC[count]
	Available spare	smart_available_spare	Percentage of available spare space	110	1	uint8	-	DEC[%]
	Smart critical warning	smart_critical_warning	Status of critical warnings 0: No warnings, 1: Warnings have occurred. The value of each bit indicates the status of the following warning types. Unassigned bits are reserved. The bit positions are expressed in little endian format in accordance with the external specifications.	111	1	uint8	-	STATUS
		Reserved		111	0	-	7:6	STATUS
		Invalid Persistent Memory	Persistent Memory Region is in Read-Only state or reliability is questionable	111	0	-	5	STATUS
		Volatile Backup	There is a problem with the volatile memory backup mechanism	111	0	-	4	STATUS
		Media Read-only	The media is in Read-Only mode.	111	0	-	3	STATUS
		Reliability Status	The reliability of the NVM subsystem has degraded.	111	0	-	2	STATUS
		Temperature Threshold	The temperature has exceeded the upper threshold or dropped below the lower threshold.	111	0	-	1	STATUS
		Available Spare	The amount of spare space has dropped below the lower threshold.	111	0	-	0	STATUS
	Percentage Used	smart_percentage_used	Percentage of life used.	112	1	uint8	0	DEC[%]
	Unsafe Shutdowns	smart_unsafe_shutdowns	Number of unsafe shutdowns	113	2	uint16	-	DEC[count]
	Media And Data Integrity Errors	smart_media_and_data_integrity_errors	Number of media and data integrity errors	115	1	uint8	-	DEC[count]
	Percentage Rate	storage_raw_read_error_rate	Error rate when reading data	116	1	uint8	-	DEC[%]
	Reallocated_Sector Ct	storage_reallocated_sector_ct	Number of reallocated bad sectors	117	1	uint8	-	DEC[count]
	Hardware_Ecc Recovered	storage_hardware_ecc_recovered	Number of errors recovered by ECC	118	1	uint8	-	DEC[count]
	Reallocated_Event Count	storage_reallocated_event_count	Number of reallocated events	118	1	uint8	-	DEC[count]
	Offline Uncorrectable	storage_offline_uncorrectable	Number of uncorrectable errors in offline scan	120	1	uint8	-	DEC[count]
	Udma_Crc_Error Count	storage_udma_crc_error_count	Number of CRC errors during UDMA transfer	121	1	uint8	-	DEC[count]
Storage	Disk Used Percentage(DATA)	disk_used_percent	Disk usage (data)	122	1	uint8	-	Usage rate [%]
	Disk Used Percentage(System)	disk_used_percent	Disk usage percentage (system)	123	1	uint8	-	Usage rate [%]
	Disk Inodes Used Percentage(DATA)	disk_inodes_used_percent	i-nodes usage percentage (data)	124	1	uint8	-	Usage rate [%]
	Disk Inodes Used Percentage(System)	disk_inodes_used_percent	i-nodes usage rate (system)	125	1	uint8	-	usage rate [%]
	Boot Device	storage_boot_device	device being booted up	126	1	uint8	-	STATUS
	Boot NVMe No	storage_nvme_number	NVMe number that is booting up	127	1	uint8	-	STATUS
container	Running Container num	docker_n_containers_running	Number of docker processes in docker ps	128	1	uint8	-	DEC[count]
	Exited Container num	docker_n_containers_exited	Number of docker processes in docker ps -a --filter "status=exited"	129	1	uint8	-	DEC[count]
cAdvisor	max_storage_used	max_storage_used	Total file system usage by running container services	130	1	uint8	-	DEC[%]
	memory_failed_counter	memory_failed_counter	Number of memory allocation failures + number of OOM (Out Of Memory) events	131	1	uint8	-	DEC[count]
Temperature monitor	gpu_temp	gpu_temp	GPU temperature	132	2	float16	-	DEC[C]
	cpu_temp	sensors_temp_input	CPU temperature	134	2	float16	-	DEC[mC]
FDIR	Event Occurrence	fdir_event_flag	0: No event occurred, 1: Event occurred The value of each bit indicates the status of the following alert event types. The bit positions are expressed in network byte order.	136	2	uint16	-	STATUS
		event_loss_packet	TCP/IP packet loss occurrence flag	136	0	-	0	STATUS
		event_storage_overload	Storage overuse occurrence flag	136	0	-	1	STATUS
		event_high_gpu_temp	GPU temperature warning flag (application shutdown)	136	0	-	2	STATUS
		event_too_high_gpu_temp	GPU temperature warning flag (immediate power supply shutdown)	136	0	-	3	STATUS
		event_high_cpu_temp	CPU temperature warning flag (clock down)	136	0	-	4	STATUS
		event_too_high_cpu_temp	CPU abnormally high temperature flag (immediate power supply stop)	136	0	-	5	STATUS
		event_memory_overload	Memory overuse occurrence flag	136	0	-	6	STATUS
		event_application_timeout	Application execution timeout detection flag	136	0	-	7	STATUS
		event_too_high_storage_temp	Storage temperature too high flag (power supply stopped after logging)	136	0	-	8	STATUS
		Reserved		136	0	-	9:15	STATUS
APP	App ID in running[0]		Application ID in running[0]	138	1	uint8	-	HEX
	OBS ID in running[0]		Observation ID in running[0]	139	2	uint16	-	HEX
	Script No in running[0]		Running script number in running[0]	141	1	uint8	-	HEX
	App ID in running[1]		Application ID in running[1]	142	1	uint8	-	HEX
	OBS ID in running[1]		Observation ID in running[1]	143	2	uint16	-	HEX
	Script No in running[1]		Running script number in running[1]	145	1	uint8	-	HEX
	App ID in running[2]		Application ID in running[2]	146	1	uint8	-	HEX
	OBS ID in running[2]		Observation ID in running[2]	147	2	uint16	-	HEX
	Script No in running[2]		Running script number in running[2]	149	1	uint8	-	HEX
EDAC	Corrected Errors		Number of corrected memory errors	150	1	uint8	-	DEC[count]
	Uncorrected Errors		Number of uncorrectable errors in memory	151	1	uint8	-	DEC[count]

*1: The counter value returns to 0 and starts counting again when it exceeds 255.

3.2. Adding Telemetry

Change the following three locations.

/opt/open-set/src/IFSW/config/config.ini
/opt/open-set/src/IFSW/command_handler/command_00_get_telemetry/get_telemetry.py
/opt/open-set/tests/tools/ifsw_verification/commands/BUSOBC_consts.py

In addition, if you want to add a method for collecting telemetry, you will need to make changes to the following two locations.

/opt/open-set/src/FDIRModule/telegraf_tools/telegraf.conf
/opt/open-set/src/FDIRModule/telegraf_tools/execd_plugin_files

The former is the telegraf configuration file. The latter is the management directory for the script files that are executed by the telegraf inputs.execd plugin.
For example, if your environment is equipped with an AMD GPU and ROCm is installed, you can enable inputs.amd_rocm_smi and the get_status_rocm plugin under inputs.execd in telegraf.conf to collect GPU-related telemetry.

For information on how to collect telemetry using telegraf, please refer to telegraf's github (https://github.com/influxdata/telegraf). Here, we will explain the changes required in IFSW when adding telemetry.

3.2.1. Changes to config.ini

This file contains the configuration values used within IFSW, and also includes the names of the metrics collected by telegraf. Open /opt/open-set/src/IFSW/config/config.ini and add the names of the metrics corresponding to the telemetry you want to add to the following.

TELEGRAF_FIELDS = client_status_command_counter, client_status_reject_counter, ...(omitted), ue_count

3.2.2. Changes to GetTelemetry.py

This file links the values of the metrics collected by telegraf to telemetry and outputs them. First, open /opt/open-set/src/IFSW/command_handler/command_00_GetTelemetry/GetTelemetry.py and add the telemetry names and their initial values (invalid values) to the following telemetry_initial_values.

def initialize_status_data():
    global_db = SqliteDict(DB_IFSW_GLOBAL_VARS)

    telemetry_initial_values = {
        # ======== Time ==========#
        "unix_time": np.uint32(time.time()),

         : (omitted) 
        "ue_count": np.uint8(0),
    }

Metrics collected by telegraf = If it is telemetry, the changes are complete. Otherwise, continue to create a class that links metrics and telemetry. For example, the telemetry "gpu_gtt_usage", which indicates GPU usage, is linked to the metrics "gpu_gtt_used" and "gpu_gtt_total" by the following class.

class GPUGTTUsage(BaseTelemetryMsgHandler):
def update_status(self):
    status = "gpu_gtt_usage"
    try:
        var_type = type(self.status_dict[status])
        value_used = self.metrics_dict["gpu_gtt_used"][0]["value"]
        value_total = self.metrics_dict["gpu_gtt_total"][0]["value"]
        self.status_dict[status] = var_type(
            100*float(value_used)/float(value_total))
    except Exception as e:
        logging.warning("Exception occurred when \
                        updating gpu_gtt_usage. \n%s", e)

Similarly, create a class (NewTelemetry) that links the telemetry (new_telemetry) and metrics (metrics) to be added according to the following format.

class NewTelemetry(BaseTelemetryMsgHandler):
    def update_status(self):
        status = "new_telemetry"
        try:
            var_type = type(self.status_dict[status])
            self.status_dict[status] = var_type(
                # Here, write the function for the metric that indicates the telemetry value.
                # The metric value can be obtained with float(self.metrics_dict["metrics"][0]["value"])

                )
        except Exception as e:
            logging.warning("Exception occurred when \
                            updating new_telemetry. \n%s", e)

Please change the names of "new_telemetry", "metrics" and NewTelemetry to match the telemetry you are adding.

Once you have created the class, the next step is to add it to the SpecificTelemetryMsgHandler of the compose_telemetry_msg method of TelemetryMsgAggregator.

    def compose_telemetry_msg(self):
        BaseTelemetry = BaseTelemetryMsgHandler(
            self.status_data, self.metrics_data)
        for telemetry in self.status_data.keys():
            BaseTelemetry.update_status(telemetry)

        SpecificTelemetryMsgHandler = [
            ShutdownRequest(self.status_data, self.metrics_data),
            RamFree(self.status_data, self.metrics_data),
            :(omitted)
            StorageNVMeNumber(self.status_data, self.metrics_data),
        ]

Once you have done this, restart the services related to IFSW to apply the changes.

sudo systemctl restart ifsw-*

3.3.3. Changing BUSOBC_consts

Open /opt/open-set/tests/tools/ifsw_verification/commands/BUSOBC_consts.py and edit the following items.

TELEMETRY_FIELDS = [
    # ======== Time ==========#
    "unix_time",
    :(omitted)
    "ue_count",
]

Add the name of the new telemetry to this list.

3.3. Deleting Telemetry

Delete the telemetry by deleting the relevant parts of the file updated in [3.2] (#32-Adding Telemetry). In addition, please restart the IFSW-related services to reflect the configuration changes.

If you have made changes to the telegraf configuration file or the inputs.execd plugin script file in order to collect telemetry, please delete those changes as well.

4. FDIR Specifications

This chapter summarizes the specifications of FDIR, which detects MISSION-OBC abnormalities and performs isolation and recovery processing.

4.1. FDIR List

Alert Management Tool Prometheus is used as the alert management tool, and it works in conjunction with telegraf to monitor abnormal values in metrics. The following is a list of FDIRs that perform detection, isolation, and recovery processing in Open-SET.

No	FDIR Name	FD Contents	IR_ID*2	request_shutdown value*3
1	Storage High Temperature	Detects abnormal temperatures in storage devices.	2	0x40
2	Application Execution Time	Detects MISSION-OBC startup time overruns.	4	0x00
3	Memory Overuse	Detects failures to secure memory requested by container applications.	6	0x00
4	CPU high temperature (>=90°C)	Detects abnormal temperatures in the CPU of the MISSION-OBC.	1	0x80
5	CPU high temperature (>=70°C)	Detects abnormal temperatures in the CPU of the MISSION-OBC.	3	0x00
6	GPU high temperature (>=90°C)	Detects abnormal temperatures in the GPU installed in the MISSION-OBC.	1	0x80
7	GPU high temperature (>=70°C)	Detects abnormal temperatures in the GPU installed in the MISSION-OBC.	4	0x00
8	Storage Overuse	Detects excessive use of the file system used by the container application.	4	0x00
9	Packet Loss	Detects packet loss that occurs in the NIC inside the MISSION-OBC.	6	0x00

*2: IR_ID is an ID that is linked to IR processing.
*3: request_shutdown is a flag value that notifies the BUS-OBC of the status inside the MISSION-OBC. The BUS-OBC processes data in request_shutdown. The following settings are used for MISSION-OBC.
        0x00: No processing
        0x40: Shut down after retrieving log information
        0x80: Shut down immediately

Next, we will show the IR processing for IR_ID.

IR_ID	IR content
1	Perform shutdown and stop power supply from BUS-OBC to MISSION-OBC.
2	Retrieve log information and perform shutdown.
3	Perform clock down.
4	If a container application is running, stop the application.
6	Outputs a message to the log file indicating that packet loss has occurred.

4.2. Adding monitoring items

Change the following three locations.

/opt/open-set/src/EventHub/WebhookReceiver/config/config.ini
/opt/open-set/src/FDIRModule/prometheus_tools/rule_files/
/opt/open-set/src/FDIRModule/prometheus_tools/prometheus.yml

4.2.1. Modifying config.ini

config.ini sets the alert name and bit position in the telemetry fdir_event_flag. Open /opt/open-set/src/EventHub/WebhookReceiver/config/config.ini and add the alert names and bit positions you want to add to the following.

[alertTBL]
too_high_storage_temp = 7
application_timeout = 8
memory_overload = 9
too_high_cpu_temp = 10
high_cpu_temp = 11
too_high_gpu_temp = 12
high_gpu_temp = 13
storage_overload = 14
loss_packet = 15

4.2.2. Changing alertfiles

alertfiles manages the files that define alert rules. First, create an alert configuration file for each monitoring item.

Creating an alert definition
vi [file name].yml


How to write an alert file (sample.yml)
groups:
- name: <alert rule name>
rules:
- alert: <alert name>
    expr: <alert conditions>
    for: <time until the conditions are judged to be true>
    labels:
    severity: <alert content>
    annotations:
    summary: <alert summary>
    description: <Explanation of alert conditions>

Next, place the file you created in /opt/open-set/src/FDIRModule/prometheus_tools/rule_files.

sudo cp sample.yml /opt/open-set/src/FDIRModule/prometheus_tools/rule_files

Copy the contents of rule_files to the location where the prometheus executable is located.

sudo cp /opt/open-set/src/FDIRModule/prometheus_tools/rule_files/* /prometheus-2.47.0-linux-amd64/rule_files

4.2.3. Modifying prometheus.yml

prometheus.yml is the configuration file for prometheus. Open /opt/open-set/src/FDIRModule/prometheus_tools/prometheus.yml and add the following to the configuration file for the alert you want to add.

rule_files:
    - "/prometheus-2.47.0.linux-amd64/rule_files/app_runtime.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/high_cpu_temp.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/too_high_cpu_temp.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/high_gpu_temp.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/too_high_gpu_temp.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/loss_packet.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/storage_overload.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/memory_overload.yml"
    - "/prometheus-2.47.0.linux-amd64/rule_files/high_temp_storage.yml"

Next, copy this file to the location where the prometheus executable is located.

sudo cp /opt/open-set/src/FDIRModule/prometheus_tools/prometheus.yml /prometheus-2.47.0-linux-amd64

Once this is complete, restart the following FDIR-related services to apply the settings.

sudo systemctl restart prometheus.service
sudo systemctl restart webhook_receiver.service
sudo systemctl restart alertmanager.service

4.3. Deleting monitoring items

By deleting the relevant parts from the file updated in [4.2] (#42 - Adding monitoring items), the monitoring items will be deleted. Please restart the FDIR-related services to reflect the configuration changes.

4.4. Adding IR Events

IR events are IR processes that are executed in response to IR_ID. You can add IR events by saving script files that define IR processes in the following directory.

/opt/open-set/src/EventHub/WebhookReceiver/IR_events

First, create a script file for the IR event. The file name should be in the format XX_*.py, with a two-digit hexadecimal number XX at the beginning. XX is the IR_ID for that event.

Creating a script file
vi [file name].py


How to write a script file (FF_sample.py)
def ExecuteIR():
    response_code = E_OK
    try:
        # Write the IR event processing here.
    except Exception as e:
        logging.warning("Error: %s", e)
        response_code = ER_IR_EXE
    return response_code

Next, please place the file you created in /opt/open-set/src/EventHub/WebhookReceiver/IR_events.

sudo cp FF_sample.py /opt/open-set/src/EventHub/WebhookReceiver/IR_events

Once you have completed the above, restart the following FDIR-related services to apply the settings.

sudo systemctl restart prometheus.service
sudo systemctl restart webhook_receiver.service
sudo systemctl restart alertmanager.service

4.5. Deleting IR Events

IR events can be deleted by deleting the relevant parts of the file updated in [4.4] (#44-Adding IR Events).

5. Recovery Processing

This chapter explains the device recovery process. The following explains recovery operations in line with the default Open-SET specifications, but please customize as necessary to suit your own environment.

The default specification of Open-SET assumes that there are three storage devices, NVMe0, NVMe1, and SATA0, and of these, SATA0 is used as the storage device for recovery. Recovery is performed using the following three commands.

0x10 PowerOnDevice
0x11 RestoreOS
0x12 RestoreFile

The basic flow of recovery is as follows

If NVMe0/NVMe1 fails to boot for some reason, boot from SATA0.
Execute the PowerOnDevice command on SATA0 to enable the device to be recovered (do not enable multiple devices at the same time).
Execute the RestoreOS command on SATA0 to deploy the image file in /initialdata on SATA0 to the device to be recovered.
Set the device to be recovered to the boot device and reboot.
Confirm that the device to be recovered has started up.
If necessary, use the RestoreFile command to execute the update script in /initialdata on SATA0 on the current device and apply the differences from the image file.

As a preliminary step, please place the image file and update application script in /initialdata on SATA0. Also, please check the following recovery-related settings in /opt/open-set/src/IFSW/config/config.ini.

[recovery_sequence]
(omitted)
DD_UPTIME_TIMEOUT = 1500 # Timeout for RestoreOS command (seconds since MISSION-OBC booted)
DD_BLOCK_SIZE = 1048576 # Block size when expanding image files with the RestoreOS command
MOUNT_WAIT_TIME = 3 # Time to wait for SATA0 to be enabled and mount to be executed with the RestoreFile command
DIR_MOUNT_RESTORE = /mnt/restore # Directory name to mount the device to be recovered with the RestoreOS command
DIR_INITIAL_DATA = /initialdata # Directory for recovery files
DIR_INSTALLER = / # Base path for unmounting SATA0 in the RestoreFile command
REBOOT_SCRIPT = /opt/open-set/sys/system_reboot # Directory for reboot script files
BOOT_ISO = boot_golden-image.iso # Image file name of the boot partition to be placed in /initialdata on SATA0
ROOT_ISO = root_golden-image.iso # Image file name of the root partition to be placed in /initialdata on SATA0

In addition, at the time of installation, the three recovery commands are disabled. To enable them, you need to change the contents of the following file.

/opt/open-set/src/IFSW/tools/control_device.py

Replace the following four lines in this file with the appropriate commands for your environment.

cmd = ["echo", "Powering on device:",
DEVICE_NAME_INFO[device_id]["storage_device"]]
->
Replace with the command to enable the storage device corresponding to device_id.

cmd = ["echo", "Setting boot device to:",
DEVICE_NAME_INFO[device_id]["storage_device"]]
->
Replace with a command that sets the storage device corresponding to device_id as the device to boot next time.

cmd = ["echo", "Setting boot device to: default"]
->
Replace with a command that sets the device to boot next time to the default.

cmd = ["echo", "Current boot device:"] # Edite here
->
Replace with a command that displays the name of the currently booted storage device.

When you have finished, restart the services related to IFSW to apply the changes.

sudo systemctl restart ifsw-*

6. How to use FTP

This chapter explains data transfer between BUS-OBC and MISSION-OBC using FTP commands.

The procedure for performing PUT/GET using SFTP from the client side is as follows.

sftp [user name]@[IP address/domain name of the connection destination
Connected to [IP address/domain name of the connection destination].
sftp> get [path of the file to be acquired (@MISSION-OBC)]
sftp> put [path of file to be sent (@satellite bus system)]

Please note that the authentication information set in the vsftpd of the Installer is used for the connection, so files cannot be sent to directories that do not have editing permissions. Please replace the [username] here with your actual username, and the [IP address/domain name of the connection destination] with the IP address or domain name of the server set in the vsftpd settings in the Installer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1. Prerequisites

2. Command Specifications

2.1. List of Commands

2.2. How to send and receive command request data

2.3. Adding a command

2.3.1. Modifying command_handler

2.3.2. Change to status_command_handler

2.3.3. Changing BUSOBC_consts

2.4. Deleting Commands

3. Telemetry Specifications

3.1. Telemetry List

3.2. Adding Telemetry

3.2.1. Changes to config.ini

3.2.2. Changes to GetTelemetry.py

3.3.3. Changing BUSOBC_consts

3.3. Deleting Telemetry

4. FDIR Specifications

4.1. FDIR List

4.2. Adding monitoring items

4.2.1. Modifying config.ini

4.2.2. Changing alertfiles

4.2.3. Modifying prometheus.yml

4.3. Deleting monitoring items

4.4. Adding IR Events

4.5. Deleting IR Events

5. Recovery Processing

6. How to use FTP

FilesExpand file tree

ADVANCED_TOPICS.md

Latest commit

History

ADVANCED_TOPICS.md

File metadata and controls

1. Prerequisites

2. Command Specifications

2.1. List of Commands

2.2. How to send and receive command request data

2.3. Adding a command

2.3.1. Modifying command_handler

2.3.2. Change to status_command_handler

2.3.3. Changing BUSOBC_consts

2.4. Deleting Commands

3. Telemetry Specifications

3.1. Telemetry List

3.2. Adding Telemetry

3.2.1. Changes to config.ini

3.2.2. Changes to GetTelemetry.py

3.3.3. Changing BUSOBC_consts

3.3. Deleting Telemetry

4. FDIR Specifications

4.1. FDIR List

4.2. Adding monitoring items

4.2.1. Modifying config.ini

4.2.2. Changing alertfiles

4.2.3. Modifying prometheus.yml

4.3. Deleting monitoring items

4.4. Adding IR Events

4.5. Deleting IR Events

5. Recovery Processing

6. How to use FTP