When bosh deploy fails, sometimes the failing vm is missing from bosh vms (deleted from bosh db) but still alive in Softlayer. If you do nothing, in the next bosh deploy, bosh will treat it as a missing vm and create a new one with different IP. Sometimes we need to keep the vm IP unchanged, so the missing vm needs to be recovered in the bosh db.
If you don't perform this run book, the next bosh deploy will create a new vm to backfill the missing vm with different IP. This will break the cases when the vm IP needs to be kept unchanged.
-
Make sure the failing vm can still been seen from Softlayer Portal and
/var/vcap/bosh/user_data.jsonexists on the failing vm.Take
nats/0as an example of the failing vm.update instances set vm_cid='12345678' where job='nats' and index=0; update instances set agent_id='1234abcd-12ab-34cd-56ef-123456abcdef' where job='nats' and index=0;The
vm_cidcan be got from Softlayer Portal address such as12345678in https://control.softlayer.com/devices/details/12345678/virtualGuestThe
agent_idcan be got from/var/vcap/bosh/user_data.jsonon the failing vm.Login to the failing vm and create /var/vcap/bosh/spec.json like this:
{ "job": { "name": "REPLACE_job_name" }, "deployment": "REPLACE_deployment_name", "networks": { "default": { "cloud_properties": { "security_groups": [ "default", "cf" ] }, "default": [ "dns", "gateway" ], "dns": [ "REPLACE_dns_ip", "10.0.80.11", "10.0.80.12" ], "gateway": "REPLACE_gateway_ip", "ip": "REPLACE_vm_ip", "netmask": "REPLACE_netmask_ip", "type": "dynamic" } }, "index": REPLACE_job_index, "id": "", "persistent_disk": 0 }- name: Replace with the job name. It's
natsin the example. - dns: Replace with the dns ip. In case the PowerDNS on director is used, it's the director ip.
- gateway: Replace with the gateway ip. It can be found from
route -nornetstat -rn. See example below. - ip: Replace with the vm ip. It can be found from
ifconfig. See example below. - netmask: Replace with the netmask ip. It can be found from
ifconfig. See example below. - deployment: Replace with the name of deployment which the vm belongs to.
- index: Replace with the job index. It's
0in the example.
netstat -rnorroute -noutput example:Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 <gateway_ip> 0.0.0.0 UG 0 0 0 eth1 9.0.0.0 10.121.120.193 255.0.0.0 UG 0 0 0 eth0 10.0.0.0 10.121.120.193 255.0.0.0 UG 0 0 0 eth0 10.121.120.192 0.0.0.0 255.255.255.192 U 0 0 0 eth0 161.26.0.0 10.121.120.193 255.255.0.0 UG 0 0 0 eth0 169.53.1.128 0.0.0.0 255.255.255.224 U 0 0 0 eth1ifconfigoutput example:eth0 Link encap:Ethernet HWaddr 06:99:32:27:34:f7 inet addr:<vm_private_ip> Bcast:10.121.120.255 Mask:<netmask_ip_private> inet6 addr: fe80::499:32ff:fe27:34f7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:596214 errors:0 dropped:0 overruns:0 frame:0 TX packets:5065 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:63759564 (63.7 MB) TX bytes:661260 (661.2 KB) eth1 Link encap:Ethernet HWaddr 06:da:3f:89:2a:b2 inet addr:<vm_public_ip> Bcast:169.53.1.159 Mask:<netmask_ip_public> inet6 addr: fe80::4da:3fff:fe89:2ab2/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:65584 errors:0 dropped:0 overruns:0 frame:0 TX packets:21240 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:5413249 (5.4 MB) TX bytes:3236947 (3.2 MB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:6291 errors:0 dropped:0 overruns:0 frame:0 TX packets:6291 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:892748 (892.7 KB) TX bytes:892748 (892.7 KB)sv restart agentbosh vms: the missing vm is backbosh cck: no problems
- name: Replace with the job name. It's
-
Sometimes when
bosh deployfails, user_data.json has not yet been generated in cases like create_vm failure.Make sure the failing vm can still been seen from Softlayer Portal
1) Do the steps 1) & 2) in Scenario 1
Except the
agent_idneeds need to be got from the bosh debug log for the failure.bosh debug log example:
E, [2016-07-22 06:17:15 #13984] [canary_update(nats/0 (14109da6-cc51-49e8-8e99-d096e4f59550))] ERROR -- DirectorJobRunner: error creating vm: Creating Virtual_Guest with agent ID '1234abcd-12ab-34cd-56ef-123456abcdef': Attaching ephemeral disk to VirtualGuest `12345678`: Waiting for VirtualGuest `12345678` has Service Setup transaction complete: Getting Last Complete Transaction for virtual guest with ID '12345678': Get https://[user]:[api_key]@api.softlayer.com/rest/v3/SoftLayer_Virtual_Guest/12345678/getLastTransaction.json?objectMask=transactionGroup: net/http: TLS handshake timeout
user_data.jsonexample:{"agent_id":"1234abcd-12ab-34cd-56ef-123456abcdef","vm":{"name":"vm-1234abcd-12ab-34cd-56ef-123456abcdef","id":"vm-1234abcd-12ab-34cd-56ef-123456abcdef"},"mbus":"nats://nats:nats@[director_ip]:4222","ntp":[],"blobstore":{"provider":"dav","options":{"endpoint":"http://[director_ip]:25250","password":"[agent_password]","user":"agent"}},"networks":{"default":{"type":"dynamic","ip":"[vm_ip]","netmask":"[netmask]","gateway":"[gateway_ip]","dns":["[dns_ip]","10.0.80.11","10.0.80.12"],"default":["dns","gateway"],"preconfigured":true,"cloud_properties":{"security_groups":["default","cf"]}}},"disks":{"ephemeral":"/dev/xvdc","persistent":{}},"env":{}}
Update the items in bold:
agent_idcan be found in the bosh debug log- Keep "persistent" as empty which will be fixed later)
- Refer to Scenario 1 to get
vm_ip,netmaskandgateway_ip
sv restart agentProblem 1 of 1: Inconsistent mount information: Record shows that disk '12661853' should be mounted on 22354779. However it is currently : Not mounted in any VM. 1. Ignore 2. Reattach disk to instance 3. Reattach disk and reboot instance Please choose a resolution [1 - 3]: 2bosh vms: the missing vm is backbosh cck: no problems