Video recording and production done by OpenStack Foundation.
In very large scale, heavily used Openstack deployment like the ones we have in Paypal, resource leaks happens from time to time and the content in the Openstack databases becomes inconsistent across different components due to the distributed nature of Openstack.
The resource leaks and inconsistent data cause capacities shortage and operation failures in our cloud insfrastructure.
Please note that it is important to find and fix the underlying issues in the code. In a production environment, there are third-party services could cause the openstack into inconsistent states, for example, hardware failures in hypervisor/switch, backend storge issues, loadbalancer issues, database cluster out-of-sync, and rabbitmq issues, etc. As a cloud provider for enterprise, we also need to resolve customer issues ASAP through the quick hack.
We would like to share our experiences and lession learned on how to detect resource leaks and keep Openstack consistent.
Just like fsck for filesystem, we deployed a set of cleanup tools to check/repair the Openstack cloud.
The tool set cleans up leaking resources and fix inconsistent data not only for Openstack alone, but also other services used by Openstack (DNS server, and NSX controller, etc)
Here are the list of items being cleaned up:
1. zombie VMs. instances marked as deleted in Nova DB but still running on hypervisors.
2. zombie disk files on the hypervisor. The huge disk files left on hypervisor for deleted VMs.
3. in consistent cinder volume states acrossing five different modules: Nova DB in API cell, Nova DB in compute cells, ciner DB, the libvirt.xml of the instance on the hypervisor, and the iscsi sessions on the hypervisor.
5. Unused the DNS entries for deleted VMs, and duplicated DNS entries for the same IP.
6. Orphan ports in Neutron DB which are no longer used by VMs or
7. Resources leaks in NSX controller, for example: virtual ports, virtual switch, virtual router and security groups.
8. nova quota out of sync and cinder quota out-of-sync
9. inconsistence caused by staled RPC message