.. _how_to_upgrade_persistent_instances: .. _how_to_upgrade_persistent_instances_aws: How to Upgrade Fedora Copr Persistent VMs (Amazon AWS) ****************************************************** This document describes the process of upgrading persistent VM instance(s) (e.g., ``copr-fe-dev.aws.fedoraproject.org``) to a new Fedora version by creating a completely new VM to replace the old one. Requirements ============ * Access to the team's `Amazon AWS account`_ and proper configuration of that account according to the `README.md `_. * Permissions to run playbooks on `batcave01 `_. * Since we do not modify the public IPs (neither v4 nor v6), no DNS modifications should be required. However, familiarize yourself with the `DNS SOP`_ in case of any issues. * Make sure you have `/usr/bin/aws` installed and that you have `fedora-copr` section in `~/.aws/credentials` Pre-upgrade =========== The goal is to complete as much pre-upgrade work as possible while focusing on minimizing the **outage window** and only performing essential tasks that cannot be done post-upgrade. Avoid conducting the pre-upgrade too far in advance of the actual upgrade. Ideally, perform this phase a couple of hours or a day before. Announce the outage ------------------- See a specific document :ref:`announcing_fedora_copr_outage`, namely the "planned" outage state. Check the hot-fixes ------------------- The old set of instances (especially prod) has been running for quite some time, likely accumulating several hotfixes over that period. Research the applied hotfixes and determine which of them need to be manually implemented on the N+2 boxes (if any, note them). First, check the `hot-fixed issues and PRs `_. Then, check the file-system modifications:: # over ssh on the _old_ box, search for weird things (ignore config changes # and /boot) [root@copr-be-dev ~][STG]# rpm -Va | grep -v -e /etc/ -e /boot/ ... S.5....T. /var/www/cgi-resalloc ... S.5....T. /usr/lib/python3.12/site-packages/copr_backend/pulp.py ... E.g., the ``/var/www/cgi-resalloc`` file is a weird change, but that in particular is covered `in playbooks `_. The ``pulp.py`` change is important to note though! You may consult the ``dnf diff copr-backend`` output, find the corresponding upstream PR on GitHub, and tag the PR with ``hot-fixed`` label (if not already done). Preparation ----------- Ensure you have the `helper playbook repository`_ cloned locally and navigate to the clone directory. Review the ``dev.yml``, ``prod.yml``, and ``all.yml`` configurations in the ``./group_vars`` directory. Pay particular attention to the data volume IDs as **these MUST match the EC2 reality**. In the following moments, you will run several playbooks on your machine. During execution, explicitly specify two Ansible variables, ``copr_instance`` (set to either ``dev`` or ``prod``) and ``server_id`` (set to either ``frontend``, ``backend``, ``distgit``, or ``keygen``). For example:: $ opts=( -e copr_instance=dev -e server_id=keygen ) $ ansible-playbook play-vm-migration-01-new-box.yml "${opts[@]}" Identify the AMI (golden images) you want to use for the new VM instances. Typically, upgrade to ``Fedora N+2`` (e.g., migrating infrastructure from Fedora 37 to Fedora 39). Visit the `Cloud Base Images`_ download page, locate the **Launch on public cloud platforms** section for **x86_64-based instances**, and click the button next to **Fedora Cloud 41 AWS** (ensure JavaScript is enabled for this page!). Note the ``ami-*`` ID in the **US East (N. Virginia)** region (for example ``ami-0746fc234df9c1ee0``). Specify this ``ami-*`` ID in ``group_vars/all.yml``, and ensure both ``group_vars/{dev,prod}.yml`` correctly reference it. Double-check other machine parameters such as instance types, names, tags, IP addresses, root volume sizes, etc. Usually, the pre-filled defaults suffice, but verification is recommended. .. note:: Use the `ec2instances.info`_ comparator to find the cheapest available instance type that meets our needs whenever more power is required. .. note:: Don't worry about ``old_instance_id`` and ``new_instance_id`` for now. We will change them after running the first set of playbooks .. warning:: The ``group_vars/`` directory serves as the primary source of truth for the Fedora Copr instances. Update the configuration in this directory whenever you ad-hoc modify some EC2 instance parameters in the future! Key pair named ``Ansible Key`` must be used. This allows us to initially run the playbooks from ``batcave01`` box against the newly spawned VM. The playbooks assure that, subsequently, Fedora Copr team members can SSH using their own keys, uploaded to FAS. Backup the Current Let's Encrypt Certificates --------------------------------------------- We will copy and paste the certificate files used on the old set of VMs onto the new VMs. These certificates will remain in use until automatically renewed by the certbot daemon. The process begins by copying the certificate files to the ``batcave01`` through the execution of playbooks with the ``-t certbot`` option. For instance:: $ sudo rbac-playbook -l copr-keygen.aws.fedoraproject.org groups/copr-keygen.yml -t certbot Do this for all the instances! Launch new instances -------------------- As simple as:: $ opts=( -e copr_instance=dev -e server_id=keygen ) $ ansible-playbook play-vm-migration-01-new-box.yml "${opts[@]}" You'll see an output like:: ok: [localhost] => { "msg": [ "ElasticIP: not specified", "Instance ID: i-04ba36eb360187572", "Network ID: eni-048189f432f068270", "Unused Public IP: 100.24.62.79", "Private IP: 172.30.2.94" ] } Now fix the corresponding ``new_instance_id`` and ``new_network_id`` options in ``group_vars/{dev,prod}.yml`` according to the output. Also update ``old_instance_id`` and ``old_network_id`` options. Note the Private IP addresses ----------------------------- Most of the communication within Copr stack happens on public interfaces via hostnames with one exception. Communication between ``backend`` and ``keygen`` is done on a private network behind a firewall through IP addresses that change when spawning a fresh instances. So once you know the Backend's private IP, please do a `private IP change`_ in ansible.git. Don't start the services after the first playbook run ----------------------------------------------------- Set the ``services_disabled: true`` for your instance in ``inventory/group_vars/copr_*_dev_aws`` for devel, or ``inventory/group_vars/copr_*_aws`` for production. Pre-prepare the new VM — backend only! -------------------------------------- .. note:: Running the playbook against the new copr-backend server before shutting down the old one is possible. This minimizes the outage duration with non-working DNF repositories on the backend, which is highly desirable. However, to prevent any issues with Ansible, the following prerequisites are necessary: - A temporary volume attached to the new box that provides an ext4 filesystem with the ``copr-repo`` label. - An existing temporary hostname (having an existing DNS record) to execute the playbook against it. The volume, DNS record, and corresponding Elastic IP for this purpose have already been prepared by the ``play-vm-migration-01-new-box.yml`` playbook mentioned above. .. note:: The following inventory configuration should already be prepared for you in the "commented-out" form. Ensure that the ``copr-be-dev-temp.aws.fedoraproject.org`` is specified in the inventory in the following groups:: copr_back_dev_aws staging cloud_aws Similarly, use ``copr-be-temp.aws.fedoraproject.org`` in:: copr_back_aws cloud_aws For both cases, set the ``birthday=yes`` variable for the temporary hostname:: [copr_back_dev_aws] copr-be-dev.aws.fedoraproject.org copr-be-dev-temp.aws.fedoraproject.org birthday=yes On Batcave, execute the playbook against the temporary hostname:: $ sudo rbac-playbook -l copr-be-dev-temp.aws.fedoraproject.org groups/copr-backend.yml $ sudo rbac-playbook -l copr-be-temp.aws.fedoraproject.org groups/copr-backend.yml Once the playbook finishes successfully, remember to revert the inventory changes we did here (commenting out again). Outage window ============= When initiating this section, aim for time efficiency as the services will be down and inaccessible to users. Let users know -------------- See :ref:`announcing_fedora_copr_outage` again, ad "ongoning" issue. Move IPs and Volumes to the New Instances ----------------------------------------- .. warning:: Prepare to follow the instructions provided during the playbook run. You'll need to perform manual steps such as DB backups, consistency checks, etc. Migrate the data volumes and IP addresses to the new machine. For the Backend case, a separate playbook is created. This playbook makes the `results directory `_ unavailable temporarily, affecting every Copr consumer! Ensure that that the ``lighttpd`` service is running on the new server once the playbook finishes, and that it hosts the correct results:: $ ansible-playbook play-vm-migration-02-migrate-backend-box.yml "${opts[@]}" For the rest of the systems (Frontend, DistGit, Keygen), use:: $ ansible-playbook play-vm-migration-02-migrate-non-backend-box.yml "${opts[@]}" Provision the new instances --------------------------- In the fedora-infra ansible repository, edit the ``inventory/inventory`` file and set the ``birthday=yes`` variable for your updated host, for example:: [copr_front_dev_aws] copr.stg.fedoraproject.org birthday=yes This is necessary to instruct the first playbook run on ``batcave01`` to sign the new host certificates (avoiding later manipulation with ``known_hosts``). On ``batcave01``, execute the playbook to provision the instance (ignore the playbook for upgrading Copr packages). For the dev instance, refer to https://docs.pagure.org/copr.copr/how_to_release_copr.html#upgrade-dev-machines and for production, refer to https://docs.pagure.org/copr.copr/how_to_release_copr.html#upgrade-production-machines It's possible that the playbook fails, but it typically isn't crucial now. If provisioning at least reaches the end of the ``base`` role, revert the ``birthday=yes`` commit and proceed with the next steps. The playbooks above have not automatically updated the systems. If you prefer to start on Fedora N+2 with up-2-date set of packages, do the ``dnf update`` now (manual step over ssh). Get it working -------------- If the old instances had hot-fixes, now is ideal time to apply them on the new instances. Then rerun the playbook from the previous section again, with dropped configuration:: services_disabled: false It should proceed with mounting data volumes but will likely not succeed. Now, you'll need to debug and address the issues. If necessary, modify and rerun the playbook multiple times (ensuring ``lighttpd`` running on the new backend all the time). .. note:: Frontend - You'll likely need to manually upgrade the PostgreSQL database once you migrate to the new Fedora (new PG major version). Refer to :ref:`Upgrade the database `. Post-upgrade ============ By this point, every Copr service should be operational. It's a good idea to test ``/usr/sbin/reboot`` now to debug potential boot issues during the outage window, as future reboots are likely to occur at the most inconvenient times. Rename the instance names ------------------------- Remove the ``-new`` name suffix from the new instances and add a ``-old`` suffix to the old instances. This playbook should be executed only once for all the infra instances:: $ opts=( -e copr_instance=dev ) # or prod $ ansible-playbook play-vm-migration-03-rename-instances.yml "${opts[@]}" Terminate the old instances --------------------------- Once you no longer require the old VMs, you can terminate them using the Amazon web UI. You can do this immediately after the upgrade or wait a couple of days (e.g. to keep the DB ``/backups`` for a while just in case of any problems). The old VMs are protected against accidental termination. To disable this option, click ``Actions``, navigate to ``Instance settings`` and then to ``Change termination protection``. Final steps ----------- See a specific document :ref:`announcing_fedora_copr_outage`, the "resolved" section. .. _`Fedora Infra OpenStack`: https://fedorainfracloud.org .. _`OpenStack images dashboard`: https://fedorainfracloud.org/dashboard/project/images/ .. _`OpenStack instances dashboard`: https://fedorainfracloud.org/dashboard/project/instances/ .. _`Fedora infrastructure issue #7966`: https://pagure.io/fedora-infrastructure/issue/7966 .. _`fedora devel`: https://lists.fedorahosted.org/archives/list/devel@lists.fedoraproject.org/ .. _`copr devel`: https://lists.fedoraproject.org/archives/list/copr-devel@lists.fedorahosted.org/ .. _`Amazon AWS account`: https://id.fedoraproject.org/saml2/SSO/Redirect?SPIdentifier=urn:amazon:webservices&RelayState=https://console.aws.amazon.com .. _`Cloud Base Images`: https://fedoraproject.org/cloud/download/ .. _`DNS SOP`: https://docs.fedoraproject.org/en-US/infra/sysadmin_guide/dns/ .. _`ec2instances.info`: https://ec2instances.info/ .. _`helper playbook repository`: https://github.com/fedora-copr/ansible-fedora-copr .. _`playbook SOP`: https://docs.fedoraproject.org/en-US/infra/sysadmin_guide/ansible/ .. _`private IP change`: https://pagure.io/fedora-infra/ansible/c/6c80a870ff2a62e73da98f7607574e534369fb37