Duffy is the middle layer running ci.centos.org that manages the provisioning, maintenance and teardown / rebuild of the Nodes (physical hardware for now, VMs coming soon) that are used to run the tests in the CI Cluster. Details on what the ci.centos.org infra is and how it works can be found at http://wiki.centos.org/QaWiki/CI and Details on the test cluster hardware can be found at http://wiki.centos.org/QaWiki/PubHardware

General Usage

Duffy exports a REST api that can be called from the jenkins instance. This API lets you request new machines, freshly installed. And it also lets you teardown nodes that your tests have finished running on. Typically, nodes are deployed on-demand and torn down immediately after a test run. We estimate this cycle usually lasts less than an hour.

Duffy tries to opportunistically setup an allocation-pool. This is a set of machines freshly installed running various operating systems - when a user job requests test nodes, those nodes are allocated from this pool so as to not need to wait on a fresh provision. The target age for this allocation pool is set to 24 hrs, any node un-used in this period is reinstalled. At the moment, the target pool size is:

OS

Arch

Release

standby nodes

CentOS

x86_64

7

20

CentOS

x86_64

6

6

CentOS

i386

6

4

CentOS

x86_64

5

4

CentOS

i386

5

4

Nodes that are members of the allocation-pool are cycled, so we can ensure we stress every physical node in the test cluster.

Node Statuses

In Duffy, a machine may be assigned a number of different states:

State

Description

Active

The machine is ready for to be controlled by Duffy - not yet installed

Ready

The machine is installed with CentOS - ready to be given by Duffy to a project requesting a node

Deployed

The machine is provisioned and assigned to a user

Failed

Either the reinstall failed, or the post-job teardown call failed

Provisioning

The machine is in the process of being deployed

Reserved

This machine will not be managed by duffy

User management

Every project can request one or multiple api keys, the key must be used in all calls made to the api. Resource allocation and management for the projects is done via these keys. Typically, we would expect every complete job to be run from its own api key.

Using from a Jenkins CI Job

There is an example python based jenkins builder script available at : https://github.com/kbsingh/centos-ci-scripts ; This script will request a node, git clone your test suite from an external git url, login to the machine and run the tests. It allows the user to completely remove their test content, the environment setup etc from jenkins itself.

You can also use the python client library and command line tool: python-cicoclient to request machines from duffy.

API Calls available

Basic

  /Help :: Will return some raw text with details on functions available

Node Requests

/Node/get

Request Nodes for your test run. The returned payload is a json with details on how to connect to the nodes. You can request upto the allocation-pool limits.

Params:
  key=<api key> {Manditory} This will be the key assigned for the job. You should request these via the process mentioned above.
  ver=<CentOS Linux ver> {Optional, defaults to '7'} The CentOS Linux version being requested, needs to be either 5, 6 or 7. 
  arch=<host arch> {Optional, defaults to x86_64 } The machine architecture. At this point only x86_64 and i386 are supported.
  count=<node count> {Optional, defaults to 1 } The number of nodes being requested. This defaults to 1 and can be upto
          the total allocation pool capacity for that Ver/arch 

Examples: 
  The following 2 requests produce the same results:
      http://admin.ci.centos.org:8080/Node/get?key=9c67d9c6-b5e2-11e4-b2af-525400ea212d
      -and-
      http://admin.ci.centos.org:8080/Node/get?key=9c67d9c6-b5e2-11e4-b2af-525400ea212d&ver=7&arch=x86_64&count=1
  The returned payload would be a json file structured as :
  {
    "hosts": [
        "n14.hufty.ci.centos.org"
    ],
    "ssid": "28c11dd0-b5d7-11e4-b2af-525400ea212d"
  } 

  Requesting 4 nodes at one time :
  http://admin.ci.centos.org:8080/Node/get?key=9c67d9c6-b5e2-11e4-b2af-525400ea212d&ver=7&arch=x86_64&count=4
  Would return a larger hosts section:

  {
    "hosts": [
        "n14.hufty.ci.centos.org",
        "n15.hufty.ci.centos.org",
        "n21.hufty.ci.centos.org",
        "n27.hufty.ci.centos.org"
    ],
    "ssid": "2ce3779e-b5e3-11e4-b2af-525400ea212d"
  } 

      

/Node/done

Tear-down allocated nodes; these will be taken away almost immediately and reinstalled. Once cleared they are returned to the allocation pool as needed.

Params:
  key=<api key> {Manditory} This will be the key assigned for the job. You should request these via the process mentioned above.
  ssid=<session-id> {Manditory} this is the session-id returned by the /Node/get call. Resource allocation for the jobs is done via these ssid's and therefore required for the tear-down operation.

Examples:
  http://admin.ci.centos.org:8080/Node/done?key=9c67d9c6-b5e2-11e4-b2af-525400ea212d&ssid=2ce3779e-b5e3-11e4-b2af-525400ea212d
  Would drop all nodes currently provisioned for user in that session.
  

/Node/fail

Set all nodes allocated in session-id into failed state. This would mean the users keychain is added to the root account on every node, and permissions granted to connect via the proxy jump host. At this point you have upto 12 hrs (can be tweaked as needed) to clear the machines and call the /NodeDone/<session-id> command. You can find <session-id> in /root/session-info or the job payload returned from the /NodeGet call.

Params:
  key=<api key> {Manditory} This will be the key assigned for the job. You should request these via the process mentioned above.
  ssid=<session-id> {Manditory} this is the session-id returned by the /Node/get call. Resource allocation for the jobs is done via these ssid's and therefore required for the tear-down operation.

Examples:
  http://admin.ci.centos.org:8080/Node/fail?key=9c67d9c6-b5e2-11e4-b2af-525400ea212d&ssid=2ce3779e-b5e3-11e4-b2af-525400ea212d
  Would mark all nodes in that session as in 'fail state', no warnings are generated for the next 12 hrs and there wont be any machine timeouts.

/Inventory

Lists the present state of node allocation, including resources currently marked as idle. You can optionally pass it the api key ( ?key=<api key? ) to only return machines allocated to a specific user.

Admin

  /Overtime :: Lists nodes that have been running longer than the original requested lease 
               time (defaults to 60 minutes)

  /NodeSetAdmin/<node-id> :: Set given node-id into administrative downtime, this node 
                             will then no longer be used in the allocation pool

  /NodeRecover/<node-id> :: Set givne node-id back into available set for allocation pool

  /PauseService :: Suspend new allocations, powerdown the entire allocation-pool. This 
                   will let existing jobs complete, but it will no longer allow new node 
                   requests to the service.

  /ResumeService :: Powerup and provision the allocation-pool. The service needs to have 
                    been in a Pause state for this request to succeed.

QaWiki/CI/Duffy (last edited 2016-02-07 10:29:59 by BrianStinson)