Nicholas Day > Self-hosting a GPU server for friends

2024-06-10

Back during the Stable Diffusion craze (2022), there was a great deal for new Nvidia 3090 TI cards with 24gb vram for $1000. Yang and I split the cost of building a new PC with these specs. So I decided to see if I could self host it for us.

To keep the server relatively secure, I didn't want to expose it to the internet. I ended up running a VPS as a public Wireguard hub, which a separate Wireguard network was made for the server and users.

[Interface]
Address = 192.0.MLSUBNET.MLSERVERIP/32
PrivateKey = MLSERVERPRIVATEKEY

[Peer]
Endpoint = vpspublicserver.com:publicport
PublicKey = VPS_SERVERPUBLICKEY
AllowedIPs = 192.0.MLSUBNET.0/24

This Wireguard config was used on the server, and on clients. Wireguard ended up being pretty nice and noticably performant to access the server from almost anywhere. I think there were some hole punching issues on campus wifi though.

Originally we left the server on all day, but to conserve power (100-200w in use) we started communicating about when to turn the server on/off.

New User in Different Country

More recently, another friend was doing ML research in the UK. She couldn't easily get access to GPU clusters through her university's long waitlist. We weren't using the server all day, so we decided to give her an account. Timezones made communicating about when to poweron/off the server more difficult, so we investigated WakeOnLan. WoL lets you turn on a computer when a specific packet with the turned off computer's mac address is sent on the local network. The network card listens for this packet and turns on the computer if it is received.

Users could login to the ML server remotely with Wireguard, but the WoL was only for the local network. You can run an always on server on the Wireguard network that's also on the local ML server network so it can ping the ML server with a WoL packet.

We ended up using an old mini pc, but any old laptop or Raspberry Pi would work just fine.

To enable WoL, surprisingly you need to enable it in the BIOS/UEFI interface, and also enable it in the OS too. (Arch wiki link)

Here's the core of the service. Wrote it in rust/rocket/sqlite for easy deployment + familiarity with stack.

#[derive(Deserialize)]
struct Reason<'r> {
    name: &'r str,
    description: &'r str,
}

#[post("/", data = "<reason>")]
async fn start(ip: IpAddr, db: Db, reason: Json<Reason<'_>>) -> Value {
    Command::new("wakeonlan")
        .arg("-i")
        .arg("192.168.1.255")
        .arg("SERVER MAC ADDRESS")
        .spawn()
        .expect("wakeonlan command failed to start");

    println!(
        r#"Starting wol for {} to do "{}""#,
        reason.name, reason.description
    );

    let name = reason.name.to_owned();
    let description = reason.description.to_owned();

    db.run(move |conn| {
        conn.execute(
            "INSERT INTO logs (timestamp, name, description, ip) VALUES (?1, ?2, ?3, ?4)",
            params![
                &Utc::now().to_rfc3339_opts(SecondsFormat::Micros, true),
                name,
                description,
                ip.to_string()
            ],
        )
    })
    .await
    .unwrap();

    json!({"status": "started"})
}

One of the more fun bits is you're required to send your name and description for what you're using the server for. There's another endpoint that lists these reasons out.

Jobs that run for a while

Another problem I encountered was forgetting to turn off the server when the job has finished. The job could run for many hours, but I didn't want to have to login to turn it off. I ended up creating another web service that turns off the server if no users are logged in and the load is low. This check runs every 30 minutes.

Because it was easy to add, this service also stores metrics like users logged in, load, memory usage every minute.

#[tokio::main]
async fn main() -> Result<(), JobSchedulerError> {
    let sched = JobScheduler::new().await?;

    let conn = Connection::open("database.db").unwrap();

    conn.execute(
        "CREATE TABLE IF NOT EXISTS metrics (
            id    INTEGER PRIMARY KEY,
            datetime  TEXT NOT NULL,
            users  TEXT,
            five_min_load TEXT,
            used_memory TEXT
        )",
        (),
    )
    .unwrap();

    let conn_mutex = Mutex::new(conn);

    sched
        .add(Job::new("0,30 * * * * *", |_uuid, _l| {
            let load: f32 = get_load().parse().expect("couldn't parse load");
            let current_users = get_users();
            let now = Utc::now().to_rfc3339();

            if current_users.len() == 0 && load <= 0.001 {
                println!(
                    "datetime: {}, msg: shutting down, len current_users: {}, load: {}",
                    now,
                    current_users.len(),
                    load
                );

                Command::new("sudo")
                    .arg("poweroff")
                    .spawn()
                    .expect("Unable to poweroff");
            }
        })?)
        .await?;

    sched
        .add(Job::new("0 * * * * *", move |_uuid, _l| {
            let five_min_load = get_load();
            let current_users = get_users();
            let used_memory = get_used_memory();

            let now = Utc::now().to_rfc3339();

            let conn = conn_mutex.lock();
            match conn {
                Ok(c) => {
                    c.execute(
                        "INSERT INTO metrics (datetime, users, five_min_load, used_memory) VALUES (?1, ?2, ?3, ?4)",
                        (&now, &current_users, &five_min_load, &used_memory),
                    ).expect("Failed to insert into db");
                },
                Err(_) => println!("datetime: {}, error: unable to lock db mutex", now),
            }

            println!(
                "datetime: {}, memory: {}, current_users: {}, load: {}",
                now, used_memory,current_users,five_min_load
            );
        })?)
        .await?;

    sched.start().await?;

    loop {
        tokio::time::sleep(Duration::from_secs(10)).await;
    }
}

Programmatically getting power usage

Researcher friend needed granular power usage stats for her research. So I researched into IOT plugs with flashable OSS firmwares and power monitoring. The Sonoff S31 was available for $8 and has screw based disassembly. You can flash it with an ESP8266/ESP32 firmware called Tasmota.

This was a great video on how to flash the plug, but I encountered some bugs.

Neither the PI Pico debug probe, nor RPI1 Model A (I was desperate) could detect / program the device with esptool. Adafruit FTDI Friend ended up working. I suspect that part of the problem was a slight protocol incompatibility or not enough power supplied to the plug when flashing it.

Once flashed, the plug couldn't connect to the wifi network. I ended up needing to have separate SSIDs for 2.5g and 5g wifi. Once I did that, the plug worked fine.

To enable power monitoring, you also need to go into the module config and select Sonoff S31, which has the right configuration for power monitoring and not just turning the plug on and off remotely.

curl http://LOCALPLUGIP/cm?cmnd=Status%208 ends up returning this json which has all of the necessary stats!

{
  "StatusSNS": {
    "Time": "2024-06-11T04:45:41",
    "ENERGY": {
      "TotalStartTime": "2024-06-02T02:57:06",
      "Total": 5.078,
      "Yesterday": 0.438,
      "Today": 0.007,
      "Power": 1,
      "ApparentPower": 12,
      "ReactivePower": 12,
      "Factor": 0.12,
      "Voltage": 119,
      "Current": 0.098
    }
  }
}

Overall, I'd recommend running a GPU server for yourself, especially if you find some good deals. It can be cheaper and more performant over time than renting compute by the hour (looking at you AWS) depending on your workload. Plus, more fun too.