Two-Phase Commit Protocol in Rust and Go

Dilawar Mahmood

2020-05-02

Distributed Systems, Performance Optimization, Software Engineering

Ever wondered what happens when you make a purchase online and your payment fails halfway through? How do distributed systems ensure that your money isn’t deducted while the order remains incomplete? These questions led my friends and me down a rabbit hole of implementing the two-phase commit protocol from scratch, choosing Rust for the coordinator and Go for the microservices.

What is Two-Phase Commit?

Two-phase commit (2PC) is a distributed algorithm that ensures transaction atomicity across multiple nodes. Think of it as a voting system: either all nodes agree to commit a transaction, or none of them do.

Architecture Overview

We built three main components to demonstrate the protocol:

Coordinator (Rust): Orchestrates the commit protocol
Wallet Service (Go): Handles user balances
Order Service (Go): Manages product inventory

The Coordinator

The coordinator is the brain of our system. Here’s the core Rust implementation:

struct Coordinator {
    wallet_conn: TcpStream,
    order_conn: TcpStream,
}

impl Coordinator {
    fn prepare_phase(&mut self, transaction: Transaction) -> Result<bool, Error> {
        self.wallet_conn.write_all(&transaction.serialize())?;
        self.order_conn.write_all(&transaction.serialize())?;
        
        let wallet_vote = self.wallet_conn.read_response()?;
        let order_vote = self.order_conn.read_response()?;
        
        Ok(wallet_vote == READY && order_vote == READY)
    }
    
    fn commit_phase(&mut self) -> Result<(), Error> {
        self.wallet_conn.write_all(COMMIT_MSG)?;
        self.order_conn.write_all(COMMIT_MSG)?;
        Ok(())
    }
}

The coordinator implements two key phases:

Prepare Phase: The coordinator sends prepare messages to all participants and waits for their votes. If any participant votes “no” or times out, the transaction is aborted.
Commit Phase: If all participants voted “yes”, the coordinator tells everyone to commit. Otherwise, it sends abort messages.

Microservices in Go

The microservices handle the actual business logic. Here’s a simplified version of our wallet service:

type WalletService struct {
    db *sql.DB
}

func (ws *WalletService) handlePrepare(tx *sql.Tx, userId int, amount float64) error {
    var balance float64
    err := tx.QueryRow("SELECT balance FROM wallets WHERE user_id = ?", userId).Scan(&balance)
    if err != nil {
        return err
    }
    
    if balance < amount {
        return errors.New("insufficient funds")
    }
    
    _, err = tx.Exec("UPDATE wallets SET balance = balance - ? WHERE user_id = ?", amount, userId)
    return err
}

Handling Failures

The interesting part comes when things go wrong. We implemented several failure scenarios:

Performance Considerations

While 2PC ensures consistency, it comes with some drawbacks:

It’s blocking: participants must wait for coordinator decisions;
Network overhead: requires multiple round trips;
Single point of failure: coordinator crashes can block the system.

Distributed Deployment

We deployed our system on Google Cloud Platform, using separate VMs for each component. This revealed interesting challenges around network latency and partial failures.

Testing Distributed Transactions

Testing distributed systems requires special consideration due to their concurrent and asynchronous nature. We built a comprehensive test suite that simulates various failure scenarios:

#[test]
fn test_node_failure_during_prepare() {
    let mut coordinator = Coordinator::new();
    let transaction = Transaction::new(user_id: 1, amount: 100.0);
    
    // Simulate node failure
    coordinator.order_conn.shutdown()?;
    
    assert!(matches!(
        coordinator.prepare_phase(transaction),
        Err(Error::Timeout)
    ));
}

Lessons Learned

Building a distributed transaction system taught us several things:

Rust’s ownership model is perfect for handling complex distributed state
Go’s goroutines make concurrent transaction handling elegant
Network failures are the norm, not the exception
Testing distributed systems requires careful consideration of timing

For those interested in exploring the implementation details further, the complete code is available on GitHub. Note that the README is currently in Norwegian.