Search code examples
parallel-processingchisel

can chisel translates firrtl to verilog in parallel/multi cpu?


I designed in chisel a mesh-array of registers, say 32x32 Bytes of D-flipflop, to take a try to implement such parrallel hardware arch in chisel. The firrtl file is like 100k lines, looks like a netlist. Then the time costed by the translation from firrtl to verilog is like many hours. During this period of time the processing is just arranged on a single CPU. Could you enlight me how to make it running parallel on CPUs?

The key codes:

val reg_vec = (0 to 31).map(i=>
                (0 to 31).map(j=>
                  Module(new MyNodeOfReg(8))
  )
)

The scala compiler and code runner version 2.11.8

I made a run batch like this, do ./run then wait for ./target/Bench.v:

mkdir target
cp /opt/eda_tool/RISCV/top.cpp target
scalac -d target -cp $CP Top.scala Test.scala
scala  -cp $CP org.scalatest.run Test

in which scalac/scala is auto-generated after scala installation in which My Test.scala is :

import chisel3._
import chisel3.util._
import chisel3.testers._
import org.scalatest._
import org.scalacheck._
import org.scalatest.prop._
import scala.sys.process._

class Bench() extends BasicTester {
  val dut = Module(new Top())
  val t = Reg(UInt(width=32),init=0.U)
  t := t+1.U

  when(t<=1.U) {
  }.elsewhen(t===100.U) {
    stop()
  }
}

class Test extends PropSpec with PropertyChecks {

  property("elaborate") {
    Driver.elaborate (() => { new Top() })
  }

  property("should return the correct result") {
    TesterDriver.execute(() => { new Bench() })
  }

}

The Top.scala is:

import chisel3._
import chisel3.util._

object ce_pm{
  val div = 4
  val e = 1   
  val ec= 1   
  val p = 10 // 16/div        
  val s = p*p       
  val w = s*e       

  val ext = 64      
  val extw= ext*e   

  val irp = 20 // 40/div  // InREG parameter
  val irn = irp*irp // InREG reg number
}

class Mux4(n: Int) extends Module {
  val io = IO(new Bundle{
    val i = Input(Vec(4,UInt(n.W)))
    val s = Input(UInt(2.W))
    val o = Output(UInt(n.W))
  })
  val mux00 = Wire(UInt(n.W))
  val mux01 = Wire(UInt(n.W))
  mux00 := Mux(io.s(0)===1.U,io.i(1),io.i(0))
  mux01 := Mux(io.s(0)===1.U,io.i(3),io.i(2))
  io.o  := Mux(io.s(1)===1.U,mux01,mux00)
}

class CEIO_TwoD_Torus extends Bundle {
  val n = Input(UInt(ce_pm.e.W))
  val s = Input(UInt(ce_pm.e.W))
  val w = Input(UInt(ce_pm.e.W))
  val e = Input(UInt(ce_pm.e.W))
}

class TwoD_TorusReg extends Module {
  val io = IO(new Bundle{
    val i = new CEIO_TwoD_Torus()
    val o = new CEIO_TwoD_Torus().flip
    val d = Input(UInt(ce_pm.e.W)) 
    val c = Input(Vec(4,UInt(1.W)))
  })
  val r = Reg(UInt(ce_pm.e.W),init=0.U)
  val u_mux4 = Module(new Mux4(ce_pm.e))
  u_mux4.io.i(0) := io.i.e
  u_mux4.io.i(1) := io.i.s
  u_mux4.io.i(2) := io.i.w
  u_mux4.io.i(3) := io.i.n
  u_mux4.io.s    := Cat(io.c(2),io.c(1))
  when (io.c(0) === 1.U) {
    when (io.c(3) === 0.U) {
      r := u_mux4.io.o
    } .otherwise {
      r := io.d
    }
  } 
  io.o.e := r
  io.o.s := r
  io.o.w := r
  io.o.n := r
}

class Top extends Module {
  val io = IO(new Bundle{
    val i = Input (UInt(ce_pm.extw.W))
    val o = Output(Vec(ce_pm.p,Vec(ce_pm.p,UInt(ce_pm.e.W))))
    val c = Input (UInt(7.W))
  })
  val n  = ce_pm.irp
  val r_vec = (0 to n-1).map ( i=>
                (0 to n-1).map ( j=>
                  Module(new TwoD_TorusReg)
                )
              )
  for (i <- 0 to n-1) {
    for (j <- 0 to n-1) {
      r_vec(i)(j).io.c(0) := io.c(1)
      r_vec(i)(j).io.c(3) := io.c(0)
      r_vec(i)(j).io.c(2) := io.c(2)
      r_vec(i)(j).io.c(1) := io.c(3)
    }
  }
  // out
  val m = ce_pm.p
  for (i <- 0 to m-1) {
    for (j <- 0 to m-1) {
      io.o(i)(j) := r_vec(i)(j).io.o.e
    }
  }
  //2-D-Torus interconnection
  for (i <- 1 to n-1) {
    for (j <- 1 to n-1) {
      r_vec(i)(j).io.i.w := r_vec(i)(j-1).io.o.e
      r_vec(i)(j).io.i.n := r_vec(i-1)(j).io.o.s
    }
  }
  for (i <- 0 to n-2) {
    for (j <- 0 to n-2) {
      r_vec(i)(j).io.i.e := r_vec(i)(j+1).io.o.w
      r_vec(i)(j).io.i.s := r_vec(i+1)(j).io.o.n
    }
  }
  for (i <- 0 to n-1) {
    r_vec(i)(0).io.i.w := r_vec(i)(n-1).io.o.e
    r_vec(0)(i).io.i.n := r_vec(n-1)(i).io.o.s
  }
}

Solution

  • This sounds like a very gnarly performance bug so if you could provide more information about your design that would be very helpful (or code would be even better). You can also try using the command-line option -ll info to provide the runtime of each of the Firrtl passes.

    rocket-chip-based projects frequently generate hundreds of thousands to millions of lines of firrtl which are usually compiled on the order of seconds to minutes. For this reason we have not yet felt the need to parallelize the code.

    EDIT: Thank you for adding the code!

    I'm struggling to reproduce the performance problems you're seeing. With irp = 32, compilation from Firrtl to Verilog is taking about 4 seconds; total compilation including Chisel is taking about 8 seconds. Should I be changing other parameters as well? I am compiling with:

    object Main extends App {
      chisel3.Driver.execute(args, () => new Top)
    } 
    

    Can you share a little bit more about how you're building the Module?