I designed in chisel a mesh-array of registers, say 32x32 Bytes of D-flipflop, to take a try to implement such parrallel hardware arch in chisel. The firrtl file is like 100k lines, looks like a netlist. Then the time costed by the translation from firrtl to verilog is like many hours. During this period of time the processing is just arranged on a single CPU. Could you enlight me how to make it running parallel on CPUs?
The key codes:
val reg_vec = (0 to 31).map(i=>
(0 to 31).map(j=>
Module(new MyNodeOfReg(8))
)
)
The scala compiler and code runner version 2.11.8
I made a run batch like this, do ./run then wait for ./target/Bench.v:
mkdir target
cp /opt/eda_tool/RISCV/top.cpp target
scalac -d target -cp $CP Top.scala Test.scala
scala -cp $CP org.scalatest.run Test
in which scalac/scala is auto-generated after scala installation in which My Test.scala is :
import chisel3._
import chisel3.util._
import chisel3.testers._
import org.scalatest._
import org.scalacheck._
import org.scalatest.prop._
import scala.sys.process._
class Bench() extends BasicTester {
val dut = Module(new Top())
val t = Reg(UInt(width=32),init=0.U)
t := t+1.U
when(t<=1.U) {
}.elsewhen(t===100.U) {
stop()
}
}
class Test extends PropSpec with PropertyChecks {
property("elaborate") {
Driver.elaborate (() => { new Top() })
}
property("should return the correct result") {
TesterDriver.execute(() => { new Bench() })
}
}
The Top.scala is:
import chisel3._
import chisel3.util._
object ce_pm{
val div = 4
val e = 1
val ec= 1
val p = 10 // 16/div
val s = p*p
val w = s*e
val ext = 64
val extw= ext*e
val irp = 20 // 40/div // InREG parameter
val irn = irp*irp // InREG reg number
}
class Mux4(n: Int) extends Module {
val io = IO(new Bundle{
val i = Input(Vec(4,UInt(n.W)))
val s = Input(UInt(2.W))
val o = Output(UInt(n.W))
})
val mux00 = Wire(UInt(n.W))
val mux01 = Wire(UInt(n.W))
mux00 := Mux(io.s(0)===1.U,io.i(1),io.i(0))
mux01 := Mux(io.s(0)===1.U,io.i(3),io.i(2))
io.o := Mux(io.s(1)===1.U,mux01,mux00)
}
class CEIO_TwoD_Torus extends Bundle {
val n = Input(UInt(ce_pm.e.W))
val s = Input(UInt(ce_pm.e.W))
val w = Input(UInt(ce_pm.e.W))
val e = Input(UInt(ce_pm.e.W))
}
class TwoD_TorusReg extends Module {
val io = IO(new Bundle{
val i = new CEIO_TwoD_Torus()
val o = new CEIO_TwoD_Torus().flip
val d = Input(UInt(ce_pm.e.W))
val c = Input(Vec(4,UInt(1.W)))
})
val r = Reg(UInt(ce_pm.e.W),init=0.U)
val u_mux4 = Module(new Mux4(ce_pm.e))
u_mux4.io.i(0) := io.i.e
u_mux4.io.i(1) := io.i.s
u_mux4.io.i(2) := io.i.w
u_mux4.io.i(3) := io.i.n
u_mux4.io.s := Cat(io.c(2),io.c(1))
when (io.c(0) === 1.U) {
when (io.c(3) === 0.U) {
r := u_mux4.io.o
} .otherwise {
r := io.d
}
}
io.o.e := r
io.o.s := r
io.o.w := r
io.o.n := r
}
class Top extends Module {
val io = IO(new Bundle{
val i = Input (UInt(ce_pm.extw.W))
val o = Output(Vec(ce_pm.p,Vec(ce_pm.p,UInt(ce_pm.e.W))))
val c = Input (UInt(7.W))
})
val n = ce_pm.irp
val r_vec = (0 to n-1).map ( i=>
(0 to n-1).map ( j=>
Module(new TwoD_TorusReg)
)
)
for (i <- 0 to n-1) {
for (j <- 0 to n-1) {
r_vec(i)(j).io.c(0) := io.c(1)
r_vec(i)(j).io.c(3) := io.c(0)
r_vec(i)(j).io.c(2) := io.c(2)
r_vec(i)(j).io.c(1) := io.c(3)
}
}
// out
val m = ce_pm.p
for (i <- 0 to m-1) {
for (j <- 0 to m-1) {
io.o(i)(j) := r_vec(i)(j).io.o.e
}
}
//2-D-Torus interconnection
for (i <- 1 to n-1) {
for (j <- 1 to n-1) {
r_vec(i)(j).io.i.w := r_vec(i)(j-1).io.o.e
r_vec(i)(j).io.i.n := r_vec(i-1)(j).io.o.s
}
}
for (i <- 0 to n-2) {
for (j <- 0 to n-2) {
r_vec(i)(j).io.i.e := r_vec(i)(j+1).io.o.w
r_vec(i)(j).io.i.s := r_vec(i+1)(j).io.o.n
}
}
for (i <- 0 to n-1) {
r_vec(i)(0).io.i.w := r_vec(i)(n-1).io.o.e
r_vec(0)(i).io.i.n := r_vec(n-1)(i).io.o.s
}
}
This sounds like a very gnarly performance bug so if you could provide more information about your design that would be very helpful (or code would be even better). You can also try using the command-line option -ll info
to provide the runtime of each of the Firrtl passes.
rocket-chip-based projects frequently generate hundreds of thousands to millions of lines of firrtl which are usually compiled on the order of seconds to minutes. For this reason we have not yet felt the need to parallelize the code.
EDIT: Thank you for adding the code!
I'm struggling to reproduce the performance problems you're seeing. With irp = 32
, compilation from Firrtl to Verilog is taking about 4 seconds; total compilation including Chisel is taking about 8 seconds. Should I be changing other parameters as well? I am compiling with:
object Main extends App {
chisel3.Driver.execute(args, () => new Top)
}
Can you share a little bit more about how you're building the Module?