Hardware Interview Questions - Study Guide | Hardware Interview

All Hardware Types Interview Questions

Digital Design/RTL

Topics

Overview

RTL Design involves designing and integrating digital logic modules for chips and IP blocks.

Successful interviews require both technical knowledge and strong behavioral preparation.

Behavioral Preparation

Critical for success. Be ready to discuss past projects with specific, measurable details.

"Improved circuit speed by 25%" is far more compelling than "worked hard on the project."

Prepare to explain:

• Architecture decisions and tradeoffs
• Performance improvements and optimization strategies
• Debugging approaches and problem-solving
• Design verification methods

Verilog/System Verilog

Undoubtedly, Verilog and SystemVerilog are essential languages to master. You don't need to be an expert, but you must understand common syntax blocking/non-blocking, always, initial, the three types of fork-join, synthesizable statements, and frequently tested modules like sequence detectors and edge detectors. More challenging ones include FIFO and round robin. System Verilog focuses on OOP aspects; I wasn't asked about it much this time, but it's frequently tested.
Be able to discuss and quickly implement the following structures:
- • Simple Counter
- • State Machine
- • FIFOs (synchronous and asynchronous)
  - • You should know the minimum depth of asynchronous FIFOs.
  - • Also, you need a reasonable test plan for FIFOs.
  - • Full/empty detection for circular FIFOs (pointer comparison vs counter methods)
- • Arbiters (round-robin, priority, etc.)

Digital Design Fundamentals

Logic design basics: universal gates, MUX implementations
Counter design
Basic multipliers and adders
State Machines (major topic):
- • Draw state diagrams for scenarios
- • Mealy vs Moore models - critical to master both
- • Convert diagrams to Verilog - practice templates for quick implementation
- • Binary vs one-hot encoding tradeoffs

Computer Architecture - (See Computer Architecture section)

OS Fundamentals (Important)
Cache (Important)
Memory hierarchy
5-stage pipeline basics
Out of order execution
Branch Prediction
Parallelism

Static Timing Analysis

Definitions: setup time, hold time, clock insertion delay, variation
Analyze Flop-Logic-Flop configurations:
- • Detect violations
- • Fix violations
Time borrowing, metastability, MTBF
Logic optimization:
- • Karnaugh maps, mux/decoder implementations
- • Glitches in combinational circuits - identify and resolve (hazard covers)

Programming & Scripting - (More details in DV section)

C/C++:
- • Leetcode (View the most common DV questions at the bottom of the page)
- • How to write Makefile
- • Typical C++ knowledge based questions - OOP concepts, Classes
- • Given a snippet of code, be able to explain the logic.
Scripting: Python

Protocols & Interfaces

Understand the working principles of some SOC communication protocols, such as AXI, PCIe, I2C, UART, and SPI; these are frequently asked in interviews. Having a project you're particularly proud of is a great asset. When interviewers ask me to describe a project on my resume, I always choose that one because it's something my teammates and I worked on from the very beginning. We had a clear understanding of all the design and optimization details, and the final result was quite good, so I felt confident when discussing it.

Things you should cover:

Valid-Ready handshaking (ubiquitous)
AXI protocols if applicable
Backpressure mechanisms (valid-ready, valid-afull, valid-credit)

Clock Domain Crossing (CDC)

CDC techniques and synchronization methods
How to synchronize 1 bit or multiple bits, grey code and potential problems
Know how to calculate the minimum clock cycle or propagation delay of a given circuit
Handling data crossing clock domains safely, some cases to consider:
- • Slower clock domain to faster clock domain
- • Faster clock domain to slower clock domain
- • Crossing between clock domains with lots of data
- • ... and know what happens if you don't handle it properly for these cases
Synchronizers (2-flop, multi-flop)
Handshaking protocols for CDC

FPGA-Specific Roles

FPGA vs ASIC differences
Optimized RTL for FPGA architectures (register abundance, DSP blocks)

RTL Design Fundamentals

What is the difference between wire and reg?

What's the difference between a Mealy and Moore machine? How will you convert your Mealy machine to Moore? What's an advantage of a Mealy machine?

Difference between blocking and non-blocking assignments? What's the difference between always_ff and always_comb?

What is the difference between bit and logic?

What are the tradeoffs between the FSM v/s shift register approach?

What is pipelining? Explain the 5 stages of pipelining.

If we have a reset tree that's too big and we can't meet reset deassertion timing, what can we do in this case?

NVIDIA phone screen - Design a XOR gate with 2:1 MUX.

Why are latches discouraged?

10.

How do you do synchronous deassertion of reset?

11.

What are the considerations before designing the microarchitecture?

12.

FPGA Engineer - Optiver Take Home Test Question:

Select the boolean equation that matches the following truth table:

A	B	C	O
0	0	0	1
0	0	1	0
0	1	0	1
0	1	1	0
1	0	0	0
1	0	1	1
1	1	0	0
1	1	1	1

Pick ONE option:

a) (A xor B) and C
b) A or (B and C)
c) A xnor C
d) B
e) B and C
f) A xor (B xor C)

13.

FPGA Engineer - Optiver Take Home Test Question:

If we used lookup tables (LUTs) with 4 inputs and 1 output to implement the LogicModule module below, how many lookup tables would be used?

module LogicModule (
  input logic Clk,
  input logic Rst,
  input logic [7:0] DataIn,
  output logic [7:0] DataOut
);

  always @(posedge Clk) begin
    DataOut[7] <= DataIn[0] | DataIn[1];
    DataOut[6] <= DataIn[1] | DataIn[2];
    DataOut[5] <= DataIn[2] | DataIn[3];
    DataOut[4] <= DataIn[3] | DataIn[4];
    DataOut[3] <= DataIn[4] | DataIn[5];
    DataOut[2] <= DataIn[5] | DataIn[6];
    DataOut[1] <= DataIn[6] | DataIn[7];
    DataOut[0] <= DataIn[7] | DataIn[0];
  end

endmodule

14.

Verilog Timing Analysis Question:

What is the value of A & B at various times of simulation - 0 time_unit, 1 time_unit, 2 time_unit, 3 time_unit?

initial begin
A = 0;
B = 1;
end
// Section 1.1:
always @(posedge clk) begin
A <= 2;
end
always @(posedge clk) begin
B <= A;
end
// Section 1.2:
always @(posedge clk) begin
A = 2;
end
always @(posedge clk) begin
B <= A;
end
// Section 1.3:
always @(posedge clk) begin
A <= 2;
end
always @(posedge clk) begin
B = A;
end
// Section 1.4:
always @(posedge clk) begin
A = 2;
end
always @(posedge clk) begin
B = A;
end

15.

Given the initial values a=1, b=0, c=0 just before a rising clock edge, compute the final values of a, b, c after one clock for each snippet.

Variant A – Non-blocking (<=)

always_ff @(posedge clk) begin
  a <= b;
  b <= c;
  c <= a;
end

Variant B – Blocking (=)

always @(posedge clk) begin
  a = b;
  b = c;
  c = a;
end

What are the final values of a, b, c for Variant A and Variant B? (Show your steps by drawing a waveform or a small timing table.)

16.

Once you write RTL, how do you make sure that it's synthesized alright? How do you make sure that there's no unintended latch or no combinational block?

17.

What is wrong with this code?

always @(posedge clk) begin : pipeline
    Q1 = in;
    Q2 = Q1;
    Q3 = Q2;
  end

💡 Click to reveal solution

Solution: Uses blocking assignments inside a clocked block, so the current input propagates through Q1→Q2→Q3 in the same cycle. That doesn’t model pipeline flops (each stage should capture theprevious cycle’s value). Use non-blocking (<=) oralways_ff to model sequential behavior.

18.

How do you write a fix priority arbiter?

19.

Meta phone screen - How do you write a round robin arbiter? Talk it out loud.

RTL Coding Challenges

Design a circuit that can detect 10X011 (X can be either 0 or 1)

Let's say we have a 3 bit signal coming in, we want the output to be high when the input is 1, 2 and 4. How will you design that? (very simple question to start off)

Basic Verilog question - Write a Sequence Detector for 101

Basic Verilog question - Write a Sequence Detector for 101011

Swap 2 variables without using a temporary variable.

Meta phone screen - You have valid / ready interface. How will you add a pipe stage.

AMD Xilinx - Write Verilog code for a 2-bit up/down counter. Inputs are clk, async rstb, up, down and output is count. Don't jump into writing RTL first, draw truth table first for ALL possible input combinations and start coding after that. Drawing circuit diagram after helps as well.

AMD Xilinx - An input bit pattern is coming in. Determine at any point if the number is divisible by 5 or not.

AMD Xilinx - Do a non-state machine approach for the previous problem (divisible by 5).

10.

Apple RTL Role Interview Question: Find if a stream of bits is divisible by 5 for infinite length, 1 bit per cycle

💡 Click to reveal solution

Solution: https://electronics.stackexchange.com/questions/345189/vhdl-interview-question-detecting-if-a-number-can-be-divided-by-5-without-rema

11.

Design a block with inputs d and clk and output match. Detect a pattern of 1101. How would you make this block configurable to detect all 4-bit patterns?

12.

Write RTL for a 512:1 multiplexer.

13.

NVIDIA phone screen - You have a 7 bit vector coming in a[6:0]. How would you find out the number of 1s in this vector?

14.

Write RTL - You've 1 bit per cycle data coming in. Take that input and make it 3 bit. Detect whether the content of the 3 bit register can be divided by 3 or not. After reset, the register value will be 0.

15.

Write RTL for a 128-bit memory with 32-bit data interface. What happens if both wr_en and rd_en are 1 in the same cycle?

16.

Create a data buffer - type and number of samples are configurable. Depth of buffer can change (could be say 5, 10 or 100). Samples gathered at the same time periodically (every 10 clks). Read can happen at any time. We want the latest X samples to be available. We can stall writes, can't stall reads.

17.

Amazon phone screen - You have a 10-bit tag coming in. Each 10-bit tag has to be assigned to a unique 4-bit ID and sent downstream. When the 4-bit ID comes back as response, the 10-bit tag has to be returned to the master. There is a valid and ready on the incoming tag side and a valid and ready on the downstream side to send the tag out. Design this block (Had to write RTL to design this)

18.

Popular question - Design a traffic light controller (Hint - use a FSM)

19.

Build a 32-bit adder with two 16-bit adders and other logic. The 16-bit adder doesn't have Cin.

20.

Write RTL for an SRAM (behavioral model) with parameterizable depth and width.

21.

Draw a circuit to detect even number of 1s.

22.

Determine Sum and Carry for Half Adder and Full Adder. (Yep Even this. Actually took me a minute to derive Cout for a full-adder)

23.

RTL for determining number of 1s in an input bit-stream.

24.

Write a verilog module which sorts the values the given memory, so the lowest value is at address 0x00 and the highest value is at 0xFF. Memory has one read port and one write port.

25.

MSFT Question

Implement a frame filter module that accepts input frames and forwards only good ones. The module should:

• Drop any frame with an error on any beat
• Drop frames with invalid start/end sequence
• Apply backpressure when FIFO/buffer is almost full
• Handle frames of variable length (1 beat to many beats)
• Ensure no partial frames are sent to the output

// Frame filter: accept input frames, forward only good ones

// Drop any frame with an error on any beat

// Drop frames with invalid start/end sequence

// Apply backpressure when FIFO/buffer is almost full

// FPGA-friendly implementation



module frame_filter #(
  parameter int DW = 512,
  parameter int PW = 6
)(
  input logic clk,
  input logic rst_n,

  // Incoming stream
  input logic in_vld, // valid beat
  input logic in_sof, // start-of-frame
  input logic in_eof, // end-of-frame
  input logic in_err, // error on this beat
  input logic [PW-1:0] in_tail_pad, // valid only when in_eof=1
  input logic [DW-1:0] in_data,
  output logic in_backpress, // assert to stall sender

  // Outgoing stream
  output logic out_vld,
  output logic out_sof,
  output logic out_eof,
  output logic [PW-1:0] out_tail_pad,
  output logic [DW-1:0] out_data,
  input logic out_stall
);

// Tasks:
// 1) Track when a frame starts and ends
// 2) If any beat in a frame has in_err=1, mark frame as bad
// 3) Do not send any beat of a bad frame to the output
// 4) Handle frames of variable length (1 beat to many beats)
// 5) Apply backpressure when buffer is almost full
// 6) What happens if frame never sends in_eof? (flush? timeout?)
// 7) Make sure you don't send partial frames to the output

endmodule

26.

Microsoft Interview Question (asked to code this up in 45 minutes) - One value arrives each clock. Using a stack-based approach, track the second-largest value observed so far. When out_valid is asserted, output (latest_value, second_largest_so_far). Example stream: 1, 4, 5, 2, 3 → on first out_valid: latest=3, second_largest=4; on next out_valid: latest=2, second_largest=4. Duplicates count as the second-largest value. State assumptions (bit width, signed/unsigned, duplicate handling, reset behavior, latency)

💡 Click to reveal solution

Solution: Maintain two stacks, one for the input order (S1) and one to track second largest (S2) in each cycle. Also one additional flop to track largest, let's call this L1
Every new input:
1. Push input into S1
2.
  If input >= L1
    Push L1 into S2
    Push input into L1
  else if input >= S2
    Push input into S2
  else
    Push head S2 into S2

27.

Please write RTL for the following register design.
Clk_50Mhz – 50 MHz input clock
Reset_n – Active-low reset
Data_in_1..4 – four 16-bit data inputs
Data_out – 16-bit data output
WE – write enable for selected input
Address – 2-bit selects which input to write

28.

Design a circuit that counts number of 1s in a[3:0] if the only available component is a full-adder

a3	a2	a1	a0
||	|	|	|
________________________
||                       |
||_______________________|
	    |
	  count

29.

Design a component to rotate a 4x4 byte array & write out the output. Data comes in 4 cycles.

Input: One row of bytes per cycle (4 inputs for the array):

Cycle 0 : B3, B2, B1, B0
Cycle 1 : B7, B6, B5, B4
Cycle 2 : B11, B10, B9, B8
Cycle 3 : B15, B14, B13, B12

Output: 1 rotated column of bytes per cycle:

Cycle 0 : B12, B8, B4, B0
Cycle 1 : B13, B9, B5, B1
Cycle 2 : B14, B10, B6, B2
Cycle 3 : B15, B11, B7, B13

30.

Create this Fibonacci generator system -

	___
Clk  --|   |
Rst  --|   |--fib[15:0]
Next --|___|

Next is a 1 bit pulse. When next is 1, generate next number in sequence. Hold previous value until next is 1.

33.

Design this system:
We get an async event.
If ≥1 event: Count as 1 event
If 0 event: Miss.

Requirements:

• Create system to raise error flag if we have 2 misses every 40 cycles
• Modify to raise error flag for 2 misses in any 40 cycle window

34.

How can a fixed 16-bit adder (black box) be used as two independent 8-bit adders? You may add external logic on inputs/outputs but cannot modify the adder internals. Normally it computes a[15:0]+b[15:0] → s[15:0]; now requirea[7:0]+b[7:0] → s[7:0] and a[15:8]+b[15:8] → s[15:8] using that single adder. State assumptions (single-cycle vs multi-cycle/time-multiplexed, latency allowed, carry behavior).

💡 Click to reveal solution

Solution: At the input to the lower adder, force a[7] and b[7] to 0. Then at the output, S0[7] = S0_from_adder[7] ^ a[7] ^ b[7]. Upper sum remains as is.

35.

Design a parameterized encoder, which converts an N bit one hot signal to a binary value, specifying the location of the set bit. It should not synthesize with priority and you can assume a "don't care" output for invalid inputs.

36.

Design hardware to implement the IIR filter
H(z) = Y(z)/X(z) = (1 + 2z^-1 + z^-2) / (1 − 1.5z^-1 + 1.5z^-2).
Specify structure (Direct Form I/II or transposed), fixed-point widths/quantization, overflow/saturation behavior, latency, and reset strategy.

37.

What's the minimum multipliers required to implement Y= (AX^4)+(BX^3)+(CX^2)+(DX)?

💡 Click to reveal solution

Solution: 4 x[x{x(ax+b)+c}+d]

FIFO Design

Sync FIFO + RTL code. Full/empty condition

AMD Xilinx - Write Async FIFO RTL

Gray to binary and Bin to gray conversion + RTL code

How do you design non power of 2 depth Async FIFO?

Qualcomm phone screen -
a) A system has a 100MHz write clock. The write logic performs 16 write operations within 80 clock cycles. The write pattern is flexible (can be burst or random). What is the minimum read frequency required to ensure that the write operations are never backpressured?
b) After finding the frequency, what is the minimum depth that this FIFO should have?

Question from NVIDIA - FIFO Write: 250 MHz, FIFO Read: 200 MHz, 70 valid data burst every 100 cycles, calculate FIFO depth.

💡 Click to reveal solution

Solution: Time to write one burst = 4ns, Time required to write 70 bursts = 280ns, Time required to read one burst = 5ns, Number of bursts read in 280ns = 56, Depth of fifo = 70-56 = 14. The key point here is he asked me to show steps for assuming 70 bursts when he originally mentioned 70 valid in 100 cycles

Calculate fifo depth and width: write side writes 18 bit/cycle, write frequency = 60MHz, read side, reads 20bit/cycle, read frequency = 36 MHz, burst size = 100 writes.

💡 Click to reveal solution

Solution: Width: 20 bits with 30 entries (+4 to sync)
Burst I assume is 100 cycles and not bits. So 1800 bits written in 100 cycles while read drains (36/60 x 20/18) x 1800 = 1200 bits in the same time. 600 bits to buffer, divides by 20 for 30 entries.
If you want to add sync delay to this, assuming read starts after a delay, can have 3-4 more entries.

Question from NVIDIA - Number generator. Synchronous reset—while reset is asserted, the first post-reset output should be 1. Generate the sequence 1, 4, 9, 16, 25, 36, 49, … without using multipliers. How many adders are needed? (Hint: it's not "square the number.")

💡 Click to reveal solution

Solution: Use sum of consecutive odd numbers to generate squares.
Architecture:
• Register the output (out_r).
• Keep an odd-number counter starting at 3 and increment by 2.
• On each step: out_r ← out_r + odd_count.
Adders required: 2 (one for the odd counter, one for out_r + odd_count).

Is there something you could change in your design so that you could use a single port SRAM but still have the capability of continuous data coming in and out?

RAM Design

Design a RAM with a write port and a read port. The write port is 16 bits wide and the read port is 16 bits wide. The RAM should be able to store 16 words.

What is the difference between Open page and Closed page policy?

MSFT Hardware Engineer II. FPGA Virtualization/SDN team

How would you implement malloc() and free() in hardware (Verilog)?

module hw_malloc_free #(
    parameter DEPTH = 16, // number of memory blocks
    parameter ADDR_WIDTH = 4 // log2(DEPTH)
)(
    input wire clk,
    input wire rst,

    // Allocation request
    input wire alloc_req, // request to allocate a block
    output reg [ADDR_WIDTH-1:0] alloc_addr, // allocated address index

    // Free request
    input wire free_req, // request to free a block
    input wire [ADDR_WIDTH-1:0] free_addr, // address to free

    // Status
    output wire full, // no free blocks
    output wire empty // all blocks free
);

FPGA Image Processing Interview Question

AXI-Stream 5x5 Line-Buffer Design

You are given an AXI-Stream video-style input:
• s_axis_tdata — 8-bit pixel
• s_axis_tvalid
• s_axis_tready
• s_axis_tlast — end of line
• s_axis_tuser — start of frame

Resolution is fixed (e.g., 1920 pixels per line), 1 pixel per cycle.

Question: Design an RTL block that outputs a 5x5 pixel window every cycle using only line buffers (BRAM-based).

Output is also AXI-Stream:
• m_axis_tdata — 25 pixels (5x5 window)
• m_axis_tvalid
• m_axis_tready
• m_axis_tuser — aligned to center pixel
• m_axis_tlast — aligned to center pixel

What you must explain in your answer:

1. How many line buffers are required and why?
2. How horizontal pixel delays are created for each line.
3. How the module knows when the 5x5 window is "valid."
4. How tuser and tlast must be delayed to align with the center of the 5x5 window.
5. What happens at borders (first 2 rows/columns).
6. How you keep the AXI-Stream protocol compliant (tvalid/tready).

Low-Power & Power Intent (UPF)

Latch vs FF for clock gating. Which is preferred and why?

💡 Click to reveal solution

Solution: Negative level-sensitive latch is preferred.

Why:
1) Area & Power: Latches are smaller/lower power than flops; savings multiply across thousands of ICGs.
2) Timing slack: A negative latch is transparent when clk=0, giving the enable nearly a full cycle to meet timing. A negedge FF gives ~½ cycle, tighter and riskier.
3) Glitch-free gating: In a standard ICG, latch output is ANDed with clk. While clk=0, the AND output stays low so enable changes can’t glitch the gated clock; latch closes at clk↑ and propagates a stable enable.
Using a posedge FF risks races at the AND; a negedge FF offers no benefit over the latch but costs more area/power.

How did you implement power-saving in your design?

How do you handle signals going from an ON to OFF domain and vice versa? How do you manage isolation, does it matter?

AMD Xilinx - If clock gating logic is moved into the DEN (Data Enable), will this cause LEC (Logical Equivalence Check) to fail or not? Why?

Static and dynamic power, what are some ways to reduce both.

Advanced RTL Topics

Different way of arbitrating for resources? Difference between Find First and Round Robin arbiters. What are some pros and cons, why you might want to use find first and why you might want to use a round robin? What about the HW cost tradeoffs? RTL code for these?

NVIDIA phone screen - Open ended question: You have to design a memory controller block. It is an SRAM memory that you are accessing. How would you design the block?

Things to think about:
a) What interfaces would you use? How many?
b) How would you arbitrate between requests?
c) How many ports does memory have?
d) What other blocks your memory controller should have?

Apple phone screen - In a system how will you prevent RAW hazard?

Design Verification

Topics

Overview

Design Verification involves writing tests for digital logic modules to ensure comprehensive coverage of all use cases and corner conditions.