Hi Guilin!
Thanks for your feedback. I have figured out the issue. I can summarize my finding here for other people who will face similar problem.
1. I have enough number of licenses. So this doesn't affect the simulation time.
2. In short, the bottleneck for simulation speed is the memory bandwidth. As you mentioned before, increasing # of cores, doesn't linearly scale the simulation time. In my case, 32 cores give the fastest simulation time for a specific design. When I simulate another similar design in parallel, using another 32 cores on the same machine, the simulation speed dropped twice. This is because the communication between CPU and RAM was already at the cap for a single simulation. Adding another simulation in parallel practically takes 2 times longer time.
I hope this helps for other people as well. I am attaching some useful links from Ansys website regarging performance optimization.
Information on Hardware Specifications
Getting the Best FDTD Performance
Best,
Kaisar