Note: This discussion is about an older version of the COMSOL Multiphysics® software. The information provided may be out of date.
Discussion Closed This discussion was created more than 6 months ago and has been closed. To start a new discussion with a link back to this one, click here.
cluster computing
Posted Mar 6, 2013, 1:31 p.m. EST Cluster & Cloud Computing, Studies & Solvers Version 4.3a, Version 4.4 2 Replies
Please login with a confirmed email address before reporting spam
Can anyone help me figure out these errors andor how to run a very large model on multiple nodes?
I am having a terrible time trying to run a very large model on multiple nodes on our cluster. I know that it can work, because I have successfully run smaller models on multiple nodes. But now, I have a very large model, due to a very fine mesh, that will not run successfully. I keep getting memory errors, its as if COMSOL does not have enough memory, but it also looks like COMSOL may not be utilizing all of the cores on each node. I cannot attached the log file, because it is not a .mph file, but I have copy and pasted all of the information below.
I would be so grateful for any insight,
Kelley
12 bn124
12 bn120
12 bn119
12 bn118
running mpdallexit on bn124
LAUNCHED mpd on bn124 via
RUNNING: mpd on bn124
LAUNCHED mpd on bn120 via bn124
LAUNCHED mpd on bn119 via bn124
LAUNCHED mpd on bn118 via bn124
RUNNING: mpd on bn120
RUNNING: mpd on bn118
RUNNING: mpd on bn119
bn124
bn119
bn118
bn120
Warning: The number of allocated threads (48) exceeds the number of available physical cores (12)
Warning: The number of allocated threads (48) exceeds the number of available physical cores (12)
Warning: The number of allocated threads (48) exceeds the number of available physical cores (12)
Warning: The number of allocated threads (48) exceeds the number of available physical cores (12)
[0] MPI startup(): cannot open dynamic library libdat2.so.2
[0] MPI startup(): cannot open dynamic library libdat2.so
[0] MPI startup(): cannot open dynamic library libdat.so.1
[0] MPI startup(): cannot open dynamic library libdat.so
[2] MPI startup(): cannot open dynamic library libdat2.so.2
[2] MPI startup(): cannot open dynamic library libdat2.so
[2] MPI startup(): cannot open dynamic library libdat.so.1
[2] MPI startup(): cannot open dynamic library libdat.so
[3] MPI startup(): cannot open dynamic library libdat2.so.2
[3] MPI startup(): cannot open dynamic library libdat2.so
[3] MPI startup(): cannot open dynamic library libdat.so.1
[3] MPI startup(): cannot open dynamic library libdat.so
[1] MPI startup(): cannot open dynamic library libdat2.so.2
[1] MPI startup(): cannot open dynamic library libdat2.so
[1] MPI startup(): cannot open dynamic library libdat.so.1
[1] MPI startup(): cannot open dynamic library libdat.so
[0] MPI startup(): tcp data transfer mode
[1] MPI startup(): tcp data transfer mode
[2] MPI startup(): tcp data transfer mode
[3] MPI startup(): tcp data transfer mode
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 11779 cn172 {0,1,2,3,4,5,6,7,8,9,10,11}
[0] MPI startup(): 1 27813 cn168 {0,1,2,3,4,5,6,7,8,9,10,11}
[0] MPI startup(): 2 10950 cn167 {0,1,2,3,4,5,6,7,8,9,10,11}
[0] MPI startup(): 3 17204 cn166 {0,1,2,3,4,5,6,7,8,9,10,11}
Node 0 is running on host: cn172
Node 0 has address: cn172
Node 1 is running on host: cn168
Node 1 has address: cn168
Node 2 is running on host: cn167
Node 2 has address: cn167
Node 3 is running on host: cn166
Node 3 has address: cn166
Warning: The total number of allocated threads (48) on host: cn168 exceeds the number of available physical cores (12)
Warning: The total number of allocated threads (48) on host: cn172 exceeds the number of available physical cores (12)
Warning: The total number of allocated threads (48) on host: cn167 exceeds the number of available physical cores (12)
Warning: The total number of allocated threads (48) on host: cn166 exceeds the number of available physical cores (12)
COMSOL 4.3a (Build: 161) starting in batch mode
*******************************************
***COMSOL 4.3.1.161 progress output file***
*******************************************
Tue Mar 05 16:35:19 PST 2013
Opening: /ibrix/home16/rabjohns/rundir/2.28mesh.025.mph
Open time: 12 s.
Running: Study 1
Number of vertex elements: 4
Number of edge elements: 400
Number of boundary elements: 10000
---------- Current Progress: 100 %
Memory: 563/563 8291/8291
Node 1:
Number of vertex elements: 4
Number of edge elements: 400
Number of boundary elements: 10000
Node 2:
Number of vertex elements: 4
Number of edge elements: 400
Number of boundary elements: 10000
Node 3:
Number of vertex elements: 4
Number of edge elements: 400
Number of boundary elements: 10000
----- Current Progress: 55 %
Memory: 575/575 8291/8291
------- Current Progress: 75 %
Memory: 604/604 8291/8291
Number of vertex elements: 8
Number of edge elements: 936
Number of boundary elements: 33600
Number of elements: 340000
---------- Current Progress: 100 %
Memory: 629/629 8355/8355
Minimum element quality: 1
Node 1:
Number of vertex elements: 8
Number of edge elements: 936
Number of boundary elements: 33600
Number of elements: 340000
Minimum element quality: 1
Node 2:
Number of vertex elements: 8
Number of edge elements: 936
Number of boundary elements: 33600
Number of elements: 340000
Minimum element quality: 1
Node 3:
Number of vertex elements: 8
Number of edge elements: 936
Number of boundary elements: 33600
Number of elements: 340000
Minimum element quality: 1
Current Progress: 0 %
Memory: 663/663 8355/8355
Memory: 900/900 8622/8622
---------- Current Progress: 100 %
Memory: 937/937 8610/8622
Time-Dependent Solver 1 in Solver 1 started at 5-Mar-2013 16:36:04.
Time-dependent solver (BDF)
Current Progress: 0 %
Memory: 1456/1456 9131/9131
Memory: 1520/1520 9195/9195
Warning: PARDISO is not distributed. Switching to MUMPS.
Memory: 2860/2860 10513/10513
Warning: PARDISO is not distributed. Switching to MUMPS.
Memory: 20575/20575 27794/27794
Memory: 22470/22470 30011/30011
Number of degrees of freedom solved for: 16726014.
Memory: 29744/29744 37274/37274
Memory: 35766/35766 43219/43219
Nonsymmetric matrix found.
Scales for dependent variables:
mod1.u: 9.2e-05
mod1.p: 1
Symmetric matrices found.
Format not changed since SOR line uses nonsymmetric storage.
Scales for dependent variables:
mod1.bc: 0.05
mod1.ac: 0.05
Memory: 43291/43291 50752/50752
Nonsymmetric matrix found.
Step Time Stepsize Res Jac Sol Order Tfail NLfail LinIt LinErr LinRes
Memory: 20186/43291 51079/51079
Memory: 35272/43291 83099/83099
Memory: 37667/43291 84978/84978
Memory: 38350/43291 85705/85705
Memory: 43061/43291 105054/105054
Memory: 47498/47498 111297/111297
Memory: 57658/57658 115600/115600
Memory: 59547/59547 117720/117720
Memory: 59832/59832 118281/118281
Memory: 61138/61138 116498/118281
Memory: 62323/62323 117852/118281
Memory: 64075/64075 119805/119805
Memory: 65421/65421 128049/128049
0 0 out 20 22 0 0
Group #1: 20 21 0
Group #2: 0 1 0 0 6.9e-310 6.9e-310
---------- Current Progress: 100 %
Memory: 19786/65421 27342/128049
Node 1:
Time-dependent solver (BDF)
Warning: PARDISO is not distributed. Switching to MUMPS.
Warning: PARDISO is not distributed. Switching to MUMPS.
Number of degrees of freedom solved for: 16726014.
Nonsymmetric matrix found.
Scales for dependent variables:
mod1.u: 9.2e-05
mod1.p: 1
Symmetric matrices found.
Format not changed since SOR line uses nonsymmetric storage.
Scales for dependent variables:
mod1.bc: 0.05
mod1.ac: 0.05
Nonsymmetric matrix found.
Step Time Stepsize Res Jac Sol Order Tfail NLfail LinIt LinErr LinRes
0 0 out 20 22 0 0
Group #1: 20 21 0
Group #2: 0 1 0 0 6.9e-310 6.9e-310
Node 2:
Time-dependent solver (BDF)
Warning: PARDISO is not distributed. Switching to MUMPS.
Warning: PARDISO is not distributed. Switching to MUMPS.
Number of degrees of freedom solved for: 16726014.
Nonsymmetric matrix found.
Scales for dependent variables:
mod1.u: 9.2e-05
mod1.p: 1
Symmetric matrices found.
Format not changed since SOR line uses nonsymmetric storage.
Scales for dependent variables:
mod1.bc: 0.05
mod1.ac: 0.05
Nonsymmetric matrix found.
Step Time Stepsize Res Jac Sol Order Tfail NLfail LinIt LinErr LinRes
0 0 out 20 22 0 0
Group #1: 20 21 0
Group #2: 0 1 0 0 3.2e-322 1.3e-311
Node 3:
Time-dependent solver (BDF)
Warning: PARDISO is not distributed. Switching to MUMPS.
Warning: PARDISO is not distributed. Switching to MUMPS.
Number of degrees of freedom solved for: 16726014.
Nonsymmetric matrix found.
Scales for dependent variables:
mod1.u: 9.2e-05
mod1.p: 1
Symmetric matrices found.
Format not changed since SOR line uses nonsymmetric storage.
Scales for dependent variables:
mod1.bc: 0.05
mod1.ac: 0.05
Nonsymmetric matrix found.
Step Time Stepsize Res Jac Sol Order Tfail NLfail LinIt LinErr LinRes
0 0 out 20 22 0 0
Group #1: 20 21 0
Group #2: 0 1 0 0 1.8e-322 6.3e-322
Time-Dependent Solver 1 in Solver 1: Solution time: 16046 s. (4 hours, 27 minutes, 26 seconds)
Exception:
com.comsol.util.exceptions.FlException: Failed to find consistent initial values
(rethrown as com.comsol.util.exceptions.FlException)
Messages:
The following feature has encountered a problem
Failed to find consistent initial values
Segregated group X#1
Out of memory LU factorization
Last time step is not converged
- Feature: Time-Dependent Solver 1 (sol1/t1)
- Error: Failed to find consistent initial values.
- Error on node 1: Failed_to_find_consistent_initial_values
- Error on node 2: Failed_to_find_consistent_initial_values
- Error on node 3: Failed_to_find_consistent_initial_values
Stack trace:
at com.comsol.solver.SolverOperation.addError(Unknown Source)
at com.comsol.solver.SolverOperation.execute(Unknown Source)
at com.comsol.model.internal.impl.SolverSequenceImpl.a(Unknown Source)
at com.comsol.model.internal.impl.SolverSequenceImpl.g(Unknown Source)
at com.comsol.model.internal.impl.SolverSequenceImpl$an.a(Unknown Source)
at com.comsol.model.internal.impl.SolverSequenceImpl$an.execute(Unknown Source)
at com.comsol.model.clientserver.ClientManagerImpl$d.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: Exception:
com.comsol.util.exceptions.FlException: Failed to find consistent initial values
Messages:
Failed to find consistent initial values
Segregated group X#1
Out of memory LU factorization
Last time step is not converged
... 11 more
Saving: /ibrix/home16/rabjohns/rundir/2.28mesh.025_out.mph
Save time: 51 s.
Total time: 16142 s.
--- Job finished at: Tue Mar 5 21:04:28 PST 2013
Hello Kelley Rabjohns
Your Discussion has gone 30 days without a reply. If you still need help with COMSOL and have an on-subscription license, please visit our Support Center for help.
If you do not hold an on-subscription license, you may find an answer in another Discussion or in the Knowledge Base.