Problems starting cluster and crashes on SLES 12

Idar_Borlaug · July 3, 2015, 6:58am

Hi

I am sorry to have to ask this, but i can’t seem to get my cluster to work. (first time trying)

I have 3 nodes, on virtualbox machines running SLES 12.

./run-node.sh --log /tmp/eslog --db /tmp/db --cluster-dns escluster.mooo.com --run-projections All --cluster-size 3 --int-ip 192.168.33.11 --ext-ip 192.168.33.11 --cluster-gossip-port 3077

./run-node.sh --log /tmp/eslog --db /tmp/db --cluster-dns escluster.mooo.com --run-projections All --cluster-size 3 --int-ip 192.168.33.13 --ext-ip 192.168.33.13 --cluster-gossip-port 3077

./run-node.sh --log /tmp/eslog --db /tmp/db --cluster-dns escluster.mooo.com --run-projections All --cluster-size 3 --int-ip 192.168.33.12 --ext-ip 192.168.33.12 --cluster-gossip-port 3077

They can’t seem to find eachother, telnet to between the noeds on the ports seem to work. Not 3077 though.

I also get lost of these error after a few mins of running:
[ERROR] FATAL UNHANDLED EXCEPTION: System.NullReferenceException: Object reference not set to an instance of an object

at System.Threading.Timer+Scheduler.SchedulerThread () [0x00000] in :0

at System.Threading.Thread.StartInternal () [0x00000] in :0

And

Stacktrace:

Native stacktrace:

./clusternode() [0x612962]

./clusternode() [0x5beb0b]

./clusternode() [0x4584f3]

/lib64/libpthread.so.0(+0xf890) [0x7fae3cbcf890]

Debug info from gdb:

warning: /etc/gdbinit.d/gdb-heap.py: No such file or directory

[New LWP 21830]

[New LWP 21826]

[New LWP 21825]

[New LWP 21824]

[New LWP 21823]

[New LWP 21822]

[New LWP 21821]

[New LWP 21820]

[New LWP 21819]

[New LWP 21818]

[New LWP 21817]

[New LWP 21816]

[New LWP 21815]

[New LWP 21813]

[New LWP 21812]

[New LWP 21811]

[New LWP 21809]

[Thread debugging using libthread_db enabled]

Using host libthread_db library “/lib64/libthread_db.so.1”.

0x00007fae3cbcc05f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

Id Target Id Frame

18 Thread 0x7fae3c733700 (LWP 21809) “Finalizer” 0x00007fae3cbce010 in sem_wait () from /lib64/libpthread.so.0

17 Thread 0x7fae3adff700 (LWP 21811) “clusternode” 0x00007fae3c8f0d2d in read () from /lib64/libc.so.6

16 Thread 0x7fae3abfe700 (LWP 21812) “Timer-Scheduler” 0x00007fae3cbcf489 in waitpid () from /lib64/libpthread.so.0

15 Thread 0x7fae3afff700 (LWP 21813) “Threadpool moni” 0x00007fae3c909de4 in clock_nanosleep () from /lib64/libc.so.6

14 Thread 0x7fae3a5ff700 (LWP 21815) “clusternode” 0x00007fae3cbcc408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

13 Thread 0x7fae3a3fe700 (LWP 21816) “clusternode” 0x00007fae3c909de4 in clock_nanosleep () from /lib64/libc.so.6

12 Thread 0x7fae3a1fd700 (LWP 21817) “clusternode” 0x00007fae3cbcc408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

11 Thread 0x7fae39ffc700 (LWP 21818) “clusternode” 0x00007fae3cbcc408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

10 Thread 0x7fae39dfb700 (LWP 21819) “clusternode” 0x00007fae3cbcc408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

9 Thread 0x7fae39bfa700 (LWP 21820) “clusternode” 0x00007fae3cbcc408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

8 Thread 0x7fae399f9700 (LWP 21821) “clusternode” 0x00007fae3cbcc408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

7 Thread 0x7fae397f8700 (LWP 21822) “clusternode” 0x00007fae3cbcc408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

6 Thread 0x7fae3a77b700 (LWP 21823) “clusternode” 0x00007fae3c8fd663 in epoll_wait () from /lib64/libc.so.6

5 Thread 0x7fae394ef700 (LWP 21824) “IO Threadpool w” 0x00007fae3cbce0f0 in sem_timedwait () from /lib64/libpthread.so.0

4 Thread 0x7fae394ae700 (LWP 21825) “clusternode” 0x00007fae3c909de4 in clock_nanosleep () from /lib64/libc.so.6

3 Thread 0x7fae392ad700 (LWP 21826) “Threadpool work” 0x00007fae3cbce0f0 in sem_timedwait () from /lib64/libpthread.so.0

2 Thread 0x7fae38aa9700 (LWP 21830) “Threadpool work” 0x00007fae3cbce0f0 in sem_timedwait () from /lib64/libpthread.so.0

1 Thread 0x7fae3d6f9780 (LWP 21808) “clusternode” 0x00007fae3cbcc05f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

Thread 18 (Thread 0x7fae3c733700 (LWP 21809)):

#0 0x00007fae3cbce010 in sem_wait () from /lib64/libpthread.so.0

#1 0x0000000000564fa7 in mono_sem_wait (sem=sem@entry=0x1d193c0 <finalizer_sem>, alertable=alertable@entry=1) at mono-semaphore.c:101

#2 0x0000000000494a95 in finalizer_thread (unused=) at gc.c:1077

#3 0x0000000000532577 in start_wrapper_internal (data=) at threads.c:663

#4 start_wrapper (data=) at threads.c:710

#5 0x0000000000566a1e in inner_start_thread (arg=0x7ffc2fa107f0) at mono-threads-posix.c:88

#6 0x00007fae3cbc80a4 in start_thread () from /lib64/libpthread.so.0

#7 0x00007fae3c8fd08d in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7fae3adff700 (LWP 21811)):

#0 0x00007fae3c8f0d2d in read () from /lib64/libc.so.6

#1 0x0000000040e4dcbd in ?? ()

#2 0x00007fae2c002650 in ?? ()

#3 0x00007fae3adfee40 in ?? ()

#4 0x00007fae3c343f18 in ?? ()

#5 0x00007fae3c342d00 in ?? ()

#6 0x00007fae3c342af8 in ?? ()

#7 0x00007fae2c0025f0 in ?? ()

#8 0x0000000040e4dc30 in ?? ()

#9 0x00007fae3adfebc0 in ?? ()

#10 0x00007fae3adfeb00 in ?? ()

#11 0x0000000040e4dc30 in ?? ()

#12 0x00007fae3c343f38 in ?? ()

#13 0x0000000040e4db20 in ?? ()

#14 0x0000000000000000 in ?? ()

Any ideas on what is wrong?

pieter.germishuys · July 3, 2015, 7:46am

The nodes gossip on the internal http port. By default the internal http port is 2112.
In your case, the last argument needs to be set to --cluster-gossip-port 2112

This information is available via the documentation.

http://docs.geteventstore.com/server/3.0.5/cluster-without-manager-nodes/

Greg_Young1 · July 3, 2015, 7:52am

The other error is on the list iirc a glibc mismatch for binaries.

Idar_Borlaug · July 3, 2015, 8:27am

Thanks, i don’t know where i got gossip port 3077 from

What version of glibc do i need to have to run the tar.gz distribution? It fails on SLES 12 and on docker https://registry.hub.docker.com/u/wkruse/eventstore/dockerfile/ on SLES 12.

Or do i need to compile it myself?

Greg_Young1 · July 3, 2015, 8:31am

My guess is compiling on the box I’ll make the issue go away

Idar_Borlaug · July 3, 2015, 11:26am

I am now trying to compile and package evenstore on SLES 12.

Which wasn’t that easy.

When i get to package part i get this error from mkbundle

cc -o clusternode -Wall -D_REENTRANT -I/usr/lib64/pkgconfig/…/…/include/mono-2.0 clusternode.c -L/usr/lib64/pkgconfig/…/…/lib64 -Wl,-Bstatic -lmonosgen-2.0 -Wl,-Bdynamic -lmonosgen-2.0 -lm -lrt -ldl -lpthread clusternode.a

/usr/lib64/gcc/x86_64-suse-linux/4.8/…/…/…/…/x86_64-suse-linux/bin/ld: cannot find -lmonosgen-2.0

collect2: error: ld returned 1 exit status

Thats because there is no libmonosgen2.0 static library installed.

Ok so then i try to run the exe file and get:

[12223,01,11:26:35.213] Exiting with exit code: 4.

Exit reason: Appears that we are running in linux with a version 2 build of mono. This is generally not a good idea.We recommend running with 3.0 or higher (3.2 especially). If you really want to run with this version of mono use --force to override this error.

mono --version gives: Mono JIT compiler version 4.0.2 (Stable 4.0.2.5/c99aa0c Wed Jun 24 05:31:11 EDT 2015)

I am unfortunetly new to using mono

Greg_Young1 · July 3, 2015, 11:28am

Ah thats been updated. You are running version 4 of mono ... it was
checking for == 3. Just run with --force as it says it should be ok

I wouldn't worry about static linking for now (not needed to test).

Idar_Borlaug · July 3, 2015, 11:33am

ah ok

Greg_Young1 · July 3, 2015, 11:37am

its been updated to check >=3 There are quite a few old checks like
these from the time mono was transitioning between bohm gc and sgen
(bohm was a really really bad idea to run with!)

Idar_Borlaug · July 3, 2015, 11:44am

ah

mono EventStore.ClusterNode.exe --force --log /tmp/eslog --db /tmp/db --cluster-dns escluster.mooo.com --run-projections All --cluster-size 3 --int-ip 192.168.33.11 --ext-ip 192.168.33.11 --cluster-gossip-port 2112

Seems to still crash

[28344,10,11:44:13.063] ELECTIONS: (V=79) VIEWCHANGE FROM [192.168.33.13:2112, {0885d0a1-5788-481b-ad94-c4807203188c}].

Stacktrace:

Native stacktrace:

mono() [0x4b90c2]

mono() [0x5101be]

mono() [0x428e19]

/lib64/libpthread.so.0(+0xf890) [0x7fb320177890]

Debug info from gdb:

warning: /etc/gdbinit.d/gdb-heap.py: No such file or directory

warning: File “/usr/bin/mono-sgen-gdb.py” auto-loading has been declined by your `auto-load safe-path’ set to “$debugdir:$datadir/auto-load”.

To enable execution of this file add

add-auto-load-safe-path /usr/bin/mono-sgen-gdb.py

line to your configuration file “/home/vagrant/.gdbinit”.

To completely disable this security protection add

set auto-load safe-path /

line to your configuration file “/home/vagrant/.gdbinit”.

For more information about this security protection see the

“Auto-loading safe path” section in the GDB manual. E.g., run from the shell:

info “(gdb)Auto-loading safe path”

[New LWP 28366]

[New LWP 28365]

[New LWP 28361]

[New LWP 28360]

[New LWP 28359]

[New LWP 28358]

[New LWP 28357]

[New LWP 28356]

[New LWP 28355]

[New LWP 28354]

[New LWP 28353]

[New LWP 28352]

[New LWP 28351]

[New LWP 28349]

[New LWP 28348]

[New LWP 28347]

[New LWP 28345]

[Thread debugging using libthread_db enabled]

Using host libthread_db library “/lib64/libthread_db.so.1”.

0x00007fb32017405f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

Id Target Id Frame

18 Thread 0x7fb31fa93700 (LWP 28345) “Finalizer” 0x00007fb320176010 in sem_wait () from /lib64/libpthread.so.0

17 Thread 0x7fb31d043700 (LWP 28347) “mono” 0x00007fb31fc81d2d in read () from /lib64/libc.so.6

16 Thread 0x7fb31ce42700 (LWP 28348) “Timer-Scheduler” 0x00007fb320177489 in waitpid () from /lib64/libpthread.so.0

15 Thread 0x7fb31cb05700 (LWP 28349) “Threadpool moni” 0x00007fb31fc9ade4 in clock_nanosleep () from /lib64/libc.so.6

14 Thread 0x7fb2f7ffe700 (LWP 28351) “mono” 0x00007fb320174408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

13 Thread 0x7fb2f7bf1700 (LWP 28352) “mono” 0x00007fb31fc9ade4 in clock_nanosleep () from /lib64/libc.so.6

12 Thread 0x7fb2f79f0700 (LWP 28353) “mono” 0x00007fb320174408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

11 Thread 0x7fb2f76e7700 (LWP 28354) “mono” 0x00007fb320174408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

10 Thread 0x7fb2f74e6700 (LWP 28355) “mono” 0x00007fb320174408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

9 Thread 0x7fb2f72e5700 (LWP 28356) “mono” 0x00007fb320174408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

8 Thread 0x7fb2f70e4700 (LWP 28357) “mono” 0x00007fb320174408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

7 Thread 0x7fb2f6ee3700 (LWP 28358) “mono” 0x00007fb320174408 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

6 Thread 0x7fb2f6ce2700 (LWP 28359) “mono” 0x00007fb31fc8e663 in epoll_wait () from /lib64/libc.so.6

5 Thread 0x7fb2f6c1f700 (LWP 28360) “IO Threadpool w” 0x00007fb3201760f0 in sem_timedwait () from /lib64/libpthread.so.0

4 Thread 0x7fb2f6ba9700 (LWP 28361) “mono” 0x00007fb31fc9ade4 in clock_nanosleep () from /lib64/libc.so.6

3 Thread 0x7fb2f6367700 (LWP 28365) “Threadpool work” 0x00007fb3201760f0 in sem_timedwait () from /lib64/libpthread.so.0

2 Thread 0x7fb2f6771700 (LWP 28366) “Threadpool work” 0x00007fb3201760f0 in sem_timedwait () from /lib64/libpthread.so.0

1 Thread 0x7fb320ca1780 (LWP 28344) “mono” 0x00007fb32017405f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

Thread 18 (Thread 0x7fb31fa93700 (LWP 28345)):

#0 0x00007fb320176010 in sem_wait () from /lib64/libpthread.so.0

#1 0x000000000062a2f7 in mono_sem_wait ()

#2 0x00000000005ac4de in finalizer_thread ()

#3 0x0000000000590614 in start_wrapper ()

#4 0x000000000062f196 in inner_start_thread ()

#5 0x00007fb3201700a4 in start_thread () from /lib64/libpthread.so.0

#6 0x00007fb31fc8e08d in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7fb31d043700 (LWP 28347)):

#0 0x00007fb31fc81d2d in read () from /lib64/libc.so.6

#1 0x0000000041c349e5 in ?? ()

#2 0x00007fb310001450 in ?? ()

#3 0x00007fb31d042e30 in ?? ()

#4 0x00007fb31f747fa0 in ?? ()

#5 0x00007fb31f747070 in ?? ()

#6 0x00007fb31f746e28 in ?? ()

#7 0x00007fb310000bd0 in ?? ()

#8 0x0000000041c34960 in ?? ()

#9 0x00007fb31d042aa0 in ?? ()

#10 0x00007fb31d0429e0 in ?? ()

#11 0x0000000041c34960 in ?? ()

#12 0x00007fb31f747fc0 in ?? ()

#13 0x0000000041c34858 in ?? ()

#14 0x0000000000000000 in ?? ()

Thread 16 (Thread 0x7fb31ce42700 (LWP 28348)):

#0 0x00007fb320177489 in waitpid () from /lib64/libpthread.so.0

#1 0x00000000004b914f in mono_handle_native_sigsegv ()

#2 0x00000000005101be in mono_arch_handle_altstack_exception ()

#3 0x0000000000428e19 in mono_sigsegv_signal_handler ()

#4

#5 0x0000000000000000 in ?? ()

…/…/gdb/dwarf2-frame.c:692: internal-error: Unknown CFI encountered.

A problem internal to GDB has been detected,

further debugging may prove unreliable.

Quit this debugging session? (y or n) [answered Y; input not from terminal]

…/…/gdb/dwarf2-frame.c:692: internal-error: Unknown CFI encountered.

A problem internal to GDB has been detected,

further debugging may prove unreliable.

Create a core file of GDB? (y or n) [answered Y; input not from terminal]

Greg_Young1 · July 3, 2015, 11:45am

Try running like this and send up a full back trace?

http://goodenoughsoftware.net/2014/03/01/debugging-segmentation-faults-in-mono/

This way we will have symbols etc as opposed to just memory positions

Idar_Borlaug · July 3, 2015, 12:09pm

Hum… not sure this helps that much:

VND {20c47a2c-a849-4acf-be8d-a8b0bf054297} [Unknown, 192.168.33.12:1112, n/a, 192.168.33.12:1113, n/a, 192.168.33.12:2112, 192.168.33.12:2113] -1/0/0/E-1@-1:{00000000-0000-0000-0000-000000000000} | 2015-07-03 12:09:18.476

New:

MAN {00000000-0000-0000-0000-000000000000} [Manager, 192.168.33.13:2112, 192.168.33.13:2112] | 2015-07-03 12:09:18.496

VND {20c47a2c-a849-4acf-be8d-a8b0bf054297} [Unknown, 192.168.33.12:1112, n/a, 192.168.33.12:1113, n/a, 192.168.33.12:2112, 192.168.33.12:2113] -1/0/0/E-1@-1:{00000000-0000-0000-0000-000000000000} | 2015-07-03 12:09:18.476

MAN {00000000-0000-0000-0000-000000000000} [Manager, 192.168.33.11:2112, 192.168.33.11:2112] | 2015-07-03 12:09:18.496

Greg_Young1 · July 3, 2015, 12:10pm

And you are at a gdb prompt yes? Type backtrace

Idar_Borlaug · July 3, 2015, 12:11pm

(gdb) backtrace

#0 0x000000004019ee75 in ?? ()

#1 0x0000000000000000 in ?? ()

Idar_Borlaug · July 3, 2015, 12:12pm

Its random when it happends, sometimes it runs for a few mins, and sometimes just after starting.

This one was at boot:

[20140,10,12:09:18.523] ========== [192.168.33.12:2112] IS UNKNOWN!!! WHOA!!!

[20140,10,12:09:18.575] ELECTIONS: STARTING ELECTIONS.

[20140,10,12:09:18.576] ELECTIONS: (V=0) SHIFT TO LEADER ELECTION.

[20140,10,12:09:18.578] ELECTIONS: (V=0) VIEWCHANGE FROM [192.168.33.12:2112, {20c47a2c-a849-4acf-be8d-a8b0bf054297}].

[20140,10,12:09:18.581] SLOW BUS MSG [MainBus]: StartElections - 51ms. Handler: ElectionsService.

[20140,10,12:09:18.581] SLOW QUEUE MSG [MainQueue]: StartElections - 51ms. Q: 0/2.

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7fffceea1700 (LWP 20151)]

0x000000004019ee75 in ?? ()

(gdb) backtrace

#0 0x000000004019ee75 in ?? ()

#1 0x0000000000000000 in ?? ()