问题背景
ve 短视频渲染服务程序基于 opengl 和 c/c++ 开发, 服务器渲染现阶段只支持 linux,渲染流程是: 父进程下载数据和素材,再 fork 一个渲染子进程进行渲染。
近期发现生产环境中,极小概率下,会出现渲染子进程假死、卡住以及无任何日志输出的现象。
这个问题在现有生产流程下很难重现,最初采用的解决方法是在父进程中检测渲染进程启动失败时,发送信号 Kill 掉后重试任务。
最近特意抽出时间仔细分析并解决了这个问题,觉得很有意思,记录下来。
问题查找
重现问题
渲染服务由于需要预先下载素材和准备数据,很难达到同时启动多个任务的情况,所以特意重写了多线程支持本地多任务的测试程序。
基本流程就是同一时间启动多个线程,每个线程 fork 一个子进程进行渲染,然后等待所有线程完成。
经过多次测试重现假死的概率很高。
定位问题
稳定重现问题是定位问题的基本条件,在稍微修改测试程序后,使用 debug 编译,正常启动后就可以使用 gdb 调试,找到进程具体卡死的位置,步骤如下:
slayer@slayer-B250M-D3H:~$ sudo gdb -p 12680
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
...
[Thread debugging using libthread_db enabled]
...
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7ff6287a6700 (LWP 12680) "SXVideoEngineDe" __lll_lock_wait_private ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ff6287a6700 (LWP 12680))]
#0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
95 in ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
(gdb) where
#0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007ff6421bfc76 in __tz_convert (timer=0x7ff6424c8ac0 <tzset_lock>, use_localtime=1, tp=0x7ff6287a39d0)
at tzset.c:616
#2 0x00000000013b2c01 in el::base::utils::DateTime::buildTimeInfo (currTime=0x7ff6287a39a0, timeInfo=0x7ff6287a39d0)
at /home/slayer/workspace/SXVideoEngine-Core/Render/log/easylogging++.cc:1232
#3 0x00000000013b2751 in el::base::utils::DateTime::timevalToString[abi:cxx11](timeval, char const*, el::base::SubsecondPrecision const*) (tval=..., format=0x37bdbb0 "%Y-%M-%d %H:%m:%s,%g", ssPrec=0x37f2cdc)
at /home/slayer/workspace/SXVideoEngine-Core/Render/log/easylogging++.cc:1190
#4 0x00000000013b26da in el::base::utils::DateTime::getDateTime[abi:cxx11](char const*, el::base::SubsecondPrecision const*) (format=0x37bdbb0 "%Y-%M-%d %H:%m:%s,%g", ssPrec=0x37f2cdc)
at /home/slayer/workspace/SXVideoEngine-Core/Render/log/easylogging++.cc:1184
#5 0x00000000013ba36c in el::base::DefaultLogBuilder::build[abi:cxx11](el::LogMessage const*, bool) const (
this=0x37b9ff0, logMessage=0x7ff6287a4030, appendNewLine=true)
at /home/slayer/workspace/SXVideoEngine-Core/Render/log/easylogging++.cc:2412
#6 0x00000000013b9d94 in el::base::DefaultLogDispatchCallback::handle (this=0x37f8b20, data=0x7ff6287a3ee0)
at /home/slayer/workspace/SXVideoEngine-Core/Render/log/easylogging++.cc:2212
#7 0x00000000013bb2cf in el::base::LogDispatcher::dispatch (this=0x7ff6287a3f70)
at /home/slayer/workspace/SXVideoEngine-Core/Render/log/easylogging++.cc:2499
#8 0x00000000013bc0b5 in el::base::Writer::triggerDispatch (this=0x7ff6287a42e0)
at /home/slayer/workspace/SXVideoEngine-Core/Render/log/easylogging++.cc:2632
#9 0x00000000013bbdf8 in el::base::Writer::processDispatch (this=0x7ff6287a42e0)
at /home/slayer/workspace/SXVideoEngine-Core/Render/log/easylogging++.cc:2613
#10 0x00000000004392b8 in el::base::Writer::~Writer (this=0x7ff6287a42e0, __in_chrg=<optimized out>)
at /home/slayer/workspace/SXVideoEngine-Core/Render/log/easylogging++.h:3213
#11 0x00000000013aa518 in VEPrintLog (level=3, str=0x1ab9448 "Child process ve_render_process_start() : %s",
args=0x7ff6287a53a0) at /home/slayer/workspace/SXVideoEngine-Core/Render/log/corelog.cpp:135
#12 0x00000000013aa0ad in VELogInfo (str=0x1ab9448 "Child process ve_render_process_start() : %s")
at /home/slayer/workspace/SXVideoEngine-Core/Render/log/corelog.cpp:90
#13 0x000000000180b09d in ve_render_process_start (id=0x7ff6287a5740,
handler=0x43abb5 <MultiThreadVideoTest::Handler(unsigned long*, char const*)>)
at /home/slayer/workspace/SXVideoEngine-Core/Render/server/RenderProcess.cpp:597
#14 0x000000000043acea in MultiThreadVideoTest::RenderThread (context=0x7ffe95b37320)
at /home/slayer/workspace/SXVideoEngine-Core/Render/SXVideoEngineDemo.cpp:342
#15 0x00007ff6470e46ba in start_thread (arg=0x7ff6287a6700) at pthread_create.c:333
#16 0x00007ff64220941d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)
确定死锁卡死的点为:
#0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007ff6421bfc76 in __tz_convert (timer=0x7ff6424c8ac0 <tzset_lock>, use_localtime=1, tp=0x7ff6287a39d0)
at tzset.c:616
#2 0x00000000013b2c01 in el::base::utils::DateTime::buildTimeInfo (currTime=0x7ff6287a39a0, timeInfo=0x7ff6287a39d0)
at /home/slayer/workspace/SXVideoEngine-Core/Render/log/easylogging++.cc:1232
查看源码,确定 easylogging 在打印日志需要格式化当前时间,调用了 localtime_r 的方法, 查看 localtime.c 源码,
函数调用栈为: localtime_r -> __localtime_r -> __localtime64_r -> __tz_convert
struct tm *
__localtime64_r (const __time64_t *t, struct tm *tp)
{
return __tz_convert (*t, 1, tp);
}
/* Provide a 32-bit variant if needed. */
#if __TIMESIZE != 64
struct tm *
__localtime_r (const time_t *t, struct tm *tp)
{
__time64_t t64 = *t;
return __localtime64_r (&t64, tp);
}
libc_hidden_def (__localtime64_r)
#endif
weak_alias (__localtime_r, localtime_r)
继续查看 __tz_convert 的源码, 在 tzset.c 中, 可以看到调用了 __libc_lock_lock (tzset_lock);
加锁,进程就是死锁在这里。
/* Return the `struct tm' representation of TIMER in the local timezone.
Use local time if USE_LOCALTIME is nonzero, UTC otherwise. */
struct tm *
__tz_convert (__time64_t timer, int use_localtime, struct tm *tp)
{
long int leap_correction;
int leap_extra_secs;
__libc_lock_lock (tzset_lock);
/* Update internal database according to current TZ setting.
POSIX.1 8.3.7.2 says that localtime_r is not required to set tzname.
This is a good idea since this allows at least a bit more parallelism. */
tzset_internal (tp == &_tmbuf && use_localtime);
if (__use_tzfile)
__tzfile_compute (timer, use_localtime, &leap_correction,
&leap_extra_secs, tp);
else
{
if (! __offtime (timer, 0, tp))
tp = NULL;
else
__tz_compute (timer, tp, use_localtime);
leap_correction = 0L;
leap_extra_secs = 0;
}
__libc_lock_unlock (tzset_lock);
if (tp)
{
if (! use_localtime)
{
tp->tm_isdst = 0;
tp->tm_zone = "GMT";
tp->tm_gmtoff = 0L;
}
if (__offtime (timer, tp->tm_gmtoff - leap_correction, tp))
tp->tm_sec += leap_extra_secs;
else
tp = NULL;
}
return tp;
}
__libc_lock_lock (tzset_lock);
执行了加锁操作, tzset_lock 是通过下面的宏定义的
__libc_lock_define_initialized (static, tzset_lock)
宏替换后的代码是
static __libc_lock_t tzset_lock;
最终在 libc-lock.h 中找到 __libc_lock_t 的定义
typedef pthread_mutex_t __libc_lock_t;
可以确认这是一把静态共享锁。
- 为什么这里需要加锁?
__tz_convert 中计算的时候使用了多个全局变量,比如 _tm_buf, __tz_name, __daylight, __timezone 等全局变量,计算过程中,都有读写,非线程安全。
- 为什么会产生死锁?
fork 时,子进程会复制锁和锁的状态,假如复制时,主进程的刚好有个操作进行了加锁操作,此时子进程得到的锁就已经是上锁的状态,这时子进程再去加锁,只会永远等待,因为没有任何操作去解锁;主进程中解锁也是解自己的锁,与子进程无关。
这样就得到的死锁的原因和根源,剩下的就是找到最合适的解决方法。
解决方法
办法有很多,最终选择替换 localtime_r 函数,参考了 redis 5.0 实现的 nolocks_localtime 函数,彻底替换了 localtime_r 函数。
参考 https://github.com/antirez/redis/blob/5.0.0/src/localtime.c
总结
多线程且有锁操作的程序要谨慎使用 fork, 存在父进程和子进程都要使用的锁,复制后都有可能造成子进程死锁。
最终总结一下解决思路:
- 避免共享锁的代码
- 在 fork 之后立即调用 execv 方法抛弃父进程的空间数据,启用新的进程内存空间,从根本上独立于父进程,相当重新开了个进程,自然不会有锁的问题
- 使用 pthread_afork 在fork 前释放共享的锁,这个方法使用并不方便,如果引用的库中包含锁,很难对全部的锁进行操作,解锁时也可能导致使用锁的程序状态异常