日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

學(xué)無先后,達(dá)者為師

網(wǎng)站首頁 編程語言 正文

loadavg數(shù)據(jù)異常引發(fā)問題起源分析_Android

作者:邢少年 ? 更新時間: 2022-12-23 編程語言

proc

  • NAME (名稱解釋):

proc - process information pseudo-filesystem (存儲進(jìn)程信息的偽文件系統(tǒng))

  • DESCRIPTION (詳細(xì))

The ?proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures. ?
It is commonly mounted at /proc. ?Most of it is read-only, but some files allow kernel variables to
be changed

pooc文件系統(tǒng)是一個偽裝的文件系統(tǒng),它提供接口給內(nèi)核來存儲數(shù)據(jù),通常掛載在設(shè)備的/proc目錄,
大部分文件是只讀的,但是有些文件可以被內(nèi)和變量給改變.

具體代表的含義可以通過man proc去查看. 以上信息就是通過man獲取.翻譯不一定精確.

loadavg

cat /proc/loadavg

/proc/loadavg
? The first three fields in this file are load average figures giving the number of?
? jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1, 5,?
? and ?15 ?minutes. ??

這個文件的前三個數(shù)字是平均負(fù)載的數(shù)值,計(jì)算平均1分鐘,5分鐘,15分鐘內(nèi)的運(yùn)行隊(duì)列中(R狀態(tài))或等待磁盤I/O(D狀態(tài))的任務(wù)數(shù).

The first of these is the number of cur‐rently runnable kernel scheduling entities?
? (processes, threads). ?The value after the slash is the number of kernel scheduling?
? entities that currently exist on the system.?

第四個參數(shù)/前面是可運(yùn)行的內(nèi)核調(diào)度實(shí)體的數(shù)量(調(diào)度實(shí)體指 進(jìn)程,線程), /后的值是系統(tǒng)中存在的內(nèi)核調(diào)度實(shí)體的數(shù)量.

The fifth field ?is the PID of the process that was most recently created on the system.

第五個參數(shù)是系統(tǒng)最新創(chuàng)建進(jìn)程的PID

1: 問題起源

在從事的大屏領(lǐng)域遇到一個問題,就是loadavg中的數(shù)值其高無比,對比8核手機(jī)的3+,4+,目前的手頭的設(shè)備loadavg竟然高達(dá)70+,這個問題一直困擾了我很久,最近騰出一個整塊的時間來研究一下這個數(shù)值的計(jì)算規(guī)則.

kernel中的loadvg.c文件中有這樣的一個函數(shù).我們看到它就是最終的輸出函數(shù).

static int loadavg_proc_show(struct seq_file *m, void *v)
{
   unsigned long avnrun[3];
   get_avenrun(avnrun, FIXED_1/200, 0);
   seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %ld/%d %d\n",
      LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),  // 1分鐘平均值
      LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),  // 5分鐘平均值
      LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),  // 15分鐘平均值
      // 可運(yùn)行實(shí)體使用  nr_running()獲取, nr_threads 是存在的所有實(shí)體
      nr_running() , nr_threads,
      // 獲取最新創(chuàng)建的進(jìn)程PID
      task_active_pid_ns(current)->last_pid);
   return 0;
}

看過上面的代碼獲取具體平均負(fù)載的函數(shù)是get_avenrun(),我們接著找一下它的具體實(shí)現(xiàn).

unsigned long avenrun[3];
EXPORT_SYMBOL(avenrun); /* should be removed */
/**
 * get_avenrun - get the load average array
 * @loads: pointer to dest load array
 * @offset:    offset to add
 * @shift: shift count to shift the result left
 *
 * These values are estimates at best, so no need for locking.
 */
void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
{
    //數(shù)據(jù)來源主要是avenrun數(shù)組
   loads[0] = (avenrun[0] + offset) << shift;
   loads[1] = (avenrun[1] + offset) << shift;
   loads[2] = (avenrun[2] + offset) << shift;
}

2: 數(shù)據(jù)來源

接著我們接著尋找avenrun[]在哪里賦值,我們先看數(shù)據(jù)的來源問題.

  • kernel版本4.9 代碼路徑kernel/sched/core.c,kernel/sched/loadavg.c.

2.1:scheduler_tick

/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 * 這里注釋就比較清楚了,由計(jì)時器調(diào)度,調(diào)度的頻率為HZ
 */
void scheduler_tick(void)
{
   int cpu = smp_processor_id();
   struct rq *rq = cpu_rq(cpu);
   struct task_struct *curr = rq->curr;
   sched_clock_tick();
   raw_spin_lock(&rq->lock);
   walt_set_window_start(rq);
   walt_update_task_ravg(rq->curr, rq, TASK_UPDATE,
         walt_ktime_clock(), 0);
   update_rq_clock(rq);
   curr->sched_class->task_tick(rq, curr, 0);
   cpu_load_update_active(rq);
   calc_global_load_tick(rq); // 這里調(diào)度
   raw_spin_unlock(&rq->lock);
   perf_event_task_tick();
#ifdef CONFIG_SMP
   rq->idle_balance = idle_cpu(cpu);
   trigger_load_balance(rq);
#endif
   rq_last_tick_reset(rq);
   if (curr->sched_class == &fair_sched_class)
      check_for_migration(rq, curr);
}

2.2: calc_global_load_tick

/*
 * Called from scheduler_tick() to periodically update this CPU's
 * active count.
 */
void calc_global_load_tick(struct rq *this_rq)
{
   long delta;
    //過濾系統(tǒng)負(fù)載重復(fù)更新,這里是同過jiffies進(jìn)行過濾,jiffies也在下面統(tǒng)一介紹
   if (time_before(jiffies, this_rq->calc_load_update)) 
      return;
   // 更新數(shù)據(jù) 
   delta  = calc_load_fold_active(this_rq, 0);
   if (delta)
       // 將數(shù)據(jù)同步到calc_load_tasks, atomic_long_add 是kernel中的一個原子操作函數(shù)
      atomic_long_add(delta, &calc_load_tasks);
    // 下一次系統(tǒng)更新系統(tǒng)負(fù)載的時間 LOAD_FREQ定義在include/linux/sched.h 
    //   #define LOAD_FREQ   (5*HZ+1)   /* 5 sec intervals */
   this_rq->calc_load_update += LOAD_FREQ;  
}

2.3: calc_load_fold_active

long calc_load_fold_active(struct rq *this_rq, long adjust)
{
   long nr_active, delta = 0;
   nr_active = this_rq->nr_running - adjust; //統(tǒng)計(jì)調(diào)度器中nr_running的task數(shù)量 adjust傳入為0,不做討論.
   nr_active += (long)this_rq->nr_uninterruptible; //統(tǒng)計(jì)調(diào)度器中nr_uninterruptible的task的數(shù)量.
    // calc_load_active代表了nr_running和nr_uninterruptible的數(shù)量,如果存在差值就計(jì)算差值
   if (nr_active != this_rq->calc_load_active) { 
      delta = nr_active - this_rq->calc_load_active;
      this_rq->calc_load_active = nr_active;
   }
    // 統(tǒng)計(jì)完成,return后,將數(shù)據(jù)更新到 calc_load_tasks.
   return delta;
}

3: 數(shù)據(jù)計(jì)算

看完數(shù)據(jù)來源的邏輯,我們接著梳理數(shù)據(jù)計(jì)算的邏輯

這里前半部分的邏輯設(shè)計(jì)的底層驅(qū)動的高分辨率定時器模塊,我并不是十分了解.簡單的介紹一下,感興趣的可以自己去研究一下.(類名:tick-sched.c,因?yàn)?code>planuml不支持類名存在-)

3.1: tick_sched_timer

/*
 * High resolution timer specific code
 */
 //這里要看下內(nèi)核是否開啟了高分辨率定時器+ CONFIG_HIGH_RES_TIMERS = y
#ifdef CONFIG_HIGH_RES_TIMERS  
/*
 * We rearm the timer until we get disabled by the idle code.
 * Called with interrupts disabled.
 */
 // tick_sched_timer函數(shù)是高分辨率定時器的到期函數(shù),也就是定時的每個周期結(jié)束都會執(zhí)行
static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
{
   struct tick_sched *ts =
      container_of(timer, struct tick_sched, sched_timer);
   struct pt_regs *regs = get_irq_regs();
   ktime_t now = ktime_get();
   tick_sched_do_timer(now);
    ...
   return HRTIMER_RESTART;
}

3.2: calc_global_load

中間的定時器模塊的函數(shù)就跳過了,已經(jīng)超出本文的范圍,我也并不是完全了解其中的邏輯.

/*
 * calc_load - update the avenrun load estimates 10 ticks after the
 * CPUs have updated calc_load_tasks.
 *
 * Called from the global timer code.
 */
void calc_global_load(unsigned long ticks)
{
   long active, delta;
    // 在前文出現(xiàn)過的時間,這里有加上了10個tick,總間隔就是5s + 10 tick
   if (time_before(jiffies, calc_load_update + 10))
      return;
   /*
    * Fold the 'old' idle-delta to include all NO_HZ cpus.
    */
    // 統(tǒng)計(jì)NO_HZ模式下,cpu陷入空閑時間段錯過統(tǒng)計(jì)的task數(shù)據(jù)
   delta = calc_load_fold_idle();
   if (delta)
      atomic_long_add(delta, &calc_load_tasks); // 更新數(shù)據(jù)
   active = atomic_long_read(&calc_load_tasks); // 原子的方式讀取前面存入的全局變量
   active = active > 0 ? active * FIXED_1 : 0; // 乘FIXED_1
   avenrun[0] = calc_load(avenrun[0], EXP_1, active); // 1分鐘負(fù)載
   avenrun[1] = calc_load(avenrun[1], EXP_5, active); // 5分鐘負(fù)載 
   avenrun[2] = calc_load(avenrun[2], EXP_15, active); // 15分鐘負(fù)載
   calc_load_update += LOAD_FREQ; //更新時間
   /*
    * In case we idled for multiple LOAD_FREQ intervals, catch up in bulk.
    */
    //統(tǒng)計(jì)了NO_HZ模式下的task數(shù)據(jù),也要將NO_HZ模式下的tick數(shù)重新計(jì)算,要不然數(shù)據(jù)會不準(zhǔn).
   calc_global_nohz();
}

這里出現(xiàn)了一個NO_HZ模式,這個是CPU的一個概念,后文專門介紹一下.下面就是負(fù)載的計(jì)算規(guī)則了

3.3:計(jì)算規(guī)則 calc_load

/*
 * a1 = a0 * e + a * (1 - e)
 */
static unsigned long
calc_load(unsigned long load, unsigned long exp, unsigned long active)
{
   unsigned long newload;
   newload = load * exp + active * (FIXED_1 - exp);
   if (active >= load)
      newload += FIXED_1-1;
   return newload / FIXED_1;
}

具體的計(jì)算規(guī)則注釋也是非常清晰了,并不復(fù)雜,整體下來就和使用man proc獲取到的信息一樣,系統(tǒng)負(fù)載統(tǒng)計(jì)的是nr_runningnr_uninterruptible的數(shù)量.這兩個數(shù)據(jù)的來源就是core.cstruct rq,rq是CPU運(yùn)行隊(duì)列中重要的存儲結(jié)構(gòu)之一.

問題解析

回到最初的問題,我司的設(shè)備系統(tǒng)負(fù)載達(dá)到70+還沒有卡爆炸的原因,通過上面的代碼邏輯還是沒有直接給出答案.不過已經(jīng)有了邏輯,其他就很簡單了.

  • 1: 我輸出了nr_runningnr_uninterruptible的task數(shù)量發(fā)現(xiàn),nr_running的數(shù)據(jù)是正常的,出問題的在與nr_uninterruptible的數(shù)量.
  • 2:出問題的是nr_uninterruptibletask數(shù)量,那么我司的設(shè)備真的有那么多任務(wù)在等待I/O么,真的有怎么多任務(wù)在等待I/O,設(shè)備依然會十分卡頓,我抓取了systrace查看后,一切是正常的.
  • 3: 事情到了這里,就只能借助搜索引擎了.根據(jù)nr_uninterruptible的關(guān)鍵字,我查到了一些蛛絲馬跡.

簡述結(jié)果

首先在UNIX系統(tǒng)上是沒有統(tǒng)計(jì)nr_uninterruptible的,Linux在引入后,有人提出不統(tǒng)計(jì)I/O等待的任務(wù)數(shù)量,無法體現(xiàn)真正體現(xiàn)系統(tǒng)的負(fù)載狀況.

后面在很多Linux大佬的文章中看到一個信息,NFS系統(tǒng)出現(xiàn)問題的的時候,會將所有訪問這個文件系統(tǒng)的線程都標(biāo)識為nr_uninterruptible,這部分的知識太貼近內(nèi)核了.(ps:如果有大佬有相關(guān)的內(nèi)核書籍推薦的話,請務(wù)必推薦一下).

  • 結(jié)論: 因?yàn)?code>nr_uninterruptible的數(shù)據(jù)異常,導(dǎo)致系統(tǒng)負(fù)載數(shù)據(jù)并沒有體現(xiàn)出目前設(shè)備的真實(shí)狀況.

收獲和總結(jié)

  • 1: scheduler_tick這個函數(shù)注釋中提到的HZ,應(yīng)該是軟中斷,軟中斷和內(nèi)核配置中的CONFIG_HZ_250,CONFIG_HZ_1000是關(guān)聯(lián)的,例如CONFIG_HZ_1000=y,CONFIG_HZ=1000,就是每秒內(nèi)核會發(fā)出1000的軟中斷信號. 對應(yīng)的時間就是 1s/1000. (通常CONFIG_HZ=250)
  • 2: jiffies它就是時鐘中斷次數(shù), jiffies = 1s / HZ
  • 3:rq結(jié)構(gòu)體太長了,就不全部貼出來了,結(jié)構(gòu)體定義在kernel/sched/sched.h中,有興趣的自行查看.
   struct rq *rq = cpu_rq(cpu);
/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &amp;runqueue.
 */
struct rq {
   /* runqueue lock: */
   raw_spinlock_t lock;
   /*
    * nr_running and cpu_load should be in the same cacheline because
    * remote CPUs use both these fields when doing load calculation.
    */
   unsigned int nr_running; // 這里
#ifdef CONFIG_NUMA_BALANCING
   unsigned int nr_numa_running;  
   unsigned int nr_preferred_running;
#endif
   #define CPU_LOAD_IDX_MAX 5
   unsigned long cpu_load[CPU_LOAD_IDX_MAX];
   unsigned int misfit_task;
#ifdef CONFIG_NO_HZ_COMMON
#ifdef CONFIG_SMP
   unsigned long last_load_update_tick;
#endif /* CONFIG_SMP */
   unsigned long nohz_flags;
#endif /* CONFIG_NO_HZ_COMMON */
#ifdef CONFIG_NO_HZ_FULL
   unsigned long last_sched_tick;
#endif
#ifdef CONFIG_CPU_QUIET
   /* time-based average load */
   u64 nr_last_stamp;
   u64 nr_running_integral;
   seqcount_t ave_seqcnt;
#endif
   /* capture load from *all* tasks on this cpu: */
   struct load_weight load;
   unsigned long nr_load_updates;
   u64 nr_switches;
   struct cfs_rq cfs;
   struct rt_rq rt;
   struct dl_rq dl;
#ifdef CONFIG_FAIR_GROUP_SCHED
   /* list of leaf cfs_rq on this cpu: */
   struct list_head leaf_cfs_rq_list;
   struct list_head *tmp_alone_branch;
#endif /* CONFIG_FAIR_GROUP_SCHED */
   /*
    * This is part of a global counter where only the total sum
    * over all CPUs matters. A task can increase this counter on
    * one CPU and if it got migrated afterwards it may decrease
    * it on another CPU. Always updated under the runqueue lock:
    */
   unsigned long nr_uninterruptible; // 這里
   struct task_struct *curr, *idle, *stop;
   unsigned long next_balance;
   struct mm_struct *prev_mm;
   unsigned int clock_skip_update;
   u64 clock;
   u64 clock_task;
   atomic_t nr_iowait;
#ifdef CONFIG_SMP
   struct root_domain *rd;
   struct sched_domain *sd;
   unsigned long cpu_capacity;
   unsigned long cpu_capacity_orig;
   struct callback_head *balance_callback;
   unsigned char idle_balance;
   /* For active balancing */
   int active_balance;
   int push_cpu;
   struct task_struct *push_task;
   struct cpu_stop_work active_balance_work;
   /* cpu of this runqueue: */
   int cpu;
   int online;
    ...
};
  • 4高分辨率定時器針對單處理器系統(tǒng),可以為CPU提供的納米級定時精度.內(nèi)核配置CONFIG_HIGH_RES_TIMERS=y
  • 5:NO_HZ就是在CPU進(jìn)入休眠狀態(tài)時,不再持續(xù)的發(fā)送軟中斷信號,來減少設(shè)備功耗與耗電.內(nèi)核配置CONFIG_NO_HZ=y&CONFIG_NO_HZ_IDLE=y,那么相反,如果設(shè)備對功耗并不敏感,需要外部輸入電源,可以關(guān)閉這個模式,來提高性能.
  • 6:Android提取內(nèi)核配置:
adb pull /proc/config.gz .

原文鏈接:https://juejin.cn/post/7169417417599401998

欄目分類
最近更新