日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

學無先后,達者為師

網站首頁 編程語言 正文

loadavg數據異常引發問題起源分析_Android

作者:邢少年 ? 更新時間: 2022-12-23 編程語言

proc

  • NAME (名稱解釋):

proc - process information pseudo-filesystem (存儲進程信息的偽文件系統)

  • DESCRIPTION (詳細)

The ?proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures. ?
It is commonly mounted at /proc. ?Most of it is read-only, but some files allow kernel variables to
be changed

pooc文件系統是一個偽裝的文件系統,它提供接口給內核來存儲數據,通常掛載在設備的/proc目錄,
大部分文件是只讀的,但是有些文件可以被內和變量給改變.

具體代表的含義可以通過man proc去查看. 以上信息就是通過man獲取.翻譯不一定精確.

loadavg

cat /proc/loadavg

/proc/loadavg
? The first three fields in this file are load average figures giving the number of?
? jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1, 5,?
? and ?15 ?minutes. ??

這個文件的前三個數字是平均負載的數值,計算平均1分鐘,5分鐘,15分鐘內的運行隊列中(R狀態)或等待磁盤I/O(D狀態)的任務數.

The first of these is the number of cur‐rently runnable kernel scheduling entities?
? (processes, threads). ?The value after the slash is the number of kernel scheduling?
? entities that currently exist on the system.?

第四個參數/前面是可運行的內核調度實體的數量(調度實體指 進程,線程), /后的值是系統中存在的內核調度實體的數量.

The fifth field ?is the PID of the process that was most recently created on the system.

第五個參數是系統最新創建進程的PID

1: 問題起源

在從事的大屏領域遇到一個問題,就是loadavg中的數值其高無比,對比8核手機的3+,4+,目前的手頭的設備loadavg竟然高達70+,這個問題一直困擾了我很久,最近騰出一個整塊的時間來研究一下這個數值的計算規則.

kernel中的loadvg.c文件中有這樣的一個函數.我們看到它就是最終的輸出函數.

static int loadavg_proc_show(struct seq_file *m, void *v)
{
   unsigned long avnrun[3];
   get_avenrun(avnrun, FIXED_1/200, 0);
   seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %ld/%d %d\n",
      LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),  // 1分鐘平均值
      LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),  // 5分鐘平均值
      LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),  // 15分鐘平均值
      // 可運行實體使用  nr_running()獲取, nr_threads 是存在的所有實體
      nr_running() , nr_threads,
      // 獲取最新創建的進程PID
      task_active_pid_ns(current)->last_pid);
   return 0;
}

看過上面的代碼獲取具體平均負載的函數是get_avenrun(),我們接著找一下它的具體實現.

unsigned long avenrun[3];
EXPORT_SYMBOL(avenrun); /* should be removed */
/**
 * get_avenrun - get the load average array
 * @loads: pointer to dest load array
 * @offset:    offset to add
 * @shift: shift count to shift the result left
 *
 * These values are estimates at best, so no need for locking.
 */
void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
{
    //數據來源主要是avenrun數組
   loads[0] = (avenrun[0] + offset) << shift;
   loads[1] = (avenrun[1] + offset) << shift;
   loads[2] = (avenrun[2] + offset) << shift;
}

2: 數據來源

接著我們接著尋找avenrun[]在哪里賦值,我們先看數據的來源問題.

  • kernel版本4.9 代碼路徑kernel/sched/core.c,kernel/sched/loadavg.c.

2.1:scheduler_tick

/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 * 這里注釋就比較清楚了,由計時器調度,調度的頻率為HZ
 */
void scheduler_tick(void)
{
   int cpu = smp_processor_id();
   struct rq *rq = cpu_rq(cpu);
   struct task_struct *curr = rq->curr;
   sched_clock_tick();
   raw_spin_lock(&rq->lock);
   walt_set_window_start(rq);
   walt_update_task_ravg(rq->curr, rq, TASK_UPDATE,
         walt_ktime_clock(), 0);
   update_rq_clock(rq);
   curr->sched_class->task_tick(rq, curr, 0);
   cpu_load_update_active(rq);
   calc_global_load_tick(rq); // 這里調度
   raw_spin_unlock(&rq->lock);
   perf_event_task_tick();
#ifdef CONFIG_SMP
   rq->idle_balance = idle_cpu(cpu);
   trigger_load_balance(rq);
#endif
   rq_last_tick_reset(rq);
   if (curr->sched_class == &fair_sched_class)
      check_for_migration(rq, curr);
}

2.2: calc_global_load_tick

/*
 * Called from scheduler_tick() to periodically update this CPU's
 * active count.
 */
void calc_global_load_tick(struct rq *this_rq)
{
   long delta;
    //過濾系統負載重復更新,這里是同過jiffies進行過濾,jiffies也在下面統一介紹
   if (time_before(jiffies, this_rq->calc_load_update)) 
      return;
   // 更新數據 
   delta  = calc_load_fold_active(this_rq, 0);
   if (delta)
       // 將數據同步到calc_load_tasks, atomic_long_add 是kernel中的一個原子操作函數
      atomic_long_add(delta, &calc_load_tasks);
    // 下一次系統更新系統負載的時間 LOAD_FREQ定義在include/linux/sched.h 
    //   #define LOAD_FREQ   (5*HZ+1)   /* 5 sec intervals */
   this_rq->calc_load_update += LOAD_FREQ;  
}

2.3: calc_load_fold_active

long calc_load_fold_active(struct rq *this_rq, long adjust)
{
   long nr_active, delta = 0;
   nr_active = this_rq->nr_running - adjust; //統計調度器中nr_running的task數量 adjust傳入為0,不做討論.
   nr_active += (long)this_rq->nr_uninterruptible; //統計調度器中nr_uninterruptible的task的數量.
    // calc_load_active代表了nr_running和nr_uninterruptible的數量,如果存在差值就計算差值
   if (nr_active != this_rq->calc_load_active) { 
      delta = nr_active - this_rq->calc_load_active;
      this_rq->calc_load_active = nr_active;
   }
    // 統計完成,return后,將數據更新到 calc_load_tasks.
   return delta;
}

3: 數據計算

看完數據來源的邏輯,我們接著梳理數據計算的邏輯

這里前半部分的邏輯設計的底層驅動的高分辨率定時器模塊,我并不是十分了解.簡單的介紹一下,感興趣的可以自己去研究一下.(類名:tick-sched.c,因為planuml不支持類名存在-)

3.1: tick_sched_timer

/*
 * High resolution timer specific code
 */
 //這里要看下內核是否開啟了高分辨率定時器+ CONFIG_HIGH_RES_TIMERS = y
#ifdef CONFIG_HIGH_RES_TIMERS  
/*
 * We rearm the timer until we get disabled by the idle code.
 * Called with interrupts disabled.
 */
 // tick_sched_timer函數是高分辨率定時器的到期函數,也就是定時的每個周期結束都會執行
static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
{
   struct tick_sched *ts =
      container_of(timer, struct tick_sched, sched_timer);
   struct pt_regs *regs = get_irq_regs();
   ktime_t now = ktime_get();
   tick_sched_do_timer(now);
    ...
   return HRTIMER_RESTART;
}

3.2: calc_global_load

中間的定時器模塊的函數就跳過了,已經超出本文的范圍,我也并不是完全了解其中的邏輯.

/*
 * calc_load - update the avenrun load estimates 10 ticks after the
 * CPUs have updated calc_load_tasks.
 *
 * Called from the global timer code.
 */
void calc_global_load(unsigned long ticks)
{
   long active, delta;
    // 在前文出現過的時間,這里有加上了10個tick,總間隔就是5s + 10 tick
   if (time_before(jiffies, calc_load_update + 10))
      return;
   /*
    * Fold the 'old' idle-delta to include all NO_HZ cpus.
    */
    // 統計NO_HZ模式下,cpu陷入空閑時間段錯過統計的task數據
   delta = calc_load_fold_idle();
   if (delta)
      atomic_long_add(delta, &calc_load_tasks); // 更新數據
   active = atomic_long_read(&calc_load_tasks); // 原子的方式讀取前面存入的全局變量
   active = active > 0 ? active * FIXED_1 : 0; // 乘FIXED_1
   avenrun[0] = calc_load(avenrun[0], EXP_1, active); // 1分鐘負載
   avenrun[1] = calc_load(avenrun[1], EXP_5, active); // 5分鐘負載 
   avenrun[2] = calc_load(avenrun[2], EXP_15, active); // 15分鐘負載
   calc_load_update += LOAD_FREQ; //更新時間
   /*
    * In case we idled for multiple LOAD_FREQ intervals, catch up in bulk.
    */
    //統計了NO_HZ模式下的task數據,也要將NO_HZ模式下的tick數重新計算,要不然數據會不準.
   calc_global_nohz();
}

這里出現了一個NO_HZ模式,這個是CPU的一個概念,后文專門介紹一下.下面就是負載的計算規則了

3.3:計算規則 calc_load

/*
 * a1 = a0 * e + a * (1 - e)
 */
static unsigned long
calc_load(unsigned long load, unsigned long exp, unsigned long active)
{
   unsigned long newload;
   newload = load * exp + active * (FIXED_1 - exp);
   if (active >= load)
      newload += FIXED_1-1;
   return newload / FIXED_1;
}

具體的計算規則注釋也是非常清晰了,并不復雜,整體下來就和使用man proc獲取到的信息一樣,系統負載統計的是nr_runningnr_uninterruptible的數量.這兩個數據的來源就是core.cstruct rq,rq是CPU運行隊列中重要的存儲結構之一.

問題解析

回到最初的問題,我司的設備系統負載達到70+還沒有卡爆炸的原因,通過上面的代碼邏輯還是沒有直接給出答案.不過已經有了邏輯,其他就很簡單了.

  • 1: 我輸出了nr_runningnr_uninterruptible的task數量發現,nr_running的數據是正常的,出問題的在與nr_uninterruptible的數量.
  • 2:出問題的是nr_uninterruptibletask數量,那么我司的設備真的有那么多任務在等待I/O么,真的有怎么多任務在等待I/O,設備依然會十分卡頓,我抓取了systrace查看后,一切是正常的.
  • 3: 事情到了這里,就只能借助搜索引擎了.根據nr_uninterruptible的關鍵字,我查到了一些蛛絲馬跡.

簡述結果

首先在UNIX系統上是沒有統計nr_uninterruptible的,Linux在引入后,有人提出不統計I/O等待的任務數量,無法體現真正體現系統的負載狀況.

后面在很多Linux大佬的文章中看到一個信息,NFS系統出現問題的的時候,會將所有訪問這個文件系統的線程都標識為nr_uninterruptible,這部分的知識太貼近內核了.(ps:如果有大佬有相關的內核書籍推薦的話,請務必推薦一下).

  • 結論: 因為nr_uninterruptible的數據異常,導致系統負載數據并沒有體現出目前設備的真實狀況.

收獲和總結

  • 1: scheduler_tick這個函數注釋中提到的HZ,應該是軟中斷,軟中斷和內核配置中的CONFIG_HZ_250,CONFIG_HZ_1000是關聯的,例如CONFIG_HZ_1000=y,CONFIG_HZ=1000,就是每秒內核會發出1000的軟中斷信號. 對應的時間就是 1s/1000. (通常CONFIG_HZ=250)
  • 2: jiffies它就是時鐘中斷次數, jiffies = 1s / HZ
  • 3:rq結構體太長了,就不全部貼出來了,結構體定義在kernel/sched/sched.h中,有興趣的自行查看.
   struct rq *rq = cpu_rq(cpu);
/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &amp;runqueue.
 */
struct rq {
   /* runqueue lock: */
   raw_spinlock_t lock;
   /*
    * nr_running and cpu_load should be in the same cacheline because
    * remote CPUs use both these fields when doing load calculation.
    */
   unsigned int nr_running; // 這里
#ifdef CONFIG_NUMA_BALANCING
   unsigned int nr_numa_running;  
   unsigned int nr_preferred_running;
#endif
   #define CPU_LOAD_IDX_MAX 5
   unsigned long cpu_load[CPU_LOAD_IDX_MAX];
   unsigned int misfit_task;
#ifdef CONFIG_NO_HZ_COMMON
#ifdef CONFIG_SMP
   unsigned long last_load_update_tick;
#endif /* CONFIG_SMP */
   unsigned long nohz_flags;
#endif /* CONFIG_NO_HZ_COMMON */
#ifdef CONFIG_NO_HZ_FULL
   unsigned long last_sched_tick;
#endif
#ifdef CONFIG_CPU_QUIET
   /* time-based average load */
   u64 nr_last_stamp;
   u64 nr_running_integral;
   seqcount_t ave_seqcnt;
#endif
   /* capture load from *all* tasks on this cpu: */
   struct load_weight load;
   unsigned long nr_load_updates;
   u64 nr_switches;
   struct cfs_rq cfs;
   struct rt_rq rt;
   struct dl_rq dl;
#ifdef CONFIG_FAIR_GROUP_SCHED
   /* list of leaf cfs_rq on this cpu: */
   struct list_head leaf_cfs_rq_list;
   struct list_head *tmp_alone_branch;
#endif /* CONFIG_FAIR_GROUP_SCHED */
   /*
    * This is part of a global counter where only the total sum
    * over all CPUs matters. A task can increase this counter on
    * one CPU and if it got migrated afterwards it may decrease
    * it on another CPU. Always updated under the runqueue lock:
    */
   unsigned long nr_uninterruptible; // 這里
   struct task_struct *curr, *idle, *stop;
   unsigned long next_balance;
   struct mm_struct *prev_mm;
   unsigned int clock_skip_update;
   u64 clock;
   u64 clock_task;
   atomic_t nr_iowait;
#ifdef CONFIG_SMP
   struct root_domain *rd;
   struct sched_domain *sd;
   unsigned long cpu_capacity;
   unsigned long cpu_capacity_orig;
   struct callback_head *balance_callback;
   unsigned char idle_balance;
   /* For active balancing */
   int active_balance;
   int push_cpu;
   struct task_struct *push_task;
   struct cpu_stop_work active_balance_work;
   /* cpu of this runqueue: */
   int cpu;
   int online;
    ...
};
  • 4高分辨率定時器針對單處理器系統,可以為CPU提供的納米級定時精度.內核配置CONFIG_HIGH_RES_TIMERS=y
  • 5:NO_HZ就是在CPU進入休眠狀態時,不再持續的發送軟中斷信號,來減少設備功耗與耗電.內核配置CONFIG_NO_HZ=y&CONFIG_NO_HZ_IDLE=y,那么相反,如果設備對功耗并不敏感,需要外部輸入電源,可以關閉這個模式,來提高性能.
  • 6:Android提取內核配置:
adb pull /proc/config.gz .

原文鏈接:https://juejin.cn/post/7169417417599401998

欄目分類
最近更新