網站首頁編程語言正文

python合并RepeatMasker預測結果中染色體的overlap區域_python

作者：生信工具箱 ? 更新時間： 2022-08-23 編程語言

前言

RepeatMasker是一個通過已有數據庫預測重復序列的軟件，可以篩選DNA序列中的散在重復序列和低復雜序列，是重復序列注釋的重要軟件。

問題

我們想對RepeatMasker預測的結果文件進行重復序列的合并，也就是去除染色體之間的overlap區域同時將基因間距小于50個bp的也同樣視為overlap，我們應該如何用python處理并生成新的預測結果？

思路

首先需要對文件進行預處理提取出需要處理的列，'//'可以忽略
對相同染色體序列按照升序進行歸并排序
分別取相應染色體按照滑動窗口的思路進行雙指針比對，注意gap=50

1. 預處理

我們這里只需要結果文件的前三列，可以使用awk命令獲取

    awk '{for(i = 1; i <= 3; i++) 
         printf("%s ", $i); 
         printf("\n")}' result.txt >  pretreatment.txt  
         #result.txt為結果文件，pretreatment.txt為預處理結果文件

2. 將pretreatment.txt作為輸入文件，

with open ('pretreatment.txt','r')as f:
    for i in f.readlines():
        if i.strip() == '//':
            continue
        c = i.strip().split('\t')
        b.append(c[0])
        a.append((c[0],int(c[1]),int(c[2])))
print ("全部染色體數量: "+str(len(a)))

3.去重+歸并排序

c = [i for i in b_set if b.count(i) == 1]
for i in a:
    if i[0] not in c:
        continue
    a.remove(i)
    result.append((i[0],int(i[1]),int(i[2])))
print ("去重后染色體數量: "+str(len(a)))

a.sort(key = lambda x : (x[0], x[1], x[2])) 
#按照第一列，第二列，第三列分別排降升序

4.開始比對，gap=50

q = ''
start = 0
end = 0
tem1 = []
tem2 = []
gap = 50 
for i in a:
    if i[0] != q:
        if tem1:
            if tem1 not in tem2:
                tem2.append(tem1)
                tem1 = []
        q = I[0]
        start = int(i[1])
        end = int(i[2])
        continue
    if int(i[1]) < end or int(i[1]) - end < gap:
        if int(i[2]) > end:
            end = int(i[2])
            continue
        else:
            continue
    tem1.append([q,start,end])
    start = int(i[1])
    end = int(i[2])

5.將new_result.txt作為輸出文件，生成結果

with open ('new_result.txt','w')as f:
    for i in tem2:
        for o in I:
            print (o[0],o[1],o[2],file=f)
    for i in result:
        print (i[0],i[1],i[2],file=f)

6. 完整代碼

a = []
b = []
with open ('pretreatment.txt','r')as f:
    for i in f.readlines():
        if i.strip() == '//':
            continue
        c = i.strip().split('\t')
        b.append(c[0])
        a.append((c[0],int(c[1]),int(c[2])))
print ("全部染色體數量: "+str(len(a)))
b_set = set(b)
result = []
c = [i for i in b_set if b.count(i) == 1]
for i in a:
    if i[0] not in c:
        continue
    a.remove(i)
    result.append((i[0],int(i[1]),int(i[2])))
print ("去重后染色體數量: "+str(len(a)))
a.sort(key = lambda x : (x[0], x[1], x[2]))
q = ''
start = 0
end = 0
tem1 = []
tem2 = []
gap = 50
for i in a:
    if i[0] != q:
        if tem1:
            if tem1 not in tem2:
                tem2.append(tem1)
                tem1 = []
        q = I[0]
        start = int(i[1])
        end = int(i[2])
        continue
    if int(i[1]) < end or int(i[1]) - end < gap:
        if int(i[2]) > end:
            end = int(i[2])
            continue
        else:
            continue
    tem1.append([q,start,end])
    start = int(i[1])
    end = int(i[2])
with open ('new_result.txt','w')as f:
    for i in tem2:
        for o in I:
            print (o[0],o[1],o[2],file=f)
    for i in result:
        print (i[0],i[1],i[2],file=f)

原文鏈接：https://www.jianshu.com/p/42faa3a0228e

上一篇：nginx靜態資源的服務器配置方法_nginx
下一篇：Python利用VideoCapture讀取視頻或攝像頭并進

日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

網站首頁編程語言正文

python合并RepeatMasker預測結果中染色體的overlap區域_python

目錄

前言

問題

思路

1. 預處理

2. 將pretreatment.txt作為輸入文件，

3.去重+歸并排序

4.開始比對，gap=50

5.將new_result.txt作為輸出文件，生成結果

6. 完整代碼

相關推薦

日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

網站首頁 編程語言 正文

python合并RepeatMasker預測結果中染色體的overlap區域_python

目錄

前言

問題

思路

1. 預處理

2. 將pretreatment.txt作為輸入文件，

3.去重+歸并排序

4.開始比對，gap=50

5.將new_result.txt作為輸出文件，生成結果

6. 完整代碼

相關推薦

網站首頁編程語言正文

2. 將pretreatment.txt作為輸入文件，