網(wǎng)站首頁 編程語言 正文
PTB數(shù)據(jù)集
內(nèi)容如下:
一行保存一個句子;將稀有單詞替換成特殊字符 < unk > ;將具體的數(shù)字替換 成“N”
we 're talking about years ago before anyone heard of asbestos having any questionable properties there is no asbestos in our products now neithernor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes we have no useful information on whether users are at risk said james a. of boston 's cancer institute dr. led a team of researchers from the national cancer institute and the medical schools of harvard university and boston university
ptb.py
使用PTB數(shù)據(jù)集:
由下面這句話,可知用PTB數(shù)據(jù)集時候,是把所有句子首尾連接了。
words = open(file_path).read().replace('\n', '').strip().split()
ptb.py起到了下載PTB數(shù)據(jù)集,把數(shù)據(jù)集存到文件夾某個位置,然后對數(shù)據(jù)集進行提取的功能,提取出corpus, word_to_id, id_to_word。
import sys import os sys.path.append('..') try: import urllib.request except ImportError: raise ImportError('Use Python3!') import pickle import numpy as np url_base = 'https://raw.githubusercontent.com/tomsercu/lstm/master/data/' key_file = { 'train':'ptb.train.txt', 'test':'ptb.test.txt', 'valid':'ptb.valid.txt' } save_file = { 'train':'ptb.train.npy', 'test':'ptb.test.npy', 'valid':'ptb.valid.npy' } vocab_file = 'ptb.vocab.pkl' dataset_dir = os.path.dirname(os.path.abspath(__file__)) def _download(file_name): file_path = dataset_dir + '/' + file_name if os.path.exists(file_path): return print('Downloading ' + file_name + ' ... ') try: urllib.request.urlretrieve(url_base + file_name, file_path) except urllib.error.URLError: import ssl ssl._create_default_https_context = ssl._create_unverified_context urllib.request.urlretrieve(url_base + file_name, file_path) print('Done') def load_vocab(): vocab_path = dataset_dir + '/' + vocab_file if os.path.exists(vocab_path): with open(vocab_path, 'rb') as f: word_to_id, id_to_word = pickle.load(f) return word_to_id, id_to_word word_to_id = {} id_to_word = {} data_type = 'train' file_name = key_file[data_type] file_path = dataset_dir + '/' + file_name _download(file_name) words = open(file_path).read().replace('\n', '').strip().split() for i, word in enumerate(words): if word not in word_to_id: tmp_id = len(word_to_id) word_to_id[word] = tmp_id id_to_word[tmp_id] = word with open(vocab_path, 'wb') as f: pickle.dump((word_to_id, id_to_word), f) return word_to_id, id_to_word def load_data(data_type='train'): ''' :param data_type: 數(shù)據(jù)的種類:'train' or 'test' or 'valid (val)' :return: ''' if data_type == 'val': data_type = 'valid' save_path = dataset_dir + '/' + save_file[data_type] word_to_id, id_to_word = load_vocab() if os.path.exists(save_path): corpus = np.load(save_path) return corpus, word_to_id, id_to_word file_name = key_file[data_type] file_path = dataset_dir + '/' + file_name _download(file_name) words = open(file_path).read().replace('\n', ' ').strip().split() corpus = np.array([word_to_id[w] for w in words]) np.save(save_path, corpus) return corpus, word_to_id, id_to_word if __name__ == '__main__': for data_type in ('train', 'val', 'test'): load_data(data_type)
使用ptb.py
corpus保存了單詞ID列表,id_to_word 是將單詞ID轉(zhuǎn)化為單詞的字典,word_to_id 是將單詞轉(zhuǎn)化為單詞ID的字典。
使用ptb.load_data()加載數(shù)據(jù)。里面的參數(shù) ‘train’、‘test’、‘valid’ 分別對應(yīng)訓(xùn)練用數(shù)據(jù)、測試用數(shù)據(jù)、驗證用數(shù)據(jù)。
import sys sys.path.append('..') from dataset import ptb corpus, word_to_id, id_to_word = ptb.load_data('train') print('corpus size:', len(corpus)) print('corpus[:30]:', corpus[:30]) print() print('id_to_word[0]:', id_to_word[0]) print('id_to_word[1]:', id_to_word[1]) print('id_to_word[2]:', id_to_word[2]) print() print("word_to_id['car']:", word_to_id['car']) print("word_to_id['happy']:", word_to_id['happy']) print("word_to_id['lexus']:", word_to_id['lexus'])
結(jié)果:
corpus size: 929589 corpus[:30]: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29] id_to_word[0]: aer id_to_word[1]: banknote id_to_word[2]: berlitz word_to_id['car']: 3856 word_to_id['happy']: 4428 word_to_id['lexus']: 7426 Process finished with exit code 0
計數(shù)方法應(yīng)用于PTB數(shù)據(jù)集
其實和不用PTB數(shù)據(jù)集的區(qū)別就在于這句話。
corpus, word_to_id, id_to_word = ptb.load_data('train')
下面這句話起降維的效果
word_vecs = U[:, :wordvec_size]
整個代碼其實耗時最大的是在下面這個函數(shù)上:
W = ppmi(C, verbose=True)
完整代碼:
import sys sys.path.append('..') import numpy as np from common.util import most_similar, create_co_matrix, ppmi from dataset import ptb window_size = 2 wordvec_size = 100 corpus, word_to_id, id_to_word = ptb.load_data('train') vocab_size = len(word_to_id) print('counting co-occurrence ...') C = create_co_matrix(corpus, vocab_size, window_size) print('calculating PPMI ...') W = ppmi(C, verbose=True) print('calculating SVD ...') #try: # truncated SVD (fast!) print("ok") from sklearn.utils.extmath import randomized_svd U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5, random_state=None) #except ImportError: # SVD (slow) # U, S, V = np.linalg.svd(W) word_vecs = U[:, :wordvec_size] querys = ['you', 'year', 'car', 'toyota'] for query in querys: most_similar(query, word_to_id, id_to_word, word_vecs, top=5)
下面這個是用普通的np.linalg.svd(W)做出的結(jié)果。
[query] you i: 0.7016294002532959 we: 0.6388039588928223 anybody: 0.5868048667907715 do: 0.5612815618515015 'll: 0.512611985206604 [query] year month: 0.6957005262374878 quarter: 0.691483736038208 earlier: 0.6661213636398315 last: 0.6327787041664124 third: 0.6230476498603821 [query] car luxury: 0.6767407655715942 auto: 0.6339930295944214 vehicle: 0.5972712635993958 cars: 0.5888376235961914 truck: 0.5693157315254211 [query] toyota motor: 0.7481387853622437 nissan: 0.7147319316864014 motors: 0.6946366429328918 lexus: 0.6553674340248108 honda: 0.6343469619750977
下面結(jié)果,是用了sklearn模塊里面的randomized_svd方法,使用了隨機數(shù)的 Truncated SVD,僅對奇異值較大的部分進行計算,計算速度比常規(guī)的 SVD 快。
calculating SVD ... ok [query] you i: 0.6678948998451233 we: 0.6213737726211548 something: 0.560122013092041 do: 0.5594725608825684 someone: 0.5490139126777649 [query] year month: 0.6444296836853027 quarter: 0.6192560791969299 next: 0.6152222156524658 fiscal: 0.5712860226631165 earlier: 0.5641934871673584 [query] car luxury: 0.6612467765808105 auto: 0.6166062355041504 corsica: 0.5270425081253052 cars: 0.5142025947570801 truck: 0.5030257105827332 [query] toyota motor: 0.7747215628623962 motors: 0.6871038675308228 lexus: 0.6786072850227356 nissan: 0.6618651151657104 mazda: 0.6237337589263916 Process finished with exit code 0
原文鏈接:https://www.cnblogs.com/jiangyiming/p/16102323.html
相關(guān)推薦
- 2022-03-08 使用C語言實現(xiàn)本地socke通訊的方法_C 語言
- 2022-06-02 python套接字socket通信_python
- 2023-11-20 Linux、jetson nano、JTX、英偉達、nVidia查看cuda版本
- 2022-10-22 SQLMAP插件tamper編寫與使用詳解_MsSql
- 2022-06-26 Android開發(fā)快速實現(xiàn)底部導(dǎo)航欄示例_Android
- 2023-05-26 keras.layers.Conv2D()函數(shù)參數(shù)用法及說明_python
- 2022-08-21 Android設(shè)置重復(fù)文字水印背景的方法_Android
- 2022-10-29 【npm 報錯 gyp info it worked if it ends with ok 大概率是
- 最近更新
-
- window11 系統(tǒng)安裝 yarn
- 超詳細win安裝深度學(xué)習(xí)環(huán)境2025年最新版(
- Linux 中運行的top命令 怎么退出?
- MySQL 中decimal 的用法? 存儲小
- get 、set 、toString 方法的使
- @Resource和 @Autowired注解
- Java基礎(chǔ)操作-- 運算符,流程控制 Flo
- 1. Int 和Integer 的區(qū)別,Jav
- spring @retryable不生效的一種
- Spring Security之認證信息的處理
- Spring Security之認證過濾器
- Spring Security概述快速入門
- Spring Security之配置體系
- 【SpringBoot】SpringCache
- Spring Security之基于方法配置權(quán)
- redisson分布式鎖中waittime的設(shè)
- maven:解決release錯誤:Artif
- restTemplate使用總結(jié)
- Spring Security之安全異常處理
- MybatisPlus優(yōu)雅實現(xiàn)加密?
- Spring ioc容器與Bean的生命周期。
- 【探索SpringCloud】服務(wù)發(fā)現(xiàn)-Nac
- Spring Security之基于HttpR
- Redis 底層數(shù)據(jù)結(jié)構(gòu)-簡單動態(tài)字符串(SD
- arthas操作spring被代理目標對象命令
- Spring中的單例模式應(yīng)用詳解
- 聊聊消息隊列,發(fā)送消息的4種方式
- bootspring第三方資源配置管理
- GIT同步修改后的遠程分支