Hi! I'm trying to run DeepImpute on scATAC-Seq data.
I've filtered my dataset to 'high-quality' cells with at least 5500 reads.
I've filtered my features (peaks) for those observed in >10 cells, leaving me with close to 250k. When I try to run impute on this, it crashes.
Input dataset is 7706 cells (rows) and 249255 genes (columns)
First 3 rows and columns:
W_14793_15289 W_37170_37548 W_46846_47099
AAACGAAAGTAATGTG-3 0 0 0
AAACGAACAGATGGCA-2 0 0 0
AAACGAACATTGTGAC-4 0 0 0
23040 genes selected for imputation
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
in
7 # Crashed, let's try with 50% of the data to fit the network.
8
----> 9 multinet.fit(MACS_data,cell_subset=1,minVMR=0.5)
~/.local/lib/python3.7/site-packages/deepimpute/multinet.py in fit(self, raw, cell_subset, NN_lim, genes_to_impute, ntop, minVMR, mode)
192 genes_to_impute = np.concatenate((genes_to_impute, fill_genes))
193
--> 194 covariance_matrix = get_distance_matrix(raw)
195
196 self.setTargets(raw.reindex(columns=genes_to_impute), mode=mode)
~/.local/lib/python3.7/site-packages/deepimpute/multinet.py in get_distance_matrix(raw)
22 potential_pred = raw.columns[raw.std() > 0]
23
---> 24 covariance_matrix = pd.DataFrame(np.abs(np.corrcoef(raw.T.loc[potential_pred])),
25 index=potential_pred,
26 columns=potential_pred).fillna(0)
in corrcoef(*args, **kwargs)
~/miniconda3/lib/python3.7/site-packages/numpy/lib/function_base.py in corrcoef(x, y, rowvar, bias, ddof)
2524 warnings.warn('bias and ddof have no effect and are deprecated',
2525 DeprecationWarning, stacklevel=3)
-> 2526 c = cov(x, y, rowvar)
2527 try:
2528 d = diag(c)
in cov(*args, **kwargs)
~/miniconda3/lib/python3.7/site-packages/numpy/lib/function_base.py in cov(m, y, rowvar, bias, ddof, fweights, aweights)
2452 else:
2453 X_T = (X*w).T
-> 2454 c = dot(X, X_T.conj())
2455 c *= np.true_divide(1, fact)
2456 return c.squeeze()
in dot(*args, **kwargs)
MemoryError: Unable to allocate array with shape (249255, 249255) and data type float64
Could you explain why the program is trying to operate on a matrix of all_genes x all_genes? I am running this on a server with ~200GB of memory. I can override the error with echo 1 > /proc/sys/vm/overcommit_memory but it actually uses it all and crashes. Any thoughts would be appreciated! If I'm understanding correctly, I cannot run the model on the identified subset and then apply that similarly to the rest of the genes afterwards, correct? Additionally, I realize that this is meant for scRNA-Seq but I figured it should be able to be applied to scATAC-Seq.
On another note, is it possible to specify which normalization to use? Say, Square-root transform?
|