Classification of helical polymers with deep-learning language models

Many macromolecules in biological systems exist in the form of helical polymers, such as amyloid fibrils (Fitzpatrick et al., 2017, Zhang et al., 2019, Zhang et al., 2020), microtubule (Löwe et al., 2001), actin (Galkin et al., 2010, von der Ecken et al., 2015), bacteria pili (Neuhaus, 2020), viruses (Weis et al., 2019), and phage tails (De rosier and Klug, 1968). 3D reconstruction of these helical polymers is essential to elucidate the structure–function relationships and provide an understanding of related diseases at the atomic level (Li et al., 2022, Lövestam et al., 2022), leading to potential drug discovery (Renaud, 2018). Often, the 3D reconstruction of helical polymers is challenged by the heterogeneity of the sample (Egelman, 2007).

Helical reconstruction from cryo-EM images involves the deposition of filaments on grids and the collection of projection image data. Then, the filaments on these images are identified manually or through automated methods. The traditional method of reconstructing the 3D structure of helical polymers is performed in the Fourier space of the full-length filament. The helical parameters are determined through indexing of the Fourier layer lines, and the reconstruction is based on Fourier-Bessel synthesis (De rosier and Klug, 1968, Gonen et al., 2005, Yonekura et al., 2003). However, this method requires a long, straight, and uniform helical structure. When the helical structures are short, poorly ordered, or tilted out of the plane, the layer lines become weak and ambiguous, making it challenging to derive helical parameters using the Fourier-Bessel indexing of helical parameters (Diaz et al., 2010, Stewart, 1988). Reconstruction using iterative helical real-space reconstruction (IHRSR) does not require the helical polymers to be as long or straight. This is because the helical polymers are computationally segmented along the helical axis, and the segments are analyzed like “single particles” and then reconstructed using the estimated helical parameter (Egelman, 2000, Desfosses et al., 2014, He and Scheres, 2017). However, the presence of heterogeneous types of helical structures can impact the accuracy of the estimated helical parameters, separation of the polymorphs into homogenous subsets, and ultimately affect the final reconstructions (He and Scheres, 2017, Ramey et al., 2009).

There are several approaches to separate different types of filaments, including manual selection (Fitzpatrick et al., 2017, Zhang et al., 2020), 2D classification (Cao et al., 2019), and 3D classification (Guerrero-Ferreira et al., 2019). Manual picking requires visually distinct features, for example, diameter or shape, and the researcher must have sufficient prior knowledge of structural differences in the helical assemblies. Distinct types of helical structures can also have a similar diameter or features beyond visual recognition (Guerrero-Ferreira et al., 2019). Manual selection is thus not only labor-intensive but also error prone. While 2D classification can categorize different states and views of helical assemblies, it can also have limitations as the filaments may not be homogenous within each 2D class (Fig. 2B,C). Additionally, the features of the 2D average images may not be clear enough to visually group them when each 2D class represents not only the type of filaments, but also different, unknown views and different parts of the pitch of the helix. The chep method (Pothula et al., 2019, Pothula et al., 2021) clusters the per-filament histogram of 2D class assignments to separate the different filament types. In this approach, the relative positions of the segments along the filaments are ignored, which might cause the weak separation of the amyloid filaments (Pothula et al., 2021). Although 3D classification is the most accurate approach, it requires prior knowledge of the helical parameters and initial 3D models for each type of helical assembly, which restricts its use for experimental samples with limited prior information.

Here, we aim to develop a computational method to quantitatively separate the different types of helical structures in a cryo-EM dataset. After 2D classification, each of the helical segment images is assigned a 2D class index (ci, Fig. 1B). These indices, when associated with the sequential position information along each filament, can be considered special sentences, with each 2D class index serving as a ‘word’ that is related to the filament type and rotation around the helical axis (Fig. 1A,B). In recent years, self-supervised learning of natural languages has been shown to be extremely powerful not only in the quantitative representation of words and sentences (Mikolov et al., 2013, Le and Mikolov, 2014, Devlin et al., 2019) with successful applications such as document clustering (Subakti et al., 2022) and generation of new documents (Rothe et al., 2020), but also in biomedical research, for example, in protein informatics (Zou et al., 2019, Rives, 2021) and structural prediction (Chandra et al., 2023, Ferruz and Höcker, 2022). Thus, we explored the language model technique to classify different helical structures.

In this Helical classification with Language Models (HLM) method, we have examined two self-supervised learning methods: the classic word2vec model (Mikolov et al., 2013, Le and Mikolov, 2014) and a recent Transformer based BERT model (Devlin et al., 2019) for helical classification (Fig. 1C). The word2vec only has a single vector representation for each ‘word’ or 2D class. However, the word embedding in the Transformer model is context-dependent and is able to distinguish the different meanings of the same word in different contexts. Both models have been shown to be effective in separating different types of manually picked or auto-picked filaments in simulated and experimental datasets. For more challenging datasets, we augmented the 2D classification and class selection results with filament recovery to reduce the number of fragmented filaments (e.g., sentences with missing words). In one of the validation tests, it was shown that HLM-Transformer led to the discovery of a subset of tau filaments in a public dataset that has an extra, unreported density adjacent to the tau protein.

留言 (0)

沒有登入
gif