Pagina 1 di 1

[RISOLTO]Grep o AWK per estrarre blocco di testo

Inviato: giovedì 5 maggio 2016, 19:32
da cloudany
Ciao ragazzi, altro giro altro dubbio, allora ho un giga file di testo di questo tipo:

Codice: Seleziona tutto

>gi|526245010|ref|NC_021865.1| Paenibacillus phage phiIBB_Pl23, complete genome
TTTGTGGCATTGTGAACTCTCTTATACAAGCCGACACGACTACGATTTTCAACCAATTCCAGAAATGGTTTGAAAGCCGT
ACAGCAGACTTTGAAAAAGAGTGGAAAGAATGGCTGGAGAAGATGAAAGATCAAGGGGGAGGGAAATTTGGTGTAACATC
CGTCAATGGTAAGACCGGGGATGTAATACTGATGGCCAAACATGTAGGGGCCCCAAGTATAAATGATCTCAGAGCATACG
CCCTGAAAGGCGAACCCGCGGGGCAGTACACGCCTACGTTTTTGAATGGTTGGTACGTACAAGCAGGTGAAGTCAAAGGG
GTTTGCTACTACAAGGATCAATTCGGCTACGTCCACCTTTATGGCACTTGCTCAGGGACTAAAACTGAGTTTGGCACACC
TTTATTCAACCTCCCGGCTGGGTTTCGTCCAAGTGGTGTAATCCGCGTTGGTTGTCTGATGATCGACTTTGCAGACTACT
CCAGATCTATTCAATTTCTAGGTGTATATCCCAGTGGAGAGGTATTGATAGAAAGTTATGGGTTACCCGGTTTTGTCTCG
etc...
TTTAGCATTTTCCCCTCCTCATTTTACGGGCAAAGGTAGGTGTTTGAGATGGATAAGATCATACAAGCGAGTCGAGTAGA
CAAAGCAGGTAAATACGTAGAAGCGTTGGCACTCATAGAGAAAGATGGCAAGTATTTTAACCTTTTAGATCAAGAGGTAC
CCATAGATGATACTGTCGTATTTGATCCTTTACCCATGCCGATTTATACGCCTATTTGGGATTTTAAAACTAAGACATGG
AAAGAAGGTTTATCTCAAGAGGAAATTGATCAGATCAAAAATCGCCCAGATCCTCCAAACCCAATGAAGGTCATGGAAAA
>gi|460042095|ref|NC_020479.1| Bacillus phage Curly, complete genome
CCCCCACCCCGATGAGGTCGCCTTCTCGTTCCCTGACACCAGAAGCTAATAAGGAGGAATACATATGAGTAAAAGTTGGA
AAGACAAGGATTACGAAGCAAAACACGAGGCAGAACACATTAGACACTACATCAGGAAAGCCAAGAAAGACCCCGAGGCG
GACTATCTGGCGGAGCTGATGTAAAAGCTATTCAGACAACTAACGCAGACTTCATTGGATACATTGAGCACCACGACCGC
AAGATCATGGAATCACTCCTTGTACCGCCATCCCTTCTAGGATTATCTCGTGGTCAATCTGGCTCTTATGCGCTGTCTTC
etc...
>gi|460042095|ref|NC_020479.1| Bacillus phage Curly, complete genome
etc...
Devo riuscire ad estrarre (in bash) il blocco di genoma fra i due >gi.... questo per 400 blocchi, iterarlo per 400 volte è l'ultimo dei miei problemi, il problema risiede nell'estrazione del blocco ho tentato con:

Codice: Seleziona tutto

awk '/>gi/,/^$/{print}' viral.1.1.genomic.fna
sed -n '/>gi/, /'\n'/p' viral.1.1.genomic.fna
senza risultato, qualcuno di voi potrebbe risolvermi il dubbio? Grazie mille davvero!

Re: Grep o AWK per estrarre blocco di testo

Inviato: venerdì 6 maggio 2016, 0:22
da crap0101
con awk meglio usare >gi come separatore:

Codice: Seleziona tutto

crap0101@orange:/tmp/foo$ cat input.txt 
>gi|526245010|ref|NC_021865.1| Paenibacillus phage phiIBB_Pl23, complete genome
TTTGTGGCATTGTGAACTCTCTTATACAAGCCGACACGACTACGATTTTCAACCAATTCCAGAAATGGTTTGAAAGCCGT
ACAGCAGACTTTGAAAAAGAGTGGAAAGAATGGCTGGAGAAGATGAAAGATCAAGGGGGAGGGAAATTTGGTGTAACATC
CGTC
>gi|526245010|ref1|NC_021865.1| Paenibacillus phage phiIBB_Pl23, complete genome
TTTGTGGCATTGTGAACTCTCTTATACAAGCCGACACGACTACGATTTTCAACCAATTCCAGAAATGGTTTGAAAGCCGT
ACAGCAGACTTTGAAAAAGAGTGGAAAGAATGGCTGGAGAAGATGAAAGATCAAGGGGGAGGGAAATTTGGTGTAACATC
CGTC
>gi|526245010|ref2|NC_021865.1| Paenibacillus phage phiIBB_Pl23, complete genome
TTTGTGGCATTGTGAACTCTCTTATACAAGCCGACACGACTACGATTTTCAACCAATTCCAGAAATGGTTTGAAAGCCGT
ACAGCAGACTTTGAAAAAGAGTGGAAAGAATGGCTGGAGAAGATGAAAGATCAAGGGGGAGGGAAATTTGGTGTAACATC
CGTC
crap0101@orange:/tmp/foo$ awk --posix -v RS='>gi' '{print $0 "###"}' input.txt
###
|526245010|ref|NC_021865.1| Paenibacillus phage phiIBB_Pl23, complete genome
TTTGTGGCATTGTGAACTCTCTTATACAAGCCGACACGACTACGATTTTCAACCAATTCCAGAAATGGTTTGAAAGCCGT
ACAGCAGACTTTGAAAAAGAGTGGAAAGAATGGCTGGAGAAGATGAAAGATCAAGGGGGAGGGAAATTTGGTGTAACATC
CGTC
###
|526245010|ref1|NC_021865.1| Paenibacillus phage phiIBB_Pl23, complete genome
TTTGTGGCATTGTGAACTCTCTTATACAAGCCGACACGACTACGATTTTCAACCAATTCCAGAAATGGTTTGAAAGCCGT
ACAGCAGACTTTGAAAAAGAGTGGAAAGAATGGCTGGAGAAGATGAAAGATCAAGGGGGAGGGAAATTTGGTGTAACATC
CGTC
###
|526245010|ref2|NC_021865.1| Paenibacillus phage phiIBB_Pl23, complete genome
TTTGTGGCATTGTGAACTCTCTTATACAAGCCGACACGACTACGATTTTCAACCAATTCCAGAAATGGTTTGAAAGCCGT
ACAGCAGACTTTGAAAAAGAGTGGAAAGAATGGCTGGAGAAGATGAAAGATCAAGGGGGAGGGAAATTTGGTGTAACATC
CGTC
###
se però devi estrarli e presumibilmente metterli su file separati conviene usare csplit:

Codice: Seleziona tutto

crap0101@orange:/tmp/foo$ csplit input.txt '/>gi/' '{*}'
0
247
248
248
crap0101@orange:/tmp/foo$ l xx*
xx00  xx01  xx02  xx03
crap0101@orange:/tmp/foo$ cat xx00
crap0101@orange:/tmp/foo$ cat xx01
>gi|526245010|ref|NC_021865.1| Paenibacillus phage phiIBB_Pl23, complete genome
TTTGTGGCATTGTGAACTCTCTTATACAAGCCGACACGACTACGATTTTCAACCAATTCCAGAAATGGTTTGAAAGCCGT
ACAGCAGACTTTGAAAAAGAGTGGAAAGAATGGCTGGAGAAGATGAAAGATCAAGGGGGAGGGAAATTTGGTGTAACATC
CGTC
crap0101@orange:/tmp/foo$ cat xx02
>gi|526245010|ref1|NC_021865.1| Paenibacillus phage phiIBB_Pl23, complete genome
TTTGTGGCATTGTGAACTCTCTTATACAAGCCGACACGACTACGATTTTCAACCAATTCCAGAAATGGTTTGAAAGCCGT
ACAGCAGACTTTGAAAAAGAGTGGAAAGAATGGCTGGAGAAGATGAAAGATCAAGGGGGAGGGAAATTTGGTGTAACATC
CGTC
crap0101@orange:/tmp/foo$ cat xx03
>gi|526245010|ref2|NC_021865.1| Paenibacillus phage phiIBB_Pl23, complete genome
TTTGTGGCATTGTGAACTCTCTTATACAAGCCGACACGACTACGATTTTCAACCAATTCCAGAAATGGTTTGAAAGCCGT
ACAGCAGACTTTGAAAAAGAGTGGAAAGAATGGCTGGAGAAGATGAAAGATCAAGGGGGAGGGAAATTTGGTGTAACATC
CGTC

Re: Grep o AWK per estrarre blocco di testo

Inviato: venerdì 6 maggio 2016, 9:23
da cloudany
Grazie mille davvero , non conoscevo "csplit" prima d'ora grazie mille.