<?xml version="1.0" encoding="UTF-8"?><rss version="2.0">	<channel>		<title>All Blog Comments</title>		<language>en-us</language>		<link>http://www.bioinformaticszen.com</link>		<description>All comments from Bioinformatics Zen</description><item>
<author>yannick wurm</author><title>yannick wurm - Scripting | software | Bioinformatics Zen</title><link>/software/scripting#IDComment42219114</link><description>cool thanks! </description><pubDate>Fri, 6 Nov 2009 12:09:25 +0000</pubDate><guid>/software/scripting#IDComment42219114</guid></item><item>
<author>michaelbarton</author><title>michaelbarton - Scripting | software | Bioinformatics Zen</title><link>/software/scripting#IDComment41954357</link><description>Hi Yannick. I use Rake mainly but I&amp;#039;ve been experimenting with boson and thor also. I outlined a simple project a while ago on github here: &lt;a href=&quot;http://github.com/michaelbarton/organised_experiments&quot; target=&quot;_blank&quot;&gt;http://github.com/michaelbarton/organised_experim...&lt;/a&gt; </description><pubDate>Wed, 4 Nov 2009 15:47:38 +0000</pubDate><guid>/software/scripting#IDComment41954357</guid></item><item>
<author>Yannick Wurm</author><title>Yannick Wurm - Scripting | software | Bioinformatics Zen</title><link>/software/scripting#IDComment41946153</link><description>So what do you use to manage workflow steps, mike? </description><pubDate>Wed, 4 Nov 2009 14:00:09 +0000</pubDate><guid>/software/scripting#IDComment41946153</guid></item><item>
<author> Maximilian</author><title> Maximilian - Dealing With Big Data In Bioinformatics | software | Bioinformatics Zen</title><link>/software/dealing-with-big-data-in-bioinformatics#IDComment40895689</link><description>OK, Mike, I agree: Exploratory data analysis is much easier to do in databases than in flat files. Concerning readability: ORM introduce a huge layer of complexity to your whole source code. They need their own libraries, conventions, dependencies, rules which the person reading the code has to be get used to. It might be a matter of taste and I might lack the experience with ORMs but loading data from sql is usually just a couple of lines and just a small part of my scripts so I never had a problem with coding this in a three lines of sql... but this might depend on the type of project.  </description><pubDate>Tue, 27 Oct 2009 22:49:23 +0000</pubDate><guid>/software/dealing-with-big-data-in-bioinformatics#IDComment40895689</guid></item><item>
<author>michaelbarton</author><title>michaelbarton - Dealing With Big Data In Bioinformatics | software | Bioinformatics Zen</title><link>/software/dealing-with-big-data-in-bioinformatics#IDComment40893559</link><description>Hi Max,  I think I&amp;#039;ve covered some of this points above. I agree that using a database might not be the best choice in all cases. I think that some of the stuff Google do with map/reduce (e.g. HDFS) focus on reading in one file and printing aggregated results out to another file. The reasons, as you wrote, is that it is much faster than using a database.  As I touched on in response to Neil&amp;#039;s comment I think using a database allows more flexibility in making changes when doing exploratory analysis. I might initially start off looping over some data but then later I need to query and join specific subsets, or produce results by groups. Databases can make  these kind of things easier to implement than if I was manipulating flat files with scripts.  I did work around the ORM to raw SQL when I needed. This was only two instances in my entire project though, and I would still stand by using an object relational mapping for database manipulation. I think ORMs makes code much easier to read. I think being able to maintain the code is just as important as making sure it runs quickly. </description><pubDate>Tue, 27 Oct 2009 22:23:12 +0000</pubDate><guid>/software/dealing-with-big-data-in-bioinformatics#IDComment40893559</guid></item><item>
<author>michaelbarton</author><title>michaelbarton - Dealing With Big Data In Bioinformatics | software | Bioinformatics Zen</title><link>/software/dealing-with-big-data-in-bioinformatics#IDComment40891015</link><description>Hi Neil,  I understand that millions of records may not be that much given your examples of genome wide association studies and next generation sequencing. What I&amp;#039;ve written here however is my perspective on dealing with a very large dataset where usually I deal with much smaller data. I would hope that when I have to start dealing with even larger data that these approaches might still be useful.  I agree with your point (and Max&amp;#039;s below) that using a database can be a security blanket. I am often guilty of using a database when just looping flat files would be easier. Nevertheless I would still prefer to use a database because I think using it allows me to be more flexible in how I can change my approach and corresponding the software in an exploratory research project, even if there is a performance penalty compared with flat files. From my experience when  using flat files managing a project can become complicated as it becomes complicated to understand when each file and script mean. I find that if all my projects follow the same database backed format then hopefully when I return to a project a few months later it should be relatively easier to pick up where I left off.  Hopefully in the future projects like MySQL Drizzle might make using a database faster. There is also the approaches Google takes such as sharding up a BigTable database. Also interesting is NoSQL movement with databases like Tokyo Tyrant, CouchDB and MongoDB. All of these in the future may make database backed data storage much faster.  </description><pubDate>Tue, 27 Oct 2009 21:57:58 +0000</pubDate><guid>/software/dealing-with-big-data-in-bioinformatics#IDComment40891015</guid></item><item>
<author>Maximilian Haeussler</author><title>Maximilian Haeussler - Dealing With Big Data In Bioinformatics | software | Bioinformatics Zen</title><link>/software/dealing-with-big-data-in-bioinformatics#IDComment40719311</link><description>You can also denormalize your database a bit to gain some speed. Biomart.org is doing this.  I was against using ORMs in whole-genome pipelines in your last post when you&amp;#039;re tryng to convince us of their advantages and now you&amp;#039;re saying that you worked around it sometimes. &lt;a href=&quot;http://www.bioinformaticszen.com/software/using_a_database/#IDComment15789814&quot; target=&quot;_blank&quot;&gt;http://www.bioinformaticszen.com/software/using_a...&lt;/a&gt;   If you need more speed and you find yourself just iterating through the rows of a whole table, you should better not use a mysql db but rather indexed textfiles. &lt;a href=&quot;http://www.bioinformaticszen.com/software/using_a_database/#IDComment15686718&quot; target=&quot;_blank&quot;&gt;http://www.bioinformaticszen.com/software/using_a...&lt;/a&gt; The UCSC genome browser is using textfiles for this reason.  Of course, when you&amp;#039;re implementing interactive web interfaces, then it&amp;#039;s fine to have everything in mysql tables as there will be only one-row access at a time, but that&amp;#039;s a different problem... </description><pubDate>Mon, 26 Oct 2009 16:50:31 +0000</pubDate><guid>/software/dealing-with-big-data-in-bioinformatics#IDComment40719311</guid></item><item>
<author>Neil</author><title>Neil - Dealing With Big Data In Bioinformatics | software | Bioinformatics Zen</title><link>/software/dealing-with-big-data-in-bioinformatics#IDComment40232384</link><description>Only millions of records? :-)  A typical GWAS experiment has billions of records, and as for next-gen sequencing, you may have millions of &lt;b&gt;files&lt;/b&gt; let alone lines of data ...  So while the above offers some practical advice (to which I would add - use temporary tables! learn lots of computer languages! be nice to your sys admin!) at some point you have to decide whether your use of a database is necessary to the project, or whether it is just a comfort blanket, to make you feel like you&amp;#039;re managing the data, when all you&amp;#039;re doing is loading and unloading it.  My heuristic would be: if what you&amp;#039;re doing includes curation and audit - which may be most of bioinformatics - then stick it in a database.  If it is not, then the data can live in files outside, provided the database tracks where to find it, so you can process it in a loop.  </description><pubDate>Fri, 23 Oct 2009 15:24:24 +0000</pubDate><guid>/software/dealing-with-big-data-in-bioinformatics#IDComment40232384</guid></item><item>
<author>Steve</author><title>Steve - Using Code Blocks | r_programming | Bioinformatics Zen</title><link>/r_programming/data_analysis_using_r_functions_as_objects#IDComment38856690</link><description>Thanks. This will come in handy. </description><pubDate>Thu, 15 Oct 2009 19:36:52 +0000</pubDate><guid>/r_programming/data_analysis_using_r_functions_as_objects#IDComment38856690</guid></item><item>
<author>gio</author><title>gio - Using Code Blocks | r_programming | Bioinformatics Zen</title><link>/r_programming/data_analysis_using_r_functions_as_objects#IDComment38796021</link><description>thanks, looks like a good introductive tutorial to R </description><pubDate>Thu, 15 Oct 2009 08:36:18 +0000</pubDate><guid>/r_programming/data_analysis_using_r_functions_as_objects#IDComment38796021</guid></item><item>
<author>dalloliogm</author><title>dalloliogm - Using Code Blocks | r_programming | Bioinformatics Zen</title><link>/r_programming/data_analysis_using_r_functions_as_objects#IDComment38795981</link><description>thanks, looks like a good introductive tutorial to R </description><pubDate>Thu, 15 Oct 2009 08:35:48 +0000</pubDate><guid>/r_programming/data_analysis_using_r_functions_as_objects#IDComment38795981</guid></item><item>
<author>michaelbarton</author><title>michaelbarton - Keyboard, Command Line, And Text Files | tools | Bioinformatics Zen</title><link>/tools/keyboard,-command-line,-and-text-files#IDComment34725266</link><description>Thanks! </description><pubDate>Wed, 16 Sep 2009 19:50:24 +0000</pubDate><guid>/tools/keyboard,-command-line,-and-text-files#IDComment34725266</guid></item><item>
<author>Ian</author><title>Ian - Keyboard, Command Line, And Text Files | tools | Bioinformatics Zen</title><link>/tools/keyboard,-command-line,-and-text-files#IDComment34691095</link><description>Useful article and screen casts! </description><pubDate>Wed, 16 Sep 2009 15:52:11 +0000</pubDate><guid>/tools/keyboard,-command-line,-and-text-files#IDComment34691095</guid></item><item>
<author>michaelbarton</author><title>michaelbarton - Keyboard, Command Line, And Text Files | tools | Bioinformatics Zen</title><link>/tools/keyboard,-command-line,-and-text-files#IDComment33741740</link><description>Can you hear that? I though that wasn&amp;#039;t audible. </description><pubDate>Wed, 9 Sep 2009 14:37:14 +0000</pubDate><guid>/tools/keyboard,-command-line,-and-text-files#IDComment33741740</guid></item><item>
<author>Kieren</author><title>Kieren - Keyboard, Command Line, And Text Files | tools | Bioinformatics Zen</title><link>/tools/keyboard,-command-line,-and-text-files#IDComment33739235</link><description>Its good to hear your girlfriend in the background busy in the kitchen. </description><pubDate>Wed, 9 Sep 2009 13:56:38 +0000</pubDate><guid>/tools/keyboard,-command-line,-and-text-files#IDComment33739235</guid></item><item>
<author>michaelbarton</author><title>michaelbarton - Using A Database | software | Bioinformatics Zen</title><link>/software/using_a_database#IDComment27993427</link><description>Thank you for your comments Siva and Gioby.  Gioby is correct when he wrote that the intention for this post was about using databases to store flat files locally. The point I was trying to make was that it is better to organise data in a database, any kind of database, rather than use flat files.  In my experience many bioinformaticians don&amp;#039;t use a database unless they have be taught how to, and shown the reasons for using them. I believe that using a database is a much easier way to organise a project than through multiple flat files.  As for ORMs they are not a silver bullet but they can make it much easier to interact programmatically with a database. The alternative is to manipulate SQL strings inside the programme code, which is not pleasant.   I agree with your point Siva when you wrote that SQL databases may have difficulty handling large amounts of web traffic. If speed is an issue such as with a website then other solutions of course should be considered, but for many bioinformaticians this is not an issue. </description><pubDate>Wed, 22 Jul 2009 20:37:21 +0000</pubDate><guid>/software/using_a_database#IDComment27993427</guid></item><item>
<author>gioby</author><title>gioby - Using A Database | software | Bioinformatics Zen</title><link>/software/using_a_database#IDComment27357264</link><description>No, here we are discussing about using a database locally, as you could do with access/base and as if they were an enhanced version of spreadsheets. There has been a long time debate on flat files vs database in bioinformatics : for example, read here &lt;a href=&quot;http://www.nodalpoint.org/2007/05/02/database_or_flat_text_file&quot; target=&quot;_blank&quot;&gt;http://www.nodalpoint.org/2007/05/02/database_or_...&lt;/a&gt;  I am using a database to organize my stuff here and I find it has some good advantages. it is installed in my local computer, and for the heavier stuff we have a centralized database on a more powerful machine.  You say that databases are bad for complex searches and for binary files, but then, what do you suggest as an alternative? I have used HDF5 which is a binary format organized as an hierarchical database, and I would like to try couchdb which looks interesting.  Moreover I don&amp;#039;t agree with the fact that databases are bad for making difficult searches, and in any case an ORM library helps you a lot in these cases, because you can save complex sql queries as a function and execute them within the code.  For example you can have an object called SAMPLES_TABLE and add a method to retrieve some samples based on specific conditions, and it is very useful. </description><pubDate>Wed, 15 Jul 2009 16:46:42 +0000</pubDate><guid>/software/using_a_database#IDComment27357264</guid></item><item>
<author>siva</author><title>siva - Using A Database | software | Bioinformatics Zen</title><link>/software/using_a_database#IDComment27337236</link><description>Gioby,  &amp;quot;&amp;quot;&amp;quot;the use of a database to store local data&amp;quot;&amp;quot;&amp;quot;  Also &amp;#039;never&amp;#039; was mentioned by the author that it is for storing local data only.  The process whether he knows it or not was like a typical ETL process (extract, transform and Load) of a data warehouse. Data in multiple formats and from multiple sites are extracted then transformed and finally loaded in the &amp;quot;&amp;quot;&amp;quot;local&amp;quot;&amp;quot;&amp;quot; database. &amp;quot;&amp;quot;&amp;quot;Here, data is not from local only sources&amp;quot;&amp;quot;&amp;quot;&amp;quot;  cheers, siva </description><pubDate>Wed, 15 Jul 2009 11:04:52 +0000</pubDate><guid>/software/using_a_database#IDComment27337236</guid></item><item>
<author>siva</author><title>siva - Using A Database | software | Bioinformatics Zen</title><link>/software/using_a_database#IDComment27335195</link><description>Agreed Gioby.  But what I want to say is that Databases and ORMs are always not silver bullets.  For an organization internal purpose it is fine, but  the case where you need to access data past DBMS server you need to give some thought for alternatives (added that if data needed  is to be searched with lot of join conditions). Also for compound searches, image retrieval this approach just suckkss.   Further, lot depends on what you want from the data that decides organizing tables.  Further more, I agree with Danny who suggested Berkeley DB.  Cheers, Siva </description><pubDate>Wed, 15 Jul 2009 10:33:40 +0000</pubDate><guid>/software/using_a_database#IDComment27335195</guid></item><item>
<author>gioby</author><title>gioby - Using A Database | software | Bioinformatics Zen</title><link>/software/using_a_database#IDComment27326008</link><description>In this post and comments we are discussing about something different: the use of a database to store local data, as an alternative to flat files, and this has nothing to do with providing web services.  A tipical problem for a bioinformatician is data management: you usually have many different informations coming out from other databases, results of experiments, manual annotations, etc... and you have to organize them on your hard disk in a way that it is easy to access them and retrieve them.  For example, you may want to study a set of genes, and for each of them you may have many different un-related informations: their positions on the  chromosome, their transcripts, their activity, etc.. The traditional approach to store such informations is to use a flat file (a table, like oocalc/excel), but when you have to handle many different informations you may end up with a lot of confusion in your data, like duplicated files, old versions, different scripts to access it... A database can be very useful in these cases, because it allows you to have a standard way to access data, and you can define relationships between tables, so it is a lot better when you have passed throught the learning curve. An ORM is an additional improvement to databases, because it allows you to think in terms of objects (e.g. genes, proteins, sequences) instead of tables, and store information to a db without worrying about many details. </description><pubDate>Wed, 15 Jul 2009 07:44:03 +0000</pubDate><guid>/software/using_a_database#IDComment27326008</guid></item>	</channel></rss>