Sometimes, we have situations where we get data as HTML data and we have to convert it to DataTable. From DataTable, we can export it to xlsx or use it according to our need.
The problem I faced with HTML data, was that I had to export a report in xlsx format and I had only HTML data. The HTML data was exporting into xlsx but with a format message, while opening the Excel. That means, it was writing the HTML to the Excel sheet as it appeared. But for the Excel sheet, we need columns to be written properly into each cell. xlsx doesn't support direct HTML conversion to xlsx cells.
To export HTML data to xlsx, I had to convert the HTML data to DataTable from where I can export the data easily and there will be no format message while opening the Excel. Below is the code to convert the HTML to DataTable.
- public static DataTable ConvertHTMLTablesToDataTable(string HTML)
- {
-
-
- DataTable dt = null;
- DataRow dr = null;
- DataColumn dc = null;
- string TableExpression = "<table[^>]*>(.*?)</table>";
- string HeaderExpression = "<th[^>]*>(.*?)</th>";
- string RowExpression = "<tr[^>]*>(.*?)</tr>";
- string ColumnExpression = "<td[^>]*>(.*?)</td>";
- bool HeadersExist = false;
- int iCurrentColumn = 0;
- int iCurrentRow = 0;
-
-
- MatchCollection Tables = Regex.Matches(HTML, TableExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);
-
-
- foreach (Match Table in Tables)
- {
-
-
- iCurrentRow = 0;
- HeadersExist = false;
-
-
- dt = new DataTable();
-
-
- if (Table.Value.Contains("<th"))
- {
-
- HeadersExist = true;
-
-
- MatchCollection Headers = Regex.Matches(Table.Value, HeaderExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);
-
-
- foreach (Match Header in Headers)
- {
-
- dt.Columns.Add(Header.Groups[1].ToString());
-
- }
- }
- else
- {
- for (int iColumns = 1; iColumns <= Regex.Matches(Regex.Matches(Regex.Matches(Table.Value, TableExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase)[0].ToString(), RowExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase)[0].ToString(), ColumnExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase).Count; iColumns++)
- {
- dt.Columns.Add("Column " + iColumns);
- }
- }
-
-
- MatchCollection Rows = Regex.Matches(Table.Value, RowExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);
-
-
- foreach (Match Row in Rows)
- {
-
-
- if (!(iCurrentRow == 0 & HeadersExist == true))
- {
-
-
- dr = dt.NewRow();
- iCurrentColumn = 0;
-
-
- MatchCollection Columns = Regex.Matches(Row.Value, ColumnExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);
-
-
- foreach (Match Column in Columns)
- {
-
- DataColumnCollection columns = dt.Columns;
-
- if (!columns.Contains("Column " + iCurrentColumn))
- {
-
- dt.Columns.Add("Column " + iCurrentColumn);
- }
-
- dr[iCurrentColumn] = Column.Groups[1].ToString();
-
- iCurrentColumn += 1;
-
- }
-
-
- dt.Rows.Add(dr);
-
- }
-
-
- iCurrentRow += 1;
- }
-
-
- }
-
- return (dt);
-
- }
This line gets all the Table matches as defined in TableExpression variable.
- MatchCollection Tables = Regex.Matches(HTML, TableExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);
This line gets the total number of columns using ColumnExpressions as <td>. It created each <td> as a column.
- MatchCollection Columns = Regex.Matches(Row.Value, ColumnExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);
This line gets the header using HeaderExpression as <th>. It created each <th> as a Header.
- MatchCollection Headers = Regex.Matches(Table.Value, HeaderExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);
Use this code to convert HTML to DataTable. If you have found any problem, please comment below.